# Working with big files

 

## Key idea 1: read the file streaming, and unpack on the fly

### Why?
> This scales: you do NOT want to have a big file in memory if you only need it bit by bit. <br/>
And why waste time and HD space with unpacking a file completely?

1. You do not have to unzip a zip, gzipped or bzip2 file before you can read it.
2. You can read it streaming.
    * Even on the command line:
        * `zcat` for gzipped files
        * or `gunzip -c <file>|more` 
        * See <https://en.wikibooks.org/wiki/Guide_to_Unix/Commands/File_Compression>
3. In Python:

```
import gzip

with gzip.open('input.gz','r') as fin:
    for line in fin:
        print('got line', line)
```
4. For bz2 files there is `BZ2File` with similar interface.

## Key idea 2: clean memory when you are done
1. This especially holds when working with XML files and `lxml`
2. Even when you read an XML file "streamingly" and remove the context,
    * `lxml` stores the internal tree structure
    * so your memory consumption starts to go up,
    * your machine starts to swap like hell
    * and basically stalls
    
    
    

## Key idea 3: divide and conquer
1. If you are OK with RAM memory, but your input file(s) are still so big that processing takes ages
2. you can **divide** the work over several machines or cores
3. and afterwards **combine** the results.
4. Sometimes you have to divide yourself, sometimes you get the input data already in several files.
    * E.g., you can downoad the complete wikipedia dump in 1 file or in 4 files. 
    

# Three examples

1. [Reading a big text file](http://nbviewer.jupyter.org/format/slides/url/maartenmarx.nl/teaching/DataScience/NoteBooks/ReadingFilesFromTheWeb.ipynb#Reading-gzipped-file-line-by-line)  
    * We have done this before several times.
1. [Reading a big XML file](http://nbviewer.jupyter.org/format/slides/url/maartenmarx.nl/teaching/DataScience/NoteBooks/ParseWikipediaDump.ipynb)
1. [Reading a big spreadsheet](http://nbviewer.jupyter.org/format/slides/url/maartenmarx.nl/teaching/DataScience/NoteBooks/ParseBigSpreadsheet.ipynb)