## Large data hacks

### Reading large text files that are "out-of-memory"

Instead of using `.read()` to load the entire file, you can instead use `.readline()` to read one line at a time.

In [1]:
BIGFILE = "/home/deren/Documents/kmerkit/data/amaranths/hybridus_SLH_AL_1060_R1_concat.fastq.gz"

In [6]:
import gzip

### how many lines are in this file?

In [32]:
# you can iterate over the lines in a file from the open file object
nlines = 0
openfile = gzip.open(BIGFILE, 'r')
for line in openfile:
    nlines += 1
openfile.close()
print(nlines)

3979812


In [33]:
# by contrast, reading the entire file into memory can be wasteful
# if you only need one line at a time.
with gzip.open(BIGFILE, 'r') as indata:
    nlines1 = indata.readlines()
    print(len(nlines1))

3979812


In [34]:
# only checks that there IS data on each line
with gzip.open(BIGFILE, 'r') as indata:
    nlines2 = sum(1 for i in indata)
    print(nlines2)

3979812


In [35]:
# process each line
with gzip.open(BIGFILE, 'r') as indata:
    for i in range(10):
        print(indata.readline())

b'@NB551405:60:H7T2GAFXY:1:11101:3993:1049 1:N:0:CATTCGGT+NNNNNTAA\n'
b'ATCGGCTAATAGTAGAGGTGTTGTGCCCTAACCGAAATATTCAAGTGAGATCACCCAAGTAAGAAAGATATTTATTCTTTCATTTGCCAATTAATTCATAATATTTTTCTTATGCAATATCTTAGTTTTTATCCATTCT\n'
b'+\n'
b'EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAAAA6A<A<A\n'
b'@NB551405:60:H7T2GAFXY:1:11101:11941:1058 1:N:0:AAGTCGAG+NNNNAACC\n'
b'ATCGGATGCCATATATTCATAAACCAGTAGCCTATGCTCATCTTCACAACAGTATCCAATCAGCTTTACAAGGTTCGGGTGGCTAAGTTGTCCTAAGTAGCTAACTTCAGTCTGTATTATGAAACATATAAGCATTGTAAGC\n'
b'+\n'
b'EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAAA<EEEEEEEEAAAAAEEEAA\n'
b'@NB551405:60:H7T2GAFXY:1:11101:10608:1061 1:N:0:TATCGGTC+NNNNTAAA\n'
b'ATCGGTTTCTGGGTGGGTTAGGATGGTAAGGGTGCGGTTTTGGGTTTAGGTTAGGGTGGGATGGATTATTGGATCTGGTATATTGGGGGTTTGAAGTAATTTGGGATTTAGGGTTAGGGTTAGGGCGCCAGAATTGGGGATT\n'


### Generators
Python generators are a complex topic. They can be extremely useful, but it can be hard to recognize their benefit until you reach a fairly high level of coding, or encounter an important use case. These use cases tend to involve very large data files, where you want to process only a proportion of the data at a time without having to read the entire data into memory. Generators make this possible by turning the "reading" process into a step-by-step approach, where it reads a chunk of data, and then waits at the point in the file where it left off until you request additional lines of data from the file. The `.readline()` function above is an example, which 

https://wiki.python.org/moin/Generators

### Reading chunks of a file using generators

In [36]:
import itertools

In [None]:
itertools.izip(4 * )

### Reading large tabular data files that "out-of-memory"