## The problem

As in the first week, can we predict buildings that will likely fail inspection?

In [2]:
# So, this would actually fit in memory, but let's pretend not...
%ls -lh data

total 1633872
-rw-r-----@ 1 dav  staff   227M Jun  3 12:11 Building_Permits.csv
-rw-r-----@ 1 dav  staff   571M Jun  3 12:11 Building_Violations.csv


## Back to the future

Let's pretend I have ~128MB of RAM. This was actually true a long time ago. Probably, we should just put our data in a database, so don't take the following as a recommendation. It's more to illustrate the idea of working out-of-core.

In [32]:
# Just get in the habit of always doing this
%matplotlib inline
import pandas as pd

# I like my code to be Python3-centric, but this is for you still in Python 2
from __future__ import division
from six import print_

Pandas likes to be told how many rows to read per chunk, but we are constrained by bytes. So,  let's have a look at our data.

In [17]:
with open('data/Building_Permits.csv') as permit_f:
    # 2 ** 20 Bytes is the definition of MB used by ls
    some_lines = permit_f.readlines(50 * 2 ** 20)

print_('So 50MB is about {} lines'.format(len(some_lines)))

So 50MB is about 85875 lines


In [18]:
max_len = 0
for line in some_lines:
    if len(line) > max_len:
        max_len = len(line)

print('The longest line in that chunk is {} bytes'.format(max_len))

The longest line in that chunk is 2340 bytes


To be safe, we'll assume all lines are 2340 bytes, and so we'll tell pandas to read in about this many lines:

$$
50\ MB \over 2340\ bytes\, /\, line
$$

In [20]:
50 * 2 ** 20 / 2340

22405

In [28]:
# Ironically, we are setting low_memory to False, 
# even though we are pretending to have low memory
# This actually won't solve the issue of "mixed types"
permit_iterator = pd.read_csv('data/Building_Permits.csv', chunksize=22400, low_memory=False)

In [36]:
for first_chunk in permit_iterator:
    # We would normally iterate through, but here, we stop after the first assignment.
    # We take advantage of the fact that python leaves the last value
    # of the loop variable defined.
    break

In [34]:
# The chunks end up being pretty small (CSV is inefficient)
first_chunk.size / 2 ** 20

2.7984619140625

In [35]:
first_chunk.shape

(22400, 131)

At this point, we can do piece-wise computations on these chunks.