# Reading, processing, and writing large files

In this section, we will look at our second strategy for working with large files, namely reading a file in chunks, processing each chunk, then appending the results to the same file.

In [1]:
import pandas as pd
from dfply import *

## Hiding stack traceback

We hide the exception traceback for didactic reasons (code source: [see this post](https://stackoverflow.com/questions/46222753/how-do-i-suppress-tracebacks-in-jupyter)).  Don't run this cell if you want to see a full traceback.

In [2]:
import sys
ipython = get_ipython()

def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
                   exception_only=False, running_compiled_code=False):
    etype, value, tb = sys.exc_info()
    return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))

ipython.showtraceback = hide_traceback

## Example 2 - Adding some dateparts and writing out the result 

Now suppose that instead of aggregating and visualizing, our goal was to add some new columns to the data set and write the result to a csv.  Again, we will start by prototyping our code on the first chunk then transforming and writing all of the chunks. 

## Summary of the process

* Read and prototype on the first chunk
    * Outcome: helper functions for processing each chunk
* Reset the data frame iterator
* Process and write all chunks

#### Step 1 - Prototype on the first chunk

In [4]:
c_size = 10000
new_names = ['date', 'lat', 'lon', 'base']
date_cols = ['date']
df_iter = pd.read_csv("./data/uber/uber-trip-data/uber-raw-data-apr14.csv", 
                      header=0, names=new_names, 
                      parse_dates=date_cols,
                      chunksize=c_size,
                      sep=',',
                      engine='python')

In [10]:
from toolz import first
first_chunk = first(df_iter) 
first_chunk.head()

Unnamed: 0,date,lat,lon,base
30000,2014-04-26 23:32:00,40.7288,-73.994,B02512
30001,2014-04-26 23:33:00,40.7261,-73.9986,B02512
30002,2014-04-26 23:33:00,40.7279,-74.0021,B02512
30003,2014-04-26 23:34:00,40.7738,-73.9486,B02512
30004,2014-04-26 23:34:00,40.7296,-74.0024,B02512


#### Build an expression

In [8]:
(first_chunk >>
  mutate(weekday = X.date.dt.weekday_name,
         weekofyear = X.date.dt.weekofyear,
         day = X.date.dt.day,
         hour = X.date.dt.hour) >>
  head)

Unnamed: 0,date,lat,lon,base,weekday,weekofyear,day,hour
20000,2014-04-17 18:14:00,40.7222,-74.0095,B02512,Thursday,16,17,18
20001,2014-04-17 18:14:00,40.7464,-73.9739,B02512,Thursday,16,17,18
20002,2014-04-17 18:15:00,40.723,-74.0021,B02512,Thursday,16,17,18
20003,2014-04-17 18:15:00,40.723,-74.0021,B02512,Thursday,16,17,18
20004,2014-04-17 18:15:00,40.759,-73.9645,B02512,Thursday,16,17,18


#### Encapsulate the expression in a function and test on the first chunk

In [9]:
add_dateparts = lambda df: (df 
                            >> mutate(weekday = X.date.dt.weekday_name,
                                     weekofyear = X.date.dt.weekofyear,
                                     day = X.date.dt.day,
                                     hour = X.date.dt.hour))
add_dateparts(first_chunk).head()

Unnamed: 0,date,lat,lon,base,weekday,weekofyear,day,hour
20000,2014-04-17 18:14:00,40.7222,-74.0095,B02512,Thursday,16,17,18
20001,2014-04-17 18:14:00,40.7464,-73.9739,B02512,Thursday,16,17,18
20002,2014-04-17 18:15:00,40.723,-74.0021,B02512,Thursday,16,17,18
20003,2014-04-17 18:15:00,40.723,-74.0021,B02512,Thursday,16,17,18
20004,2014-04-17 18:15:00,40.759,-73.9645,B02512,Thursday,16,17,18


#### Process and write the first chunk

For the first chunk, use

* `header=True`
* `mode='w'`
    * `'w'` == Write $\rightarrow$ creates a new file

In [11]:
out_file = "./data/uber-raw-data-apr14-with-datepart.csv"
add_dateparts(first_chunk).to_csv(out_file, header=True, mode='w')

#### Process and write the remaining chunks

For the remaining chunks, use

* `mode='a'`
    * `'a'` == append $\rightarrow$ adds lines to existing file
* `header=False`
    * No headers in the middle of the file

In [12]:
for i, chunk in enumerate(df_iter):
    print("writing chunk {0}".format(i+1))
    add_dateparts(chunk).to_csv(out_file, header=False, mode='a')

writing chunk 1
writing chunk 2
writing chunk 3
writing chunk 4
writing chunk 5
writing chunk 6
writing chunk 7
writing chunk 8
writing chunk 9
writing chunk 10
writing chunk 11
writing chunk 12
writing chunk 13
writing chunk 14
writing chunk 15
writing chunk 16
writing chunk 17
writing chunk 18
writing chunk 19
writing chunk 20
writing chunk 21
writing chunk 22
writing chunk 23
writing chunk 24
writing chunk 25
writing chunk 26
writing chunk 27
writing chunk 28
writing chunk 29
writing chunk 30
writing chunk 31
writing chunk 32
writing chunk 33
writing chunk 34
writing chunk 35
writing chunk 36
writing chunk 37
writing chunk 38
writing chunk 39
writing chunk 40
writing chunk 41
writing chunk 42
writing chunk 43
writing chunk 44
writing chunk 45
writing chunk 46
writing chunk 47
writing chunk 48
writing chunk 49
writing chunk 50
writing chunk 51
writing chunk 52
writing chunk 53


In [13]:
!head -n 10 ./data/uber-raw-data-apr14-with-datepart.csv

,date,lat,lon,base,weekday,weekofyear,day,hour
30000,2014-04-26 23:32:00,40.7288,-73.994,B02512,Saturday,17,26,23
30001,2014-04-26 23:33:00,40.7261,-73.9986,B02512,Saturday,17,26,23
30002,2014-04-26 23:33:00,40.7279,-74.0021,B02512,Saturday,17,26,23
30003,2014-04-26 23:34:00,40.7738,-73.9486,B02512,Saturday,17,26,23
30004,2014-04-26 23:34:00,40.7296,-74.0024,B02512,Saturday,17,26,23
30005,2014-04-26 23:34:00,40.7637,-73.9793,B02512,Saturday,17,26,23
30006,2014-04-26 23:35:00,40.7737,-73.9476,B02512,Saturday,17,26,23
30007,2014-04-26 23:35:00,40.7368,-73.9734,B02512,Saturday,17,26,23
30008,2014-04-26 23:36:00,40.7561,-73.9654,B02512,Saturday,17,26,23


## A note on out of memory errors

* Happen frequently when reading data in chunks

#### Example - MoMA Artwork

Even though this is not large (easily fits in memory on modern machines), we get memory errors when iterating through the chunks.

In [None]:
[chunk for chunk in pd.read_csv('./data/Artworks.csv', chunksize=500)]

## Solution 1 - Specify the `sep` and let `engine='python'`

In [181]:
df_iter = pd.read_csv('./data/Artworks.csv', 
                      chunksize=500, # Pick a reasonable chunk size.  I had memory errors with a smaller size
                      sep=',', # To help the parser not run out of memory
                      dtype={'BeginDate':str}, # We are using string method, make sure they will work
                      engine='python') # The way I fixed parsing errors
[chunk for chunk in pd.read_csv('./data/Artworks.csv', chunksize=500)]

ParserError: Error tokenizing data. C error: out of memory

## Solution 2 - Use `csv.reader` or `csv.DictReader`

In [38]:
import pandas as pd
from csv import DictReader, Sniffer
from toolz import partition_all

with open('./data/Artworks.csv') as csvfile:
    dialect = Sniffer().sniff(csvfile.read(50))
    csvfile.seek(0)
    reader = DictReader(csvfile, dialect=dialect)
    columns = reader.fieldnames
    chunksize = 10000
    for i, chunk in enumerate(partition_all(chunksize, reader)):
        print('creating df {0}'.format(i))
        _ = pd.DataFrame().from_dict(chunk)

creating df 0
creating df 1
creating df 2
creating df 3
creating df 4
creating df 5
creating df 6
creating df 7
creating df 8
creating df 9
creating df 10
creating df 11
creating df 12
creating df 13
creating df 14


## What does `DictReader` do?

* Read a line at a time
* Return a row `dict` of `(col_name, value)` pairs

In [39]:
from toolz import take

with open('./data/Artworks.csv') as csvfile:
    reader = DictReader(csvfile, dialect=dialect)
    columns = reader.fieldnames
    head = list(take(2, reader))
head


[OrderedDict([('\ufeffTitle',
               'Ferdinandsbrücke Project, Vienna, Austria, Elevation, preliminary version'),
              ('Artist', 'Otto Wagner'),
              ('ConstituentID', '6210'),
              ('ArtistBio', '(Austrian, 1841–1918)'),
              ('Nationality', '(Austrian)'),
              ('BeginDate', '(1841)'),
              ('EndDate', '(1918)'),
              ('Gender', '(Male)'),
              ('Date', '1896'),
              ('Medium', 'Ink and cut-and-pasted painted pages on paper'),
              ('Dimensions', '19 1/8 x 66 1/2" (48.6 x 168.9 cm)"'),
              ('CreditLine',
               'Fractional and promised gift of Jo Carole and Ronald S. Lauder'),
              ('AccessionNumber', '885.1996'),
              ('Classification', 'Architecture'),
              ('Department', 'Architecture & Design'),
              ('DateAcquired', '1996-04-09'),
              ('Cataloged', 'Y'),
              ('ObjectID', '2'),
              ('URL', 'http://ww

## <font color="red"> Exercise 3 </font>

Create a file for the May Uber pick-ups that contains the various dateparts added in the last example.  Use the `bash` `head` function to inspect the first 10 rows of the result.  Include the ***day of the year*** and ***week of the year***.

In [49]:
# Your code here

## Up Next

Stuff