# Speeding Up I/O

I/O is one of the most time-intensive things we do in any coding language - let's explore some ways to do this more efficiently

## Numpy

We used numpy's loadtxt and genfromtxt in October's sessions - these are fast

In [9]:
import numpy as np

firstColFlt, secColFlt = np.loadtxt( "random.txt", dtype=float, usecols=[0,1], unpack=True )
print firstColFlt, secColFlt

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] [ 10.  21.  32.  43.  54.  65.  76.  87.  98. 109.]


Numpy can save multiple arrays in a binary format with savez, but watch the ordering

In [10]:
np.savez('binaryNonCompressed.npz', firstColFlt, secColFlt)
npzFile = np.load('binaryNonCompressed.npz')
print npzFile.files
print npzFile['arr_1']

['arr_1', 'arr_0']
[ 10.  21.  32.  43.  54.  65.  76.  87.  98. 109.]


Numpy also has a way to compress the binary data - this is slower (time for compression and decompresssion) but may be beneficial if you are moving data between clusters

In [11]:
np.savez_compressed('binaryCompressed.npz', firstColFlt, secColFlt)
npzCFile = np.load('binaryCompressed.npz')
print npzCFile.files
print npzCFile['arr_0']

['arr_1', 'arr_0']
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


## Pickle

Binary reads and writes will always be faster than ascii.  Read your data in the original form one time and save it in a much better form for reading subsequent times. The python pickle package allows for very efficient I/O

In [24]:
import pickle as pk

pk.dump(firstColFlt, open( "firstCol.pkl", "wb"))
readInFirstCol = pk.load( open( "firstCol.pkl", "rb"))
print readInFirstCol


[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


Python2 has cPickle which is considerably faster for large data dumps and reads

In [27]:
import cPickle as pk

pk.dump(secColFlt, open( "secCol.pkl", "wb"))
readInSecCol = pk.load( open( "secCol.pkl", "rb"))
print readInSecCol

[ 10.  21.  32.  43.  54.  65.  76.  87.  98. 109.]


Note that we have the same modes as with standard python files (w and r), but are also doing binary (b)

## Pandas

Pandas has read_csv, as well as some specialized I/O methods

In [15]:
import pandas as pd

testFrame = pd.read_csv( 'testScores.csv' )
print testFrame

   Names  Score  Unnamed: 2
0  Billy     98         NaN
1   Joel     95         NaN
2  Elton     96         NaN
3   John     85         NaN
4  James     92         NaN
5   Earl     91         NaN
6  Jones     88         NaN


Note that there is an extra, unnamed column with NaNs because each line ends with a ","

Pandas has a built in method to store data using pickle

In [18]:
noUnnamedFrame = testFrame.drop( columns = "Unnamed: 2" )
noUnnamedFrame.to_pickle( 'testFrame.pkl' )
readFrame = pd.read_pickle( 'testFrame.pkl' )
print readFrame

   Names  Score
0  Billy     98
1   Joel     95
2  Elton     96
3   John     85
4  James     92
5   Earl     91
6  Jones     88


It also has the ability to use hdf5, which can be coupled to many other libraries

In [22]:
namesFrame = noUnnamedFrame.drop( columns = "Score" )
hdf5File =  pd.HDFStore( 'storeFrame.h5' )
hdf5File['namesFrame'] = namesFrame
readInNames = hdf5File['namesFrame']
print readInNames

   Names
0  Billy
1   Joel
2  Elton
3   John
4  James
5   Earl
6  Jones
