# Speeding Up I/O

I/O is one of the most time-intensive things we do in any coding language.  If you find yourself reading in large data sets or reading in the same data sets over and over, it might be worthwhile to see if there is a better method for your input, such as saving your data once in a faster-to-read method.

This notebook looks at some different I/O methods:
* Numpy's binary forms (compressed and non-compressed)
* Pickle
* Pandas

## Numpy

We used numpy's loadtxt and genfromtxt in October's sessions - these are fast

In [None]:
import numpy as np

# Read in the data
firstColFlt, secColFlt = np.loadtxt( "random.txt", dtype=float, usecols=[0,1], unpack=True )

print( "#-#-# firstColFlt")
print( firstColFlt )

print( "\n#-#-# secColFlt")
print( secColFlt )

### Saving in binary 

Numpy can save multiple arrays in a binary format with savez. If you read in data multiple times, it's likely worth it to write it to binary once, and then continue to read that. Watch the ordering as numpy doesn't save the array names.

In [None]:
# Save the data
np.savez('binaryNonCompressed.npz', firstColFlt, secColFlt)

# Read it back in
npzFile = np.load('binaryNonCompressed.npz')

print( "#-#-# files")
print( npzFile.files )

print( "\n#-#-# arr_1")
print( npzFile['arr_1'] )

Numpy also has a way to compress the binary data - this is slower (time for compression and decompresssion) but may be beneficial if you are moving data between clusters

In [None]:
# Save the data
np.savez_compressed('binaryCompressed.npz', firstColFlt, secColFlt)

# Read it back in
npzCFile = np.load('binaryCompressed.npz')

print( "#-#-# files")
print( npzCFile.files )

print( "\n#-#-# arr_0")
print( npzCFile['arr_0'] )

## Pickle

Binary reads and writes will always be faster than ascii.  Read your data in the original form one time and save it in a much better form for reading subsequent times. The python pickle package allows for very efficient I/O

In [None]:
import pickle as pk

# Save to a pickle file
pk.dump(firstColFlt, open( "firstCol.pkl", "wb"))

# Read it back in
readInFirstCol = pk.load( open( "firstCol.pkl", "rb"))

print( "#-#-# readInFirstCol")
print( readInFirstCol )


Python2 has cPickle which is considerably faster for large data dumps and reads; Note that python3 uses cPickle by default - there is no change

In [None]:
# Use cPickle if you are using python 2
#import cPickle as pk

# If you wish to manually force the c implementation of pickle, you can use _pickle
import _pickle as pk

# Save to a pickle file
pk.dump(secColFlt, open( "secCol.pkl", "wb"))

# Read it back in
readInSecCol = pk.load( open( "secCol.pkl", "rb"))

print( "#-#-# readInSecCol")
print( readInSecCol )

Note that we have the same modes as with standard python files (w and r), but are also doing binary (b)

## Pandas

Pandas has read_csv, as well as some specialized I/O methods

In [1]:
import pandas as pd

testFrame = pd.read_csv( 'testScores.csv' )
print( testFrame )

   Names  Score  Unnamed: 2
0  Billy     98         NaN
1   Joel     95         NaN
2  Elton     96         NaN
3   John     85         NaN
4  James     92         NaN
5   Earl     91         NaN
6  Jones     88         NaN


Note that there is an extra, unnamed column with NaNs because each line ends with a ","

### Pandas binary using Pickle

Pandas has a built in method to store data using pickle

In [2]:
# Make a new frame without the right column
noUnnamedFrame = testFrame.drop( columns = "Unnamed: 2" )

# Save this to a pickle file
noUnnamedFrame.to_pickle( 'testFrame.pkl' )

# Read the file back in
readFrame = pd.read_pickle( 'testFrame.pkl' )

print( readFrame )

   Names  Score
0  Billy     98
1   Joel     95
2  Elton     96
3   John     85
4  James     92
5   Earl     91
6  Jones     88


It also has the ability to use hdf5, which can be coupled to many other libraries

In [5]:
# Create a new frame without the score column
namesFrame = noUnnamedFrame.drop( columns = "Score" )

# Store this into an hdf5 file
namesFrame.to_hdf('writingHDF5.h5','data')
#hdf5File =  pd.HDFStore( 'storeFrame.h5' )
#hdf5File['namesFrame'] = namesFrame
#hdf5File.close()

# Read this back in
readInData = pd.read_hdf('writingHDF5.h5', mode='r')

print( readInData )

   Names
0  Billy
1   Joel
2  Elton
3   John
4  James
5   Earl
6  Jones


# Check yourself

In [6]:
# Variables to use in our examples
import numpy as np
alpha = np.arange(500)

Write alpha to a compressed binary file using numpy.  Read this back in and save the variable as beta.

In [7]:
# Try it here


Save beta as a pickle file.  Read this back in and save the variable as omega.

In [8]:
# Try it here
