## Reading binary data formats



-   Storing data efficiently in binary format
-   **Serialization**
-   The `pickle` module allows writing Python objects to disk



In [1]:
import pandas as pd
frame = pd.read_csv('examples/ex1.csv')
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [2]:
frame.to_pickle('frame.pkl')

We can then read in the pickled object



In [3]:
pd.read_pickle('frame.pkl')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### A few tips



-   `pickle` is only recommended as a short-term storage format due to potential compatibility problems
-   pandas can also use HDF5
-   `bcolz`: a compressable column-oriented binary format based on the Blosc compression library
-   `Feather`: a cross-language column-oriented file format based on the Apache Arrow columnar memory format



## Using the HDF5 format



-   HDF5 is a well-regarded file format intended for storing large quantities of scientific array data
-   It is available as a C library, and it has interfaces available in many other languages
-   The "HDF" in HDF5 stands for hierarchical data format and supports
    -   multiple datasets in one file
    -   metadata
    -   on-the-fly compression
    -   reading and writing small sections of large arrays



Although we can use modules such as `PyTables` or `h5py`, pandas provides a higher-level interface

In [4]:
import pandas as pd
import numpy as np
frame = pd.DataFrame({'a': np.random.randn(100)})

In [5]:
store = pd.HDFStore('mydata.h5', 'w')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

Objects can be retrieved like dictionaries



In [6]:
store['obj1']

Unnamed: 0,a
0,-0.306696
1,0.738166
2,0.534466
3,1.570626
4,-0.745497
5,0.291285
6,-2.115760
7,-0.632431
8,0.691021
9,0.208223


HDFStore supports two storage schemas, 'fixed' and 'table'



In [7]:
store.put('obj2', frame, format='table')

In [8]:
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,-0.656137
11,0.583059
12,-1.380217
13,0.222848
14,2.687638
15,0.453271


Flushing the writes to the HDFStore object before we write the file



In [1]:
store.close()

Writing the data



In [1]:
frame.to_hdf('mydata.h5', 'obj3', format='table')

Reading with queries built in



In [1]:
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

## Reading Microsoft Excel files



-   pandas also supports reading tabular data stored in Excel
-   Internally is uses the packages `xlrd` and `openpyxl` to read XLS and XLSX files



In [10]:
frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


Writing Excel files is simple



In [11]:
frame.to_excel('examples/ex2.xlsx')

## Homework



1.  Write a function that reads streamflow data from USGS webpages by taking a station ID, a start and an end date as input arguments.
2.  Retrieve and plot time series of 3 stations of your choosing (same period for all three).
3.  For the time series you retrieved, calculate the correlations between them.
4.  Without using the resample function from pandas, calculate the monthly average for one of the time series you downloaded.

