<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#pandas-has-built-in-support-for-two-more-binary-data-formats:-HDF5-and-Message‐Pack." data-toc-modified-id="pandas-has-built-in-support-for-two-more-binary-data-formats:-HDF5-and-Message‐Pack.-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span><code>pandas</code> has built-in support for two more binary data formats: <code>HDF5</code> and <code>Message‐Pack</code>.</a></span></li></ul></li><li><span><a href="#Using-HDF5-Format" data-toc-modified-id="Using-HDF5-Format-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Using HDF5 Format</a></span></li><li><span><a href="#Reading-Microsoft-Excel-Files" data-toc-modified-id="Reading-Microsoft-Excel-Files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reading Microsoft Excel Files</a></span><ul class="toc-item"><li><span><a href="#If-you-are-reading-multiple-sheets-in-a-file,-then-it-is-faster-to-create-the-ExcelFile" data-toc-modified-id="If-you-are-reading-multiple-sheets-in-a-file,-then-it-is-faster-to-create-the-ExcelFile-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>If you are reading multiple sheets in a file, then it is faster to create the ExcelFile</a></span></li><li><span><a href="#To-write-pandas-data-to-Excel-format,-you-must-first-create-an-ExcelWriter,-then-write-data-to-it-using-pandas-objects’-to_excel-method:" data-toc-modified-id="To-write-pandas-data-to-Excel-format,-you-must-first-create-an-ExcelWriter,-then-write-data-to-it-using-pandas-objects’-to_excel-method:-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>To write pandas data to Excel format, you must first create an <code>ExcelWriter</code>, then write data to it using pandas objects’ <code>to_excel</code> method:</a></span></li></ul></li></ul></div>

# 6.2 Binary Data Formats
One of the easiest ways to store data (also known as *serialization*) efficiently in binary format is using Python’s built-in `pickle` serialization. pandas objects all have a `to_pickle` method that writes the data to disk in `pickle` format:

In [1]:
import pandas as pd
import numpy as np

In [2]:
frame = pd.read_csv('../examples/ex1.csv')
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [3]:
frame.to_pickle('../examples/frame_pickle')

You can read any “pickled” object stored in a file by using the built-in pickle directly, or even more conveniently using `pandas.read_pickle`:

In [4]:
pd.read_pickle('../examples/frame_pickle')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


#### `pandas` has built-in support for two more binary data formats: `HDF5` and `Message‐Pack`. 

### Using HDF5 Format
* HDF5 is a well-regarded file format intended for storing large quantities of scientific array data.It is available as a C library, and it has interfaces available in many other languages, including Java, Julia, MATLAB, and Python.
* The “HDF” in HDF5 stands for hierarchical data format.
* HDF5 can be a good choice for working with very large datasets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays.
* pandas provides a high-level interface that simplifies storing Series and DataFrame object. The `HDFStore` class works like a dict and handles the low-level details:

In [5]:
frame = pd.DataFrame({'a': np.random.randn(100)})
frame

Unnamed: 0,a
0,-1.429856
1,-0.207395
2,1.057811
3,2.223234
4,-0.129972
...,...
95,-1.415094
96,0.691735
97,-0.210971
98,-3.007026


In [6]:
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

Objects contained in the `HDF5` file can then be **retrieved with the same dict-like API:**

In [7]:
store['obj1']

Unnamed: 0,a
0,-1.429856
1,-0.207395
2,1.057811
3,2.223234
4,-0.129972
...,...
95,-1.415094
96,0.691735
97,-0.210971
98,-3.007026


* `HDFStore` supports two storage schemas, 'fixed' and 'table'. 
* The latter is generally slower, **but it supports query operations** using a special syntax:

In [8]:
store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,-2.113088
11,-1.493659
12,-1.372625
13,0.834503
14,-0.160536
15,0.471164


In [9]:
store.close()

* The `put` is an explicit version of the `store['obj2'] = frame` method but allows us to set other options like the storage format.
* The `pandas.read_hdf` function gives you a shortcut to these tools:

In [10]:
frame.to_hdf('mydata.h5', 'obj3', format='table')
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

Unnamed: 0,a
0,-1.429856
1,-0.207395
2,1.057811
3,2.223234
4,-0.129972


### Reading Microsoft Excel Files
* pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the `ExcelFile` class or `pandas.read_excel` function.
* Internally these tools use the add-on packages `xlrd` and `openpyxl` to read XLS and XLSX files, respectively. 

In [11]:
xlsx = pd.ExcelFile('../examples/ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


#### If you are reading multiple sheets in a file, then it is faster to create the ExcelFile    
but you can also simply pass the filename to `pandas.read_excel`:

In [12]:
frame = pd.read_excel('../examples/ex1.xlsx', 'Sheet1')
frame

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


#### To write pandas data to Excel format, you must first create an `ExcelWriter`, then write data to it using pandas objects’ `to_excel` method:

In [13]:
writer = pd.ExcelWriter('../examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

You can also pass a file path to `to_excel` and avoid the `ExcelWriter`:

In [14]:
frame.to_excel('../examples/ex2.xlsx')