# [Pandas and HDF5](http://pandas.pydata.org/)

In [None]:
% matplotlib inline

In [None]:
import os
import shutil
import glob
import sqlite3 as sqlite
DATADIR = os.path.join(os.path.expanduser("~"),
                       "DATA", "Bioinf")
print(os.path.exists(DATADIR))

import pandas as pd
import seaborn as sns

sns.set()

## Reading/Writing Text Data with  Pandas

One of the beauties of Pandas is the ease of data input/output that it provides. It has the capability to read a variety of common data formats including
* Tabular text data
    * [``read_csv``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html): read comma separated files
    * [``read_table``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html): read tab separated files
        * These are both wrappers to the same function with different default values
    * [``read_excel``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)
* Databases
    * [``read_sql``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)
* Excel
    
* HDF5, a high performance file format for very large data
    * [``read_hdf``](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.read_hdf.html)



## What is HDF5?

> HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. ([HDF5 Home Page](https://www.hdfgroup.org/HDF5/))

> HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. ([HDF5 FAQ](https://www.hdfgroup.org/about/hdf_technologies.html))

The roots of HDF5 go back to the late 1980s at the National Center for Supercomputing Applications and was shortly thereafter adopted by NASA as the data form for its Earth Observing System ([Wikipedia](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)).

## Why Use HDF5?

HDF5 offers both disk utilization and performance enhancements over naive file foramts, such as CSV files. We can illustrate this with gene expression data. In the ``/home/jovyan/DATA/Bioinf`` directory we have the PANCAN12 data file stored in two formats:

1. The original text file (``PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist``)
1. The data stored as an HDF5 file (``PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist.hdf5``

The HDF5 file is only about 2/3 the size of the original file. We will also see noticable improvements in how long it takes to read the data from disk.

```bash
-rw-r--r-- 1 jovyan staff 538029908 Jun  8 20:49 PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist.hdf5                      
-rwxr-xr-x 1 jovyan staff 839788724 Jun  8 20:50 PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist                           
```

## Simple Profiling

One simple way we can evaluate the performance of our programs is by using Python's ``time`` module to measure how  how long it takes for a command to execute. This is somewhat naive because our computers are multitasking and a program might take a longer or shorter amount of time because the computer was doing more or fewer competing tasks. But it is a reasonable starting approach.

Within the ``time`` module is a function [``time``](https://docs.python.org/3/library/time.html) that, on Unix systems, returns the number of elapsed seconds since January 1, 1970 ([see Unix epoch](https://en.wikipedia.org/wiki/Unix_time)).

We can save the Unix time before we start the command and compare that to the Unix time after our command execute. 

**Note:** it takes tens of seconds to read the file in.

In [None]:
import time
url_txt = \
    os.path.join(DATADIR,
                 "PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist")
start = time.time()
pd.read_table(url_txt)
time_table = time.time()-start
print("Elapsed time to read original file",time_table)
#data

### Now Measure Time to Read in HDF5

In [None]:
url_txt = os.path.join(DATADIR,"PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist.hdf5")
start = time.time()
pd.read_hdf(url_txt)
time_hdf5 = time.time()-start
print("Elapsed time to read hdf5 file",time_hdf5)

## Slicing

In [None]:
data[0:2]


## Pandas and HDF5

* HDF5 is a high performance binary data format written in C
* HDF5 facilitates a number of performance enhancements such as being able to access parts of the data without having to read into memory the whole dataset
* Python has too different packages provide an HDF5 interface
    * [h5py](http://www.h5py.org/)
    * [pytables](http://www.pytables.org/moin)
* Pandas uses pytable to interface with hdf5


In [None]:
!conda install pytables -y

In [None]:
url_hdf = os.path.join(DATADIR, "PANCAN12.IlluminaHiSeq_RNASeqV2.geneExp.tumor_whitelist.hdf5")
start = time.time()
data_hdf5 = pd.read_hdf(url_hdf)
time_hdf5 = time.time()-start
print("HDF5 %5.4f x faster than traditional read"%(time_table/time_hdf5))

In [None]:
data_hdf5

In [None]:
21000*3000*8
