# Pickle to HDF

The bird classifier data from [Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721#.tq9h4p590) is distributed as a pickle file.

Front and center on the [pickle's documentation](https://docs.python.org/3/library/pickle.html) is: **Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.**

Additionally the pickle file is much larger than it should be. [HDF5](https://support.hdfgroup.org/HDF5/) is designed to store data.

In [1]:
import h5py
import pickle
import py
from time import time

## Read Pickle

Read the pickled dataset and record how long it took.

`full_dataset.pkl` from [`data.zip`](https://s3-us-west-2.amazonaws.com/ml-is-fun/data.zip)

In [2]:
dataset = "full_dataset.pkl"

In [3]:
t1 = time()
X, Y, X_test, Y_test = pickle.load(open(dataset, "rb"), encoding="bytes")
t2 = time()
mb = py.path.local(dataset).size()/1024 ** 2

FileNotFoundError: [Errno 2] No such file or directory: 'full_dataset.pkl'

In [4]:
print("It took {:.2f}s to load the {:.1f}MB bird classifier dataset from pickle".format(t2-t1,mb))

It took 23.35s to load the 5229.6MB bird classifier dataset from pickle


## Save HDF5

Resave the pickled dataset as a hdf5 file

http://docs.h5py.org/en/latest/high/dataset.html#lossless-compression-filters

In [5]:
hdf_dataset = "full_dataset.hdf5"

opts=dict()
opts["compression"]="gzip"
opts["compression_opts"]=9

t1 = time()
with h5py.File(hdf_dataset, "w") as fid:
    fid.create_dataset("X", data=(X,), **opts)
    fid.create_dataset("Y", data=(Y,), **opts)
    fid.create_dataset("X_test", data=(X_test, ), **opts)
    fid.create_dataset("Y_test", data=(Y_test, ), **opts)
t2 = time()
mb = py.path.local(hdf_dataset).size()/1024 ** 2

In [6]:
print("It took {:.2f}s to save the {:.1f}MB bird classifier dataset as hdf5 with gzip-9 compression".format(t2-t1,mb))

It took 114.47s to save the 368.4MB bird classifier dataset as hdf5 with gzip-9 compression


## Read HDF5

Read in the HDF5 file.

In [7]:
t1=time()
with h5py.File(hdf_dataset, "r") as fid:
    X2=fid["X"].value[0]
    Y2=fid["Y"].value[0]
    X_test2=fid["X_test"].value[0]
    Y_test2=fid["Y_test"].value[0]
t2=time()

In [8]:
print("It took {:.2f}s to read the {:.1f}MB bird classifier dataset from pickle".format(t2-t1,mb))

It took 6.58s to read the 368.4MB bird classifier dataset from pickle


## Data Integrity 

Check that the hdf5 file and pickle file have the same data

In [9]:
import numpy.testing

In [10]:
numpy.testing.assert_array_equal(X, X2)
numpy.testing.assert_array_equal(X_test, X_test2)
numpy.testing.assert_array_equal(Y, Y2)
numpy.testing.assert_array_equal(Y_test, Y_test2)