# Pickle to HDF

The bird classifier data from [Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721#.tq9h4p590) is distributed as a pickle file.

Front and center on the [pickle's documentation](https://docs.python.org/3/library/pickle.html) is **Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.**

Additionally the pickle file is much larger than it should be.

In [12]:
import h5py
import pickle
from time import time
import py

## Read Pickle

Read the pickled dataset and record how long it took.

In [31]:
dataset = "full_dataset.pkl"
t1 = time()
X, Y, X_test, Y_test = pickle.load(open(dataset, "rb"), encoding="bytes")
t2 = time()
mb = py.path.local(dataset).size()/1024 ** 2

In [32]:
print("It took {:.2f}s to load the {:.1f}MB bird classifier dataset from pickle".format(t2-t1,mb))

It took 48.07s to load the 5229.6MB bird classifier dataset from pickle


## Save HDF5

Resave the pickled dataset as a hdf5 file

http://docs.h5py.org/en/latest/high/dataset.html#lossless-compression-filters

In [33]:
hdf_dataset = "full_dataset.hdf5"

opts=dict()
opts["compression"]="gzip"
opts["compression_opts"]=9

t1 = time()
with h5py.File(hdf_dataset, "w") as fid:
    fid.create_dataset("X", data=(X,), **opts)
    fid.create_dataset("Y", data=(Y,), **opts)
    fid.create_dataset("X_test", data=(X_test, ), **opts)
    fid.create_dataset("Y_test", data=(Y_test, ), **opts)
t2 = time()
mb = py.path.local(hdf_dataset).size()/1024 ** 2

In [36]:
print("It took {:.2f}s to save the {:.1f}MB bird classifier dataset as hdf5 with gzip-9 compression".format(t2-t1,mb))

It took 113.11s to save the 368.4MB bird classifier dataset as hdf5 with gzip-9 compression


## Read HDF5

Read in the HDF5 file.

In [37]:
t1=time()
with h5py.File(hdf_dataset, "r") as fid:
    X2=fid["X"].value[0]
    Y2=fid["Y"].value[0]
    X_test2=fid["X"].value[0]
    Y_test2=fid["Y"].value[0]
t2=time()

In [38]:
print("It took {:.2f}s to read the {:.1f}MB bird classifier dataset from pickle".format(t2-t1,mb))

It took 11.64s to read the 368.4MB bird classifier dataset from pickle


## Data Integrity 

Check that the hdf5 file and pickle file have the same data

In [42]:
import numpy.testing

In [44]:
numpy.testing.assert_array_equal(X, X2)
numpy.testing.assert_array_equal(X_test, X_test2)
numpy.testing.assert_array_equal(Y, Y2)
numpy.testing.assert_array_equal(Y_test, Y_test2)

AssertionError: 
Arrays are not equal

(shapes (15000, 32, 32, 3), (56780, 32, 32, 3) mismatch)
 x: array([[[[ 0.619608,  0.439216,  0.192157],
         [ 0.623529,  0.435294,  0.184314],
         [ 0.647059,  0.454902,  0.2     ],...
 y: array([[[[ 0.047059,  0.117647,  0.035294],
         [ 0.070588,  0.117647,  0.054902],
         [ 0.047059,  0.078431,  0.035294],...