# Pickle to HDF

The bird classifier data from [Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721#.tq9h4p590) is distributed as a pickle file.

Front and center on the [pickle's documentation](https://docs.python.org/3/library/pickle.html) is **Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.**

Additionally the pickle file is much larger than it should be.

In [12]:
import h5py
import pickle
from time import time
import py

Read the pickled dataset and record how long it took.

In [18]:
dataset = "full_dataset.pkl"
t1 = time()
X, Y, X_test, Y_test = pickle.load(open(dataset, "rb"), encoding="bytes")
t2 = time()
mb = py.path.local(dataset).size()/1024 ** 2
print("It took {:.2f}s to load the {:.1f}MB bird classifier dataset from pickle".format(t2-t1,mb))

It took 50.02s to load the 5229.6MB bird classifier dataset from pickle


Resave the pickled dataset as a hdf5 file

http://docs.h5py.org/en/latest/high/dataset.html#lossless-compression-filters

In [26]:
hdf_dataset = "full_dataset.hdf5"

opts=dict()
opts["compression"]="gzip"
opts["compression_opts"]=9

t1 = time()
with h5py.File(hdf_dataset, "w") as fid:
    fid.create_dataset("X", data=(X,), **opts)
    fid.create_dataset("Y", data=(Y,), **opts)
    fid.create_dataset("X_test", data=(X_test, ), **opts)
    fid.create_dataset("Y_test", data=(Y_test, ), **opts)
t2 = time2()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent