# Python Object Serialization


## Carlos Santillan 
___Principal Software Engineer___

_Bentley Systems Inc._
## PyData meetup December, 2017






## Pickle 

Pickle is a very powerful library to serialize and deserialize python objects.

Use Cases :

* Save state of program to disk
* Transfer data over network connection
* Store python objects in database

In [1]:
import pickle
import sys

The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. 

cpickle is a high performance compatible API. 

Note - Python 3 automatically will try to use cpickle if available

Protocol versions:
* 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
* 1 is an old binary format which is also compatible with earlier versions of Python.
* 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.
* 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. 
* 4 was added in Python 3.4. supports for very large objects, pickling more kinds of objects, and some data format optimizations. 


In [2]:
# Helper functions
def savePickle(object, filename, protocol = pickle.HIGHEST_PROTOCOL):
    pickle.dump(object, open(filename, "wb",),protocol)

def loadPickle(filename):
    return pickle.load(open(filename, "rb"))

def printVariables():
    for var,obj in globals().items():
        if var.startswith("_") == 0 :
            print( var + ":  "+str(sys.getsizeof(obj)) )

## Dictionary Sample

In [3]:
myDict = {'Mammal': { 'Feline':['Cat','Tiger'],'Canine':['Dog','Wolf']},
          'Reptile':{'Snakes':['Rattler','King Cobra','Mamba'],'Lizard':['Monitor','Gila Monster']  }
         }
print(myDict["Mammal"])
printVariables()



{'Feline': ['Cat', 'Tiger'], 'Canine': ['Dog', 'Wolf']}
In:  96
Out:  240
get_ipython:  64
exit:  56
quit:  56
pickle:  80
sys:  80
savePickle:  136
loadPickle:  136
printVariables:  136
myDict:  240


In [4]:
savePickle(myDict,"MyDictionary.p")

%ls
#%cat "MyDictionary.p"

 Volume in drive G is New Volume
 Volume Serial Number is 22C9-4C20

 Directory of G:\dev\python\PySerialization

11/25/2017  08:49 PM    <DIR>          .
11/25/2017  08:49 PM    <DIR>          ..
11/12/2017  06:23 PM             1,258 .gitignore
11/12/2017  06:26 PM    <DIR>          .ipynb_checkpoints
11/25/2017  08:49 PM           551,417 DecisionTree.p
11/12/2017  06:23 PM            11,558 LICENSE
11/25/2017  08:49 PM            16,576 MyDF.p
11/25/2017  08:49 PM            16,576 MyDFpd.p
11/25/2017  08:49 PM             4,252 MyDFpdCompressed.p
11/25/2017  08:49 PM               178 MyDictionary.p
11/25/2017  08:48 PM            18,741 Python Serialization.ipynb
11/12/2017  06:23 PM                69 README.md
               9 File(s)        620,625 bytes
               3 Dir(s)  688,604,397,568 bytes free


In [5]:
# Delete variable from Memory
del(myDict)
printVariables()
#print(myDict)

In:  128
Out:  240
get_ipython:  64
exit:  56
quit:  56
pickle:  80
sys:  80
savePickle:  136
loadPickle:  136
printVariables:  136


In [6]:
newDict = loadPickle("MyDictionary.p")

print(newDict)
print(newDict["Mammal"])


{'Mammal': {'Feline': ['Cat', 'Tiger'], 'Canine': ['Dog', 'Wolf']}, 'Reptile': {'Snakes': ['Rattler', 'King Cobra', 'Mamba'], 'Lizard': ['Monitor', 'Gila Monster']}}
{'Feline': ['Cat', 'Tiger'], 'Canine': ['Dog', 'Wolf']}


# Save Pandas to pickle



In [7]:
import pandas as pd

df = pd.DataFrame([range(1000), range(1,10000,10)])
df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
1,1,11,21,31,41,51,61,71,81,91,...,9901,9911,9921,9931,9941,9951,9961,9971,9981,9991


In [8]:
#Save using pickle API
savePickle(df,"MyDF.p") 
#Save using DF's API
df.to_pickle("MyDFpd.p")
#Save using DF's API and  compression
df.to_pickle("MyDFpdCompressed.p",compression ='gzip')
%ls

 Volume in drive G is New Volume
 Volume Serial Number is 22C9-4C20

 Directory of G:\dev\python\PySerialization

11/25/2017  08:49 PM    <DIR>          .
11/25/2017  08:49 PM    <DIR>          ..
11/12/2017  06:23 PM             1,258 .gitignore
11/12/2017  06:26 PM    <DIR>          .ipynb_checkpoints
11/25/2017  08:49 PM           551,417 DecisionTree.p
11/12/2017  06:23 PM            11,558 LICENSE
11/25/2017  08:49 PM            16,576 MyDF.p
11/25/2017  08:49 PM            16,576 MyDFpd.p
11/25/2017  08:49 PM             4,252 MyDFpdCompressed.p
11/25/2017  08:49 PM               178 MyDictionary.p
11/25/2017  08:48 PM            18,741 Python Serialization.ipynb
11/12/2017  06:23 PM                69 README.md
               9 File(s)        620,625 bytes
               3 Dir(s)  688,604,397,568 bytes free


# Save model 


Pickle can be used to save and load a pre trainined scikit model


In [9]:
from sklearn import datasets
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import  cross_val_score

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

print ("* Training sample size : ", len(X_train))
print ("* Validation sample size : ", len(X_test))

dt = tree.DecisionTreeClassifier(criterion='gini',min_samples_split=5, random_state=1024)
dt.fit(X_train, y_train)


print(dt)
savePickle(dt,"DecisionTree.p")


* Training sample size :  353
* Validation sample size :  89
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1024,
            splitter='best')




In [10]:
def printScores(amodel, xtrain,ytrain):
    tscores = amodel.score( xtrain, ytrain)
    print ("Training score is %f" % tscores)
    print ("Model depth is %i" % amodel.tree_.max_depth )
    
printScores(dt,X_train,y_train)
   

Training score is 0.592068
Model depth is 24


## Load the Model 


In [11]:
newDT = loadPickle("DecisionTree.p")

print(newDT)

printScores(newDT,X_train,y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1024,
            splitter='best')
Training score is 0.592068
Model depth is 24


## References


 * https://docs.python.org/3/library/pickle.html
 * [Scikit Learn Model persistance](http://scikit-learn.org/stable/modules/model_persistence.html)
 * [Pandas and Pickle](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html)
 * [Understanding Python pickling and how to use it securely](https://www.synopsys.com/blogs/software-security/python-pickling/)
 * [Serialization](https://en.wikipedia.org/wiki/Serialization)
 
### Other Serialization options
 
  * [MessagePack](https://msgpack.org/)
  * [Json Seralizer](https://docs.python.org/2/library/json.html)
  * [CloudPickle](https://github.com/cloudpipe/cloudpickle)
  * [pyyaml](http://pyyaml.org/wiki/PyYAMLDocumentation)
  * [Camel](http://camel.readthedocs.io/en/latest/)
  * [Joblib.dump](https://pythonhosted.org/joblib/generated/joblib.dump.html)
 