# Data Science : Understanding Pickling in Python

## What is Pickling?
&emsp;**Pickling in Python** or **Data Serialization/De-serialization** is an important tool in a Data Scientist's arsenal!<br><br>
&emsp;Data serialization is the process of converting structured data (objects) into a series of bytes. This allows sharing or storage of the data by fulfulling an intention of minimizing the data’s size which reduces disk space or bandwidth requirements.<br>
&emsp;Python pickle module is useful for serializing and de-serializing a Python object structure.

## Types of Data (based in structure)
- Flat Data<br>
    ```{"Category" : "XYZ", "parameter1": "value1", "parameter2": "value2", "parameter3": "value3" }```
- Nested Data<br>
    ```{"XYZ" : {"parameter1": "value1", "parameter2": "value2", "parameter3": "value3" }}```

&emsp;As it can be seen clearly Flat Data has one level of properties or “key : value” pairs while Nested Data has multiple levels of data, or has sub-objects with in.


## Using Pickle Module
&emsp;Pickle is a native module in Python hence it can be directly imported as:

In [1]:
import pickle

### To serialize data
Let's take some arbitrary data to "dump" it using pickle.

In [2]:
# An arbitrary collection of objects supported by pickle.
data = {
    'area' : [2600,3000,3200,3600,4000,4100],
    'bedrooms' : [3,4,4,3,5,6],
    'age' : [20,15,18,30,8,8],
    'price' : [550000,565000,610000,595000,760000,810000]
}

To serialize an object hierarchy, you simply call the ```dump()``` function. It writes the pickled representation of the object to the open object file.

In [3]:
with open('mydata.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

In the above code, we've opened ```mydata.pickle``` file in write binary mode to dump our data object in the ```data``` variable with ```pickle.HIGHEST_PROTOCOL``` (which is basically just a value determining the priority, the other choice is ```DEFAULT_PROTOCOL```). If the file is absent in the working directory, it is automatically created.

**Advantages of Dumping the file :**
- As discussed before, this file can be easily shared with anyone due to its compressed nature and we can easily retrive its data back by unpickling it and load it for further usage.

### To de-serialize data
Let's take our arbitrary data from ```mydata.pickle``` file and "load" it using pickle.

In [5]:
datafile = open('mydata.pickle', 'rb')     
unpickled_data = pickle.load(datafile)
for keys,items in unpickled_data.items():
    print(keys, ':', items)
datafile.close()

area : [2600, 3000, 3200, 3600, 4000, 4100]
bedrooms : [3, 4, 4, 3, 5, 6]
age : [20, 15, 18, 30, 8, 8]
price : [550000, 565000, 610000, 595000, 760000, 810000]


In [6]:
for keys,items in data.items():
    print(keys, ':', items)

area : [2600, 3000, 3200, 3600, 4000, 4100]
bedrooms : [3, 4, 4, 3, 5, 6]
age : [20, 15, 18, 30, 8, 8]
price : [550000, 565000, 610000, 595000, 760000, 810000]


As you can clearly observe we have the original data retrieved from the ```mydata.pickle``` file. And it is exactly same as the original ```data```.

There is little warning though as stated by the official Pickle documentation:
> Warning: The pickle module is not secure. Only unpickle data you trust.

So incase you're trying to unpickle a file provided by someone, do it with utmost care.

## Practical Usage
&emsp;We can use this unpickled data to train a simple Linear Regression Model to predict House Prices! Let's give it a try.

In [10]:
import pandas as pd
df = pd.DataFrame(data=unpickled_data)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3,20,550000
1,3000,4,15,565000
2,3200,4,18,610000
3,3600,3,30,595000
4,4000,5,8,760000
5,4100,6,8,810000


In [13]:
from sklearn.linear_model import LinearRegression
myMLmodel = LinearRegression()
myMLmodel.fit(df.drop('price',axis='columns'),df['price'])

LinearRegression()

In [14]:
myMLmodel.predict([[3000, 3, 40]])

array([498408.25158031])

Talking of practical usage, we can even dump our fitted ML model object to a pickle file and that way we have a ready ML model which need not to be trained again and again.

In [15]:
with open('myMLmodel.pickle', 'wb') as f:
    # Pickle the ML model using the highest protocol available.
    pickle.dump(myMLmodel, f, protocol=pickle.HIGHEST_PROTOCOL)

Let's test our theory and see whether our model got serialized or not.

In [16]:
modelfile = open('myMLmodel.pickle', 'rb')     
unpickled_MLmodel = pickle.load(modelfile)
modelfile.close()

In [17]:
myMLmodel.predict([[3000, 3, 40]])

array([498408.25158031])

Voila! That is how pickling can be helpful to a Data Scientist!