<a href="https://colab.research.google.com/github/baut-jc/ddds-c18/blob/lectures/Lectures/6-1a_File_Handling_with_Pickle_end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pickle

Pickle is used for serializing and de-serializing Python objects.
Serialization is the process of encoding an object as bytes that can be easily stored and later de-serialized back to a Python object.
<br>
When an object is pickled, it contains all of the information needed to reconstruct that object later on.

**Pros**
- Easy to use & lightweight.
- Allows for moving of data across a network.
- Is useful if you want to pause and resume a long-running script (you can dump the system's state and resume it later on).
- Is useful for persistency across program runs.

**Cons**
- Other programming languages cannot reconstruct pickled objects.
- Not human readable.
- Security (you can accidently unpickle malicious code if you don't know what it is you're unpickling).

## Pickling a dictionary of a data frame

In [None]:
import numpy as np
import pandas as pd
import pickle

Load a CSV file.

In [None]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.1-Transactions/Data/Transaction.train.big.csv'


In [None]:
!curl -O {url}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  178M  100  178M    0     0  38.7M      0  0:00:04  0:00:04 --:--:-- 39.2M


In [None]:
data = pd.read_csv( url )
data = data.drop(['Unnamed: 0'], axis=1)

In [None]:
data.shape

(1050000, 103)

In [None]:
data.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_91,var_92,var_93,var_94,var_95,var_96,var_97,var_98,var_99,var_100
0,train_0,,,,,,,,,,...,,,,,,,,,,
1,train_1,,,,,,,,,,...,,,,,,,,,,
2,train_2,,,,,,,,,,...,,,,,,,,,,
3,train_3,0.0,4.6739,22.3915,15.6015,,0.0464,,,-1.9254,...,,11.1077,,-12.6465,,,,,,14.0618
4,train_4,,,,,,,,,,...,,,,,,,,,,


In [None]:
data.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   float64
 2   var_0    180000 non-null   float64
dtypes: float64(2), object(1)
memory usage: 24.0+ MB


In [None]:
data['target'].value_counts( dropna=False)


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
,870000
0.0,161960
1.0,18040


In [None]:
data['target'] = data['target'].astype("boolean")


In [None]:
data.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   boolean
 2   var_0    180000 non-null   float64
dtypes: boolean(1), float64(1), object(1)
memory usage: 18.0+ MB


In [None]:
data['target'].value_counts(dropna=False)


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
,870000
False,161960
True,18040


Pickle the file.

In [None]:
dict_data = data.to_dict(orient = 'series')
dict_data


{'ID_code': 0                train_0
 1                train_1
 2                train_2
 3                train_3
 4                train_4
                ...      
 1049995    train_1049995
 1049996    train_1049996
 1049997    train_1049997
 1049998    train_1049998
 1049999    train_1049999
 Name: ID_code, Length: 1050000, dtype: object,
 'target': 0           <NA>
 1           <NA>
 2           <NA>
 3          False
 4           <NA>
            ...  
 1049995     True
 1049996     <NA>
 1049997    False
 1049998     <NA>
 1049999    False
 Name: target, Length: 1050000, dtype: boolean,
 'var_0': 0              NaN
 1              NaN
 2              NaN
 3           4.6739
 4              NaN
             ...   
 1049995    11.3251
 1049996        NaN
 1049997     8.9941
 1049998        NaN
 1049999     2.1828
 Name: var_0, Length: 1050000, dtype: float64,
 'var_1': 0              NaN
 1              NaN
 2              NaN
 3          22.3915
 4              NaN
             .

In [None]:
ls -la --si


total 188M
drwxr-xr-x 1 root root 4.1k Jul 10 22:07 [0m[01;34m.[0m/
drwxr-xr-x 1 root root 4.1k Jul 10 22:03 [01;34m..[0m/
drwxr-xr-x 4 root root 4.1k Jul  9 21:17 [01;34m.config[0m/
drwxr-xr-x 1 root root 4.1k Jul  9 21:17 [01;34msample_data[0m/
-rw-r--r-- 1 root root 188M Jul 10 22:07 Transaction.train.big.csv


In [None]:
with open('dict_data.p', 'wb') as file:
    pickle.dump(dict_data, file)


In [None]:
ls -la

total 1028884
drwxr-xr-x 1 root root      4096 Jul 10 22:11 [0m[01;34m.[0m/
drwxr-xr-x 1 root root      4096 Jul 10 22:03 [01;34m..[0m/
drwxr-xr-x 4 root root      4096 Jul  9 21:17 [01;34m.config[0m/
-rw-r--r-- 1 root root 866218159 Jul 10 22:12 dict_data.p
drwxr-xr-x 1 root root      4096 Jul  9 21:17 [01;34msample_data[0m/
-rw-r--r-- 1 root root 187331086 Jul 10 22:07 Transaction.train.big.csv


In [None]:
!ls -la --si ./dict_data.p


-rw-r--r-- 1 root root 867M Jul 10 22:12 ./dict_data.p


Read the pickle file.

In [None]:
with open('dict_data.p', 'rb') as file:
    dict_data_read = pickle.load(file)

type(dict_data_read)

dict

In [None]:
df_dict_data_read = pd.DataFrame.from_dict(dict_data_read)
df_dict_data_read.head()


Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_91,var_92,var_93,var_94,var_95,var_96,var_97,var_98,var_99,var_100
0,train_0,,,,,,,,,,...,,,,,,,,,,
1,train_1,,,,,,,,,,...,,,,,,,,,,
2,train_2,,,,,,,,,,...,,,,,,,,,,
3,train_3,False,4.6739,22.3915,15.6015,,0.0464,,,-1.9254,...,,11.1077,,-12.6465,,,,,,14.0618
4,train_4,,,,,,,,,,...,,,,,,,,,,


In [None]:
df_dict_data_read.shape

(1050000, 103)

In [None]:
type(df_dict_data_read)

In [None]:
df_dict_data_read.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   boolean
 2   var_0    180000 non-null   float64
dtypes: boolean(1), float64(1), object(1)
memory usage: 18.0+ MB


In [None]:
df_dict_data_read['target'] = df_dict_data_read['target'].astype("Int64")


In [None]:
df_dict_data_read.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   Int64  
 2   var_0    180000 non-null   float64
dtypes: Int64(1), float64(1), object(1)
memory usage: 25.0+ MB


In [None]:
df_dict_data_read.head()


Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_91,var_92,var_93,var_94,var_95,var_96,var_97,var_98,var_99,var_100
0,train_0,,,,,,,,,,...,,,,,,,,,,
1,train_1,,,,,,,,,,...,,,,,,,,,,
2,train_2,,,,,,,,,,...,,,,,,,,,,
3,train_3,0.0,4.6739,22.3915,15.6015,,0.0464,,,-1.9254,...,,11.1077,,-12.6465,,,,,,14.0618
4,train_4,,,,,,,,,,...,,,,,,,,,,


## Pickling a data frame

In [None]:
data = pd.read_csv( url )
data = data.drop(['Unnamed: 0'], axis=1)


In [None]:
data.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_91,var_92,var_93,var_94,var_95,var_96,var_97,var_98,var_99,var_100
0,train_0,,,,,,,,,,...,,,,,,,,,,
1,train_1,,,,,,,,,,...,,,,,,,,,,
2,train_2,,,,,,,,,,...,,,,,,,,,,
3,train_3,0.0,4.6739,22.3915,15.6015,,0.0464,,,-1.9254,...,,11.1077,,-12.6465,,,,,,14.0618
4,train_4,,,,,,,,,,...,,,,,,,,,,


In [None]:
data.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   float64
 2   var_0    180000 non-null   float64
dtypes: float64(2), object(1)
memory usage: 24.0+ MB


In [None]:
data['target'] = data['target'].astype("boolean")


In [None]:
data.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   boolean
 2   var_0    180000 non-null   float64
dtypes: boolean(1), float64(1), object(1)
memory usage: 18.0+ MB


Pickle the file.

In [None]:
with open('data.p', 'wb') as file:
    pickle.dump(data, file)


In [None]:
!ls -la --si ./*.p


-rw-r--r-- 1 root root 867M Jul 10 22:15 ./data.p
-rw-r--r-- 1 root root 867M Jul 10 22:12 ./dict_data.p


Read the pickle file.

In [None]:
with open('data.p', 'rb') as file:
    data_read = pickle.load(file)
data_read.head()


Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_91,var_92,var_93,var_94,var_95,var_96,var_97,var_98,var_99,var_100
0,train_0,,,,,,,,,,...,,,,,,,,,,
1,train_1,,,,,,,,,,...,,,,,,,,,,
2,train_2,,,,,,,,,,...,,,,,,,,,,
3,train_3,False,4.6739,22.3915,15.6015,,0.0464,,,-1.9254,...,,11.1077,,-12.6465,,,,,,14.0618
4,train_4,,,,,,,,,,...,,,,,,,,,,


In [None]:
type(data_read)

In [None]:
data_read.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   boolean
 2   var_0    180000 non-null   float64
dtypes: boolean(1), float64(1), object(1)
memory usage: 18.0+ MB


In [None]:
data_read['target'] = data_read['target'].astype("Int8")


In [None]:
data_read.iloc[:,0:3].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1050000 entries, 0 to 1049999
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   ID_code  1050000 non-null  object 
 1   target   180000 non-null   Int8   
 2   var_0    180000 non-null   float64
dtypes: Int8(1), float64(1), object(1)
memory usage: 18.0+ MB


In [None]:
data_read.head()
