<a href="https://colab.research.google.com/github/docdecoder/javainfant/blob/master/Pickle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pickle

This code imports the pickle module in Python, which is used for serializing and de-serializing Python objects.



# Serializing Python Data Structures with Pickle
# Lists

In [None]:
system_names= [ 'Cardiovascular System', 'Respiratory System', 'Digestive System', 'Neurological System', 'Lymphatic System', ' Endocrinological System','Immunological System','Skeletal System']



This code uses the Python pickle module to serialize a list of system names and save it to a file named "system_file.pkl".

The with statement is used to open the file in binary write mode ('wb') and assign it to the variable f. This ensures that the file is properly closed after the block of code is executed, even if an error occurs.

The pickle.dump() function is then used to serialize the system_names list and write it to the file f. Serialization is the process of converting an object into a format that can be stored or transmitted, in this case, a binary format.

Overall, this code saves a list of system names to a file using the pickle module.



In [None]:
with open('system_file.pkl', 'wb') as f:  # open a text file
    pickle.dump(system_names, f) # serialize the list

The extension does not have to be .pkl. You can name this anything you’d like, and the file will still be created. However, it is good practice to use the .pkl extension so that you are reminded that this is a Pickle file.

Also, notice that we opened the file in wb mode. This means that you are writing the file in binary mode so that the data is returned in a bytes object.

Then, we use the dump() function to store the system_names list in the file.

Finally, you can close the file with the following line of code:

In [None]:
f.close()

This code closes a file that was previously opened using the open() function in Python. The close() method is used to free up any system resources taken up by the file object. It is good practice to close files after they have been used to avoid any potential data loss or corruption.

In [None]:
with open('system_file.pkl', 'rb') as f:

    system_names_loaded = pickle.load(f) # deserialize using load()
    print(system_names_loaded) # print student names

['Cardiovascular System', 'Respiratory System', 'Digestive System', 'Neurological System', 'Lymphatic System', ' Endocrinological System', 'Immunological System', 'Skeletal System']


This code is a Python list containing 8 strings: 'Cardiovascular System', 'Respiratory system'..'Skeletal system'. *Lists* are a type of data structure in Python that can *hold multiple values of different types.* In this case, the list holds strings, which are a type of data that represents text. The square brackets indicate that this is a list, and the commas separate the individual elements within the list.

Notice that to deserialize the file, we need to use the rb mode, which stands for read binary. Then, we unpickle the object using the load() function, after which we can store the data in a different variable and use it as we see fit.Next, check type of the file just unpickled!

In [None]:
type(system_names_loaded)

list

Great! We have preserved the original state and data type of this list.

# Numpy arrays

Now we will create a 5 by 5 array of ones.

In [None]:
import numpy as np
numpy_array = np.ones((5,5)) # 5*5 array

This code imports the NumPy library and creates a 15x5 NumPy array filled with ones using the np.ones() function. The np prefix is used to indicate that the function is part of the NumPy library. The ones() function takes a tuple as an argument that specifies the shape of the array. In this case, the tuple is (5,5) which creates a 5x5 array. The resulting array is assigned to the variable numpy_array.

Then, just like we did before, let’s call the dump() function to serialize this array to a file:

In [None]:
with open('my_array.pkl','wb') as f:
    pickle.dump(numpy_array, f)

This code uses the Python pickle module to save a NumPy array to a file named my_array.pkl.

The with statement is used to open the file in binary write mode ('wb') and assign it to the variable f. This ensures that the file is properly closed after the block of code is executed, even if an error occurs.

The pickle.dump() function is then used to serialize the numpy_array object and write it to the file f. This allows the array to be saved in a format that can be easily loaded back into memory later.

Overall, this code is a simple way to save a NumPy array to a file for later use.



Finally, let’s unpickle this array and check its shape and data type to ensure that it has retained its original state:

In [None]:
with open('my_array.pkl','rb') as f:
    unpickled_array = pickle.load(f)
    print('Array shape: '+str(unpickled_array.shape))
    print('Data type: '+str(type(unpickled_array)))

Array shape: (5, 5)
Data type: <class 'numpy.ndarray'>


, which is of the same shape and data type as the object we just serialized.

# pandas DataFrames

A data frame is an object that data scientists work with daily. The most popular way to load and save a Pandas DataFrame is to read and write it as a CSV file. Learn more about importing data in our pandas read_csv() tutorial.

However, this process is slower than serialization and can become extremely time-consuming if the data frame is large.

Let's contrast the efficiency of saving and loading a pandas dataframe using Pickle versus CSV by comparing the respective time taken.

First, let’s create a pandas dataframe with 100,000 rows of fake data:

In [None]:
import pandas as pd
import numpy as np

# Set random seed
np.random.seed(123)

data = {'Column1': np.random.randint(0, 10, size=100000),
        'Column2': np.random.choice(['A', 'B', 'C'], size=100000),
        'Column3': np.random.rand(100000)}


# Create Pandas dataframe
df = pd.DataFrame(data)

This code imports the pandas and numpy libraries in Python. It then sets a random seed using numpy to ensure that the random numbers generated are reproducible.

Next, it creates a dictionary called data with three keys: Column1, Column2, and Column3. The values for Column1 are generated using numpy's *randint function,* which *generates random integers between 0 and 10*

---

(exclusive) with a size of 100000. The values for Column2 are generated using numpy's* choice function,* which *randomly selects elements from the given array* (in this case, ['A', 'B', 'C']) with a size of 100000. The values for Column3 are generated using numpy's* rand function*, which *generates random floats between 0 and 1* with a size of 100000.

Finally, the code creates a pandas dataframe called df using the pd.DataFrame() function and passing in the data dictionary as an argument. This creates a dataframe with three columns (Column1, Column2, and Column3) and 100000 rows.



let’s calculate the amount of time taken to save this dataframe as a csv file:

In [None]:
import time

start = time.time()

df.to_csv('pandas_dataframe.csv')

end = time.time()
print(end - start)

0.29283785820007324


This code imports the time module and uses it to measure the time it takes to execute the code between the start and end variables.

The start variable is assigned the current time using the time.time() function. Then, the df DataFrame is exported to a CSV file named pandas_dataframe.csv using the to_csv() method.

After the export is complete, the end variable is assigned the current time using time.time(). The difference between end and start is then printed to the console, which gives the total time it took to export the DataFrame to a CSV file.



It took us 0.29 seconds to save a Pandas dataframe with three rows and 100,000 columns to a csv file.

Let’s see if using Pickle can help improve performance. The pandas library has a method called to_pickle() that allows us to serialize dataframes to pickle files in just one line of code:

In [None]:
start = time.time()

df.to_pickle("my_pandas_dataframe.pkl")

end = time.time()
print(end - start)

0.00867009162902832


It only took us 8 milliseconds to save the same Pandas DataFrame to a Pickle file, which is a significant performance improvement when compared to saving it as a csv.

Now, let’s read the file back to Pandas and see if loading a Pickle file offers any performance benefits as opposed to simply reading a csv file:

In [None]:
# Reading the csv file into Pandas:
start1 = time.time()
# Save the DataFrame to a CSV file first
df.to_csv("my_pandas_dataframe.csv", index=False)  # Added this line to save to CSV
df_csv = pd.read_csv("my_pandas_dataframe.csv")
end1 = time.time()
print("Time taken to read the csv file: " + str(end1 - start1) + "\n")

# Reading the Pickle file into Pandas:
start2 = time.time()
df_pkl = pd.read_pickle("my_pandas_dataframe.pkl")
end2 = time.time()
print("Time taken to read the Pickle file: " + str(end2 - start2))

Time taken to read the csv file: 0.2929713726043701

Time taken to read the Pickle file: 0.005082368850708008


Although this difference may appear minor, serializing large Pandas dataframes with Pickle can result in considerable time savings. Pickle will also help us preserve the data type of each column in every case and takes up less disk space than a CSV file

# Dictionaries

In [None]:
  Organs= {'Cardiovascular System':'Heart','Respiratory System':'Lung','Digestive System':'Stomach','Neurological System':'Brain','Lymphatic System':'Spleen','Endocrinological System':'Thyroid','Immunological System':'Thymus','Skeletal System':'Spine'}


This code creates a dictionary called students that contains information about Organs. Each system is represented by a key-value pair, where the key is a string representing the Systems's name  and the value is another dictionary containing the organ's name.

For example, the first system is represented by the key 'Cardiovascular system' and the value {'Cardiovascular systen': 'Heart'}. This dictionary contains 1 key-value pairs.
Similarly, the second and third systemss are represented by the keys, respectively, and their corresponding values contain their name information.

Overall, this code creates a nested dictionary structure that can be used to store and access information about multiple systems

In [7]:
# @title Default title text
Organs= {
    'Cardiovascular System':{'Name':'Heart','Tissues': 'Heart Tissue','Place':'Chest'},
    'Respiratory System':{'Name': 'Lung', 'Tissues': 'Lung Tissue','Place':'Chest'},


     'Digestive System':{'Name':'Stomach','Tissues':'Stomach Tissue','Place':'Abdomen'},

      'Neurological System':{'Name':'Brain','Tissues':'Brain Tissue','Place':'Head'},

    'Lymphatic System':{'Name':'Spleen', 'Tissues': 'Vascular Tissue','Place': 'Abdomen'},
    'Endocrinological System':{'Name':'Thyroid','Tissues':'Thyroid Tissue','Place':'Neck'},
    'Immunological System':{'Name':'Thymus','Tissues':'Thymus Tissue','Place':'Neck'},

    'Skeletal System':{'Name': 'Spine', 'Tissues': 'Neuronal Tissues','Place':
                       ' Back'}
}
# serialize the dictionary to a pickle file
with open("Organs_dict.pkl", "wb") as f:
    # Changed 'students' to 'Organs' to pickle the existing dictionary
    pickle.dump(Organs, f)

# deserialize the dictionary and print it out
with open("Organs_dict.pkl", "rb") as f:
    deserialized_dict = pickle.load(f)
    print(deserialized_dict)


{'Cardiovascular System': {'Name': 'Heart', 'Tissues': 'Heart Tissue', 'Place': 'Chest'}, 'Respiratory System': {'Name': 'Lung', 'Tissues': 'Lung Tissue', 'Place': 'Chest'}, 'Digestive System': {'Name': 'Stomach', 'Tissues': 'Stomach Tissue', 'Place': 'Abdomen'}, 'Neurological System': {'Name': 'Brain', 'Tissues': 'Brain Tissue', 'Place': 'Head'}, 'Lymphatic System': {'Name': 'Spleen', 'Tissues': 'Vascular Tissue', 'Place': 'Abdomen'}, 'Endocrinological System': {'Name': 'Thyroid', 'Tissues': 'Thyroid Tissue', 'Place': 'Neck'}, 'Immunological System': {'Name': 'Thymus', 'Tissues': 'Thymus Tissue', 'Place': 'Neck'}, 'Skeletal System': {'Name': 'Spine', 'Tissues': 'Neuronal Tissues', 'Place': ' Back'}}


In [8]:
type(deserialized_dict)

dict

In [20]:
print(
    "The first organ's name is "
    + deserialized_dict["Cardiovascular System"]["Name"]
    + " and its tissue is "
    + (str(deserialized_dict["Cardiovascular System"]["Tissues"]))
    +","
    + " and place of the organ is in the "
    + (str(deserialized_dict["Cardiovascular System"]["Place"]))
    +"."
)

The first organ's name is Heart and its tissue is Heart Tissue, and place of the organ is in the Chest.


Errors made1:
deserialized_dict is structured in such a way that "Tissues" is a key within the nested dictionary under each system, such as "Cardiovascular System", "Respiratory System", etc. You're directly trying to access "Tissues" from the top-level dictionary deserialized_dict, where it doesn't exist. You need to specify the System first, and then access 'Tissues' and then the specific tissue type.
used suugested code:
print(
    "The first system's name is "
    + deserialized_dict["Cardiovascular System"]["Name"]
    + " and tissue is "
    + (str(deserialized_dict["Cardiovascular System"]["Tissues"]))  # Access Tissues within "Cardiovascular System"
    + " place of the organ is in the "  # Added a space for better readability
    + (str(deserialized_dict["Cardiovascular System"]["Place"]))  # Access Place within "Cardiovascular System"
)

Error2:

KeyError:'Tissues '
The error message KeyError: 'Tissues ' indicates that the key 'Tissues ' is not found in the nested dictionary deserialized_dict["Cardiovascular System"]. This is likely due to a typo in the key name. When the dictionary Organs was created, the key for the tissue type was defined as 'Tissues' (without a trailing space). However, in the print statement, you are trying to access the value using the key 'Tissues ' (with a trailing space). This extra space in the key name is causing the KeyError because the dictionary doesn't contain a key with that exact name.
used suggested code:
print(
    "The first system's name is "
    + deserialized_dict["Cardiovascular System"]["Name"]
    + " and tissue is "
    + (str(deserialized_dict["Cardiovascular System"]["Tissues"])) # Removed the extra space in "Tissues "
    + " place of the organ is in the "
    + (str(deserialized_dict["Cardiovascular System"]["Place"]))
)


Error 3:
TypeError: bad operand type for unary +: 'str'
The error "TypeError: bad operand type for unary +: 'str'" is happening because you're trying to use the + operator with a string and the result of an expression in a way that's not allowed. Specifically, you have a comma after (str(deserialized_dict["Cardiovascular System"]["Tissues"])), which makes Python think it is two different print statements. The second print statement starts with + "and place of the organ is in the ", which is the unary + operator followed by a string, which is an invalid operation.
used suggested code:

print(
    "The first organ's name is "
    + deserialized_dict["Cardiovascular System"]["Name"]
    + " and its tissue is "
    + (str(deserialized_dict["Cardiovascular System"]["Tissues"]))
    + " and place of the organ is in the "
    + (str(deserialized_dict["Cardiovascular System"]["Place"]))
)


Explanation of Changes:
    Removed the comma after (str(deserialized_dict["Cardiovascular System"]["Tissues"]))): This ensures the entire expression is part of a single print statement, avoiding the unintended use of the unary + operator.
    Combined all strings into a single print statement: This ensures that Python is appending string to string rather than to any other datatypes.




Great! The dictionary retained all its original properties and can be accessed just like it was before serialization.

# Serializing Machine Learning Models with Pickle


Pickle allows you to serialize machine learning models in their existing state, making it possible to use them again as needed.

 First generate some fake data and build a linear regression model with the Scikit-Learn library:



In [21]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# generate regression dataset
X, y = make_regression(n_samples=5000, n_features=5, noise=0.3, random_state=101)

# train regression model
linear_model = LinearRegression()
linear_model.fit(X, y)

This code imports the LinearRegression class from the sklearn.linear_model module and the make_regression function from the sklearn.datasets module.

The make_regression function is used to generate a synthetic regression dataset with 5000 samples, 5features, and a noise level of 0.3. The random_state parameter is set to 101 to ensure reproducibility.

The generated dataset is then split into two arrays, X and y, where X contains the input features and y contains the corresponding target values.

Finally, a LinearRegression object is created and trained on the generated dataset using the fit method. This trains the model to learn the relationship between the input features and the target values, so that it can make predictions on new data.



In [22]:
# summary of the model
print('Model intercept :', linear_model.intercept_)
print('Model coefficients : ', linear_model.coef_)
print('Model score : ', linear_model.score(X, y))

Model intercept : 0.0026360037414828175
Model coefficients :  [ 7.79640942 33.43726819 57.51459187 64.4743176   7.76001334]
Model score :  0.9999896250777828


This code snippet prints out a summary of a linear regression model.

The first line prints out the **intercept of the model, which is the value of the dependent variable (y) when all independent variables (x) are equal to zero.**

The second line prints out the **coefficients of the model, which represent the change in the dependent variable (y) for a one-unit change in each independent variable.**

The third line prints out the score of the model, which is the coefficient of determination (R-squared) that measures the **proportion of the variance in the dependent variable that is explained by the independent variables.**

The variables X and y are assumed to be previously defined as the independent and dependent variables, respectively, used to fit the linear regression model. The linear_model object is also assumed to be previously defined as the fitted linear regression model.



In [23]:
with open("linear_regression.pkl", "wb") as f:
    pickle.dump(linear_model, f)

This code uses python pickle module to save trained linear regression model to a file named 'linear_regression.pkl'

'with' statement isused to write file in the 'wb' write binary mode and assign it to a variable f.

pickle.dump function is then used to serialize 'lnear_model' object and write to a file f.

By saving model to a file, it can be Loaded used later, without retraining it from the scratch.

The model is now saved to a pickle file. It can be deserailized using the load function.

In [25]:
with open("linear_regression.pkl", "rb") as f:
    unpickled_linear_model = pickle.load(f)

This code uses the open() function in Python to open a file named "linear_regression.pkl" in binary mode with the "rb" argument. The with statement is used to ensure that the file is properly closed after it is used.

Once the file is opened, the pickle.load() function is used to load the contents of the file into a variable named unpickled_linear_model. This function is used to deserialize a Python object hierarchy from a file.

Overall, this code is used to load a previously saved linear regression model from a file

The deserailized model is now loaded into a variable called 'unpickled_linear_model'. Check model's parameters to see if its same as the original file we serailized.

In [26]:
# summary of the model
print('Model intercept :', unpickled_linear_model.intercept_)
print('Model coefficients : ', unpickled_linear_model.coef_)
print('Model score : ', unpickled_linear_model.score(X, y))

Model intercept : 0.0026360037414828175
Model coefficients :  [ 7.79640942 33.43726819 57.51459187 64.4743176   7.76001334]
Model score :  0.9999896250777828


This code snippet prints out a summary of a linear regression model.

The first line prints out the intercept of the model, which is the value of the dependent variable when all independent variables are equal to zero.

The second line prints out the coefficients of the model, which represent the change in the dependent variable for a one-unit change in each independent variable.

The third line prints out the R-squared score of the model, which is a measure of how well the model fits the data. The score ranges from 0 to 1, with higher values indicating a better fit.

The unpickled_linear_model object is assumed to be a trained linear regression model that has been previously saved and loaded using the pickle module. The X and y variables are assumed to be the independent and dependent variables used to train the model, respectively


Great! The parameters of the model we just unpickled are the same as the one we initially created.

We can now proceed to use this model to make predictions on a test dataset, train on top of it, or transfer it to a different environment.

# Increasing Python Pickle Performance For Large Objects

Pickle is an efficient serialization format that has often proved to be faster than JSON, XML, and HDF5 in various benchmarks.

When dealing with extremely large data structures or huge machine learning models, however, Pickle can slow down considerably, and serialization can become a bottleneck in your workflow.

Here are some ways you can reduce the time taken to save and load Pickle files:



# Use the PROTOCOL argument
The** default protocol** used when saving and loading Pickle files is **currently 4,** which **is the most compatible protocol** with different Python versions.

However, if you want **to speed up your workflow**, you can **use the HIGHEST_PROTOCOL argument,** which is Pickle’s fastest available protocol.

To compare the performance difference between Pickle’s most compatible protocol and the default protocol, let’s first serialize a Pandas DataFrame using the default protocol. This is the protocol version that Pickle uses if no specific protocol is explicitly stated.

In [27]:
import pickle
import time
import numpy as np

# Set random seed
np.random.seed(1008)

data = {'Column1': np.random.randint(0, 10, size=500000),
        'Column2': np.random.choice(['A', 'B', 'C'], size=500000),
        'Column3': np.random.rand(500000)}

# serialize to a file

start = time.time()

with open("df1.pkl", "wb") as f:
    pickle.dump(data, f)

end = time.time()
print(end - start)

0.023430585861206055




> pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)



Write a pickled representation of obj to the open file object file.

This is equivalent to Pickler(file, protocol).dump(obj), but may
be more efficient.

The optional *protocol* argument tells the pickler to use the given
protocol; supported protocols are 0, 1, 2, 3, 4 and 5.  The default
protocol is 4. It was introduced in Python 3.4, and is incompatible
with previous versions.

Specifying a negative protocol version selects the highest protocol
version supported.  The higher the protocol used, the more recent the
version of Python needed to read the pickle produced.

The *file* argument must have a write() method that accepts a single
bytes argument.  It can thus be a file object opened for binary
writing, an io.BytesIO instance, or any other custom object that meets
this interface.

If *fix_imports* is True and protocol is less than 3, pickle will try
to map the new Python 3 names to the old module names used in Python
2, so that the pickle data stream is readable with Python 2.

If *buffer_callback* is None (the default), buffer views are serialized
into *file* as part of the pickle stream.  It is an error if
*buffer_callback* is not None and *protocol* is None or smaller than 5.

 It took around 2 milliseconds for us to serialize the DataFrame using Pickle’s default protocol.

Now, let’s pickle the DataFrame using the highest protocol:

In [28]:
start = time.time()

with open("df2.pkl", "wb") as f:
    pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

end = time.time()
print(end - start)

0.01298069953918457


With the highest protocol, we managed to serialize the DataFrame in half the amount of time.

# Use cPickle instead of Pickle

The cPickle module is a faster version of Pickle that is written in C. This makes it faster than the Pickle library, which is implemented purely in Python.

Note that in Python3, cPickle has been renamed to _pickle, which is the library that we will be importing.



In [29]:
import _pickle as cPickle

start = time.time()

with open("df3.pkl", "wb") as f:
    cPickle.dump(data, f)

end = time.time()

print(end-start)

0.02864813804626465


This code imports the _pickle module as cPickle. The _pickle module is used for serializing and de-serializing Python objects.

The code then starts a timer using the time module.

Next, the code opens a file named "df3.pkl" in write binary mode using the with statement. The cPickle.dump() method is used to serialize the data object and write it to the file.

After the serialization is complete, the file is closed automatically due to the 'with' statement.

The timer is stopped and the elapsed time is printed to the console. This can be useful for measuring the time it takes to serialize and write the data to the file.



Serialization with cPickle took approximately 2 milliseconds, which represents a substantial improvement over the Pickle Python module.

Even with workarounds to make serialization faster, the process can still be very slow for large objects.

To improve performance, you can break the data structure down and only serialize necessary subsets.

When working with dictionaries, for instance, you can specify key-value pairs that you want to access again. Reduce the size of the dictionary before serializing it since this will cut down the object’s complexity and speed up the process significantly