# Save and Load Model (Scikit-learn)

**WHAT**
On various instances, while working on developing a Machine Learning Model, we'll need to save our prediction models to file, and then restore them in order to reuse our previous work to.


**WHY**
We need to save and restore/reload later our ML Model , so as to -

a) test our model on/with new data, 

b) compare multiple models, 

c) or anything else. 

**object serialization**
This process / procedure of saving a ML Model is also known as object serialization - representing an object with a stream of bytes, in order to store it on disk, send it over a network or save to a database.

**deserialization**
While the restoring/reloading of ML Model procedure is known as deserialization. 

In this Kernel, we will explore 3 ways to Save and Reload ML Models in Python and scikit-learn, we will also discuss about the pros and cons of each method. 

We will be covering following 3 approaches of Saving and Reloading a ML Model -

1) Pickle Approach

2) Joblib Approach

3) Manual Save and Restore to JSON approach

Now , lets develop a ML Model which we shall use to Save and Reload in this Kernel

**ML Model Creation**

For the purpose of Demo , we will create a basic Logistic Regression Model on IRIS Dataset.

Dataset used : IRIS 

Model        : Logistic Regression using Scikit Learn

**Step - 1 ** : Import Packages

In [1]:
# Import Required packages 
#-------------------------

# Import the Logistic Regression Module from Scikit Learn
from sklearn.linear_model import LogisticRegression  

# Import the IRIS Dataset to be used in this Kernel
from sklearn.datasets import load_iris  

# Load the Module to split the Dataset into Train & Test 
from sklearn.model_selection import train_test_split


**Step - 2 **: Load the IRIS Data

In [2]:
# Load the data
Iris_data = load_iris()  


**Step - 3 **: Split the IRIS Data into Training & Testing Data

In [3]:
# Split data
Xtrain, Xtest, Ytrain, Ytest = train_test_split(Iris_data.data, 
                                                Iris_data.target, 
                                                test_size=0.3, 
                                                random_state=4)  

Now , lets build the Logistic Regression Model on the IRIS Data

Note : The Model creation in this Kernel is for demonstartion only and does not cover the details of Model Creation.

In [4]:
# Define the Model
LR_Model = LogisticRegression(C=0.1,  
                               max_iter=20, 
                               fit_intercept=True, 
                               n_jobs=3, 
                               solver='liblinear')

# Train the Model
LR_Model.fit(Xtrain, Ytrain)  

  " = {}.".format(effective_n_jobs(self.n_jobs)))


LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=20,
                   multi_class='warn', n_jobs=3, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Now , that Model has been Created and Trained , we might want to save the trained Model for future use.


**Approach 1 : Pickle approach**

Following lines of code, the LR_Model which we created in the previous step is saved to file, and then loaded as a new object called Pickled_RL_Model. 

The loaded model is then used to calculate the accuracy score and predict outcomes on new unseen (test) data.

In [5]:
# Import pickle Package

import pickle


In [6]:
# Save the Modle to file in the current working directory

Pkl_Filename = "Pickle_RL_Model.pkl"  

with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(LR_Model, file)


In [7]:
# Load the Model back from file
with open(Pkl_Filename, 'rb') as file:  
    Pickled_LR_Model = pickle.load(file)

Pickled_LR_Model

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=20,
                   multi_class='warn', n_jobs=3, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
# Use the Reloaded Model to 
# Calculate the accuracy score and predict target values

# Calculate the Score 
score = Pickled_LR_Model.score(Xtest, Ytest)  
# Print the Score
print("Test score: {0:.2f} %".format(100 * score))  

# Predict the Labels using the reloaded Model
Ypredict = Pickled_LR_Model.predict(Xtest)  

Ypredict

Test score: 91.11 %


array([2, 0, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 2, 0, 1, 0, 0, 2, 0, 2,
       1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 1, 2, 2, 1, 1, 0, 2, 0, 1, 0,
       2])

**Let's Reflect back on Pickle approach :**

PROs of Pickle :

1) save and restore our learning models is quick - we can do it in two lines of code. 

2) It is useful if you have optimized the model's parameters on the training data, so you don't need to repeat this step again. 


CONs of Pickle :

1) it doesn't save the test results or any data. 


**Approach 2 - Joblib** :

The Joblib Module is available from Scikit Learn package and is intended to be a replacement for Pickle, for objects containing large data. 

This approach will save our ML Model in the pickle format only but we dont need to load additional libraries as the 'Pickling' facility is available within Scikit Learn package itself which we will use invariably for developing our ML models.

In following Python scripts , we will show how to Save and reload ML Models using Joblib

Import the required Library for using Joblib

In [9]:
# Import Joblib Module from Scikit Learn

from sklearn.externals import joblib




Save the Model using Joblib

In [10]:
# Save RL_Model to file in the current working directory

joblib_file = "joblib_RL_Model.pkl"  
joblib.dump(LR_Model, joblib_file)


['joblib_RL_Model.pkl']

Reload the saved Model using Joblib

In [11]:
# Load from file

joblib_LR_model = joblib.load(joblib_file)


joblib_LR_model

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=20,
                   multi_class='warn', n_jobs=3, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Reload the Saved Model using Joblib 

In [12]:
# Use the Reloaded Joblib Model to 
# Calculate the accuracy score and predict target values

# Calculate the Score 
score = joblib_LR_model.score(Xtest, Ytest)  
# Print the Score
print("Test score: {0:.2f} %".format(100 * score))  

# Predict the Labels using the reloaded Model
Ypredict = joblib_LR_model.predict(Xtest)  

Ypredict

Test score: 91.11 %


array([2, 0, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 2, 0, 1, 0, 0, 2, 0, 2,
       1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 1, 2, 2, 1, 1, 0, 2, 0, 1, 0,
       2])

**Let's Reflect back on Joblib approach :**

PROs of Joblib :

1) the Joblib library offers a bit simpler workflow compared to Pickle. 

2) While Pickle requires a file object to be passed as an argument, Joblib works with both file objects and string filenames. 

3) In case our model contains large arrays of data, each array will be stored in a separate file, but the save and restore procedure will remain the same. 

4) Joblib also allows different compression methods, such as 'zlib', 'gzip', 'bz2', and different levels of compression.


**Approach 3 - Manual Save and Restore to JSON ** :

whenever we want to have full control over the save and restore process, the best way is to build our own functions manually.

The Script following shows an example of manually saving and restoring objects using JSON. This approach allows us to select the data which needs to be saved, such as the model parameters, coefficients, training data, and anything else we need.

For simplicity, we'll save only three model parameters and the training data. Some additional data we could store with this approach is, for example, a cross-validation score on the training set, test data, accuracy score on the test data, etc.

Import the required libraries

In [13]:
# Import required packages

import json  
import numpy as np


Since we want to save all of this data in a single object, one possible way to do it is to create a new class which inherits from the model class, which in our example is LogisticRegression. The new class, called MyLogReg, then implements the methods save_json and load_json for saving and restoring to/from a JSON file, respectively.

In [14]:
class MyLogReg(LogisticRegression):

    # Override the class constructor
    def __init__(self, C=1.0, solver='liblinear', max_iter=100, X_train=None, Y_train=None):
        LogisticRegression.__init__(self, C=C, solver=solver, max_iter=max_iter)
        self.X_train = X_train
        self.Y_train = Y_train

    # A method for saving object data to JSON file
    def save_json(self, filepath):
        dict_ = {}
        dict_['C'] = self.C
        dict_['max_iter'] = self.max_iter
        dict_['solver'] = self.solver
        dict_['X_train'] = self.X_train.tolist() if self.X_train is not None else 'None'
        dict_['Y_train'] = self.Y_train.tolist() if self.Y_train is not None else 'None'

        # Creat json and save to file
        json_txt = json.dumps(dict_, indent=4)
        with open(filepath, 'w') as file:
            file.write(json_txt)

    # A method for loading data from JSON file
    def load_json(self, filepath):
        with open(filepath, 'r') as file:
            dict_ = json.load(file)

        self.C = dict_['C']
        self.max_iter = dict_['max_iter']
        self.solver = dict_['solver']
        self.X_train = np.asarray(dict_['X_train']) if dict_['X_train'] != 'None' else None
        self.Y_train = np.asarray(dict_['Y_train']) if dict_['Y_train'] != 'None' else None

Next we create an object mylogreg, pass the training data to it, and save it to file. 

Then we create a new object json_mylogreg and call the load_json method to load the data from file.

In [16]:
filepath = "mylogreg.json"

# Create a model and train it
mylogreg = MyLogReg(X_train=Xtrain, Y_train=Ytrain)  
mylogreg.save_json(filepath)

# Create a new object and load its data from JSON file
json_mylogreg = MyLogReg()  
json_mylogreg.load_json(filepath)  
json_mylogreg  

MyLogReg(C=1.0,
         X_train=array([[4.3, 3. , 1.1, 0.1],
       [5.7, 4.4, 1.5, 0.4],
       [5.9, 3. , 4.2, 1.5],
       [6.1, 3. , 4.6, 1.4],
       [6.5, 3. , 5.5, 1.8],
       [5.2, 3.5, 1.5, 0.2],
       [5.6, 2.5, 3.9, 1.1],
       [7.7, 2.6, 6.9, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.2, 2.9, 4.3, 1.3],
       [5.7, 2.9, 4.2, 1.3],
       [5. , 3.5, 1.6, 0.6],
       [5.6, 2.9, 3.6, 1.3],
       [6. , 2.2, 5. , 1.5],
       [5.5, 2.6, 4.4, 1.2],
       [4.6, 3.4, 1.4, 0.3],
       [5.6, 3. , 4.1, 1.3],
       [5.1, 3.4, 1.5, 0.2],
       [6.4...
       [6.4, 3.1, 5.5, 1.8],
       [7. , 3.2, 4.7, 1.4],
       [6.3, 2.3, 4.4, 1.3],
       [6.5, 3. , 5.8, 2.2],
       [7.2, 3. , 5.8, 1.6],
       [7.7, 2.8, 6.7, 2. ]]),
         Y_train=array([0, 0, 1, 1, 2, 0, 1, 2, 2, 1, 1, 0, 1, 2, 1, 0, 1, 0, 1, 2, 1, 2,
       1, 0, 2, 2, 0, 1, 2, 0, 2, 1, 2, 1, 0, 2, 1, 2, 0, 2, 1, 2, 1, 2,
       1, 1, 2, 1, 1, 2, 1, 1, 0, 2, 0, 1, 0, 1, 1, 1, 1, 0, 2, 2, 1, 1,
       1, 0, 0, 2,

**Let's reflect back on the JSON approach**

PROs :

Since the data serialization using JSON actually saves the object into a string format, rather than byte stream, the 'mylogreg.json' file could be opened and modified with a text editor.


CONs :

Although this approach would be convenient for the developer, it is less secure since an intruder can view and amend the content of the JSON file. 

Moreover, this approach is more suitable for objects with small number of instance variables, such as the scikit-learn models, because any addition of new variables requires changes in the save and restore methods.

# Exercise

Create a new class which inherits from the model class Random Forest Classifier. The new class, called MyRF, then implements the methods save_json and load_json for saving and restoring to/from a JSON file, respectively.

The random forest shall only build 10 trees with 10 as the maximum depth of each tree.

MyRF(X_train=array([[4.3, 3. , 1.1, 0.1],
       [5.7, 4.4, 1.5, 0.4],
       [5.9, 3. , 4.2, 1.5],
       [6.1, 3. , 4.6, 1.4],
       [6.5, 3. , 5.5, 1.8],
       [5.2, 3.5, 1.5, 0.2],
       [5.6, 2.5, 3.9, 1.1],
       [7.7, 2.6, 6.9, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.2, 2.9, 4.3, 1.3],
       [5.7, 2.9, 4.2, 1.3],
       [5. , 3.5, 1.6, 0.6],
       [5.6, 2.9, 3.6, 1.3],
       [6. , 2.2, 5. , 1.5],
       [5.5, 2.6, 4.4, 1.2],
       [4.6, 3.4, 1.4, 0.3],
       [5.6, 3. , 4.1, 1.3],
       [5.1, 3.4, 1.5, 0.2],
       [6.4, 2.9, 4.3, 1...
       [6.4, 3.1, 5.5, 1.8],
       [7. , 3.2, 4.7, 1.4],
       [6.3, 2.3, 4.4, 1.3],
       [6.5, 3. , 5.8, 2.2],
       [7.2, 3. , 5.8, 1.6],
       [7.7, 2.8, 6.7, 2. ]]),
     Y_train=array([0, 0, 1, 1, 2, 0, 1, 2, 2, 1, 1, 0, 1, 2, 1, 0, 1, 0, 1, 2, 1, 2,
       1, 0, 2, 2, 0, 1, 2, 0, 2, 1, 2, 1, 0, 2, 1, 2, 0, 2, 1, 2, 1, 2,
       1, 1, 2, 1, 1, 2, 1, 1, 0, 2, 0, 1, 0, 1, 1, 1, 1, 0, 2, 2, 1, 1,
       1, 0, 0, 2, 2, 0, 0, 0