# Saving and loading a Machine Learning model

Saving and loading machine learning models is crucial for efficiency and practicality. Training a model, especially complex ones, can take significant time and computing resources. Saving the trained model allows you to reuse it for new data without retraining, saving significant time and effort. This is especially beneficial for real-time applications or those dealing with constantly incoming data. Additionally, saved models can be easily deployed on different systems, promoting scalability without requiring retraining on each machine. In essence, saving and loading models makes your machine learning work more efficient and reusable.

In Python, there are two main ways of saving a machine learning model. Either by using the built-in pickle module, or buy replacing the dump and load functions in favour of joblib's functions, which offer advantages like improved performance and compression for models containing large NumPy arrays.

- [`pickle`](#pickle)
  - [Dump](#dump)
  - [Load](#load)
- [`joblib`](#joblib)
  - [Dump](#dump)
  - [Load](#load)


In [30]:
# Importing packages

# Utilities
import numpy as np
import pandas as pd

# Saving models
import pickle
from joblib import dump, load

# Model
from sklearn.ensemble import RandomForestClassifier

# Pre-processing
from sklearn.model_selection import train_test_split

# Model improvement
from sklearn.model_selection import GridSearchCV

# Evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [21]:
# Importing dataset
heart_disease = pd.read_csv('../datasets/heart-disease.csv')

In [22]:
# Creating model
# Creating new dict with sets of hyperparameters
grid_2 = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None],
    'max_features': ['log2', 'sqrt'],
    'min_samples_split': [6],
    'min_samples_leaf': [1, 2],
}

# Setting seed
np.random.seed(42)

# Shuffling dataset
heart_disease_shuffled = heart_disease.sample(frac=1)

# Split into X and y
X = heart_disease_shuffled.drop('target', axis=1)
y = heart_disease_shuffled['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiating model
clf = RandomForestClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf = GridSearchCV(
    estimator=clf,
    param_grid=grid_2,
    cv=5,
    verbose=2,
)

# Fitting the GridSearchCv version of clf
gs_clf.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.1s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=200; total time=   0.1s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=200; total time=   0.1s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=6, 

In [23]:
# Defining evaluate_preds function
# Creating evaluate_preds function
def evaluate_preds(y_true, y_preds) -> dict:
    """
    Performs evaluation comparison on y_true labels vs. y_pred labels
    on a classification
    """
    accuracy = accuracy_score(y_true=y_true, y_pred=y_preds)
    precision = precision_score(
        y_true=y_true,
        y_pred=y_preds,
    )
    recall = recall_score(y_true=y_true, y_pred=y_preds)
    f1 = f1_score(y_true=y_true, y_pred=y_preds)

    metric_dict: dict[str, float]= {
        'accuracy': float(f'{accuracy:.2f}'),
        'precision': float(f'{precision :.2f}'),
        'recall': float(f'{recall:.2f}'),
        'f1': float(f'{f1:.2f}'),
    }
    
    print(f'Accuracy: {metric_dict['accuracy']}')
    print(f'Precision: {metric_dict['precision']}')
    print(f'Recall: {metric_dict['recall']}')
    print(f'f1: {metric_dict['f1']}')

    return metric_dict

## `pickle`

The pickle module is a bridge between python objects and persistent storage. It allows taking a complex machine learning model, which is essentially a Python object sith its internal parameters and learned knowledge, and convert it into a format that can be saved as a file.

This process, called pickling, essentially transforms the model into a stream of bytes. When needed later, you can use the unpickling functionality to reverse the process, reconstructing the original model object from the saved file. This is incredibly valuable for machine learning.

It allows the trained model to be saved after a potentially lengthy training process and reuse it for future predictions on new data, saving significant time and resources. This pickled model can then be shared with others for deployment or further analysis.

### Dump

The `dump` function is the way to turn a python object into a binary file. This function takes two arguments, the object to be saved and an open file object (in the write binary, wb, mode). It serializes the object into a byte stream and writes that stream directly to the specified file. This is ideal when saving the model to a specified location on the computers disk for later use.

There is also another function to serialize python objects, the `dumps` function. The big difference between this function and the `dump` function is how the model is saved. `dump` saves the binary data into a file, whereas `dumps` returns the entire pickled representation as a bytes object in memory. This is useful to store the model in a variable or pass it around in the program. This function may be used to send the model over a network or store it in a database.

### Load

The `load` function is designed to work with files. It takes a single argument, which is an open file object (in the read binary, rb, mode). `load` reads the byte stream containing the pickled data from the file and then deserializes it, recreating the original Python object. Presenting a great way of retrieving a machine learning model back into a program for making predictions on new data.

Much like `dump` and `dumps`, the `load` function has a `loads` counterpart. In contrast to `load`, the `loads` function works with the pickled data in memory, represented as a bytes object. It takes this bytes object as its single argument and performs the deserialization process, returning the original Python object. This function works with the `dumps` function, since it can be used for loading a pickled data received over a network or retrieved from a database in that format.


In [24]:
# Saving model to file
pickle.dump(gs_clf, open('../models/gs_random_forest_model.pkl', 'wb'))

In [25]:
# Loading model
loaded_pickle_model = pickle.load(open('../models/gs_random_forest_model.pkl', 'rb'))

In [26]:
# Making predictions using loaded model
pickle_y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_true=y_test, y_preds=pickle_y_preds)

Accuracy: 0.82
Precision: 0.87
Recall: 0.79
f1: 0.83


{'accuracy': 0.82, 'precision': 0.87, 'recall': 0.79, 'f1': 0.83}

## `joblib`

Joblib is a great improvement to pickle when it comes to saving machine learning model. It excels at handling the large NumPy arrays that are common in machine learning models. Joblib can compress the data during saving, resulting in smaller files and faster loading times. It can also leverage multiple cores in a machine to speed up saving very large models. Joblib also offers caching to avoid redundant saves and error handling for a more robust saving process. Overall Joblib provides a more streamlined and efficient solution specifically designed for the needs of machine learning practitioners.

### Dump

Joblib's `dump` function takes the model and destination (filename or file object) just like pickle's `dump`, but with two key advantages for machine learning workflows. Joblib shines in handling the large NumPy arrays that are common in these modules. It automatically compresses the data during saving using efficient algorithms, leading to smaller files and faster loading times. For very large models, Joblib can even parallelize the saving process across machine's cores, significantly speeding things up. Additionally, Joblib incorporates error handling to catch potential issues during pickling, making it more robust ofr complex models compared to the base pickle module. Overall, Joblib's `dump` function offers a more streamlined ad efficient solution specifically designed for saving machine learning models.

### Load

The counterpart to `dump`, Joblib's `load` function specializes in retrieving saved machine learning models. It takes a single argument, like filename of open file object, pointing to the location of the pickled model. Joblib excels here as well. If the data was compressed using `dump`, `load` automatically decompresses the data on the fly, ensuring you get back to the original model. It also incorporates error handling mechanisms to catch issues during unpickling, adding robustness. Finally, `load` efficiently reconstructs the original machine learning object from the loaded data. This lets you seamlessly integrate the loaded model back into you program to make predictions on new data.


In [32]:
# Save model to file
dump(gs_clf, filename='../models/gs_random_forest_model_1.joblib')

['../models/gs_random_forest_model_1.joblib']

In [33]:
# Loading saved model
loaded_joblib_model = load(filename='../models/gs_random_forest_model_1.joblib')

In [34]:
# Making and evaluating joblib predictions
joblib_y_preds = loaded_joblib_model.predict(X_test)
evaluate_preds(y_true=y_test, y_preds=joblib_y_preds)

Accuracy: 0.82
Precision: 0.87
Recall: 0.79
f1: 0.83


{'accuracy': 0.82, 'precision': 0.87, 'recall': 0.79, 'f1': 0.83}