Saving and Sharing Models
===

Author: Nathan A. Mahynski

Date: 2024/07/10

Description: After creating a great model, how can I (easily) save it for future use or share it with someone else?

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mahynski/pychemauth/blob/main/docs/jupyter/api/sharing_models.ipynb)

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries

In [1]:
try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

In [2]:
import imblearn
import sklearn

import numpy as np

from sklearn.datasets import load_iris as load_data
from sklearn.model_selection import train_test_split

from pychemauth.preprocessing.scaling import CorrectedScaler
from pychemauth.classifier.simca import DDSIMCA_Model
from pychemauth.utils import HuggingFace

In [3]:
%watermark -t -m -v --iversions

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.85+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

numpy     : 1.24.3
watermark : 2.4.3
pychemauth: 0.0.0b4
matplotlib: 3.7.2
sklearn   : 1.3.0
imblearn  : 0.11.0



Create a Model
---

Let's create a simple model as an example to work with.  in this case, let's build a pipeline which uses a DD-SIMCA model to model a single flower in the iris dataset.

In [4]:
X, y = load_data(return_X_y=True, as_frame=True)

# Let's turn the indices into names
names = dict(zip(np.arange(3), ['setosa', 'versicolor', 'virginica']))
y = y.apply(lambda x: names[x])

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X.values,
    y.values, # Let's try to predict the salary based on the other numerical features.
    shuffle=True,
    random_state=42,
    test_size=0.2,
    stratify=y # It is usually important to balance the test and train set so they have the same fraction of classes
)

In [6]:
# Let's just model a single type of iris for this example
chosen_class = 'setosa'

X_train_dds = X_train[y_train == chosen_class]
y_train_dds = y_train[y_train == chosen_class]

X_test_dds = X_test[y_test == chosen_class]
y_test_dds = y_test[y_test == chosen_class]

In [7]:
# Now let's build a simple pipeline
model = imblearn.pipeline.Pipeline(
    steps=[
        ("autoscaler", CorrectedScaler( # Then, we should center and scale the data
            with_mean=True,
            with_std=True,
            pareto=False
            )
        ),
        ("my_chosen_model", DDSIMCA_Model( # Finally, we will pass the cleaned, balanced, and scaled data to the model
            n_components=1,
            scale_x=True
            )
        )
    ]
)

In [8]:
_ = model.fit(X_train_dds, y_train_dds)

In [9]:
model.predict(X_test_dds)

array([ True,  True, False,  True,  True,  True, False,  True,  True,
        True])

In [10]:
model.named_steps

{'autoscaler': <pychemauth.preprocessing.scaling.CorrectedScaler at 0x7a7bed6be080>,
 'my_chosen_model': DDSIMCA_Model(n_components=1)}

If we just want to save the model to disk, called "serialization", there are a number of ways we can accomplish this.  Perhaps the simplest way is to use <a href="https://docs.python.org/3/library/pickle.html">`pickle`</a> which is the preferred way to serialize Python objects.  The commands look like this:

```python
import pickle

# To save the model disk, ensure the file is opened with "w"rite permissions
pickle.dump(model, file=open('my_model.pkl', 'wb'), protocol=4)

# To load the model from disk, ensure the file is opened with "r"ead permissions
stored_model = pickle.load(open('my_model.pkl', 'rb'))
```

Hugging Face
---

However, pickling is not most ideal way to store models long term since we can lose track of them and we may forget some of the details of how it works, what it was trained on, etc. if the model is renamed or transferred somewhere else.

A better solution is to use a centralized hub service which can store, deploy, track, and document the model.  [Hugging Face](https://huggingface.co/) is one such service.

<h4>From the Hugging Face Hub <a href="https://huggingface.co/docs/hub/index">documentation</a>:</h4>

> "The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together. The Hub works as a central place where anyone can explore, experiment, collaborate, and build technology with Machine Learning.
>
> <h5>What can you find on the Hub?</h5>
>
> The Hugging Face Hub hosts Git-based repositories, which are version-controlled buckets that can contain all your files. 💾
>
> On it, you’ll be able to upload and discover…
>
> * Models, hosting the latest state-of-the-art models for NLP, vision, and audio tasks
> * Datasets, featuring a wide variety of data for different domains and modalities..
> * Spaces, interactive apps for demonstrating ML models directly in your browser.
>
> The Hub offers versioning, commit history, diffs, branches, and over a dozen library integrations! You can learn more about the features that all repositories share in the Repositories documentation."

We strongly encourage you to read the [Model Hub](https://huggingface.co/docs/hub/models-the-hub) and [Model Card](https://huggingface.co/docs/hub/model-cards) documentation.  The former explains how models are stored and accessed from the hub, while the latter explains how models are documented.

<h3>Pushing to the Hub</h3>

PyChemAuth provides some simple utilities to get you started saving your models on HF Hub, but only a very basic Card is created with the commands below and you should go to your (newly created) repo and document you model further there.

In [12]:
# Check out the documentation for more information.
?HuggingFace.push_to_hub

In [13]:
# To create repos you will need to specify a token which acts as a password behind the scenes.
# To do this, go to hugginface.co and Create a token under Settings > Access Tokens.

if 'google.colab' in str(get_ipython()):
    # Colab has a nice way to store these "secrets" which you can learn about in this YouTube video:
    # https://www.youtube.com/watch?v=LPa51KxqUAw
    from google.colab import userdata
    TOKEN = userdata.get(
        'HF_TOKEN' # CHange this to whatever you save you HF token as in the Secrets menu on Colab
    )
else:
    # Otherwise, you can just paste the token here; but be careful not to share this with anyone.
    TOKEN = "hf_*"

In [14]:
# Let's push the model to the hub.
# The first time a model is pushed to a repo that doesn't exist, it is created with a basic Card.
# In the future, this will only update the repo and should not overwrite anything you put in the Card.
HuggingFace.push_to_hub(
    model=model,
    namespace="mahynski",
    repo_name="pychemauth-sharing-demo", # Create a name for this model
    private=False, # The default is True, but since this is a demonstration we will set this to public
    token=TOKEN
)

CommitInfo(commit_url='https://huggingface.co/mahynski/pychemauth-sharing-demo/commit/e2390ed8adc8df53fb6a7d27eda3c090aedc4662', commit_message='Pushing model on 2024-07-10 14:17:24.764004', commit_description='', oid='e2390ed8adc8df53fb6a7d27eda3c090aedc4662', pr_url=None, pr_revision=None, pr_num=None)

Now you can go check the model out at https://huggingface.co/mahynski/pychemauth-sharing-demo.

<h3>Downloading Pre-trained Models</h3>

Once your model is on the Hub, anyone can download it! This way, you can share models with colleagues easily by just sending them to the correct website. The commands to download a model created by PyChemAuth are given below.

**Note: you can also control access to your model by keeping the repo private, or by using [gating](https://huggingface.co/docs/hub/en/models-gated).**

In [15]:
# Check out the documentation for more information.
?HuggingFace.from_pretrained

In [16]:
downloaded_model = HuggingFace.from_pretrained(
    model_id="mahynski/pychemauth-sharing-demo",
    token=None # For public models we don't need a token to access them!
)

model.pkl:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

In [17]:
downloaded_model.predict(X_test_dds)

array([ True,  True, False,  True,  True,  True, False,  True,  True,
        True])

In [18]:
downloaded_model.named_steps

{'autoscaler': <pychemauth.preprocessing.scaling.CorrectedScaler at 0x7a7bed782200>,
 'my_chosen_model': DDSIMCA_Model(n_components=1)}