# Registering a scikit-learn model on Verta

Within Verta, a "Model" can be any arbitrary function: a traditional ML model (e.g., sklearn, PyTorch, TF, etc); a function (e.g., squaring a number, making a DB function etc.); or a mixture of the above (e.g., pre-processing code, a DB call, and then a model application.) See more [here](https://docs.verta.ai/verta/registry/concepts).

This notebook provides an example of how to register a scikit-learn model on Verta as a Verta Standard Model either via convenience functions or by extending [VertaModelBase](https://verta.readthedocs.io/en/master/_autogen/verta.registry.VertaModelBase.html?highlight=VertaModelBase#verta.registry.VertaModelBase).

<a href="https://colab.research.google.com/github/VertaAI/examples/blob/registry_examples/registry/sklearn/sklearn-register-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Imports

In [None]:
!pip install wget  # you may need pip3
!pip install verta # restart colab if prompted

In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")

import numpy as np
import pandas as pd

from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier

## 1. Register Model

### 1.1 (Optional) Model Training 

A model has to exist before we can register, so we will also train one here in our notebook.

If you already have a trained sklearn model pickled into a file, you can skip this step and directly register it on the catalog

#### 1.1.1 Load Training Data

In [None]:
import os
import wget

train_data_url = "http://s3.amazonaws.com/verta-starter/census-train.csv"
train_data_filename = wget.detect_filename(train_data_url)
if not os.path.isfile(train_data_filename):
    wget.download(train_data_url)

test_data_url = "http://s3.amazonaws.com/verta-starter/census-test.csv"
test_data_filename = wget.detect_filename(test_data_url)
if not os.path.isfile(test_data_filename):
    wget.download(test_data_url)

In [None]:
df_train = pd.read_csv(train_data_filename)
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:, -1]

df_test = pd.read_csv(test_data_filename)
X_test = df_test.iloc[:,:-1]
y_test = df_test.iloc[:, -1]


df_train.head()

#### 1.1.2 Train/Test code

**Model Info**

We'll be training K Nearest Neighbors Classifier on the [Adult-Census](https://archive.ics.uci.edu/ml/datasets/Adult) dataset to predict either a person has >50k or <50k income.

In [None]:
# create validation split
(X_val_train, X_val_test,
 y_val_train, y_val_test) = model_selection.train_test_split(X_train, y_train,
                                                             test_size=0.2,
                                                             shuffle=True)

# create and train model
model = KNeighborsClassifier()
model.fit(X_train, y_train)

# calculate and log validation accuracy
val_acc = model.score(X_val_test, y_val_test)
print(val_acc)

### 1.2 Register Model to Verta Model Catalog

Now that the model is in a good shape, we can register it into the Verta platform.

We'll connect to Verta through the [Verta Python Client](https://verta.readthedocs.io/en/main/_autogen/verta.Client.html), 
create a [registered model](https://verta.readthedocs.io/en/master/_autogen/verta.registry.entities.RegisteredModel.html) for our scikit-learn model 
and a [version](https://verta.readthedocs.io/en/master/_autogen/verta.registry.entities.RegisteredModelVersion.html) to associate this particular model with on the catalog.

All of these can be viewed in the Verta web app once they are created.

In [None]:
# Paste your credentials in this cell or anywhere above this along with the code snippet to connect to Verta Platform

from verta import Client

client = Client(
        #   host="app.verta.ai",
        #   email="user@verta.ai",
        #   dev_key="a765b2de-786d-466c-b2d8-thiye06f80d5",
        )

In [None]:
# Create/Get a Verta registered model

registered_model = client.get_or_create_registered_model(
    name="census-sklearn-example", # Name to identify on the catalog
    labels=["tabular", "sklearn"] # tags/labels to make it easier to search and categorize
    )

#### 1.2.1 Register from the model object
 
If you are in the same file where you have the model object (can be loaded through joblib/cloudpickle if saved into a pickle file) handy, use the code below to package the model

In [None]:
from verta.environment import Python

# uncommment the below if you want to load it from a pickled object
# import cloudpickle
# model = cloudpickle.load(open("<filepath_to_model.pkl>", "rb"))

model_version_v1 = registered_model.create_standard_model_from_sklearn(
    model, # The loaded model object, can be the one trained in the same file or deserialized by joblib/cloudpickle
    environment=Python(requirements=[ # Add the required libraries for the model to run
    "scikit-learn"
    ]), 
    name="v1", # Name to identify the version in the model versions tab
)

#### (OR) 1.2.2 Register from serialized object by extending VertaModelBase

Useful if you have certain preprocessing functions which can't be integrated into the scikit-learn model/pipeline. Also, defining describe and example method helps in populating the Model API

In [None]:
# pickle artifacts

import cloudpickle
with open("model.pkl", "wb") as f:
    cloudpickle.dump(model, f)

In [None]:
from verta.registry import VertaModelBase, verify_io

class CensusIncomeClassifier(VertaModelBase):
    def __init__(self, artifacts):
        self.knn = cloudpickle.load(open(artifacts["serialized_model"], "rb"))
        
    @verify_io
    def predict(self, batch_input):
        results = []
        for one_input in batch_input:
            results.extend(self.knn.predict([one_input]).tolist())
        return results
    
    # Optional: populates the model playground
    def describe(self):
        """Return a description of the service."""
        return {
            "method": "predict",
            "args": ",".join(list(X_train.columns)),
            "returns": "income_label",
            "description": """
                Predicts whether a person has >50k income based on census data.
            """,
            "input_description": """
                Batch of census information, one sample per entry.
            """,
            "output_description": """
                Binary classification, with 1 representing the prediction that the
                person earns more than 50k a year.
            """
        }
    
    # Optional: populates the model playground
    def example(self):
        """Return example input json-serializable data."""
        return [
            [71.67822567370767, 0.0, 0.0, 99.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
            [6.901547652701675, 0.0, 1887.0, 50.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
            [72.84132724180968, 0.0, 0.0, 40.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
        ]

As a sanity check, we can validate that our model is instantiable and can produce predictions.

In [None]:
artifacts_dict = {"serialized_model" : "model.pkl"}
clf = CensusIncomeClassifier(artifacts_dict)
clf.predict(clf.example())

In [None]:
from verta.environment import Python
from verta.utils import ModelAPI

model_version = registered_model.create_standard_model(
    model_cls=CensusIncomeClassifier,
    environment=Python(requirements=["scikit-learn", "pandas"]),
    code_dependencies=[],
    artifacts=artifacts_dict,
    model_api=ModelAPI(X_train, y_train),
    name="v1-vbm" # different version name to avoid errors
)

### 

And That's it. You should now be able to see your Model, on your Catalog.

---