# Tracking scikit-learn models using Verta

Verta's experiment management system enables data scientists to track rich information about their modeling experiments including data such as metrics, hyperparameters, confusion matrices, examples of input and output data, and many others.

This notebook shows how to use Verta's experiment management system with models developed in scikit-learn. See Verta [documentation](https://docs.verta.ai/verta/experiment-management) for full details on Verta's experiment management capabilities.

Updated for Verta version: 0.18.2

This example features:
- **scikit-learn**'s `LinearRegression` model
- **scikit-learn**'s `GridSearchCV` utility for performing grid search and cross-validation
- **verta**'s Python client logging the grid search results
- **verta**'s Python client retrieving the best run from the grid search to calculate and log full training accuracy

<a href="https://colab.research.google.com/github/VertaAI/modeldb/blob/master/client/workflows/examples/sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Imports

In [1]:
from __future__ import print_function

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import time
import os

import six

import numpy as np
import pandas as pd

import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

### 0.1 Verta import and setup

In [2]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

In [3]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST']

In [4]:
from verta import Client
from verta.utils import ModelAPI

client = Client(os.environ['VERTA_HOST'])

---

## 1. Model Training

### 1.1 Prepare Data

In [5]:
data = datasets.load_iris()

X = data['data']
y = data['target']
labels = data['target_names'].tolist()

In [6]:
df = pd.DataFrame(np.hstack((X, y.reshape(-1, 1))),
                  columns=data['feature_names'] + ['species'])

df.head()

### 1.2 Prepare Hyperparameters

In [7]:
grid = {
    'C': [1e-4, 1e-3, 1e-2],
    'solver': ['lbfgs'],
    'max_iter': [1e4, 1e5],
}

### 1.3 Train model

In [8]:
model = linear_model.LogisticRegression(multi_class='auto')
grid_search = model_selection.GridSearchCV(model, grid,
                                           cv=5, return_train_score=False)
grid_search.fit(X, y)

In [9]:
# set up structures in Verta
proj = client.set_project("Iris Classification")
expt = client.set_experiment("Logistic Regresssion")

In [10]:
results = pd.DataFrame(grid_search.cv_results_)

for _, run_result in results.iterrows():
    run = client.set_experiment_run()
    
    # log hyperparameters
    run.log_hyperparameters(run_result['params'])
    
    # log accuracy for each validation fold
    for obs_key in ["split{}_test_score".format(i) for i in range(5)]:
        run.log_observation("fold_acc", run_result[obs_key])
    
    # log summary stats of validation
    run.log_metric("val_acc_mean", run_result['mean_test_score'])
    run.log_metric("val_acc_std", run_result['std_test_score'])

---

## 2. Retrieve logged metadata

In [11]:
best_run = expt.expt_runs.sort("metrics.val_acc_mean", descending=True)[0]
print("Validation Accuracy: {:.4f}".format(best_run.get_metric("val_acc_mean")))

best_hyperparams = best_run.get_hyperparameters()
print("Hyperparameters: {}".format(best_hyperparams))

---

## 3. Train on Full Dataset using best hyperparams

In [12]:
model = linear_model.LogisticRegression(multi_class='auto', **best_hyperparams)
model.fit(X, y)

In [13]:
train_acc = model.score(X, y)
# best_run.log_metric("full_train_acc", train_acc)
best_run.log_tag("best-run")

# compute and log confusion matrix
from sklearn.metrics import confusion_matrix
from verta.data_types import ConfusionMatrix
confusion_matrix = confusion_matrix(y, model.predict(X))
best_run.log_attribute("confusion_matrix", ConfusionMatrix(confusion_matrix.tolist(), labels))


# log additional info for reproducibility
from verta.code import Notebook
from verta.environment import Python
best_run.log_artifact("notebook", "sklearn-gridsearch.ipynb")
best_run.log_environment(Python(["sklearn", "numpy", "pandas"]))
best_run.log_artifact("data", df)

print("Training accuracy: {:.4f}".format(train_acc))

---