# ModelDB Features Demo -- couture.ai

This notebook runs a sample sci-kit model and connects it to a running ModelDB instance -- using ModelDB for model repository and expriment logging.

The scikit model used is available in [scikit documentatation](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)  

ModelDB references:

- [this blog by the founder](https://blog.verta.ai/blog/model-versioning-done-right-a-modeldb-2.0-walkthrough)
- [webinar 1 code](https://github.com/VertaAI/modeldb/blob/master/demos/webinar-2020-5-6/02-mdb_versioned/01-train/02%20Positive%20Data%20NLP.ipynb)
- [webinar 2 code]()

### Setting up Git for Notebook  and other settings

ModelDB uses **git** for tracking model versions, this notebook should be in a proper git repository. Ensure that you've setup the repository with a git origin URL.

Before running this notebook:
- make sure ModelDB is running at localhost:3000
- create a new test folder for this repo and run ```git init``` in it
- in this new folder, set an origin ```git remote add origin <a_git_remote_location_for_code>.git``` 

### Imports
- Scikit Learn (Dataset and models)
- Verta (ModelDB API calls)
- Itertools

In [192]:
# imports and package installations
import itertools
try:
    import sklearn
except:
    !pip3 install scikit-learn
try:
    import verta
except:
    !pip3 install 

### Importing a sample dataset
The sklearn [digits](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) dataset is used and split into test and training components in a 1:2 ratio. 

In [193]:
# getting sample dataset
from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.33)

### Constants for ModelDB Naming and Behaviour

Specify the the following:
- Project Name
- Experiment Name
- Repository Name
- Branch Name
- *whether to marge branch to master*

In [196]:
# Projects Constants
HOST = "http://localhost:3000"
PROJECT_NAME = "Couture Sklearn Test"
EXPERIMENT_NAME = "Experiment 1"
REPOSITORY_NAME = "couture-sklearn-repo"
BRANCH_NAME = "svm-test"
MERGE_TO_MASTER = True

#### Setting up ModelDB Project, Experiment and Repository and Dataset

In [188]:
# setting up ModelDB project, experiment and repository
from verta import Client
from verta.utils import ModelAPI

client = Client(HOST)
proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)

repo = client.set_repository(REPOSITORY_NAME)
commit = repo.get_commit(branch='master').new_branch(BRANCH_NAME)

dataset = client.set_dataset(name="Test Dataset 1",
                             type="local")

# arguments for create version depends on type of dataset
dataset_version = dataset.create_version(path="digits.csv")

# run.log_dataset_version("training_data", dataset_version)

connection successfully established
got existing Project: Couture Sklearn Test v9
got existing Experiment: Experiment 1
set existing Repository: couture-sklearn-v9 from personal workspace
set existing Dataset: Test Dataset 1 from personal workspace
created new DatasetVersion: 1


### Hyperparameters

Multiple values can be specified for each hyperparameter and an experiment run would be generated for each possible combination of hyperparameter values. For example, if ```n``` values for *parameter A*, ```m``` values for *parameter B* and ```p``` values for *parameter C* are declared, then there would be a total fo ```m*n*p``` runs in the experiment.

This behaviour can be changed by specifying what's wanted for ```itertools```. Check the documentation [here](https://docs.python.org/3/library/itertools.html)

In [189]:
# set model hyperparameters
hyperparam_candidates = {
    'C': [90., 95.],
    'gamma': [0.002, 0.001],
}
hyperparam_sets = [dict(zip(hyperparam_candidates.keys(), values))
                   for values
                   in itertools.product(*hyperparam_candidates.values())]

### Experiment Runs
This section is the heart of the code file where the model is defined, trained and experiment runs are conducted for provided combination of hyperparameters. The code is largely self-explanatory. In-line comments have been provided for readability.

In [190]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.environment import Python

from sklearn import svm

def run_experiment(hyperparams, index):
    ...
    code_ver = Notebook()
    config_ver = Hyperparameters(hyperparams)
    train_ver = digits.data
    test_ver = digits.target
    env_ver = Python(Python.read_pip_environment())

    commit.update("notebooks/couture-sklearn-test", code_ver)
    commit.update("config/hyperparams", config_ver)
#     commit.update("data/train", train_ver)
#     commit.update("data/test", test_ver)
    commit.update("env/python", env_ver)
    commit.save("Hyperparameter tuning Run: " + str(index))

    # create object to track experiment run
    run = client.set_experiment_run("Run : " + str(index+1))
    
    # log hyperparameters
    run.log_hyperparameters(hyperparams)
    
    # model definition
    model = svm.SVC(**hyperparams)
    
    # model training
    model.fit(X_train, y_train)
    
    # calculate and log validation accuracy
    val_acc = model.score(X_test, y_test)
    run.log_metric("val_acc", val_acc)
    print("Validation accuracy: {:.4f}".format(val_acc))
    
    # save and log model
    run.log_model(model)
    
    # log dataset snapshot as version
    run.log_dataset_version("0.67 training set", dataset_version)
    
#     log Git information as code version
    run.log_code()
    
    run.log_commit(
        commit,
        {
            'notebook': "notebooks/couture-sklearn-test",
            'hyperparameters': "config/hyperparams",
#             'training_data': "data/train",
#             'test_data': "data/test",
            'python_env': "env/python",
        },
    )
     

# to run all experiments
for i, hyperparams in enumerate(hyperparam_sets):
    run_experiment(hyperparams, i)

<IPython.core.display.Javascript object>

created new ExperimentRun: Run : 1
Validation accuracy: 0.9815
upload complete (custom_modules)
upload complete (model.pkl)
upload complete (model_api.json)
Git repository successfully located at /home/kaushal/Projects/couture-ai/modeldb/test1/


<IPython.core.display.Javascript object>

created new ExperimentRun: Run : 2
Validation accuracy: 0.9865
upload complete (custom_modules)
upload complete (model.pkl)
upload complete (model_api.json)
Git repository successfully located at /home/kaushal/Projects/couture-ai/modeldb/test1/


<IPython.core.display.Javascript object>

created new ExperimentRun: Run : 3
Validation accuracy: 0.9815
upload complete (custom_modules)
upload complete (model.pkl)
upload complete (model_api.json)
Git repository successfully located at /home/kaushal/Projects/couture-ai/modeldb/test1/


<IPython.core.display.Javascript object>

created new ExperimentRun: Run : 4
Validation accuracy: 0.9865
upload complete (custom_modules)
upload complete (model.pkl)
upload complete (model_api.json)
Git repository successfully located at /home/kaushal/Projects/couture-ai/modeldb/test1/


In [176]:
# Print the commit log
for c in commit.log():
    print(c)

Commit 3c868fd588106ebf2d310398ac7efbe38c30dea48e571984f28751ea4708b635 (Branch: log-reg)
Date: 2020-09-03 17:37:31

    Hyperparameter tuning Run: 3

Commit a4dd5a3da5310830f89bbe7f2b2a5821a141e7132dcfaf434f8d9f1b68f6ef4f
Date: 2020-09-03 17:37:29

    Hyperparameter tuning Run: 2

Commit 63b7b5d8b93df6108f1eca1da467aba2a248cec279922ccadeaf09a81b88f6f4
Date: 2020-09-03 17:37:26

    Hyperparameter tuning Run: 1

Commit 0fd9eaac0d08e421496466a0d00085d3917f9486c0777717422d0ae187004d5e
Date: 2020-09-03 17:37:23

    Hyperparameter tuning Run: 0

Commit 81b14e128ad215c9977fef7b88225e86fc81a8bc1997ce9bf330b71910029924
Date: 2020-09-03 17:01:15

    Initial commit



In [191]:
# merge to master if specified in config
master = repo.get_commit(branch='master')
if MERGE_TO_MASTER == True:
    master.merge(commit)

### Get Best Run
Look through the accuracy of all model experiment runs to retrieve the most successful experiment.

In [178]:
best_run = expt.expt_runs.sort("metrics.val_acc", descending=True)[0]
print("Validation Accuracy: {:.4f}".format(best_run.get_metric("val_acc")))

best_hyperparams = best_run.get_hyperparameters()
print("Hyperparameters: {}".format(best_hyperparams))

Validation Accuracy: 0.9865
Hyperparameters: {'C': 90, 'gamma': 0.0010000000474974513}
