# Logistic Regression with Grid Search (scikit-learn)

<a href="https://colab.research.google.com/github/VertaAI/modeldb/blob/master/client/workflows/demos/census-with-managed-versioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

This example builds on our [basic census income classification example](census-end-to-end.ipynb) by incorporating [S3 data versioning](https://verta.readthedocs.io/en/master/_autogen/verta.dataset.S3.html).

In [2]:
HOST = "app.verta.ai"

PROJECT_NAME = "Census Income Classification - S3 Data"
EXPERIMENT_NAME = "Logistic Regression"

In [3]:
# import os
# os.environ['VERTA_EMAIL'] = ''
# os.environ['VERTA_DEV_KEY'] = ''

## Imports

In [4]:
from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import os

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model

In [5]:
try:
    import wget
except ImportError:
    !pip install wget  # you may need pip3
    import wget

---

# Log Workflow

This section demonstrates logging model metadata and training artifacts to ModelDB.

## Instantiate Client

In [6]:
from verta import Client
from verta.utils import ModelAPI

client = Client(HOST)

<h2 style="color:blue">Prepare Data</h2>

In [7]:
bucket = "verta-starter"
key = "census-train.csv"

First we download our data from S3 for use in this notebook.

In [8]:
data_dir = os.curdir
train_data_filename = os.path.join(data_dir, key)

if not os.path.exists(train_data_filename):
    train_data_url = "http://s3.amazonaws.com/{}/{}".format(bucket, key)
    wget.download(train_data_url, train_data_filename)

Then we version our dataset; with `enable_mdb_versioning=True`, the client will obtain the data file(s) from S3 and store them in ModelDB.

In [9]:
from verta.dataset import S3

dataset = client.set_dataset(name="Census Income S3")
content = S3("s3://{}/{}".format(bucket, key), enable_mdb_versioning=True)
version = dataset.create_version(content)

In [10]:
df_train = pd.read_csv(train_data_filename)
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:, -1]

df_train.head()

## Prepare Hyperparameters

In [11]:
hyperparam_candidates = {
    'C': [1e-6, 1e-4],
    'solver': ['lbfgs'],
    'max_iter': [15, 28],
}
hyperparam_sets = [dict(zip(hyperparam_candidates.keys(), values))
                   for values
                   in itertools.product(*hyperparam_candidates.values())]

## Train Models

In [12]:
def run_experiment(hyperparams):
    # create object to track experiment run
    run = client.set_experiment_run()
    
    # create validation split
    (X_val_train, X_val_test,
     y_val_train, y_val_test) = model_selection.train_test_split(X_train, y_train,
                                                                 test_size=0.2,
                                                                 shuffle=True)

    # log hyperparameters
    run.log_hyperparameters(hyperparams)
    print(hyperparams, end=' ')
    
    # create and train model
    model = linear_model.LogisticRegression(**hyperparams)
    model.fit(X_train, y_train)
    
    # calculate and log validation accuracy
    val_acc = model.score(X_val_test, y_val_test)
    run.log_metric("val_acc", val_acc)
    print("Validation accuracy: {:.4f}".format(val_acc))
    
    # create deployment artifacts
    model_api = ModelAPI(X_train, y_train)
    requirements = ["scikit-learn"]
    
    # save and log model
    run.log_model(model, model_api=model_api)
    run.log_requirements(requirements)
    
    # log dataset snapshot as version
    run.log_dataset_version("train", version)


proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)
for hyperparams in hyperparam_sets:
    run_experiment(hyperparams)

---

# Retrieve data

Let's say our data file becomes lost in the future.

In [13]:
os.remove(train_data_filename)

Because we've used ModelDB to manage our data, we can obtain the dataset version from our experiment run, and use it to recover our original dataset.

In [14]:
run = proj.expt_runs[0]
version = run.get_dataset_version("train")

version.get_content().download(
    "s3://{}/{}".format(bucket, key),
    download_to_path=train_data_filename,
)

In [15]:
pd.read_csv(train_data_filename).head()

---