# Logistic Regression (scikit-learn) with HDFS/Spark Data Versioning

This example is based on our [basic census income classification example](census-end-to-end.ipynb), using local setups of ModelDB and its client, and [HDFS/Spark data versioning](https://verta.readthedocs.io/en/master/_autogen/verta.dataset.HDFSPath.html).

In [1]:
!pip install /path/to/verta-0.15.10-py2.py3-none-any.whl

In [2]:
HOST = "localhost:8080"

PROJECT_NAME = "Census Income Classification - HDFS Data"
EXPERIMENT_NAME = "Logistic Regression"

## Imports

In [3]:
from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import os

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model

---

# Log Workflow

This section demonstrates logging model metadata and training artifacts to ModelDB.

## Instantiate Client

In [4]:
from verta import Client
from verta.utils import ModelAPI

client = Client(HOST)
proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)

<h2>Prepare Data</h2>

In [5]:
from pyspark import SparkContext

sc = SparkContext("local")

In [6]:
from verta.dataset import HDFSPath

hdfs = "hdfs://HOST:PORT"

dataset = client.set_dataset(name="Census Income S3")
blob = HDFSPath.with_spark(sc, "{}/data/census/*".format(hdfs))
version = dataset.create_version(blob)

version

In [7]:
csv = sc.textFile("{}/data/census/census-train.csv".format(hdfs)).collect()

In [8]:
from verta.external.six import StringIO

df_train = pd.read_csv(StringIO('\n'.join(csv)))
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:, -1]

df_train.head()

## Prepare Hyperparameters

In [9]:
hyperparam_candidates = {
    'C': [1e-6, 1e-4],
    'solver': ['lbfgs'],
    'max_iter': [15, 28],
}
hyperparam_sets = [dict(zip(hyperparam_candidates.keys(), values))
                   for values
                   in itertools.product(*hyperparam_candidates.values())]

## Train Models

In [10]:
def run_experiment(hyperparams):
    # create object to track experiment run
    run = client.set_experiment_run()
    
    # create validation split
    (X_val_train, X_val_test,
     y_val_train, y_val_test) = model_selection.train_test_split(X_train, y_train,
                                                                 test_size=0.2,
                                                                 shuffle=True)
    
    # log hyperparameters
    run.log_hyperparameters(hyperparams)
    print(hyperparams, end=' ')
    
    # create and train model
    model = linear_model.LogisticRegression(**hyperparams)
    model.fit(X_train, y_train)
    
    # calculate and log validation accuracy
    val_acc = model.score(X_val_test, y_val_test)
    run.log_metric("val_acc", val_acc)
    print("Validation accuracy: {:.4f}".format(val_acc))
    
    # save and log model
    run.log_model(model)
    
    # log dataset snapshot as version
    run.log_dataset_version("train", version)

for hyperparams in hyperparam_sets:
    run_experiment(hyperparams)

---

# Revisit Workflow

This section demonstrates querying and retrieving runs via the Client.

## Retrieve Best Run

In [11]:
best_run = expt.expt_runs.sort("metrics.val_acc", descending=True)[0]
print("Validation Accuracy: {:.4f}".format(best_run.get_metric("val_acc")))

best_hyperparams = best_run.get_hyperparameters()
print("Hyperparameters: {}".format(best_hyperparams))

## Train on Full Dataset

In [12]:
model = linear_model.LogisticRegression(multi_class='auto', **best_hyperparams)
model.fit(X_train, y_train)

## Calculate Accuracy on Full Training Set

In [13]:
train_acc = model.score(X_train, y_train)
print("Training accuracy: {:.4f}".format(train_acc))

---