<center>
<h1> AMLD Workshop </h1>
    <h2>The Full Machine Learning Lifecycle - How to Use Machine Learning in Production (MLOps)</h2>
    <h3>March 27, 2022</h3>
<hr>
<h1>Model tuning - Improve your model performance</h1>
<hr>
 </center>

If you opened this notebook, your model probably needs improvements. Use this notebook to find a set-up that produces a better model. You might experiment with hyper-parameter tuning, or even try different algorithms.

First, similar to the MLflow exercise, lets import the relevant libraries ...

In [None]:
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)
import os
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# functions needed for data pre processing
from cd4ml.data_processing.ingest_data import get_data
from cd4ml.data_processing.split_train_test import get_train_test_split
from cd4ml.data_processing.transform_data import get_transformed_data

... set the paths and variables ...

In [None]:
_raw_data_dir = '/mnt/raw/winji/day2'
_root_dir = os.environ.get('PROJECT_PATH')

if _root_dir is None:
    raise ValueError('PROJECT_PATH environment variable not set')

_data_dir = os.path.join(_root_dir, 'data')

... and prepare the validation data

In [None]:
# get the data
n_days_validation_set = 20

df_raw = get_data(_raw_data_dir)
df_all_train_data, _ = get_train_test_split(df_raw, n_days_test=20)
df_train, df_val = get_train_test_split(df_all_train_data, n_days_validation_set)
x_train, y_train = get_transformed_data(df_train)
x_val, y_val = get_transformed_data(df_val)

Now see how well the model performs on your training and validation data and try to adjust and improve it. Make sure you set the ```_experiment_name``` and use the MLflow UI to inspect your model runs. Log all the relevant metrics and parameters as you have seen in the MLflow exercise.

In [None]:
# DEFINE YOUR MODEL HERE:
def get_model():
    C = 1.0
    iterations = 50
    model = LogisticRegression(C=C, max_iter=iterations)
    return model

# SET THE MLFLOW EXPERIMENT HERE:
# _experiment_name = "#####_tracking_exercise"

if not _experiment_name:
    raise ValueError('_experiment_name not set')

mlflow.set_experiment(_experiment_name)
mlflow.autolog()

with mlflow.start_run() as run:
    print(f"\nActive run_id: {run.info.run_id}")

    # fit your classifier
    clf = get_model()
    clf.fit(x_train, y_train)

    y_val_pred = clf.predict(x_val)

    # CALCULATE YOUR METRICS HERE:
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred, average='macro')
    print('Accuracy validation set:', val_accuracy)
    print('F1-score validation set:', val_f1)

    # LOG COMMANDS HERE:
    mlflow.log_metric('val_acc', val_accuracy)
    mlflow.log_metric('val_f1', val_f1)

mlflow.end_run()

Once you are satisfied with the model performance, adjust the ```get_model()``` function in the code (located in ```cd4ml-workshop/cd4ml/model_training/train_model.py```) accordingly and (re-)run the CI-pipeline with Apache Airflow