# Model Experimentation Tracking (MLFow) - Hyperparamter Optimization

Record and query experiments: Code, data, config, results, parameters, metrics

![Data](images/MLflow_Model_experimentation.png)

## Import Packages

In [2]:
# Data analysis library
import numpy as np
import pandas as pd
import joblib

# Machine Learning library
import sklearn
from sklearn.metrics import roc_curve, auc, accuracy_score, plot_confusion_matrix, plot_roc_curve
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from lightgbm import LGBMClassifier
from lightgbm import plot_importance, plot_metric

# Model experimentation library
import mlflow
import mlflow.lightgbm
from mlflow.tracking import MlflowClient

# Hyperparameter tunning library
import optuna

# Plotting library
import matplotlib.pyplot as plt
# Prevent figures from displaying by turning interactive mode off using the function
plt.ioff()
import warnings
warnings.filterwarnings("ignore")

In [3]:
print(f'Numpy version is {np.__version__}')
print(f'Pandas version is {pd.__version__}')
print(f'sklearn version is {sklearn.__version__}')
print(f'mlflow version is {mlflow.__version__}')
print(f'joblib version is {joblib.__version__}')
print(f'optuna version is {optuna.__version__}')

Numpy version is 1.21.5
Pandas version is 1.4.2
sklearn version is 1.0.2
mlflow version is 1.28.0
joblib version is 1.1.0
optuna version is 3.0.2


## Download data 

### Campus Recruitment Dataset
#### Academic and Employability Factors influencing placement

https://www.kaggle.com/benroshan/factors-affecting-campus-placement

## Load data

In [6]:
## Files
data_file = 'Placement_Data_Full_Class.csv'

# Load train loan dataset 
try:
    data = pd.read_csv(data_file)
    print("The dataset has {} samples with {} features.".format(*data.shape))
except:
    print("The dataset could not be loaded. Is the dataset missing?")

The dataset has 215 samples with 15 features.


## Introduction To The Data

In [7]:
data.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [8]:
data['status'].value_counts()

Placed        148
Not Placed     67
Name: status, dtype: int64

## Start MLflow UI

Start **mlflow ui** comman from the command prompt

In [6]:
!mlflow ui

^C


## Initialize MLflow

**Experiments** : You can organize runs into experiments, which group together runs for a specific task. 

**Tracking URI**: MLflow runs can be recorded to local files, to a database, or remotely to a tracking server. By default, the MLflow Python API logs runs locally to files in an mlruns directory wherever you ran your program

#### MLflow Tracking Servers 
MLflow tracking server has two components for storage: a **backend store** and an **artifact store**

The **backend store** is where MLflow Tracking Server stores experiment and run metadata as well as params, metrics, and tags for runs. MLflow supports two types of backend stores: **file store and database-backed store**.

The **artifact store** is a location suitable for large data (such as an S3 bucket or shared NFS file system) and is where clients log their artifact output (for example, models).

    Amazon S3 and S3-compatible storage
    Azure Blob Storage
    Google Cloud Storage
    FTP server
    SFTP Server
    NFS
    HDFS

In [9]:
experiment_name = "campus_recruitment_experiments_v2"
artifact_repository = './mlflow-run'

# Provide uri and connect to your tracking server
mlflow.set_tracking_uri('http://127.0.0.1:5000/')

# Initialize client
client = MlflowClient()

# If experiment doesn't exist then it will create new
# else it will take the experiment id and will use to to run the experiments
try:
    # Create experiment 
    experiment_id = client.create_experiment(experiment_name, artifact_location=artifact_repository)
except:
    # Get the experiment id if it already exists
    experiment_id = client.get_experiment_by_name(experiment_name).experiment_id

## Prepare data for model training

In [10]:
exclude_feature = ['sl_no', 'salary', 'status']
# Define Target columns
target = data['status'].map({"Placed": 0 , "Not Placed": 1})

# Define numeric and categorical features
numeric_columns = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_columns = data.select_dtypes(include=['object']).columns.tolist()
numeric_features = [col for col in numeric_columns if col not in exclude_feature]
categorical_features = [col for col in categorical_columns if col not in exclude_feature]

# Define final feature list for training and validation
features = numeric_features + categorical_features
# Final data for training and validation
data = data[features]
data = data.fillna(0)

# Split data in train and vlaidation
X_train, X_valid, y_train, y_valid = train_test_split(data, target, test_size=0.15, random_state=10)

# Perform label encoding for categorical variable
for feature in categorical_features:
    le = LabelEncoder()
    le.fit(X_train.loc[:, feature])
    X_train.loc[:, feature] = le.transform(X_train.loc[:, feature])
    X_valid.loc[:, feature] = le.transform(X_valid.loc[:, feature])

## Lightgbm Hyperparameter tunning + MLFlow for model tracking

### Define model training function to train and track model results

In [12]:
def model_training_tracking(params):
    # Launching Multiple Runs in One Program.This is easy to do because the ActiveRun object returned by mlflow.start_run() is a 
    # Python context manager. You can “scope” each run to just one block of code as follows:
    with mlflow.start_run(experiment_id=experiment_id, run_name='Lightgbm_model') as run:
        # Get run id 
        run_id = run.info.run_uuid
        
        # Set the notes for the run
        MlflowClient().set_tag(run_id,
                               "mlflow.note.content",
                               "This is experiment for hyperparameter optimzation for lightgbm models for the Campus Recruitment Dataset")
        
        # Define customer tag
        tags = {"Application": "Payment Monitoring Platform",
                "release.candidate": "PMP",
                "release.version": "2.2.0"}

        # Set Tag
        mlflow.set_tags(tags)
                        
        # Log python environment details
        mlflow.log_artifact('requirements.txt')
        
        # logging params
        mlflow.log_params(params)

        # Perform model training
        lgb_clf = LGBMClassifier(**params)
        lgb_clf.fit(X_train, y_train, 
                    eval_set = [(X_train, y_train), (X_valid, y_valid)], 
                    early_stopping_rounds=50,
                    verbose=20)

        # Log model artifacts
        mlflow.sklearn.log_model(lgb_clf, "model")

        # Perform model evaluation 
        lgb_valid_prediction = lgb_clf.predict_proba(X_valid)[:, 1]
        fpr, tpr, thresholds = roc_curve(y_valid, lgb_valid_prediction)
        roc_auc = auc(fpr, tpr) # compute area under the curve
        print("=====================================")
        print("Validation AUC:{}".format(roc_auc))
        print("=====================================")   

        # log metrics
        mlflow.log_metrics({"Validation_AUC": roc_auc})

        # Plot and save feature importance details
        ax = plot_importance(lgb_clf, height=0.4)
        filename = './images/lgb_validation_feature_importance.png'
        plt.savefig(filename)
        # log model artifacts
        mlflow.log_artifact(filename)

        ax = plot_metric(lgb_clf.evals_result_)
        filename = './images/lgb_validation_metrics_comparision.png'
        plt.savefig(filename)
        # log model artifacts
        mlflow.log_artifact(filename)

        # Plot and save metrics details    
        plot_confusion_matrix(lgb_clf, X_valid, y_valid, 
                              display_labels=['Placed', 'Not Placed'],
                              cmap='magma')
        plt.title('Confusion Matrix')
        filename = './images/lgb_validation_confusion_matrix.png'
        plt.savefig(filename)
        # log model artifacts
        mlflow.log_artifact(filename)

        # Plot and save AUC details  
        plot_roc_curve(lgb_clf, X_valid, y_valid, name='Validation')
        plt.title('ROC AUC Curve')
        filename = './images/lgb_validation_roc_curve.png'
        plt.savefig(filename)
        # log model artifacts
        mlflow.log_artifact(filename)
        
        return roc_auc

### Define an objective function to be maximized

In [13]:
def objective(trial):

    param = {
        "objective": "binary",
        "metric": "auc",
        "learning_rate": trial.suggest_float("learning_rate", 1e-2, 1e-1, log=True),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.4, 1.0),
        "random_state": 42,
    }
    
    auc = model_training_tracking(param)
    return auc

### Create a study object and optimize the objective function

In [18]:
# Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)
trial = study.best_trial
print('AUC: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[32m[I 2022-09-15 16:47:45,996][0m A new study created in memory with name: no-name-617d02ef-a277-4cdc-9e8e-12bf8c63cee3[0m


[20]	training's auc: 0.98435	valid_1's auc: 0.898496
[40]	training's auc: 0.996051	valid_1's auc: 0.902256
[60]	training's auc: 0.999269	valid_1's auc: 0.906015
Validation AUC:0.9172932330827067


[32m[I 2022-09-15 16:47:56,094][0m Trial 0 finished with value: 0.9172932330827067 and parameters: {'learning_rate': 0.0792807017079039, 'colsample_bytree': 0.9932833411854791, 'subsample': 0.7063891578872163}. Best is trial 0 with value: 0.9172932330827067.[0m


[20]	training's auc: 0.982887	valid_1's auc: 0.87218
[40]	training's auc: 0.994735	valid_1's auc: 0.902256
Validation AUC:0.9060150375939849


[32m[I 2022-09-15 16:48:03,970][0m Trial 1 finished with value: 0.9060150375939849 and parameters: {'learning_rate': 0.09935115666239229, 'colsample_bytree': 0.5143539035235656, 'subsample': 0.6441054702962785}. Best is trial 0 with value: 0.9172932330827067.[0m


[20]	training's auc: 0.964239	valid_1's auc: 0.853383
[40]	training's auc: 0.973453	valid_1's auc: 0.857143
[60]	training's auc: 0.976964	valid_1's auc: 0.887218
[80]	training's auc: 0.98062	valid_1's auc: 0.898496
[100]	training's auc: 0.983033	valid_1's auc: 0.894737
Validation AUC:0.8947368421052632


[32m[I 2022-09-15 16:48:12,051][0m Trial 2 finished with value: 0.8947368421052632 and parameters: {'learning_rate': 0.012423708760618324, 'colsample_bytree': 0.7448402251997226, 'subsample': 0.49860645459550434}. Best is trial 0 with value: 0.9172932330827067.[0m


[20]	training's auc: 0.97572	valid_1's auc: 0.868421
[40]	training's auc: 0.990054	valid_1's auc: 0.902256
[60]	training's auc: 0.994881	valid_1's auc: 0.898496
[80]	training's auc: 0.997514	valid_1's auc: 0.909774
[100]	training's auc: 0.999269	valid_1's auc: 0.921053
Validation AUC:0.9172932330827068


[32m[I 2022-09-15 16:48:19,850][0m Trial 3 finished with value: 0.9172932330827068 and parameters: {'learning_rate': 0.05172037508341735, 'colsample_bytree': 0.5645865666068733, 'subsample': 0.44657980932054714}. Best is trial 3 with value: 0.9172932330827068.[0m


[20]	training's auc: 0.95305	valid_1's auc: 0.887218
[40]	training's auc: 0.961094	valid_1's auc: 0.87594
Validation AUC:0.8984962406015037


[32m[I 2022-09-15 16:48:27,693][0m Trial 4 finished with value: 0.8984962406015037 and parameters: {'learning_rate': 0.01038399829372392, 'colsample_bytree': 0.46949095344370356, 'subsample': 0.5012208812550029}. Best is trial 3 with value: 0.9172932330827068.[0m


[20]	training's auc: 0.964019	valid_1's auc: 0.883459
[40]	training's auc: 0.977329	valid_1's auc: 0.87218
Validation AUC:0.8984962406015037


[32m[I 2022-09-15 16:48:35,134][0m Trial 5 finished with value: 0.8984962406015037 and parameters: {'learning_rate': 0.02945661833987922, 'colsample_bytree': 0.5225637763221862, 'subsample': 0.724601738692791}. Best is trial 3 with value: 0.9172932330827068.[0m


[20]	training's auc: 0.967822	valid_1's auc: 0.868421
[40]	training's auc: 0.977329	valid_1's auc: 0.887218
[60]	training's auc: 0.983619	valid_1's auc: 0.87594
[80]	training's auc: 0.988591	valid_1's auc: 0.894737
[100]	training's auc: 0.991517	valid_1's auc: 0.902256
Validation AUC:0.8984962406015038


[32m[I 2022-09-15 16:48:43,057][0m Trial 6 finished with value: 0.8984962406015038 and parameters: {'learning_rate': 0.022404039821532176, 'colsample_bytree': 0.5908173181537603, 'subsample': 0.7072421827228921}. Best is trial 3 with value: 0.9172932330827068.[0m


[20]	training's auc: 0.970528	valid_1's auc: 0.868421
[40]	training's auc: 0.978353	valid_1's auc: 0.890977
[60]	training's auc: 0.984935	valid_1's auc: 0.890977
[80]	training's auc: 0.98786	valid_1's auc: 0.894737
[100]	training's auc: 0.991517	valid_1's auc: 0.906015
Validation AUC:0.9060150375939849


[32m[I 2022-09-15 16:48:51,154][0m Trial 7 finished with value: 0.9060150375939849 and parameters: {'learning_rate': 0.022766650623132302, 'colsample_bytree': 0.7330369957223126, 'subsample': 0.7293983726006674}. Best is trial 3 with value: 0.9172932330827068.[0m


[20]	training's auc: 0.977476	valid_1's auc: 0.887218
[40]	training's auc: 0.988153	valid_1's auc: 0.898496
[60]	training's auc: 0.994442	valid_1's auc: 0.898496
[80]	training's auc: 0.997367	valid_1's auc: 0.902256
Validation AUC:0.9097744360902256


[32m[I 2022-09-15 16:48:58,735][0m Trial 8 finished with value: 0.9097744360902256 and parameters: {'learning_rate': 0.04974454863479166, 'colsample_bytree': 0.9780676295645269, 'subsample': 0.5117619998773019}. Best is trial 3 with value: 0.9172932330827068.[0m


[20]	training's auc: 0.962849	valid_1's auc: 0.849624
[40]	training's auc: 0.971771	valid_1's auc: 0.879699
[60]	training's auc: 0.974477	valid_1's auc: 0.913534
[80]	training's auc: 0.977987	valid_1's auc: 0.898496
[100]	training's auc: 0.982448	valid_1's auc: 0.913534
Validation AUC:0.9135338345864661


[32m[I 2022-09-15 16:49:06,799][0m Trial 9 finished with value: 0.9135338345864661 and parameters: {'learning_rate': 0.011115678045179418, 'colsample_bytree': 0.9217459477245878, 'subsample': 0.8186485799614089}. Best is trial 3 with value: 0.9172932330827068.[0m


AUC: 0.9172932330827068
Best hyperparameters: {'learning_rate': 0.05172037508341735, 'colsample_bytree': 0.5645865666068733, 'subsample': 0.44657980932054714}


## Load best lightgbm model

Check Mlflow UI and pick the best model for model deployment

In [22]:
# Load best model
lgb_best_model = mlflow.sklearn.load_model("D:/Practice/MLflow/Campus recruitment/mlflow-run/c0db9b1be4ff46a9ba009476ff6ca334/artifacts/model")

# Make prediction aganist Validation data
lgb_best_val_prediction = lgb_best_model.predict(X_valid)
lgb_best_val_prediction

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)

## Reference

### Model experimentation
https://www.mlflow.org/docs/latest/tracking.html#

### Hyperparameter Optimization
https://github.com/optuna/optuna