# Model Tracking with MLflow

In this demo, we will explore the capabilities of MLflow, a comprehensive framework for the complete machine learning lifecycle. MLflow provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models.

In this demo, we will focus on tracking and logging components of MLflow. First, we will demonstrate how to track an experiment with MLflow and show various custom logging features including logging parameters, metrics, figures and arbitrary artifacts.

## Learning Objectives:

By the end of this demo, you will be able to:

* Train a model using a Feature Store table as the modeling set
* Manually log parameters, metrics, models, and figures with MLflow tracking
* Log training dataset with model in MLflow
* Log additional artifacts to a model run
* Review an experiment using the MLflow UI

In [0]:
table_name = "sdd_dev.sohag_test.diabetes_binary_health_indicators_brfss_2015"
feature_dataset = spark.read.table(table_name)
feature_data_pd = feature_dataset.toPandas()
feature_data_pd.head()

In [0]:
# Convert all columns in the Dataframe to double data type
for column in feature_data_pd.columns:
    feature_data_pd[column] = feature_data_pd[column].astype(float)

print(feature_data_pd.dtypes)

In [0]:
from sklearn.model_selection import train_test_split

# Split the target variable into it's own dataset
target_col = "Diabetes_binary"

X_all = feature_data_pd.drop(target_col, axis=1)
y_all = feature_data_pd[target_col]

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=0)

print(f"We have {len(X_train)} rows in the training set and {len(X_test)} rows in the test set")

## MLflow Parameter Logging

In this code, we use MLflow to start a run and log parameters such as the criterion and max_depth of the Decision Tree model. After fitting the model on the training data, we evaluate its performance on the test set and log the accuracy as a metric.

### Important Notes:
- **MLflow autologging is enabled by default on Databricks**. This means you don't need to do anything for supported libraries. In the next section, we are disabling it and manually log params, metrics etc. just demonstrate how to do it manually when you need to log any custom model info.
- **Note**: We won't define the `experiment_name`, all runs generated in this notebook will be logged under the notebook title.

In [0]:
dtc_params = {
    'criterion': 'gini',
    'max_depth': 10,
    'min_samples_split': 20,
    'min_samples_leaf': 5
}

In [0]:
import mlflow
from mlflow.models.signature import infer_signature
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import *

UC_PATH = "databricks-uc"
EXP_PATH = "/Users/sohagahammed.siyam@kone.com/Databricks Training/MLOps/MLflow - Model Tracking"
MODEL_NAME = "sdd_dev.sohag_test.diabetes_prediction"

# register models in UC and Notebook
mlflow.set_registry_uri(UC_PATH)

# Create experiment if it does not exist
experiment = mlflow.get_experiment_by_name(EXP_PATH)
if experiment is None:
    mlflow.create_experiment(EXP_PATH)

mlflow.set_experiment(EXP_PATH)

# Turn of autologging
mlflow.sklearn.autolog(disable=True)

In [0]:
# start an MLFlow run
with mlflow.start_run(run_name="Model tracking demo") as run:
    # Log the dataset as artifacts
    feature_dataset.toPandas().to_csv("/dbfs/tmp/feature_dataset.csv", index=False)
    X_train.to_csv("/dbfs/tmp/X_train.csv", index=False)
    X_test.to_csv("/dbfs/tmp/X_test.csv", index=False)
    
    mlflow.log_artifact("/dbfs/tmp/feature_dataset.csv", artifact_path="datasets/source")
    mlflow.log_artifact("/dbfs/tmp/X_train.csv", artifact_path="datasets/train")
    mlflow.log_artifact("/dbfs/tmp/X_test.csv", artifact_path="datasets/test")
    
    # log our parameters
    mlflow.log_params(dtc_params)
    
    # fit the model
    dtc = DecisionTreeClassifier(**dtc_params)
    dtc_mdl = dtc.fit(X_train, y_train)

    # Define model signature
    signature = infer_signature(X_all, y_all)
    
    # log the model
    mlflow.sklearn.log_model(
        sk_model=dtc_mdl,
        artifact_path="model-artifact",
        signature=signature,
        registered_model_name=MODEL_NAME
    )

    # Evaluate on the training set
    y_pred = dtc_mdl.predict(X_train)

    # Log metrics like accuracy, precision, recall, f1
    mlflow.log_metric("train_accuracy", accuracy_score(y_train, y_pred))
    mlflow.log_metric("train_precision", precision_score(y_train, y_pred))
    mlflow.log_metric("train_recall", recall_score(y_train, y_pred))
    mlflow.log_metric("train_f1", f1_score(y_train, y_pred))

    # Evaluate on the test set
    y_pred = dtc_mdl.predict(X_test)
    mlflow.log_metric("test_accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("test_precision", precision_score(y_test, y_pred))
    mlflow.log_metric("test_recall", recall_score(y_test, y_pred))
    mlflow.log_metric("test_f1", f1_score(y_test, y_pred))

## Log model Artifacts
In addition to logging the model, we can also log other artifacts such as the training dataset, feature dataset, and model run metadata. Let's setup an MLFlow client to log artifacts after the run is completed.

In [0]:
run.info

In [0]:
from mlflow.client import MlflowClient
import matplotlib.pyplot as plt
import seaborn as sns

client = MlflowClient()

# Log confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues", ax=ax)
ax.set_title("Confusion Matrix")
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
# mlflow.log_figure(fig, "confusion_matrix.png")

client.log_figure(run.info.run_id, artifact_file="confusion_matrix.png", figure=fig)
plt.show()