# 1. Running and tracking machine learning experiments

## 1.0. The data we use: Palmer Pinguins

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. It provides a great dataset for data exploration & visualization, as an alternative to iris.

We will use this dataset in classification setting to predict the penguins’ species from anatomical information.

Each penguin is from one of the three following species: Adelie, Gentoo, and Chinstrap.

<!-- ![palmer penguins](../img/palmer_penguins.png "Palmer Penguins") -->
<img src='../img/palmer_penguins.png' alt='' width='600'>

This problem is a classification problem since the target is categorical. We will use features based on penguins’ culmen measurement.

<!-- ![Culmen features](../img/culmen_depth.png) -->
<img src='../img/culmen_depth.png' alt='' width='600'>

## 1.1. Load and prepare the data

In [1]:
import pandas as pd

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

data_path = "../data/penguins_classification.csv"
data = pd.read_csv(data_path)

data.sample(5)

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Species
212,46.2,14.9,Gentoo
253,47.2,15.5,Gentoo
94,40.8,18.9,Adelie
19,37.8,18.3,Adelie
112,42.2,19.5,Adelie


In [2]:
from sklearn.model_selection import train_test_split

data, target = data[culmen_columns], data[target_column]
data_train, data_test, target_train, target_test = train_test_split(data, target, random_state=0)

## 1.2. The modelling: Decision Tree

In [3]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=3, max_leaf_nodes=4)
tree.fit(data_train, target_train)

In [4]:
test_score = tree.score(data_test, target_test)
print(f"Accuracy of the DecisionTreeClassifier: {test_score:.1%}")

Accuracy of the DecisionTreeClassifier: 96.5%


## 1.3. Setup and configure the tracking server (Scenario 3b)

We want to use a database as a tracking store and a local directory as artifact store.
Using a database is required for later steps in the tutorial, like managing deployments. In this case, artifacts are stored under the local `./mlruns` directory, and MLflow entities are inserted in a SQLite database file `mflow.db`.

Run to start the tracking server following the Scenario 3b:

```
mlflow server --backend-store-uri sqlite:///mflow.db --default-artifact-root mlruns/ --host 0.0.0.0 --port 5003
```

Now you should be able to see the Tracking UI in a browser at `http://localhost:5003`.

In [5]:
import mlflow
import mlflow.sklearn

In [6]:
server_uri = "http://localhost:5003"     # port 5003: a local tracking server 
#server_uri = "http://localhost:5007"     # port 5007: a local docker container running a tracking server

mlflow.set_tracking_uri(server_uri)       # or set the MLFLOW_TRACKING_URI in the env

In [7]:
mlflow.tracking.get_tracking_uri()

'http://localhost:5003'

### Create a new experiment

In [8]:
exp_name = "penguin_classification"

mlflow.create_experiment(exp_name)

'1'

## 1.4. Track parameters and metrics

Basic things to track:
- Parameters: Key-value input parameters: `mlflow.log_param, mlflow.log_params`
- Metrics: Key-value metrics, where the value is numeric (can be updated over the run): `mlflow.log_metric, mlflow.log_metrics`

In [9]:
mlflow.set_experiment(exp_name)                         # <-- set the experiment we want to track to

with mlflow.start_run() as run:                         # <-- start a run of the experiment
    print(f"Started run {run.info.run_id}")
    # Load dataset
    print("Load dataset...")
    culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
    target_column = "Species"

    data_path = "../data/penguins_classification.csv"
    data = pd.read_csv(data_path)

    # Prepare a train-test-split
    print("Prepare a train-test-split...")
    data, target = data[culmen_columns], data[target_column]
    data_train, data_test, target_train, target_test = train_test_split(data, target, random_state=0)

    # Initialize and fit a classifier
    max_depth = 3
    max_leaf_nodes = 4
    print(f"Initialize and fit a DecisionTreeClassifier with max_depth={max_depth}, max_leaf_nodes{max_leaf_nodes}")
    
    mlflow.log_params(                                   # <-- Track parameters
        {"max_depth": max_depth, 
         "max_leaf_nodes": max_leaf_nodes}
    )
    tree = DecisionTreeClassifier(
        max_depth=max_depth,
        max_leaf_nodes=max_leaf_nodes
    )
    tree.fit(data_train, target_train)

    # Calculate test scores
    test_score = tree.score(data_test, target_test)
    mlflow.log_metric("test_accuracy", test_score)       # <-- Track metrics
    print(f"Result: Accuracy of the DecisionTreeClassifier: {test_score:.1%}")

Started run d812ecae304e4df084a152f38e7023dd
Load dataset...
Prepare a train-test-split...
Initialize and fit a DecisionTreeClassifier with max_depth=3, max_leaf_nodes4
Result: Accuracy of the DecisionTreeClassifier: 96.5%


Have a look at the tracking UI to see how it played out!

<img src='../img/mlflow_ui_pinguins_experiment_runs_list.png' alt='' width='1000'>

<img src='../img/mlflow_ui_pinguins_experiment_first_run_details.png' alt='' width='1000'>

## 1.5. Track artifacts: the notebook

What else could we want to track?

- **Code Version**: Git commit hash used for the run (if it was run from an MLflow Project)
- **Start & End Time**: Start and end time of the run
- **Source code**: what code run?
- **Notebook**: what notebook run?
- **Plots**
- **Properties of the input data**
- **Model**
- ...

In [17]:
import os
print(f"Current working directory: {os.getcwd()}")

# POSSIBLE SOLUTION:

mlflow.set_experiment(exp_name)                                                                    # <-- set the experiment we want to track to

with mlflow.start_run() as run:                                                                    # <-- start a run of the experiment
    print(f"Started run {run.info.run_id}")
    # Load dataset
    print("Load dataset...")
    culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
    target_column = "Species"

    data_path = "../data/penguins_classification.csv"
    data = pd.read_csv(data_path)
    mlflow.log_param("num_samples", data.shape[0])                                                 # <-- ADDED: track the number of samples in the dataset

    # Prepare a train-test-split
    print("Prepare a train-test-split...")
    data, target = data[culmen_columns], data[target_column]
    data_train, data_test, target_train, target_test = train_test_split(
        data, target, random_state=0)

    # Initialize and fit a classifier
    max_depth = 3
    max_leaf_nodes = 4
    print(f"Initialize and fit a DecisionTreeClassifier with max_depth={max_depth}, max_leaf_nodes{max_leaf_nodes}")
    
    mlflow.log_params(                                                                             # <-- Track parameters
        {"max_depth": max_depth, 
         "max_leaf_nodes": max_leaf_nodes}
    )
    tree = DecisionTreeClassifier(
        max_depth=max_depth,
        max_leaf_nodes=max_leaf_nodes
    )
    tree.fit(data_train, target_train)

    # Calculate test scores
    test_score = tree.score(data_test, target_test)
    mlflow.log_metric("test_accuracy", test_score)                                                 # <-- Track metrics

    # Track artifacts:  
    os.chdir("../../../mlflow")                                                                    # Change the current working directory to the same path the tracking server was executed:
    print(f"Current working directory temporaly moved to: {os.getcwd()}")

    mlflow.log_artifact("../examples/palmer_pinguins/notebooks/1_Run_and_track_experiments.ipynb") # <-- ADDED: track the source code of the notebook
    print(f"Notebook '1_Run_and_track_experiments.ipynb' stored in: {run.info.artifact_uri}")

    os.chdir("../examples/palmer_pinguins/notebooks")                                              # Change the current working directory to its original position         
    print(f"Current working directory: {os.getcwd()}")

    # Show results:
    print(f"Result: Accuracy of the DecisionTreeClassifier: {test_score:.1%}")

Current working directory: /home/gustavo/training/GitHub/Training.MLOps.MLFlow/2_MLFlow_Backend_and_Artifact_Storage_Scenarios_1_2_and_3/examples/palmer_pinguins/notebooks
Started run 6b45743850f24bcdaeb320c147fb1acf
Load dataset...
Prepare a train-test-split...
Initialize and fit a DecisionTreeClassifier with max_depth=3, max_leaf_nodes4
Current working directory temporaly moved to: /home/gustavo/training/GitHub/Training.MLOps.MLFlow/2_MLFlow_Backend_and_Artifact_Storage_Scenarios_1_2_and_3/mlflow
Notebook '1_Run_and_track_experiments.ipynb' stored in: ./mlruns/1/6b45743850f24bcdaeb320c147fb1acf/artifacts
Current working directory: /home/gustavo/training/GitHub/Training.MLOps.MLFlow/2_MLFlow_Backend_and_Artifact_Storage_Scenarios_1_2_and_3/examples/palmer_pinguins/notebooks
Result: Accuracy of the DecisionTreeClassifier: 96.5%


## 1.6. Track artifacts: the trained Model

We want to store the model artifacts to reuse it for deployment or later experimentation.
Since we used a scikit-learn model here, we can use the build-in module to store the model in sklearn format. 

```mlflow.sklearn.log_model(tree, "model")```

There are buildin modules for all kind of types of models, as well as the possibility to specify a custom format. Even autologging is available!

In [20]:
mlflow.set_experiment(exp_name)                                                                    # <-- set the experiment we want to track to

with mlflow.start_run() as run:                                                                    # <-- start a run of the experiment
    print(f"Started run {run.info.run_id}")
    # Load dataset
    print("Load dataset...")
    culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
    target_column = "Species"

    data_path = "../data/penguins_classification.csv"
    data = pd.read_csv(data_path)
    mlflow.log_param("num_samples", data.shape[0])                                                 # <-- track the number of samples in the dataset

    # Prepare a train-test-split
    print("Prepare a train-test-split...")
    data, target = data[culmen_columns], data[target_column]
    data_train, data_test, target_train, target_test = train_test_split(
        data, target, random_state=0)

    # Initialize and fit a classifier
    max_depth = 3
    max_leaf_nodes = 4
    print(f"Initialize and fit a DecisionTreeClassifier with max_depth={max_depth}, max_leaf_nodes{max_leaf_nodes}")
    
    mlflow.log_params(                                                                             # <-- Track parameters
        {"max_depth": max_depth, 
         "max_leaf_nodes": max_leaf_nodes}
    )
    tree = DecisionTreeClassifier(
        max_depth=max_depth,
        max_leaf_nodes=max_leaf_nodes
    )
    tree.fit(data_train, target_train)

    # Calculate test scores
    test_score = tree.score(data_test, target_test)
    mlflow.log_metric("test_accuracy", test_score)                                                 # <-- Track metrics
    print(f"Result: Accuracy of the DecisionTreeClassifier: {test_score:.1%}")
    
    # Track artifacts 
    os.chdir("../../../mlflow")                                                                    # Change the current working directory to the same path the tracking server was executed:
    print(f"Current working directory temporaly moved to: {os.getcwd()}")

    mlflow.log_artifact("../examples/palmer_pinguins/notebooks/1_Run_and_track_experiments.ipynb") # <-- Track the source code of the notebook
    print(f"Notebook '1_Run_and_track_experiments.ipynb' stored in: {run.info.artifact_uri}")

    # Log the model
    mlflow.sklearn.log_model(tree, "model")                                                        # <-- ADDED: Log the model
    print(f"Model stored in: {run.info.artifact_uri}/model")

    os.chdir("../examples/palmer_pinguins/notebooks")                                              # Change the current working directory to its original position         
    print(f"Current working directory: {os.getcwd()}")


Started run f7a63ac5ca554188a67636e3ed8007fe
Load dataset...
Prepare a train-test-split...
Initialize and fit a DecisionTreeClassifier with max_depth=3, max_leaf_nodes4
Result: Accuracy of the DecisionTreeClassifier: 96.5%
Current working directory temporaly moved to: /home/gustavo/training/GitHub/Training.MLOps.MLFlow/2_MLFlow_Backend_and_Artifact_Storage_Scenarios_1_2_and_3/mlflow
Notebook '1_Run_and_track_experiments.ipynb' stored in: ./mlruns/1/f7a63ac5ca554188a67636e3ed8007fe/artifacts
Model stored in: ./mlruns/1/f7a63ac5ca554188a67636e3ed8007fe/artifacts/model
Current working directory: /home/gustavo/training/GitHub/Training.MLOps.MLFlow/2_MLFlow_Backend_and_Artifact_Storage_Scenarios_1_2_and_3/examples/palmer_pinguins/notebooks


Have a look at the tracking UI to see how it played out!

<img src='../img/mlflow_ui_pinguins_experiment_runs_list_3_with_model.png' alt='' width='1000'>

<img src='../img/mlflow_ui_pinguins_experiment_run_details_3_with_model.png' alt='' width='1000'>

We can have a look to the MLmodel package created:

<img src='../img/mlflow_ui_pinguins_experiment_run_details_3_with_model-MLmodel.png' alt='' width='1000'>

The conda environment especification:

<img src='../img/mlflow_ui_pinguins_experiment_run_details_3_with_model-Conda.png' alt='' width='1000'>

And the python environment especification :

<img src='../img/mlflow_ui_pinguins_experiment_run_details_3_with_model-PythonEnv.png' alt='' width='1000'>
<img src='../img/mlflow_ui_pinguins_experiment_run_details_3_with_model-Requirements.png' alt='' width='1000'>