- What to log
    - model name
    - model hyper-prameters
    - model features
    - performances
- Model
    - Save
    - Load
- Create experiments
- Search run for a given experiment with SQL query - https://docs.faculty.ai/user-guide/experiments/index.html#experiments-multiple
- Create runs


# Doc

## MLflow Tracking
documentation > https://www.mlflow.org/docs/latest/tracking.html

**Vocabulary**
- *run*: An MLflow run is a collection of parameters, metrics, tags, and artifacts associated with a machine learning model training process.
- *experiment*: Experiments are the primary unit of organization in MLflow; all MLflow runs belong to an experiment. Each experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools.
- *MLflow entities*: runs, parameters, metrics, tags, notes, metadata, etc
- ...

**What can be recorded by an MLflow run?** > https://www.mlflow.org/docs/latest/tracking.html#concepts

**Where runs are recorded?** > https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded

They can be recorded
- to local files (by default to *mlruns* directory)
    - `mlflow ui`
- to SQLAlchemy compatible database
    - `mlflow.set_tracking_uri('sqlite:///mlflow.db')`
    - `mlflow ui --backend-store-uri sqlite:///mlflow.db`
- remotely to a tracking server

To show the current tracking uri `mlflow.get_tracking_uri()`
    
**How they are recorded** > https://www.mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded

MLflow uses two components for storage:
- backend store: for MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc)
- artifact store: for artifacts (files, models, images, in-memory objects, or model summary, etc)

**How to vizualise the logged runs?**
- You can use the MLflow tracking ui `mlflow ui` (should be run from the folder where the *mlruns* directory is located)

### Logging

**What to log**


**How**
- Manual logging
    - Log the fitted model: `mlflow.sklearn.log_model(rf, 'random-forest-model')`
    - Log the model parameters: `mlflow.log_param('num_trees', n_estimators)`
    - Log the evaluation metrics: `mlflow.log_metric('mse', mse)`
    - Log other artifacts: `mlflow.log_artifact('predictions.csv')`

- Automatic logging with MLflow autolog
    - With MLflow's autologging capabilities, a single line of code automatically logs the resulting model, the parameters used to create the model, and a model score > https://www.mlflow.org/docs/latest/tracking.html#automatic-logging
    - Call mlflow.<framework>.autolog() API before running training code to log model-specific metrics, parameters, and model artifacts. Supports many ML frameworks (sklearn, tensorflow, etc).

### Other


# Code

In [4]:
import numpy as np
import pandas as pd
import mlflow

  and should_run_async(code)


# First pipeline

## Load dataset

In [24]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

def get_dataset() -> pd.DataFrame:
    db = load_diabetes()
    X, y = db.data, db.target
    return train_test_split(X, y)

  and should_run_async(code)


In [28]:
X_train, X_test, y_train, y_test = get_dataset()
X_train.shape, X_test.shape

((331, 10), (111, 10))

## Train model

In [29]:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f'RMSE = {rmse:.2f}, MAE = {mae:.2f}, R2 = {r2:.2f}')
    return rmse, mae, r2

def train_model(X_train, X_test, y_train, y_test: pd.DataFrame, model_class, **model_kwargs) -> int:
    model = model_class(random_state=42, **model_kwargs)
    model.fit(X_train, y_train)
    evaluate_model(model, X_test, y_test)

In [30]:
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor

model_dict_list = [
    {'model_class' : ElasticNet, 'model_kwargs': {'alpha': 0.01, 'l1_ratio': 0.75}},
    {'model_class' : RandomForestRegressor, 'model_kwargs': {'n_estimators': 100, 'max_depth': 6, 'max_features': 3}}
]

# Mlflow

In [31]:
import mlflow

mlflow_backend_store_sqlite_db_uri = 'sqlite:///mlflow.db'
mlflow.set_tracking_uri(mlflow_backend_store_sqlite_db_uri)

In [34]:
mlflow.sklearn.autolog()

mlflow.set_experiment('experiment 2')

for model_dict in model_dict_list:
    with mlflow.start_run():
        train_model(X_train, X_test, y_train, y_test, model_dict['model_class'], **model_dict['model_kwargs'])

  and should_run_async(code)


INFO: 'experiment 2' does not exist. Creating a new experiment
RMSE = 58.78, MAE = 49.96, R2 = 0.44
RMSE = 58.75, MAE = 48.96, R2 = 0.44


In [36]:
mlflow.search_runs(filter_string="metric.training_mae < 30")

  and should_run_async(code)


Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.training_rmse,metrics.training_mse,metrics.training_mae,metrics.training_score,...,params.random_state,params.min_samples_split,params.max_samples,params.verbose,tags.mlflow.source.type,tags.estimator_class,tags.estimator_name,tags.mlflow.source.name,tags.mlflow.log-model.history,tags.mlflow.user
0,67cf7cf4c4a34b618c566b0d951bf094,2,FINISHED,./mlruns/2/67cf7cf4c4a34b618c566b0d951bf094/ar...,2021-05-12 23:54:31.977000+00:00,2021-05-12 23:54:32.474000+00:00,35.647433,1270.739478,29.775595,0.78143,...,42,2,,0,LOCAL,sklearn.ensemble._forest.RandomForestRegressor,RandomForestRegressor,/Users/alaa.bakhti/miniconda3/envs/dsp/lib/pyt...,"[{""run_id"": ""67cf7cf4c4a34b618c566b0d951bf094""...",alaa.bakhti
