# Week 2
    
## Important Concepts
* ML experiment: The process of building an ML model
* Experiment run: Each trial in an ML experiment
* Run artifact: Any file that is associated with an ML run
* Experiment metadata: All information related to the experimet


## What is Experiment Tracking?
Experiment tracking is the process of keeping track of all the **relevant information** from an **ML experiment**, which includes:
* Source code
* Environment
* Data
* Model
* Hyperparameters
* Metrics
* ...

What exactly the 'relevant information' is, depends on the specific experiment.

## Why is Experiment Tracking so important
3 main reasons
* Reproducability
* Organization
* Optimization

## MLflow
* Python package, that contains 4 main modules:
    * Tracking
    * Models
    * Model Registry
    * Projects
* Here we focus on tracking
    * MLflow tracking module allows you to organize your experiments into runs, and keep track of
        * Parameters
        * Metrics
        * Metadata
        * Artifacts
        * Models
    * Along with this information, MLflow automatically logs extra information about the run:
        * Source code
        * Version of the code (git commit)
        * Start and end time
        * Author


## Getting started
* ```pip install mlflow```
* typing ```mlflow``` shows the options you have:
![mlflow](mlflow.png)
* Have a look at the ```ui``` option:
    * ```mlflow ui```
    * This runs mlflow ui locally
    * This gives you access to the experiments via the browser
![mlflow](mlflow_ui_terminal.png)
![mlflow](mlflow_ui.png)


## Example: How to add loging to a Jupyter Notenbook
* Create conda environment: ```conda create --name exp-tracking-env python=3.9```
* Activate the environment: ```conda activate exp-tracking-env```
* Install the requirements: ```pip install -r requirements.txt```
* Start mlflow uri: ```mlflow ui --backend-store-uri sqlite:///mlflow.db```
    * The option ```backend-store-uri``` here means that we want to store all the artifacts and metadata in an sqlite database
    * Copy the notebook mlops-zoomcamp/01-intro/duration-prediction.ipynb
    * Create a kernel from the environment: ```conda install -c anaconda ipykernel``` ```python -m ipykernel install --user --name=exp-tracking-env```
    * Open the notebook and choose the kernel ```exp-tracking-env```
    * Add this to the notebook:
    ```import mlflow
mlflow.set_tracking_uri(\"sqlite:///mlflow.db\")
mlflow.set_experiment(\"nyc-taxi-experiment\")```
    * To track an experiment add
    ```with mlflow.start_run():```
      Then everything inside the ```with statement``` will be associated with the current run


## Experiment Tracking with MLflow
* Add paramter tuning to the notebook
    * Use a second model as example: xgboost
    * Use hyperopt for hyperparamtertuning
    * Documentation for hyperopt: https://hyperopt.github.io/hyperopt/getting-started/search_spaces
* Show how it looks in MLflow
    * Different visualisation possibilities: Parallel Coordinates Plot, Scatter Plot, Contour Plot
    * Possibility to filter results, e.g. by tags
* Select the best one
    * One way to select the best model is to sort the results by the metric
    * Also consider: training time, model size
* Autolog
    * Works only with certain frameworks: mlflow.org/docs/latest/tracking.html#automatic-logging
    * Enables us to log a lot of information automatically with less code - additional logging maybe necessary
    * For xgboost: ```mlflow.xgboost.autolog()```
        * This saves automatically a lot of useful paramters and artifacts
* Use again the notebook from the previous sesion ```duration-prediction.ipynb```


# Model Management
![ml_lifecycle](ml_lifecycle.png)
(https://neptune.ai/blog/ml-experiment-tracking)
* After deploying the model we may realize that the model needs to be updated
* Once we deploy the model, the prediction - monitoring stage starts
* As for experiment tracking the way how we manage them can be automized
    * E.g. we could manage our models using different folder and filenames. This has several disadvantages:
        * it is very error prone
        * there is no versioning
        * there is no model lineage
    * Alternatively we can use mlflow to manage our models
        * the most simple way to save a model is: ```mlflow.log_artifact(local_path=\"models/lin_reg.bin\", artifact_path=\"models_pickle/\")```, Note: for me this did not work, but only ```mlflow.log_artifact(local_path=\"models/lin_reg.bin\")```
        * better way to save the models:
          ```
          with mlflow.start_run():
            best_params =  {'learning_rate': 0.20905792515510074,
                            'max_depth': 7,
                            'min_child_weight': 0.5241500975917085,
                            'objective': 'reg:squarederror',
                            'reg_alpha': 0.13309121698466933,
                            'reg_lambda': 0.11277257081373988,
                            'seed': 42}
            mlflow.log_params(best_params)
            booster = xgb.train(
                # paramters are passed to xgboost
                params=params,
                # training on train data
                dtrain=train,
                # set boosting rounds
                num_boost_round=100,
                # validation is done on validation dataset
                evals=[(valid, 'validation')],
                # if model does not improve for 50 methods->stop
                early_stopping_rounds=50
            )

            # make predictions
            y_pred = booster.predict(valid)
            # calculate error
            rmse = mean_squared_error(y_val, y_pred, squared=False)
            # log metric
            mlflow.log_metric(\"rmse\", rmse)
            # log the model
            mlflow.xgboost.log_model(booster, artifact_path=\"models_mlflow\")
        ```
        * Note: We have to disable the autolog, else the model will be saved twice: ```mlflow.xgboost.autolog(disable=True)```
* Additionally we should log the preprocessor as an artifact
* MLflow ui also shows us how to make predictions
    * under ```models_mlflow``` examples for spark and pandas dataframe predictions are shown, with the specific model id
    * we can load the model by the model id and usee it to make predictions"

# Model Registry
* With MLflow we have a tracking server, were we can track our models, their metrics etc.
* At some point we decide that some models are ready for production. We can then store them in the model registry of mlflow
* The model registry is not deploying a model, but a list of models that are ready for deployment
* To decide for a specific model consider
    * metric
    * training time
    * model size
* Once decided to register a model click on "Register Model"
* In our case we don't have any model registered yet, i.e. we have to create a new one. Call it "nyc-taxi-regressor"
* Note: For this to work we need to the ```mlflow.set_tracking_uri```
* We can then see our registered models under the "Models" tab in mlflow
* The registered models can be assigned to different stages:
    * Staging
    * Production
    * Archive
    * See notebook: 'mlflow.client.ipynb'
    * We can interact with the models using ```MlflowClient```

## MLflow in Practice
* Depending on the scenario different aspects of mlflow are needed and can be used. Sometimes a local storage of the experiments is sufficient in other cases (when working with several people on developing the model) it is important to share the results using a remote server.
* Configuring mlflow
    * backend store
        * local filesystem (if no backend store is set, this is the default)
        * SQL Alquemy compatible database (e.g. sqlite)
    * artifacts store
        * local filesystem (default)
        * remote (e.g. S3 bucket)
    * tracking server
        * no tracking server
        * localhost
        * remote
* Example notebooks of these 3 scenarios can be found in [running-mlflow-examples](running-mlflow-examples)
### Benefits of a remote tracking server
* The tracking server can easily be deployed on the cloud
* Share experiments with other Data Scientists
* Collaborate with others to build and deploy models
* Give more visibility of the data science efforts
### Issues with running a remote (shared) Mlflow server
* Security
    * restrict access to the server (e.g. access through VPN)
* Scalability
* Isolation
    * define standard for nameing experiments, models and a set of default tags
    * restrict access to artifacts (e.g. use s3 buckets living on different AWS accounts)
### Mlflow limitations
* Authentification & Users: The open source version of mlflow doesn't provide any sort of authentification
* Data versioning: to ensure full reproducibility we need to version the data used to train a model. Mlflow doesn't provide a built-in solution for that, but there are a few ways to deal with this limitation
* Model/Data Monitoring & Alerting: This is outside of the scope of mlflow, and currently there are no suitable tools for doing this.
