# Homework
The goal of this homework is to get familiar with tools like MLflow for experiment tracking and model management.

## Q1. Install the package
To get started with MLflow you'll need to install the appropriate Python package.  
Once you installed the package, run the command mlflow --version and check the output.

What's the version that you have?

In [None]:
!python -V

In [None]:
!mlflow --version

## Q2. Download and preprocess the data
We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip.

So what's the size of the saved DictVectorizer file?

* 54 kB
* 154 kB
* 54 MB
* 154 MB

In [None]:
!cd ~/notebooks/data/output && ls -l

The size of the saved DictVectorizer file is round 154 kB.

## Q3. Train a model with autolog
We will train a RandomForestRegressor (from Scikit-Learn) on the taxi dataset.

We have prepared the training script train.py.

The script will:

*load the datasets produced by the previous step,
*train the model on the training set,
*calculate the RMSE score on the validation set.

Your task is to modify the script to enable autologging with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked.

Tip 1: don't forget to wrap the training code with a with mlflow.start_run(): statement as we showed in the videos.

Tip 2: don't modify the hyperparameters of the model to make sure that the training will finish quickly.

What is the value of the max_depth parameter:

* 4
* 6
* 8
* 10

In [None]:
import os
import pickle
import click

import mlflow

mlflow.set_tracking_uri("sqlite:///~/mlops-zoomcamp/cohorts/2023/02-experiment-tracking/homework/mlflow.db")
mlflow.set_experiment("nyc-taxi-experiment-rf")

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


def load_pickle(filename: str):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


@click.command()
@click.option(
    "--data_path",
    default="/home/ubuntu/notebooks/data/output",
    help="Location where the processed NYC taxi trip data was saved"
)
def run_train(data_path: str):
    with mlflow.start_run():
        
        mlflow.set_tag("developer", "t")
        
        mlflow.log_param("train-data-path", data_path)
        mlflow.log_param("valid-data-path", data_path)

        max_depth=10
        random_state=0
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("random_state", random_state)
        X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
        X_val, y_val = load_pickle(os.path.join(data_path, "val.pkl"))

        rf = RandomForestRegressor(max_depth=max_depth, random_state=random_state)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_val)

        rmse = mean_squared_error(y_val, y_pred, squared=False)
        mlflow.log_metric("rmse", rmse)


if __name__ == '__main__':
    run_train()


The value of the max_depth parameter is 10.

## Q4. Tune model hyperparameters

What's the best validation RMSE that you got?

* 1.85
* 2.15
* 2.45
* 2.85

In [None]:
import os
import pickle
import click
import mlflow
import optuna

from optuna.samplers import TPESampler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("random-forest-hyperopt")


def load_pickle(filename):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


@click.command()
@click.option(
    "--data_path",
    default="/home/ubuntu/notebooks/data/output",
    help="Location where the processed NYC taxi trip data was saved"
)
@click.option(
    "--num_trials",
    default=10,
    help="The number of parameter evaluations for the optimizer to explore"
)
def run_optimization(data_path: str, num_trials: int):

    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    X_val, y_val = load_pickle(os.path.join(data_path, "val.pkl"))

    def objective(trial):
        with mlflow.start_run():
            mlflow.set_tag("Question4", "RandomForestRegressor")
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 10, 50, 1),
                'max_depth': trial.suggest_int('max_depth', 1, 20, 1),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 10, 1),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 4, 1),
                'random_state': 42,
                'n_jobs': -1
            }
            mlflow.log_params(params)
            rf = RandomForestRegressor(**params)
            rf.fit(X_train, y_train)
            y_pred = rf.predict(X_val)
            rmse = mean_squared_error(y_val, y_pred, squared=False)
            mlflow.log_metric("rmse", rmse)
        return rmse

    sampler = TPESampler(seed=42)
    study = optuna.create_study(direction="minimize", sampler=sampler)
    study.optimize(objective, n_trials=num_trials)


if __name__ == '__main__':
    run_optimization()


The best validation RMSE is 2.45.

## Q5. Promote the best model to the model registry

Your task is to update the script register_model.py so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.

What is the test RMSE of the best model?

* 1.885
* 2.185
* 2.555
* 2.955

The RMSE of the best model is .

## Q6. Model metadata

Now explore your best model in the model registry using UI. What information does the model registry contain about each model?

* Version number
* Source experiment
* Model signature
* All the above answers are correct

Answer: All the above answers are correct.