## Q1. Install the package
To get started with MLflow you'll need to install the appropriate Python package.

For this we recommend creating a separate Python environment, for example, you can use conda environments, and then install the package there with pip or conda.

Once you installed the package, run the command mlflow --version and check the output.

What's the version that you have?

In [None]:
! mlflow --version

In [None]:
from preprocess_data import run_data_prep

In [None]:
run_data_prep(raw_data_path='data', dest_path='output')

## Q2. So what's the size of the saved DictVectorizer file?

- 54 kB
- 154 kB
- 54 MB
- 154 MB

In [None]:
! du -h output/dv.pkl

In [None]:
from train import run_train

In [None]:
run_train('output')

## Q3. What is the value of the max_depth parameter:

- 4
- 6
- 8
- 10

## Q4. Tune model hyperparameters
Now let's try to reduce the validation error by tuning the hyperparameters of the RandomForestRegressor using optuna. We have prepared the script hpo.py for this exercise.

Your task is to modify the script `hpo.py` and make sure that the validation RMSE is logged to the tracking server for each run of the hyperparameter optimization (you will need to add a few lines of code to the objective function) and run the script without passing any parameters.

After that, open UI and explore the runs from the experiment called `random-forest-hyperopt` to answer the question below.

Note: Don't use autologging for this exercise.

The idea is to just log the information that you need to answer the question below, including:
- the list of hyperparameters that are passed to the `objective` function during the optimization
- the `RMSE` obtained on the validation set (February 2022 data).
### What's the best validation RMSE that you got?

- 1.85
- 2.15
- 2.45
- 2.85


In [None]:
from hpo import run_optimization

In [None]:
run_optimization('output', num_trials=10)

## Q5. Promote the best model to the model registry

The results from the hyperparameter optimization are quite good. So, we can assume that we are ready to test some of these models in production. In this exercise, you'll promote the best model to the model registry. We have prepared a script called register_model.py, which will check the results from the previous step and select the top 5 runs. After that, it will calculate the RMSE of those models on the test set (March 2022 data) and save the results to a new experiment called random-forest-best-models.

Your task is to update the script `register_model.py` so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.

Tips for MLflow:

1. you can use the method search_runs from the MlflowClient to get the model with the lowest RMSE,
2. to register the model you can use the method mlflow.register_model and you will need to pass the right model_uri in the form of a string that looks like this: "runs:/<RUN_ID>/model", and the name of the model (make sure to choose a good one!).
What is the test RMSE of the best model?


- 1.885
- 2.185
- 2.555
- 2.955

In [None]:
from register_model import run_register_model

In [None]:
run_register_model('output', top_n=10)

## Q6. Model metadata
Now explore your best model in the model registry using UI. What information does the model registry contain about each model?

- Version number
- Source experiment
- Model signature
- All the above answers are correct