# Homework 2

The goal of this homework is to get familiar with tools like MLflow for experiment tracking and model management.

See [questions](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/02-experiment-tracking/homework.md).

## Q1. Install the package

On your local terminal, run the following commands to create a new conda environment.

```bash
cd ~/github/mlops-zoomcamp-2023/notebooks  # Change this for your folder.
conda create -n mlops-zoomcamp-env python=3.9
conda activate mlops-zoomcamp-env
pip install mlflow jupyter scikit-learn pandas seaborn hyperopt xgboost fastparquet boto3
```

From the same folder, open VS Code from the terminal with this command.

```bash
code .
```

Then, in the notebook, select the `mlops-zoomcamp-env` kernel.


In [35]:
import mlflow

print(mlflow.__version__)

2.3.2


## Q2. Download and preprocess the data

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [36]:
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet

--2023-05-30 15:14:44--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
Résolution de d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)… 13.225.189.130, 13.225.189.87, 13.225.189.178, ...
Connexion à d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|13.225.189.130|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 1254291 (1,2M) [binary/octet-stream]
Sauvegarde en : « /Users/boisalai/data/green_tripdata_2022-01.parquet.1 »


2023-05-30 15:14:44 (9,65 MB/s) — « /Users/boisalai/data/green_tripdata_2022-01.parquet.1 » sauvegardé [1254291/1254291]

--2023-05-30 15:14:44--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
Résolution de d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)… 13.225.189.130, 13.225.189.87, 13.225.189.178, ...
Connexion à d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|13.225.189.130|:443… connecté.
requête HTTP transmise, e

Use the script `preprocess_data.py` located in the folder homework to preprocess the data.

In [37]:
!python preprocess_data.py --raw_data_path ~/data --dest_path ./output

What's the size of the saved `DictVectorizer` file?

In [38]:
import os

file_name = "./output/dv.pkl"
file_stats = os.stat(file_name)

print(file_stats)
print(f"File size is {file_stats.st_size} bytes, {file_stats.st_size / 1024:.1f} kB, {file_stats.st_size / (1024 * 1024):.3f} MB.")

os.stat_result(st_mode=33188, st_ino=60199724, st_dev=16777231, st_nlink=1, st_uid=501, st_gid=20, st_size=153660, st_atime=1685468066, st_mtime=1685474087, st_ctime=1685474087)
File size is 153660 bytes, 150.1 kB, 0.147 MB.


## Q3. Train a model with autolog

Our task is to modify the script to enable autologging with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked.

So see the script `train.py` modified. 

Execute the script with the following command.

In [43]:
!python train.py --data_path ./output




Launch the MLflow UI to check that the experiment run was ptoperly tracked.

In [44]:
!mlflow ui

[2023-05-30 15:23:56 -0400] [7264] [INFO] Starting gunicorn 20.1.0
[2023-05-30 15:23:56 -0400] [7264] [INFO] Listening at: http://127.0.0.1:5000 (7264)
[2023-05-30 15:23:56 -0400] [7264] [INFO] Using worker: sync
[2023-05-30 15:23:56 -0400] [7265] [INFO] Booting worker with pid: 7265
[2023-05-30 15:23:56 -0400] [7266] [INFO] Booting worker with pid: 7266
[2023-05-30 15:23:56 -0400] [7267] [INFO] Booting worker with pid: 7267
[2023-05-30 15:23:56 -0400] [7268] [INFO] Booting worker with pid: 7268
^C
[2023-05-30 15:24:30 -0400] [7264] [INFO] Handling signal: int
[2023-05-30 15:24:30 -0400] [7265] [INFO] Worker exiting (pid: 7265)
[2023-05-30 15:24:30 -0400] [7266] [INFO] Worker exiting (pid: 7266)
[2023-05-30 15:24:30 -0400] [7268] [INFO] Worker exiting (pid: 7268)
[2023-05-30 15:24:30 -0400] [7267] [INFO] Worker exiting (pid: 7267)


What is the value of the `max_depth` parameter?

In [40]:
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Retrieve the experiment ID from its name.
experiment_name = "random-forest"
experiment = client.get_experiment_by_name(experiment_name)
experiment_id = experiment.experiment_id

# Retrieve information about the runs in the experiment.
runs = client.search_runs(experiment_ids=[experiment_id])
for run in runs:
    run_id = run.info.run_id
    params = client.get_run(run_id).data.params
    print(f"Hyperparameters for run {run_id}: {params}")
    max_depth = params.get("max_depth")
    print(f"max_depth for run {run_id}: {max_depth}")

Hyperparameters for run 086e10c841bf40d2803a1c5a78aa1cd9: {'bootstrap': 'True', 'max_depth': '10', 'max_samples': 'None', 'min_weight_fraction_leaf': '0.0', 'max_leaf_nodes': 'None', 'min_samples_leaf': '1', 'random_state': '0', 'min_impurity_decrease': '0.0', 'verbose': '0', 'n_estimators': '100', 'criterion': 'squared_error', 'oob_score': 'False', 'ccp_alpha': '0.0', 'warm_start': 'False', 'max_features': '1.0', 'n_jobs': 'None', 'min_samples_split': '2'}
max_depth for run 086e10c841bf40d2803a1c5a78aa1cd9: 10
Hyperparameters for run 7a84350b10c942918dffdb2d8b3a0756: {'bootstrap': 'True', 'max_depth': '10', 'max_samples': 'None', 'min_weight_fraction_leaf': '0.0', 'max_leaf_nodes': 'None', 'min_samples_leaf': '1', 'random_state': '0', 'min_impurity_decrease': '0.0', 'verbose': '0', 'n_estimators': '100', 'criterion': 'squared_error', 'oob_score': 'False', 'ccp_alpha': '0.0', 'warm_start': 'False', 'max_features': '1.0', 'n_jobs': 'None', 'min_samples_split': '2'}
max_depth for run 7a8

## Launch the tracking server locally for MLflow

Now we want to manage the entire lifecycle of our ML model. In this step, you'll need to launch a tracking server. This way we will also have access to the model registry.

In case of MLflow, you need to:

* launch the tracking server on your local machine,
* select a SQLite db for the backend store and a folder called artifacts for the artifacts store.

You should keep the tracking server running to work on the next three exercises that use the server.

Run the following commands in the terminal to launch the MLflow UI with a sqlite backend store.

```bash
cd ~/github/mlops-zoomcamp-2023/notebooks
mlflow ui --backend-store-uri sqlite:///mlflow.db
```

Install Optuna.

In [45]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.2.0-py3-none-any.whl (390 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.6/390.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting cmaes>=0.9.1
  Using cached cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting colorlog
  Using cached colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: colorlog, cmaes, optuna
Successfully installed cmaes-0.9.1 colorlog-6.7.0 optuna-3.2.0


## Q4. Tune model hyperparameters

In [47]:
!python hpo.py --data_path ./output

2023/05/30 15:48:09 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.
[32m[I 2023-05-30 15:48:09,181][0m A new study created in memory with name: no-name-bb14162d-fe0d-468a-899a-66f7c3ae0746[0m
[32m[I 2023-05-30 15:48:10,005][0m Trial 0 finished with value: 2.451379690825458 and parameters: {'n_estimators': 25, 'max_depth': 20, 'min_samples_split': 8, 'min_samples_leaf': 3}. Best is trial 0 with value: 2.451379690825458.[0m
[32m[I 2023-05-30 15:48:10,073][0m Trial 1 finished with value: 2.4667366020368333 and parameters: {'n_estimators': 16, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 0 with value: 2.451379690825458.[0m
[32m[I 2023-05-30 15:48:10,470][0m Trial 2 finished with value: 2.449827329704216 and parameters: {'n_estimators': 34, 'max_depth': 15, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 2 with value: 2.449827329704216.[0m
[32m[I 2023-05-30 15:48

What's the best validation RMSE that you got? Lower values of RMSE indicate better fit.

## Q5. Promote the best model to the model registry

In [55]:
!python register_model.py --data_path ./output 

best_runs=[<Run: data=<RunData: metrics={'test_rmse': 2.2854691906481364,
 'training_mean_absolute_error': 1.4410764513945242,
 'training_mean_squared_error': 3.948112551107436,
 'training_r2_score': 0.26013767483835504,
 'training_root_mean_squared_error': 1.9869857953964936,
 'training_score': 0.26013767483835504,
 'val_rmse': 2.449827329704216}, params={'bootstrap': 'True',
 'ccp_alpha': '0.0',
 'criterion': 'squared_error',
 'max_depth': '15',
 'max_features': '1.0',
 'max_leaf_nodes': 'None',
 'max_samples': 'None',
 'min_impurity_decrease': '0.0',
 'min_samples_leaf': '4',
 'min_samples_split': '2',
 'min_weight_fraction_leaf': '0.0',
 'n_estimators': '34',
 'n_jobs': '-1',
 'oob_score': 'False',
 'random_state': '42',
 'verbose': '0',
 'warm_start': 'False'}, tags={'estimator_class': 'sklearn.ensemble._forest.RandomForestRegressor',
 'estimator_name': 'RandomForestRegressor',
 'mlflow.log-model.history': '[{"run_id": "2261c8d3cebf49009616b49a36c38b2c", '
                        

## Q6. Model metadata

