# Q1. Install MLflow

In [1]:
!mlflow --version

mlflow, version 3.1.1


# Q2. Download and preprocess the data

In [2]:
!mkdir taxi_data
!wget -P taxi_data/ https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
!wget -P taxi_data/ https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet
!wget -P taxi_data/ https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-03.parquet

mkdir: cannot create directory ‘taxi_data’: File exists
--2025-07-02 10:54:56--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.239.38.147, 18.239.38.181, 18.239.38.83, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.239.38.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1427002 (1.4M) [binary/octet-stream]
Saving to: ‘taxi_data/green_tripdata_2023-01.parquet.1’


2025-07-02 10:54:56 (125 MB/s) - ‘taxi_data/green_tripdata_2023-01.parquet.1’ saved [1427002/1427002]

--2025-07-02 10:54:56--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.239.38.163, 18.239.38.181, 18.239.38.147, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.239.38.163|:443... connected.
HTTP request sent,

In [3]:
!pwd
!python preprocess_data.py --raw_data_path taxi_data/ --dest_path ./output

/workspaces/mlops-zoomcamp/02-experiment-tracking/homework


In [4]:
!ls output/
!ls -1 output/ | wc -l

dv.pkl	test.pkl  train.pkl  val.pkl
4


# Q3. Train a model with autolog

In [5]:
!python train.py

2025/07/02 10:55:07 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


- launch mlflow in the homework dir: `mlflow ui`
- look for value of min_samples_split in the experiment
- min_samples_split: 2

# Q4. Launch the tracking server locally

run `mlflow server --backend-store-uri sqlite:///my.db --default-artifact-root artifacts/`

# Q5. Tune model hyperparameters

- add the following code to the objective function:
```
with mlflow.start_run():
    mlflow.log_params(params)
    # [...]
    mlflow.log_metric("val_rmse", rmse)
```

In [9]:
!python hpo.py

  import pkg_resources
2025/07/02 11:02:54 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.
🏃 View run selective-bat-982 at: http://127.0.0.1:5000/#/experiments/1/runs/9a2b011a012640b89063d6cbe2501326

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run brawny-snail-976 at: http://127.0.0.1:5000/#/experiments/1/runs/336b514043f847a78982e91ca81e952d

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run redolent-newt-167 at: http://127.0.0.1:5000/#/experiments/1/runs/fe613dd8e26446b5a7827b920708a0d8

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run shivering-carp-128 at: http://127.0.0.1:5000/#/experiments/1/runs/4c89d125ccaf49849dd65fe67114f451

🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1                    

🏃 View run likeable-moth-260 at: http://127.0.0.1:5000/#/experiments/1/runs

- best validation RMSE: 5.335419588556921

# Q6. Promote the best model to the model registry

- modify the following code:
```
# Select the model with the lowest test RMSE
experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
best_run = client.search_runs(
    experiment_ids=experiment._experiment_id,
    run_view_type=ViewType.ACTIVE_ONLY,
    max_results=1,
    order_by=["metrics.rmse ASC"]
)[0]

# Register the best model
mlflow.register_model(
    model_uri = f"runs:/{best_run.info.run_id}/model",
    name = "duration_estimator"
)
```
- please note that when executing this in a github codespace, the frist trial works and in the following trials in the fit() method, the script is "Terminated": This is likely due to insufficient memory

In [1]:
!python3 register_model.py

{'max_depth': '20', 'min_samples_leaf': '1', 'min_samples_split': '9', 'n_estimators': '19', 'random_state': '42'}
{'max_depth': 20, 'n_estimators': 19, 'min_samples_split': 9, 'min_samples_leaf': 1, 'random_state': 42}
🏃 View run silent-conch-870 at: http://127.0.0.1:5000/#/experiments/2/runs/dfbb7175407c4895acd8a95067b1d3f5
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
{'max_depth': '18', 'min_samples_leaf': '1', 'min_samples_split': '6', 'n_estimators': '13', 'random_state': '42'}
{'max_depth': 18, 'n_estimators': 13, 'min_samples_split': 6, 'min_samples_leaf': 1, 'random_state': 42}
🏃 View run peaceful-duck-3 at: http://127.0.0.1:5000/#/experiments/2/runs/ed12accd623443e0bb70a80aa9665836
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
{'max_depth': '5', 'min_samples_leaf': '3', 'min_samples_split': '8', 'n_estimators': '21', 'random_state': '42'}
{'max_depth': 5, 'n_estimators': 21, 'min_samples_split': 8, 'min_samples_leaf': 3, 'random_state': 42}
🏃 View 

In [2]:
!python3 register_model.py

{'max_depth': '20', 'min_samples_leaf': '1', 'min_samples_split': '9', 'n_estimators': '19', 'random_state': '42'}
{'max_depth': 20, 'n_estimators': 19, 'min_samples_split': 9, 'min_samples_leaf': 1, 'random_state': 42}
🏃 View run resilient-sloth-802 at: http://127.0.0.1:5000/#/experiments/2/runs/bfd3c49e4b834de390523ce7e285f487
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
{'max_depth': '18', 'min_samples_leaf': '1', 'min_samples_split': '6', 'n_estimators': '13', 'random_state': '42'}
{'max_depth': 18, 'n_estimators': 13, 'min_samples_split': 6, 'min_samples_leaf': 1, 'random_state': 42}
🏃 View run selective-fowl-900 at: http://127.0.0.1:5000/#/experiments/2/runs/1e8a5af9c41a4f88b6b39fddd827c82c
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
{'max_depth': '5', 'min_samples_leaf': '3', 'min_samples_split': '8', 'n_estimators': '21', 'random_state': '42'}
{'max_depth': 5, 'n_estimators': 21, 'min_samples_split': 8, 'min_samples_leaf': 3, 'random_state': 42}
🏃

- best test RMSE: 5.567408012462019

# Deleting experiments
- when deleting an experiment it is impossible to create an experiment with the same name until it is removed from the "trash"
- when there are difficulties use this sqlite3 code - line by line:
``` 
sqlite3 mlflow.db #start sqlite3
SELECT experiment_id, name, lifecycle_stage FROM experiments WHERE name = 'random-forest-best-models';
DELETE FROM experiments WHERE name = 'random-forest-best-models';
.quit
```

In [3]:
#import mlflow
#from mlflow.entities import ViewType
#from mlflow.tracking import MlflowClient

#mlflow.set_tracking_uri("http://127.0.0.1:5000")

#experiments = mlflow.search_experiments(view_type=ViewType.ALL)
#print([e.name for e in experiments])
#client = MlflowClient()
#experiment = client.get_experiment_by_name("random-forest-best-models")
#print(experiment)

#if experiment.lifecycle_stage == "deleted":
#    client.delete_experiment(experiment.experiment_id)

In [16]:
!pwd

/workspaces/mlops-zoomcamp/02-experiment-tracking/homework
