# Homework with Weights & Biases

The goal of this homework is to get familiar with Weights & Biases for experiment tracking, model management, hyperparameter optimization, and many more.

## Q1. Install the Package

Once you installed the package, run the command wandb --version and check the output.  
What's the version that you have?

In [1]:
!wandb --version

wandb, version 0.15.4


Answer: The version is: wandb, version 0.15.4

## Q2. Download and preprocess the data
We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip.  
Download the data for January, February and March 2022 in parquet format.  
  
Once you navigate to the Files tab of your artifact on your Weights & Biases page, what's the size of the saved DictVectorizer file?

* 54 kB
* 154 kB
* 54 MB
* 154 MB

In [3]:
!python preprocess_data.py \
  --wandb_project mlops-wandb_experiment-tracking \
  --wandb_entity chweber \
  --raw_data_path data/ \
  --dest_path ./output

[34m[1mwandb[0m: Currently logged in as: [33mchweber[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/toph/Documents/Private/GitHub/MLOps-Zoomcamp/wandb/run-20230609_002304-j60xrglx[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mpeachy-spaceship-2[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking/runs/j60xrglx[0m
[34m[1mwandb[0m: Adding directory to artifact (./output)... Done. 0.0s
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 🚀 View run [33mpeachy-spaceship-2[0m at: [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking/runs/j60xrglx[0m
[34m[1mwandb[0m: Syn

In [5]:
!ls -l output/

total 17320
-rw-r--r--  1 toph  staff   153660  9 Jun 00:23 dv.pkl
-rw-r--r--  1 toph  staff  2632817  9 Jun 00:23 test.pkl
-rw-r--r--  1 toph  staff  2146163  9 Jun 00:23 train.pkl
-rw-r--r--  1 toph  staff  2336393  9 Jun 00:23 val.pkl


Answer: The DictVectorizer file is round 154 kB.

## Q3. Train a model with Weights & Biases logging
We will train a RandomForestRegressor (from Scikit-Learn) on the taxi dataset.  
  
Once you have successfully ran the script, navigate the Overview section of the run in the Weights & Biases UI and scroll down to the Configs. What is the value of the max_depth parameter:

* 4
* 6
* 8
* 10

In [7]:
!python train.py \
  --wandb_project mlops-wandb_experiment-tracking \
  --wandb_entity chweber \
  --data_artifact "chweber/mlops-wandb_experiment-tracking/NYC-Taxi:v0"

[34m[1mwandb[0m: Currently logged in as: [33mchweber[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/toph/Documents/Private/GitHub/MLOps-Zoomcamp/wandb/run-20230609_011851-av13eibc[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mlucky-glitter-4[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking/runs/av13eibc[0m
[34m[1mwandb[0m:   4 of 4 files downloaded.  
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: / 0.007 MB of 0.007 MB uploaded (0.000 MB deduped)
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m: MSE ▁
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb

Answer: The value of the max_depth parameter is 10.

## Q4. Tune model hyperparameters
Now let's try to reduce the validation error by tuning the hyperparameters of the RandomForestRegressor using Weights & Biases Sweeps.  
  
You can take a look at the sweep on your Weights & Biases dashboard, take a look at the Parameter Inportance Panel and the Parallel Coordinates Plot to determine, and analyze which hyperparameter is the most important:

* max_depth
* n_estimators
* min_samples_split
* min_samples_leaf

In [8]:
!python sweep.py \
  --wandb_project mlops-wandb_experiment-tracking \
  --wandb_entity chweber \
  --data_artifact "chweber/mlops-wandb_experiment-tracking/NYC-Taxi:v0"

Create sweep with ID: me93ahh7
Sweep URL: https://wandb.ai/chweber/mlops-wandb_experiment-tracking/sweeps/me93ahh7
[34m[1mwandb[0m: Agent Starting Run: v5dm6vos with config:
[34m[1mwandb[0m: 	max_depth: 20
[34m[1mwandb[0m: 	min_samples_leaf: 4
[34m[1mwandb[0m: 	min_samples_split: 10
[34m[1mwandb[0m: 	n_estimators: 50
[34m[1mwandb[0m: Currently logged in as: [33mchweber[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/toph/Documents/Private/GitHub/MLOps-Zoomcamp/wandb/run-20230609_014441-v5dm6vos[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33msnowy-sweep-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking[0m
[34m[1mwandb[0m: 🧹 View sweep at [34m[4mhttps://wandb.ai/chweber/mlops-wandb_experiment-tracking/sweeps/m

Answer: The most important hyperparameter is **max_depth**
.

## Q5. Link the best model to the model registry

Now that we have obtained the optimal set of hyperparameters and trained the best model, we can assume that we are ready to test some of these models in production. In this exercise, you'll create a model registry and link the best model from the Sweep to the model registry.  

Now that the model artifact is linked to the Registered Model, which of these information do we see on the Registered Model UI?

* Versioning
* Metadata
* Aliases
* Metric (MSE)
* Source run
* All of these
* None of these

Answer: On the Registered Model UI we see 1) Versioning, 2) Aliases, 3) Metric (MSE), 4) Source run.  
Metadata can be accessed over ***View version details*** and the ***Metadata tab***.