# Homework with Weights & Biases

The goal of this homework is to get familiar with Weights & Biases for experiment tracking, model management, hyperparameter optimization, and many more.

See [homework](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/02-experiment-tracking/wandb.md).

## Q1. Install the package

On your local terminal, run the following commands to create a new conda environment.

```bash
cd ~/github/mlops-zoomcamp-2023/notebooks  # Change this for your folder.
conda create -n mlops-zoomcamp-env python=3.9
conda activate mlops-zoomcamp-env
pip install mlflow jupyter scikit-learn pandas seaborn hyperopt xgboost fastparquet boto3
pip install wandb matplotlib pyarrow
```

From the same folder, open VS Code from the terminal with this command.

```bash
code .
```

Then, in the VS Code notebook, select the `mlops-zoomcamp-env` kernel.


In [3]:
import wandb

print(wandb.__version__)

0.15.3


## Q2. Download and preprocess the data

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [None]:
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
!wget -P ~/data https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet

In [13]:
!ls ~/data

green_tripdata_2022-01.parquet green_tripdata_2022-03.parquet
green_tripdata_2022-02.parquet


The next script will:

* initialize a Weights & Biases run.
* load the data from the folder `TAXI_DATA_FOLDER` (the folder where you have downloaded the data),
* fit a DictVectorizer on the training set (January 2022 data),
* save the preprocessed datasets and the DictVectorizer to your Weights & Biases dashboard as an artifact of type preprocessed_dataset.

In [None]:
import os 

WANDB_PROJECT_NAME="homework-wandb"
WANDB_USERNAME="boisalai"
TAXI_DATA_FOLDER="~/data"
DEST_PATH="output"

# Before, set your key in the terminal with `export WANDB_KEY=XXXXXXXXXXXXXXXXX`.
# See https://docs.wandb.ai/guides/track/environment-variables
# API secret keys should never be put in a client-side code or should be hidden.
WANDB_KEY = os.environ.get('WANDB_KEY')
%env WANDB_API_KEY=$WANDB_KEY

In [3]:
%mkdir output

In [6]:
import wandb

wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mboisalai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [7]:
!python preprocess_data.py \
  --wandb_project $WANDB_PROJECT_NAME \
  --wandb_entity $WANDB_USERNAME \
  --raw_data_path $TAXI_DATA_FOLDER \
  --dest_path $DEST_PATH

[34m[1mwandb[0m: Currently logged in as: [33mboisalai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/boisalai/GitHub/mlops-zoomcamp-2023/notebooks/wandb/wandb/run-20230601_111716-e69nhslm[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mlogical-silence-4[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/boisalai/homework-wandb[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/boisalai/homework-wandb/runs/e69nhslm[0m
[34m[1mwandb[0m: Adding directory to artifact (./output)... Done. 0.0s


What's the size of the saved `DictVectorizer` file?

In [8]:
import os

file_name = "./output/dv.pkl"
file_stats = os.stat(file_name)

print(file_stats)
print(f"File size is {file_stats.st_size} bytes, {file_stats.st_size / 1024:.1f} kB, {file_stats.st_size / (1024 * 1024):.3f} MB.")

os.stat_result(st_mode=33188, st_ino=60494609, st_dev=16777231, st_nlink=1, st_uid=501, st_gid=20, st_size=153660, st_atime=1685632639, st_mtime=1685632639, st_ctime=1685632639)
File size is 153660 bytes, 150.1 kB, 0.147 MB.


## Q3. Train a model with Weights & Biases logging

We will train a `RandomForestRegressor` (from Scikit-Learn) on the taxi dataset.

We have prepared the training script `train.py` for this exercise, which can be also found in the folder `homework-wandb`.

The script will:

* initialize a Weights & Biases run.
* load the preprocessed datasets by fetching them from the Weights & Biases artifact previously created,
* train the model on the training set,
* calculate the MSE score on the validation set and log it to Weights & Biases,
* save the trained model and log it to Weights & Biases as a model artifact.

Your task is to modify the script to enable to add Weights & Biases logging, execute the script and then check the Weights & Biases run UI to check that the experiment run was properly tracked.

So see the script `train.py` modified. 

Execute the script with the following command.

In [11]:
!python train.py \
    --wandb_project $WANDB_PROJECT_NAME \
    --wandb_entity $WANDB_USERNAME \
    --data_artifact "$WANDB_USERNAME/$WANDB_PROJECT_NAME/NYC-Taxi:v0"


[34m[1mwandb[0m: Currently logged in as: [33mboisalai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/boisalai/GitHub/mlops-zoomcamp-2023/notebooks/wandb/wandb/run-20230601_112801-a1c8npr7[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mazure-valley-6[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/boisalai/homework-wandb[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/boisalai/homework-wandb/runs/a1c8npr7[0m
[34m[1mwandb[0m:   4 of 4 files downloaded.  
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m: MSE ▁
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m: MSE 2.45398
[34m[1mwandb[0m: 
[34m[1mwandb[0m: 🚀 View run [33m

<img src="images/screen1.png">

## Q4. Tune model hyperparameters

Now let's try to reduce the validation error by tuning the hyperparameters of the `RandomForestRegressor` using [Weights & Biases Sweeps](https://docs.wandb.ai/guides/sweeps). We have prepared the script `sweep.py` for this exercise in the `homework-wandb` directory.

Your task is to modify `sweep.py` to pass the parameters `n_estimators`, `min_samples_split` and `min_samples_leaf` from `config` to `RandomForestRegressor` inside the `run_train()` function. Then we will run the sweep to figure out not only the best best of hyperparameters for training our model, but also to analyze the most optimum trends in different hyperparameters. We can run the sweep using:

```bash
python sweep.py \
  --wandb_project <WANDB_PROJECT_NAME> \
  --wandb_entity <WANDB_USERNAME> \
  --data_artifact "<WANDB_USERNAME>/<WANDB_PROJECT_NAME>/NYC-Taxi:v0"
```

This command will run the sweep for 5 iterations using the **Bayesian Optimization and HyperBand** method proposed by the paper [BOHB: Robust and Efficient Hyperparameter Optimization at Scale](https://arxiv.org/abs/1807.01774). You can take a look at the sweep on your Weights & Biases dashboard, take a look at the **Parameter Inportance Panel** and the **Parallel Coordinates Plot** to determine, and analyze which hyperparameter is the most important:

* `max_depth`
* `n_estimators`
* `min_samples_split`
* `min_samples_leaf`

In [12]:
!python sweep.py \
    --wandb_project $WANDB_PROJECT_NAME \
    --wandb_entity $WANDB_USERNAME \
    --data_artifact "$WANDB_USERNAME/$WANDB_PROJECT_NAME/NYC-Taxi:v0"

Create sweep with ID: bnzckj1c
Sweep URL: https://wandb.ai/boisalai/homework-wandb/sweeps/bnzckj1c
[34m[1mwandb[0m: Agent Starting Run: drey5az0 with config:
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	min_samples_leaf: 2
[34m[1mwandb[0m: 	min_samples_split: 4
[34m[1mwandb[0m: 	n_estimators: 24


wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).


[34m[1mwandb[0m: Currently logged in as: [33mboisalai[0m. Use [1m`wandb login --relogin`[0m to force relogin


wandb: - 0.001 MB of 0.001 MB uploaded (0.000 MB deduped)

[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/boisalai/GitHub/mlops-zoomcamp-2023/notebooks/wandb/wandb/run-20230601_113848-drey5az0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mdistinctive-sweep-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/boisalai/homework-wandb[0m
[34m[1mwandb[0m: 🧹 View sweep at [34m[4mhttps://wandb.ai/boisalai/homework-wandb/sweeps/bnzckj1c[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/boisalai/homework-wandb/runs/drey5az0[0m


wandb: \ 0.001 MB of 0.005 MB uploaded (0.000 MB deduped)

[34m[1mwandb[0m:   4 of 4 files downloaded.  
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m


wandb:                                                                                
wandb: 🚀 View run logical-silence-4 at: https://wandb.ai/boisalai/homework-wandb/runs/e69nhslm
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230601_111716-e69nhslm/logs


[34m[1mwandb[0m: - 0.029 MB of 0.029 MB uploaded (0.000 MB deduped)

wandb: 
wandb: Run history:
wandb: MSE ▁
wandb: 
wandb: Run summary:
wandb: MSE 2.45398
wandb: 
wandb: 🚀 View run toasty-aardvark-5 at: https://wandb.ai/boisalai/homework-wandb/runs/w73wtjzr
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230601_112630-w73wtjzr/logs


[34m[1mwandb[0m: / 0.029 MB of 0.029 MB uploaded (0.000 MB deduped)
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m: MSE ▁
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m: MSE 2.47292
[34m[1mwandb[0m: 
[34m[1mwandb[0m: 🚀 View run [33mdistinctive-sweep-1[0m at: [34m[4mhttps://wandb.ai/boisalai/homework-wandb/runs/drey5az0[0m
[34m[1mwandb[0m: Synced 6 W&B file(s), 0 media file(s), 1 artifact file(s) and 0 other file(s)
[34m[1mwandb[0m: Find logs at: [35m[1m./wandb/run-20230601_113848-drey5az0/logs[0m
[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: zpc73dwe with config:
[34m[1mwandb[0m: 	max_depth: 2
[34m[1mwandb[0m: 	min_samples_leaf: 3
[34m[1mwandb[0m: 	min_samples_split: 7
[34m[1mwandb[0m: 	n_estimators: 32
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/boisalai/GitHub/mlo

<img src="images/screen2.png">

## Q5. Link the best model to the model registry

Now that we have obtained the optimal set of hyperparameters and trained the best model, we can assume that we are ready to test some of these models in production. In this exercise, you'll create a model registry and link the best model from the Sweep to the model registry.

First, you will need to create a Registered Model to hold all the candidate models for your particular modeling task. You can refer to [this section](https://docs.wandb.ai/guides/models/walkthrough#1-create-a-new-registered-model) of the official docs to learn how to create a registered model using the Weights & Biases UI.

Once you have created the Registered Model successfully, you can navigate to the best run of your sweep, navigate to the model artifact created by the particular run, and click on the Link to Registry option from the UI. This would link the model artifact to the Registered Model. You can choose to add some suitable aliases for the Registered Model, such as `production`, `best`, etc.

Now that the model artifact is linked to the Registered Model, which of these information do we see on the Registered Model UI?

* Versioning
* Metadata
* Aliases
* Metric (MSE)
* Source run
* All of these
* None of these

### Answer

<table> 
    <tr>
        <td>
            <img src="images/screen3.png">
        </td>
        <td>
            <img src="images/screen4.png">
        </td>
    </tr>
    <tr>
        <td>
            <img src="images/screen5.png">
        </td>
        <td>
            <img src="images/screen6.png">
        </td>
    </tr>
</table> 
