### Q1. Install the Package

Once you installed the package, run the command wandb --version and check the output.

**What's the version that you have?**

In [1]:
! wandb --version

wandb, version 0.15.3


### Q2. Download and preprocess the data

We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip.

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

Use the script `preprocess_data.py` located in the folder homework-wandb to preprocess the data.

**Once you navigate to the Files tab of your artifact on your Weights & Biases page, what's the size of the saved DictVectorizer file?**

In [None]:
! mkdir data
! cd data && curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
! cd data && curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
! cd data && curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet  

In [20]:
! python "scripts\preprocess_data.py" \
  --wandb_project NYC_TAXI_MODULE_2 \
  --wandb_entity  afk-legacy \
  --raw_data_path "data" \
  --dest_path "./output"

wandb: Currently logged in as: afk-legacy. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.3
wandb: Run data is saved locally in d:\Dev\projects\MLOpsZoomcamp\Module2\homework-wandb\wandb\run-20230606_215546-86lbhwih
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run dandy-sun-6
wandb:  View project at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2
wandb:  View run at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2/runs/86lbhwih
wandb: Adding directory to artifact (.\output)... Done. 0.0s


### Q3. Train a model with Weights & Biases logging

We will train a RandomForestRegressor (from Scikit-Learn) on the taxi dataset.

**What is the value of the max_depth parameter?**

In [27]:
! python "./scripts/train.py" \
  --wandb_project NYC_TAXI_MODULE_2 \
  --wandb_entity afk-Legacy \
  --data_artifact "afk-legacy/NYC_TAXI_MODULE_2/NYC-Taxi:v0"

wandb: Currently logged in as: afk-legacy. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.3
wandb: Run data is saved locally in d:\Dev\projects\MLOpsZoomcamp\Module2\homework-wandb\wandb\run-20230606_221618-ha3casx8
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run sandy-violet-13
wandb:  View project at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2
wandb:  View run at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2/runs/ha3casx8
wandb:   4 of 4 files downloaded.  


### Q4. Tune model hyperparameters

Now let's try to reduce the validation error by tuning the hyperparameters of the RandomForestRegressor using Weights & Biases Sweeps. We have prepared the script sweep.py for this exercise in the homework-wandb directory.

Your task is to modify sweep.py to pass the parameters n_estimators, min_samples_split and min_samples_leaf from config to RandomForestRegressor inside the run_train() function. Then we will run the sweep to figure out not only the best best of hyperparameters for training our model, but also to analyze the most optimum trends in different hyperparameters.

In [28]:
! python "./scripts/sweep.py" \
  --wandb_project NYC_TAXI_MODULE_2 \
  --wandb_entity afk-Legacy \
  --data_artifact "afk-legacy/NYC_TAXI_MODULE_2/NYC-Taxi:v0"

Create sweep with ID: bp78umeo
Sweep URL: https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2/sweeps/bp78umeo


wandb: Agent Starting Run: 9i5r3czn with config:
wandb: 	max_depth: 5
wandb: 	min_samples_leaf: 4
wandb: 	min_samples_split: 5
wandb: 	n_estimators: 48
wandb: Currently logged in as: afk-legacy. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.3
wandb: Run data is saved locally in d:\Dev\projects\MLOpsZoomcamp\Module2\homework-wandb\wandb\run-20230606_222116-9i5r3czn
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run stellar-sweep-1
wandb:  View project at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2
wandb:  View sweep at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2/sweeps/bp78umeo
wandb:  View run at https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2/runs/9i5r3czn
wandb:   4 of 4 files downloaded.  
wandb: Waiting for W&B process to finish... (success).
wandb: 
wandb: Run history:
wandb: MSE ▁
wandb: 
wandb: Run summary:
wandb: MSE 2.4607
wandb: 
wandb:  View run stellar-sweep-1 at: https://wandb.ai/afk-legacy/NYC_TAXI_MODULE_2/runs

### Q5. Link the best model to the model registry

Now that we have obtained the optimal set of hyperparameters and trained the best model, we can assume that we are ready to test some of these models in production. In this exercise, you'll create a model registry and link the best model from the Sweep to the model registry.

First, you will need to create a Registered Model to hold all the candidate models for your particular modeling task. You can refer to this section of the official docs to learn how to create a registered model using the Weights & Biases UI.

Once you have created the Registered Model successfully, you can navigate to the best run of your sweep, navigate to the model artifact created by the particular run, and click on the Link to Registry option from the UI. This would link the model artifact to the Registered Model. You can choose to add some suitable aliases for the Registered Model, such as production, best, etc.

Now that the model artifact is linked to the Registered Model, which of these information do we see on the Registered Model UI?

+ Versioning
+ Metadata
+ Aliases
+ Metric (MSE)
+ Source run
+ All of these
+ None of these
