# Q1. Install the Package

To get started with Weights & Biases you'll need to install the appropriate Python package.

For this we recommend creating a separate Python environment, for example, you can use [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-envs), 
and then install the package there with `pip` or `conda`.

Following are the libraries you need to install:

* `pandas`
* `matplotlib`
* `scikit-learn`
* `pyarrow`
* `wandb`

Once you installed the package, run the command `wandb --version` and check the output.

What's the version that you have?

In [2]:
!wandb --version

wandb, version 0.15.3


# Q2. Download and preprocess the data

We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip. 

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

**Tip:** In case you're on [GitHub Codespaces](https://github.com/features/codespaces) or [gitpod.io](https://gitpod.io), you can open up the terminal and run the following commands to download the data:

```shell
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet
```


In [3]:
%%capture
!wget -nc https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet -P data
!wget -nc https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet -P data
!wget -nc https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet -P data


Use the script `preprocess_data.py` located in the folder [`homework-wandb`](homework-wandb) to preprocess the data.

The script will:

* initialize a Weights & Biases run.
* load the data from the folder `<TAXI_DATA_FOLDER>` (the folder where you have downloaded the data),
* fit a `DictVectorizer` on the training set (January 2022 data),
* save the preprocessed datasets and the `DictVectorizer` to your Weights & Biases dashboard as an artifact of type `preprocessed_dataset`.

Your task is to download the datasets and then execute this command:

```bash
python preprocess_data.py \
  --wandb_project <WANDB_PROJECT_NAME> \
  --wandb_entity <WANDB_USERNAME> \
  --raw_data_path <TAXI_DATA_FOLDER> \
  --dest_path ./output
```

Tip: go to `02-experiment-tracking/homework-wandb/` folder before executing the command and change the value of `<WANDB_PROJECT_NAME>` to the name of your Weights & Biases project, `<WANDB_USERNAME>` to your Weights & Biases username, and `<TAXI_DATA_FOLDER>` to the location where you saved the data.

Once you navigate to the `Files` tab of your artifact on your Weights & Biases page, what's the size of the saved `DictVectorizer` file?

* 54 kB
* 154 kB
* 54 MB
* 154 MB


In [4]:
%%capture
!wget -nc https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/cohorts/2023/02-experiment-tracking/homework-wandb/preprocess_data.py -P scripts_wb

In [5]:
!python scripts_wb/preprocess_data.py \
  --wandb_project 'mlops-zoomcamp' \
  --raw_data_path ./data \
  --dest_path ./output

[34m[1mwandb[0m: Currently logged in as: [33maaalex-lit[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/workspaces/codespaces-blank/wandb/run-20230604_062850-jafs25q7[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mlight-violet-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/aaalex-lit/mlops-zoomcamp[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/aaalex-lit/mlops-zoomcamp/runs/jafs25q7[0m
[34m[1mwandb[0m: Adding directory to artifact (./output)... Done. 0.0s
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 🚀 View run [33mlight-violet-1[0m at: [34m[4mhttps://wandb.ai/aaalex-lit/mlops-zoomcamp/runs/jafs25q7[0m
[34m[1mwandb[0m: Synced 6 W&B file(s), 0 media file(s), 6 artifact file(s) and 0 other fil