# Setup

In [28]:
import requests
import os

from IPython.display import Image, display

## Setup wandb

In [21]:
!export WANDB_NOTEBOOK_NAME=wandb-homework

In [24]:
# Log in to your W&B account
import wandb
wandb.login()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/deltasmith/.netrc


True

# Q1. Install the Package
To get started with Weights & Biases you'll need to install the appropriate Python package.

For this we recommend creating a separate Python environment, for example, you can use conda environments, and then install the package there with pip or conda.

Following are the libraries you need to install:
```bash
pandas
matplotlib
scikit-learn
pyarrow
wandb
```
Once you installed the package, run the command `wandb --version` and check the output.

**What's the version that you have?**

In [None]:
!pip install -r requirements.txt

In [4]:
!wandb --version

wandb, version 0.15.3


# Q2. Download and preprocess the data

We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip.

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

**Tip:** In case you're on [GitHub Codespaces](https://github.com/features/codespaces) or [gitpod.io](https://gitpod.io), you can open up the terminal and run the following commands to download the data:

```shell
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet
```

Use the script `preprocess_data.py` located in the folder [`homework-wandb`](homework-wandb) to preprocess the data.

The script will:

* initialize a Weights & Biases run.
* load the data from the folder `<TAXI_DATA_FOLDER>` (the folder where you have downloaded the data),
* fit a `DictVectorizer` on the training set (January 2022 data),
* save the preprocessed datasets and the `DictVectorizer` to your Weights & Biases dashboard as an artifact of type `preprocessed_dataset`.

Your task is to download the datasets and then execute this command:

```bash
python preprocess_data.py \
  --wandb_project <WANDB_PROJECT_NAME> \
  --wandb_entity <WANDB_USERNAME> \
  --raw_data_path <TAXI_DATA_FOLDER> \
  --dest_path ./output
```

Tip: go to `02-experiment-tracking/homework-wandb/` folder before executing the command and change the value of `<WANDB_PROJECT_NAME>` to the name of your Weights & Biases project, `<WANDB_USERNAME>` to your Weights & Biases username, and `<TAXI_DATA_FOLDER>` to the location where you saved the data.



In [31]:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/Users/deltasmith/.wget-hsts'. HSTS will be disabled.
--2023-06-06 14:03:23--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
Resolviendo d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 2600:9000:200c:a00:b:20a5:b140:21, 2600:9000:200c:a600:b:20a5:b140:21, 2600:9000:200c:2600:b:20a5:b140:21, ...
Conectando con d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)[2600:9000:200c:a00:b:20a5:b140:21]:443... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 1254291 (1.2M) [binary/octet-stream]
Grabando a: «green_tripdata_2022-01.parquet»


2023-06-06 14:03:24 (4.41 MB/s) - «green_tripdata_2022-01.parquet» guardado [1254291/1254291]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/Users/deltasmith/.wget-hsts'. HSTS 

In [38]:
TAXI_DATA_FOLDER = "./"
WANDB_PROJECT_NAME = "wandb-homework"
WANDB_USERNAME = "carloslme"

In [39]:
!python preprocess_data.py \
  --wandb_project {WANDB_PROJECT_NAME} \
  --wandb_entity {WANDB_USERNAME} \
  --raw_data_path {TAXI_DATA_FOLDER} \
  --dest_path ./output

[34m[1mwandb[0m: Currently logged in as: [33mcarloslme[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/deltasmith/mlops-zoomcamp/02-experiment-tracking/wandb/wandb/run-20230606_141203-hhew05nw[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mdeft-night-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/carloslme/wandb-homework[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/carloslme/wandb-homework/runs/hhew05nw[0m
[34m[1mwandb[0m: Adding directory to artifact (./output)... Done. 0.0s


Once you navigate to the `Files` tab of your artifact on your Weights & Biases page, **what's the size of the saved `DictVectorizer` file?**

In [40]:
def get_file_size(file_path):
    size_in_bytes = os.path.getsize(file_path)
    return size_in_bytes / 1000

# Usage
file_path = './output/dv.pkl'
file_size_in_kb = get_file_size(file_path)
print(f"File size: {file_size_in_kb} KB")

File size: 153.66 KB
