## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.



## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 72044
* 78537 
* 57457
* 54396

In [18]:
import requests
import pandas as pd
from tqdm import tqdm

# Download March 2024 Green Taxi data
file = 'green_tripdata_2024-03.parquet'
url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
resp = requests.get(url, stream=True)
save_path = f"./data/{file}"
with open(save_path, "wb") as handle:
    for data in tqdm(resp.iter_content(),
                    desc=f"{file}",
                    postfix=f"save to {save_path}",
                    total=int(resp.headers["Content-Length"])):
        handle.write(data)

# Load the data
mar_data = pd.read_parquet(save_path)
print(mar_data.shape)


green_tripdata_2024-03.parquet: 100%|██████████| 1372372/1372372 [00:06<00:00, 223299.51it/s, save to ./data/green_tripdata_2024-03.parquet]

(57457, 20)






## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

In [None]:
from evidently.report import Report
from evidently.metrics import ColumnQuantileMetric

new_metric = ColumnQuantileMetric(column_name="fare_amount", quantile=0.5)

report = Report(metrics=[
    ColumnDriftMetric(column_name='prediction'),
    DatasetDriftMetric(),
    DatasetMissingValuesMetric(),
    new_metric  # Add the new metric here
])

report.run(reference_data=train_data, current_data=val_data, column_mapping=column_mapping)

result = report.as_dict()
print(result)




## Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024). 

What is the maximum value of metric `quantile = 0.5` on the `"fare_amount"` column during March 2024 (calculated daily)?

* 10
* 12.5
* 14.2
* 14.8

## Q4. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the  dashboard let's save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* `project_folder/config`  (05-monitoring/config)
* `project_folder/dashboards`  (05-monitoring/dashboards)
* `project_folder/data`  (05-monitoring/data)

