## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.

## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2023 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 72044
* 78537 
* 62495
* 54396

In [None]:
import pandas as pd

df = pd.read_parquet('../../data/green_tripdata_2023-03.parquet')
print('Answer:', df.shape[0])

Answer: 72044


## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

In [None]:
# add the following code to homework6_code/evidently_metrics_calculation.py
from evidently.metrics import ColumnQuantileMetric, ColumnCorrelationsMetric

report = Report(metrics = [
    DatasetMissingValuesMetric(),
	ColumnCorrelationsMetric(column_name='prediction'),
    ColumnQuantileMetric(column_name='fare_amount', quantile=0.5)
])

# create the script baseline_model.py

**Answer**: ColumnCorrelationsMetric

## Q3. Prefect flow 

Let’s update prefect tasks by giving them nice meaningful names, specifying a number of delays and retries.

Hint: use `evidently_metrics_calculation.py` script as a starting point to implement your solution. Check the  prefect docs to check task parameters.

What is the correct way of doing that?

* `@task(retries_num=2, retry_seconds=5, task_name="calculate metrics")`
* `@task(retries_num=2, retry_delay_seconds=5, name="calculate metrics")`
* `@task(retries=2, retry_seconds=5, task_name="calculate metrics")`
* `@task(retries=2, retry_delay_seconds=5, name="calculate metrics")`

**Answer**: @task(retries=2, retry_delay_seconds=5, name="calculate metrics")

## Q4. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2023). 

What is the maximum value of metric `quantile = 0.5` on the `"fare_amount"` column during March 2023 (calculated daily)?

* 10
* 12.5
* 14
* 14.8

Start all the required services:

```
docker compose up
```

In another terminal, run the `baseline_model.py` script to generate the Linear Regression model:

```
python homework6_code/baseline_model.py
```

To calculate evidently metrics with prefect and send them to database, run:

```
python homework6_code/evidently_metrics_calculation.py
```

Next, in your browser, go to a `localhost:3000` The default `username` and `password` are admin

Create a new Dashboard 

```
SELECT
  "timestamp",
  MAX(column_quantile_metric)
FROM
  dummy_metrics
GROUP BY
  "timestamp"
LIMIT
  50
```

The panel should look like this:

![Grafana Panel.png](homework6_images/grafana_panel.png)

**Answer**: 14

## Q5. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the dashboard lets save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* `project_folder/config`  (05-monitoring/config)
* `project_folder/dashboards`  (05-monitoring/dashboards)
* `project_folder/data`  (05-monitoring/data)

The dashboard looks like this:
![Grafana Dashboard.png](homework6_images/grafana_dashboard.png)

Go to settings and save the JSON Model in the dashboards folder. Name it `metrics.json`

![Grafana Dashboard Settings.png](homework6_images/grafana_dashboard_settings.png)

**Answer**: project_folder/dashboards  (05-monitoring/dashboards)

## Submit the results

* Submit your results here: https://forms.gle/PJaYeWsnWShAEBF79
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one

## Answers
* **Q1**: 72044
* **Q2**: ColumnCorrelationsMetric
* **Q3**: @task(retries=2, retry_delay_seconds=5, name="calculate metrics")
* **Q4**: 14
* **Q5**: project_folder/dashboards  (05-monitoring/dashboards)

## Deadline

The deadline for submitting is 7 July (Friday), 23:00 CEST (Berlin time). 

After that, the form will be closed.
