# Station Air Pollution Estimation

For each station, find a good model to predict the individual pollutants.

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import seaborn as sns


from utils import *

datasets_folder = './datasets'
verbosity=1

## Input Data

We load the datasets with the techniques used in the corresponding notebooks.

- Air pollution

In [None]:
giardini_margherita_pollution_dict, san_felice_pollution_dict, chiarini_pollution_dict = read_and_preprocess_dataset(datasets_folder, 'pollution', v=verbosity)

In [3]:
display(giardini_margherita_pollution_dict['NO2'].iloc[:2])
display(giardini_margherita_pollution_dict['NO2'].iloc[-2:])

Unnamed: 0,Date,Agent_value
0,2019-01-01 00:00:00,29.0
1,2019-01-01 01:00:00,17.142019


Unnamed: 0,Date,Agent_value
52606,2024-12-31 22:00:00,22.0
52607,2024-12-31 23:00:00,21.0


- Traffic

In [None]:
giardini_margherita_traffic_df, san_felice_traffic_df, chiarini_traffic_df = read_and_preprocess_dataset(datasets_folder, 'traffic', v=verbosity)

In [5]:
display(giardini_margherita_traffic_df.iloc[:2])
display(giardini_margherita_traffic_df.iloc[-2:])

Unnamed: 0,Date,Traffic_value
0,2019-01-01 00:00:00,10501.0
1,2019-01-01 01:00:00,16863.0


Unnamed: 0,Date,Traffic_value
52606,2024-12-31 22:00:00,4162.0
52607,2024-12-31 23:00:00,3765.0


- Weather

In [None]:
weather_df = read_and_preprocess_dataset(datasets_folder, 'weather', v=verbosity)

In [7]:
display(weather_df.iloc[:2])
display(weather_df.iloc[-2:])

Unnamed: 0,Date,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,W_VEC_INT,LEAFW,ET0
0,2019-01-01 00:00:00,1.0,0.0,92.3,0.0,0.5,208.7,0.3,0.0,0.0
1,2019-01-01 01:00:00,0.3,0.0,93.6,0.0,0.5,280.0,0.2,0.0,0.0


Unnamed: 0,Date,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,W_VEC_INT,LEAFW,ET0
52583,2024-12-30 23:00:00,5.1,0.0,76.1,0.0,2.8,256.7,2.7,0.0,0.0
52584,2024-12-31 00:00:00,5.1,0.0,75.0,0.0,2.8,258.3,2.7,0.0,0.0


**NOTE:** The very last day of weather data is not present

## Merge the datasets

We will merge the datasets on the `Date` column.

*The following is only applied to one specific station and on a singular agent, just to show the process.*

In [8]:
merged_giardini_margherita = {}
merged_giardini_margherita['NO2'] = merge_datasets(
    giardini_margherita_pollution_dict['NO2'],
    giardini_margherita_traffic_df,
    weather_df,
    on='Date',
    dropna=True # drop the last day (31-12-2024)
    )

In [9]:
merged_giardini_margherita['NO2'].head(3)

Unnamed: 0,Date,Agent_value,Traffic_value,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,W_VEC_INT,LEAFW,ET0
0,2019-01-01 00:00:00,29.0,10501.0,1.0,0.0,92.3,0.0,0.5,208.7,0.3,0.0,0.0
1,2019-01-01 01:00:00,17.142019,16863.0,0.3,0.0,93.6,0.0,0.5,280.0,0.2,0.0,0.0
2,2019-01-01 02:00:00,23.0,15248.0,0.7,0.0,91.7,0.0,1.1,158.1,1.0,0.0,0.0


### Normalize the columns

In [10]:
normalized_giardini_margherita={}
normalized_giardini_margherita['NO2'] = normalize_columns(merged_giardini_margherita['NO2'], skip=['Date', 'Agent_value'])

In [11]:
normalized_giardini_margherita['NO2'].head(3)

Unnamed: 0,Date,Agent_value,Traffic_value,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,W_VEC_INT,LEAFW,ET0
0,2019-01-01 00:00:00,29.0,0.062321,0.096698,0.0,0.911899,0.0,0.034247,0.579722,0.02069,0.0,0.0
1,2019-01-01 01:00:00,17.142019,0.100078,0.080189,0.0,0.926773,0.0,0.034247,0.777778,0.013793,0.0,0.0
2,2019-01-01 02:00:00,23.0,0.090494,0.089623,0.0,0.905034,0.0,0.075342,0.439167,0.068966,0.0,0.0


## Encode date and time informations

We need to encode date and hour informations, to help the models learn that traffic is really small during the night or the weekends, and so on. We could also add a feature for holidays, if needed.

We can (in order of columns required):
- one hot encode the hour/day/month (does not account for day 31 being close to day 1)
- radial basis function (more accurate (it used only months))
- sine/cosine (2 features for the months, and so on)

Do we need to keep the year? it is not cyclical, it should give no informations at all...

We might encode the months using radial and day using sine/cosine? I have no clue... I will start by applying sin/cos bcs it is easier :)

*Source: [here](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)*

In [13]:
encoded_giardini_margherita = {}
encoded_giardini_margherita['NO2'] = encode_date(normalized_giardini_margherita['NO2'])

In [14]:
encoded_giardini_margherita['NO2']

Unnamed: 0,Agent_value,Traffic_value,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,W_VEC_INT,LEAFW,ET0,hour_sin,hour_cos,day_sin,day_cos,month_sin,month_cos
0,29.000000,0.062321,0.096698,0.0,0.911899,0.0,0.034247,0.579722,0.020690,0.0,0.0,0.000000,1.000000,2.012985e-01,0.97953,5.000000e-01,0.866025
1,17.142019,0.100078,0.080189,0.0,0.926773,0.0,0.034247,0.777778,0.013793,0.0,0.0,0.258819,0.965926,2.012985e-01,0.97953,5.000000e-01,0.866025
2,23.000000,0.090494,0.089623,0.0,0.905034,0.0,0.075342,0.439167,0.068966,0.0,0.0,0.500000,0.866025,2.012985e-01,0.97953,5.000000e-01,0.866025
3,29.000000,0.058422,0.082547,0.0,0.902746,0.0,0.047945,0.526111,0.027586,0.0,0.0,0.707107,0.707107,2.012985e-01,0.97953,5.000000e-01,0.866025
4,26.000000,0.036808,0.096698,0.0,0.843249,0.0,0.047945,0.480278,0.041379,0.0,0.0,0.866025,0.500000,2.012985e-01,0.97953,5.000000e-01,0.866025
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52580,33.000000,0.065271,0.219340,0.0,0.680778,0.0,0.136986,0.746944,0.110345,0.0,0.0,-0.866025,0.500000,-2.012985e-01,0.97953,-2.449294e-16,1.000000
52581,31.000000,0.035051,0.205189,0.0,0.726545,0.0,0.171233,0.716667,0.165517,0.0,0.0,-0.707107,0.707107,-2.012985e-01,0.97953,-2.449294e-16,1.000000
52582,25.000000,0.039110,0.207547,0.0,0.717391,0.0,0.191781,0.705833,0.186207,0.0,0.0,-0.500000,0.866025,-2.012985e-01,0.97953,-2.449294e-16,1.000000
52583,22.000000,0.062897,0.193396,0.0,0.726545,0.0,0.191781,0.713056,0.186207,0.0,0.0,-0.258819,0.965926,-2.012985e-01,0.97953,-2.449294e-16,1.000000


Agent_value has to be considered as $y_{true}$