# **Datasets Merging and Normalization**

For each station, find a good model to predict the individual pollutants.

In [None]:
%load_ext autoreload
%autoreload 2

from utils import *

datasets_folder = './datasets'
verbosity=2

## Input Data

We load the datasets with the techniques used in the corresponding notebooks.

- Air pollution

In [2]:
pollution_data = read_and_preprocess_dataset(datasets_folder, 'pollution', v=verbosity)

Stations found: GIARDINI MARGHERITA, PORTA SAN FELICE, VIA CHIARINI
Splitting station "GIARDINI MARGHERITA"...
Splitting station "PORTA SAN FELICE"...
Splitting station "VIA CHIARINI"...


In [3]:
display(pollution_data['GIARDINI MARGHERITA']['NO2'].iloc[:2])
display(pollution_data['GIARDINI MARGHERITA']['NO2'].iloc[-2:])

Unnamed: 0_level_0,Agent_value
Date,Unnamed: 1_level_1
2019-01-01 00:00:00,29.0
2019-01-01 02:00:00,23.0


Unnamed: 0_level_0,Agent_value
Date,Unnamed: 1_level_1
2024-12-31 22:00:00,22.0
2024-12-31 23:00:00,21.0


- Traffic

In [4]:
traffic_data = read_and_preprocess_dataset(datasets_folder, 'traffic', v=verbosity)

Merging readings files...
Merged 6 CSV files
Merging accuracies files...
Merged 6 CSV files
Location GIARDINI MARGHERITA: 44.482671138769533,11.35406170088398
 > Filtering close traffic data...
 > Summing up hour data...
Location PORTA SAN FELICE: 44.499059983334519,11.327526717440112
 > Filtering close traffic data...
 > Summing up hour data...
Location VIA CHIARINI: 44.499134335170289,11.285089594971216
 > Filtering close traffic data...
 > Summing up hour data...


In [5]:
display(traffic_data['GIARDINI MARGHERITA'].iloc[:2])
display(traffic_data['GIARDINI MARGHERITA'].iloc[-2:])

Unnamed: 0_level_0,Traffic_value
Date,Unnamed: 1_level_1
2019-01-01 00:00:00,10501.0
2019-01-01 01:00:00,16863.0


Unnamed: 0_level_0,Traffic_value
Date,Unnamed: 1_level_1
2024-12-31 22:00:00,4162.0
2024-12-31 23:00:00,3765.0


- Weather

In [6]:
weather_data = read_and_preprocess_dataset(datasets_folder, 'weather', v=verbosity)

Merging weather files...
Merged 6 CSV files


In [7]:
display(weather_data.iloc[:2])
display(weather_data.iloc[-2:])

Unnamed: 0_level_0,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,LEAFW
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-01-01 00:00:00,1.0,0.0,92.3,0.0,0.5,208.7,0.0
2019-01-01 01:00:00,0.3,0.0,93.6,0.0,0.5,280.0,0.0


Unnamed: 0_level_0,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,LEAFW
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-12-30 23:00:00,5.1,0.0,76.1,0.0,2.8,256.7,0.0
2024-12-31 00:00:00,5.1,0.0,75.0,0.0,2.8,258.3,0.0


**NOTE:** The very last day of weather data is not present

## Merge the datasets

We will merge the datasets on the `Date` column.

If the data to merge is hourly we can simply join the datasets on the indexes.

In [8]:
merged_giardini_margherita = {}
merged_giardini_margherita['NO2'] = join_datasets(
    pollution_data['GIARDINI MARGHERITA']['NO2'],
    traffic_data['GIARDINI MARGHERITA'],
    weather_data,
    dropna=True # drop the last day (31-12-2024)
    )

In [9]:
merged_giardini_margherita['NO2'].head(3)

Unnamed: 0_level_0,Agent_value,Traffic_value,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,LEAFW
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-01-01 00:00:00,29.0,10501.0,1.0,0.0,92.3,0.0,0.5,208.7,0.0
2019-01-01 02:00:00,23.0,15248.0,0.7,0.0,91.7,0.0,1.1,158.1,0.0
2019-01-01 03:00:00,29.0,9844.0,0.4,0.0,91.5,0.0,0.7,189.4,0.0


As introduced in the [weather notebook](./3-weather_preprocessing.ipynb), if we are considering a daily agent like *PM10*, we should first convert the traffic and weather datasets to contain daily informations.

### Normalize the columns

TODO: maybe the wind direction could be simply converted to radians? or something idk

In [10]:
normalized_giardini_margherita={}
normalized_giardini_margherita['NO2'] = normalize_columns(merged_giardini_margherita['NO2'], skip=['Date', 'Agent_value'])

In [11]:
normalized_giardini_margherita['NO2'].head(3)

Unnamed: 0_level_0,Agent_value,Traffic_value,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,LEAFW
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-01-01 00:00:00,29.0,0.062321,0.096698,0.0,0.911899,0.0,0.034247,0.579722,0.0
2019-01-01 02:00:00,23.0,0.090494,0.089623,0.0,0.905034,0.0,0.075342,0.439167,0.0
2019-01-01 03:00:00,29.0,0.058422,0.082547,0.0,0.902746,0.0,0.047945,0.526111,0.0


## Encode date and time informations

We need to encode date and hour informations, to help the models learn that traffic is really small during the night or the weekends, and so on. We could also add a feature for holidays, if needed.

We can (in order of columns required):
- one hot encode the hour/day/month (does not account for day 31 being close to day 1)
- radial basis function (more accurate (it used only months))
- sine/cosine (2 features for the months, and so on)

Do we need to keep the year? it is not cyclical, it should give no informations at all...

We might encode the months using radial and day using sine/cosine? I have no clue... I will start by applying sin/cos bcs it is easier :)

*Source: [here](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)*

In [None]:
method = 'radial_months-sin-cos_days_hours'

encoded_giardini_margherita = {}
encoded_giardini_margherita['NO2'] = encode_date_index(normalized_giardini_margherita['NO2'],method=method)

In [13]:
encoded_giardini_margherita['NO2']

Unnamed: 0_level_0,Agent_value,Traffic_value,TAVG,PREC,RHAVG,RAD,W_SCAL_INT,W_VEC_DIR,LEAFW,hour_sin,...,month_rbf_3,month_rbf_4,month_rbf_5,month_rbf_6,month_rbf_7,month_rbf_8,month_rbf_9,month_rbf_10,month_rbf_11,month_rbf_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-01-01 00:00:00,29.0,0.062321,0.096698,0.0,0.911899,0.0,0.034247,0.579722,0.0,0.000000,...,3.354626e-04,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,0.000335,0.135335
2019-01-01 02:00:00,23.0,0.090494,0.089623,0.0,0.905034,0.0,0.075342,0.439167,0.0,0.500000,...,3.354626e-04,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,0.000335,0.135335
2019-01-01 03:00:00,29.0,0.058422,0.082547,0.0,0.902746,0.0,0.047945,0.526111,0.0,0.707107,...,3.354626e-04,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,0.000335,0.135335
2019-01-01 04:00:00,26.0,0.036808,0.096698,0.0,0.843249,0.0,0.047945,0.480278,0.0,0.866025,...,3.354626e-04,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,0.000335,0.135335
2019-01-01 05:00:00,24.0,0.028297,0.127358,0.0,0.767735,0.0,0.123288,0.419444,0.0,0.965926,...,3.354626e-04,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,0.000335,0.135335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-12-30 20:00:00,33.0,0.065271,0.219340,0.0,0.680778,0.0,0.136986,0.746944,0.0,-0.866025,...,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,3.354626e-04,0.135335,1.000000
2024-12-30 21:00:00,31.0,0.035051,0.205189,0.0,0.726545,0.0,0.171233,0.716667,0.0,-0.707107,...,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,3.354626e-04,0.135335,1.000000
2024-12-30 22:00:00,25.0,0.039110,0.207547,0.0,0.717391,0.0,0.191781,0.705833,0.0,-0.500000,...,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,3.354626e-04,0.135335,1.000000
2024-12-30 23:00:00,22.0,0.062897,0.193396,0.0,0.726545,0.0,0.191781,0.713056,0.0,-0.258819,...,1.522998e-08,1.266417e-14,1.928750e-22,5.380186e-32,1.928750e-22,1.266417e-14,1.522998e-08,3.354626e-04,0.135335,1.000000


Agent_value has to be considered as $y_{true}$