### Time Series Workshop 
# 3. Air Pollutants Forecasting

In this notebook, we will analyse time series data on pollutant concentration.

## Dataset synopsis

We will work with the Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

- The dataset contains hourly recordings of air pollutants concentration in an Italian city.

- For sake of simplicity, we will limit our work to the variable **relative humidity** (humidity) and measured **carbon monoxide concentration** (co_sensor) in mg/m^3.

- This dataset is a bit challenging because
  - Timestamps are not equidistant 
  - Entire days of recordings are missing, probably due to data collection failure. 
  - There are also outliers wherever the sensors did not manage to obtain a measurement of humidity or CO concentration.

In [1]:
%config InlineBackend.figure_format='retina'
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from timeseries.data import load_air_quality

DATA_DIR = Path("..") / Path("data")

## Load and process data

In [2]:
FILE_PATH = DATA_DIR / "air_quality.csv"

df_in = load_air_quality(FILE_PATH)
df_in.head()

variables = ["co_sensor", "humidity"]
df_in = df_in[variables].copy()

for var in variables:
    df_in = df_in[df_in[var] >= 0]

df_in.head()

Unnamed: 0_level_0,co_sensor,humidity
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3


## Time related features

In [8]:
df = df_in.copy()

df["month"] = df.index.month
df["week"] = df.index.isocalendar().week
df["day"] = df.index.day
df["day_of_week"] = df.index.day_of_week
df["hour"] = df.index.hour
df["is_weekend"] = np.where(df["day_of_week"]>4, 1, 0)
df.head()

Unnamed: 0_level_0,co_sensor,humidity,month,week,day,day_of_week,hour,is_weekend
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2004-04-04 00:00:00,1224.0,56.5,4,14,4,6,0,1
2004-04-04 01:00:00,1215.0,59.2,4,14,4,6,1,1
2004-04-04 02:00:00,1115.0,62.4,4,14,4,6,2,1
2004-04-04 03:00:00,1124.0,65.0,4,14,4,6,3,1
2004-04-04 04:00:00,1028.0,65.3,4,14,4,6,4,1


## Lag features
Lag features are past values of the variable that we can use to predict future values.

Here, we will use the following lag features to predict the next hour's pollutant concentration:
- The pollutant concentration for the previous three hours (t-1, t-2, t-3).
- The pollutant concentration for the same hour on the previous day (t-24).

The reasoning behind this is that pollutant concentrations do not change quickly and, as previously demonstrated, have a 24-hour seasonality.

In [13]:
df_lags = df.copy()

for var in variables:
    for h in [1, 2, 3, 24]:
        tmp = df_lags[[var]].shift(freq=f"{h}H")
        tmp.columns = [f"{var}_lag_{h}"]
        df_lags = df_lags.merge(tmp, left_index=True, right_index=True, how="left")


df_lags.head()

Unnamed: 0_level_0,co_sensor,humidity,month,week,day,day_of_week,hour,is_weekend,co_sensor_lag_1,co_sensor_lag_2,co_sensor_lag_3,co_sensor_lag_24,humidity_lag_1,humidity_lag_2,humidity_lag_3,humidity_lag_24
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2004-04-04 00:00:00,1224.0,56.5,4,14,4,6,0,1,,,,,,,,
2004-04-04 01:00:00,1215.0,59.2,4,14,4,6,1,1,1224.0,,,,56.5,,,
2004-04-04 02:00:00,1115.0,62.4,4,14,4,6,2,1,1215.0,1224.0,,,59.2,56.5,,
2004-04-04 03:00:00,1124.0,65.0,4,14,4,6,3,1,1115.0,1215.0,1224.0,,62.4,59.2,56.5,
2004-04-04 04:00:00,1028.0,65.3,4,14,4,6,4,1,1124.0,1115.0,1215.0,,65.0,62.4,59.2,


In [19]:
# Sanity check for the first 3 hour lags:
df_lags[["co_sensor","co_sensor_lag_1","co_sensor_lag_2","co_sensor_lag_3"]].head()

Unnamed: 0_level_0,co_sensor,co_sensor_lag_1,co_sensor_lag_2,co_sensor_lag_3
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-04-04 00:00:00,1224.0,,,
2004-04-04 01:00:00,1215.0,1224.0,,
2004-04-04 02:00:00,1115.0,1215.0,1224.0,
2004-04-04 03:00:00,1124.0,1115.0,1215.0,1224.0
2004-04-04 04:00:00,1028.0,1124.0,1115.0,1215.0


In [18]:
# Sanity check for the 24 hour lag:
df_lags[["co_sensor", "co_sensor_lag_24"]].head(26)

Unnamed: 0_level_0,co_sensor,co_sensor_lag_24
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,1215.0,
2004-04-04 02:00:00,1115.0,
2004-04-04 03:00:00,1124.0,
2004-04-04 04:00:00,1028.0,
2004-04-04 05:00:00,1010.0,
2004-04-04 06:00:00,1074.0,
2004-04-04 07:00:00,1034.0,
2004-04-04 08:00:00,1130.0,
2004-04-04 09:00:00,1275.0,


## Window Features
Window features use some form of aggregation of the features' values over a pre-defined time window of a variable as predictors for the current value.

Here, we will
- Use a rolling window of 5 hours 
- Compute the mean, min, and max values of our variables within this window
- Shift the window forward to serve as predictors for the next hour

In [25]:
tmp = (
    df_lags[variables]
    .rolling(window="5H")
    .agg(["mean", "min", "max", "std"])  # Aggregate functions over the span of the window
    .shift(freq="1H")  # Move the average 1 hour forward
)

# Rename the columns
# TODO: Doesn't work yet
# TODO: Cleanup data load for airline passengers and sunspots data
#tmp.columns = [v + "_window" for v in variables]

In [27]:
tmp.tail()

Unnamed: 0_level_0,co_sensor,co_sensor,co_sensor,co_sensor,humidity,humidity,humidity,humidity
Unnamed: 0_level_1,mean,min,max,std,mean,min,max,std
date_time,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2005-04-04 11:00:00,1294.4,1031.0,1446.0,158.730274,47.9,29.3,63.1,15.071165
2005-04-04 12:00:00,1320.8,1163.0,1446.0,106.281231,40.02,23.7,61.9,15.432822
2005-04-04 13:00:00,1272.4,1142.0,1446.0,123.940712,31.3,18.3,48.9,11.890332
2005-04-04 14:00:00,1183.8,1003.0,1314.0,127.116875,24.22,13.5,36.3,8.971733
2005-04-04 15:00:00,1138.6,1003.0,1314.0,116.543125,19.58,13.1,29.3,6.929069
