In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

Welcome to this lab session 4 on **Time series modeling for air pollution monitoring with a focus on the
calibration of low-cost sensors.**

This lab session is based on the data and methods provided in the study by [Ellen M. Considine et al](https://www.sciencedirect.com/science/article/pii/S0269749120365222).


In the notebook, we will focus on improving our modeling pipeline by considering cross validation.

The question we intend to answer here is: How can we improve the experiment pipeline presented in LESSON 3 notebook.

To this aim, we present leave-one-location-out cross validation. This cross validation helps us to understand how well our model generalises into new locations corresponding to the same time coverage of our training data.

The idea is to split our training data into training and validation by location.

**Step-by-step process**

- Iterate over the monitor locations
- For each location,
    - Select data for that location as validation data and deselect these data from training.
    - Fit your model on the resulting training data and predict over the validation location.
    - Check the model error on validation data

# First, lets import the libraries we will be using

In [1]:
import math

from tqdm.auto import tqdm

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [2]:
import warnings

warnings.filterwarnings("ignore")

# Load the data

In [3]:
data_root = "./data/"
training_data_path = data_root + "cleaned_training.csv"
test_data_path = data_root + "cleaned_test.csv"

In [4]:
training_data = pd.read_csv(training_data_path)
test_data = pd.read_csv(test_data_path)

In [5]:
training_data.head()

Unnamed: 0,airnow_sensor,longitude,latitude,a_road_500,pm_cs,temp,humidity,pm_airnow,date_time,cs_sensor,time,month,weekend,sin_time,cos_time,sin_month,cos_month
0,NJH,-104.939925,39.738578,1.995,33.6167,60.038,65.085,22.1,2018-08-20 01:00:00,NJH,1,8,0,0.220297,0.141451,0.518027,-0.076184
1,NJH,-104.939925,39.738578,1.995,38.2333,58.517,67.115,27.1,2018-08-20 02:00:00,NJH,2,8,0,0.238054,-0.108947,0.518027,-0.076184
2,NJH,-104.939925,39.738578,1.995,43.35,57.383,71.1717,28.85,2018-08-20 03:00:00,NJH,3,8,0,0.036945,-0.259179,0.518027,-0.076184
3,NJH,-104.939925,39.738578,1.995,48.7,56.546,74.7317,34.6,2018-08-20 04:00:00,NJH,4,8,0,-0.19813,-0.171123,0.518027,-0.076184
4,NJH,-104.939925,39.738578,1.995,39.25,55.682,79.2033,31.5,2018-08-20 05:00:00,NJH,5,8,0,-0.251046,0.074263,0.518027,-0.076184


# Utility functions

In [6]:
# More evaluation metrics can be added to the function
def evaluate_model(y, y_hat):
    return {"RMSE": round(mean_squared_error(y, y_hat, squared=False), 2)}


Now, we need to get Validation data

In [7]:
features = [
    "pm_cs",
    "temp",
    "humidity",
    "a_road_500",
    "sin_time",
    "cos_time",
    "sin_month",
    "cos_month",
]

# This is tagged model_4 in our last notebook

In [8]:
lolo_validation_errors = {}
locations = training_data["cs_sensor"].unique()

for leave_sensor in tqdm(locations, total=len(locations)):

    train = training_data[training_data["cs_sensor"] != leave_sensor]
    validation = training_data[training_data["cs_sensor"] == leave_sensor]

    model = RandomForestRegressor()

    x_train, y_train = train[features], train["pm_airnow"]
    x_val, y_val = validation[features], validation["pm_airnow"]

    model.fit(x_train, y_train)

    y_hat_val = model.predict(x_val)

    error = evaluate_model(y_val, y_hat_val)
    lolo_validation_errors[leave_sensor] = error["RMSE"]

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))




The location names below shows signify the location that have been left out of training but used only to obtain validation error.

| Location| Baseline RMSE| CV Random forest RMSE|
  |---|---|---|
  |**Train**|---|---|
  |NJH | 4.36| 2.26|
  |i25_glo_1|6.67|3.13|
  |i25_glo_2|4.55|2.72|
  |i25_glo_3|5.41|2.37|
  |la_casa|6.06|2.4|

  

Leaving I-25 Globeville data out increases our validation error because by removing this monitor location, we exclude samples from three CS sensors in the data. Relative to the full side of our data, this is a lot of samples.

Ideally, if applying LOLO cross validation, you want to apply it to the model evaluation and selection step in our previous notebook.

We can represent our training performance in terms of the mean and standard deviation of all the cross validation errors as shown below.

In [9]:
print("error mean: ", round(np.mean(list(lolo_validation_errors.values())), 2))
print("error std: ", round(np.std(list(lolo_validation_errors.values())), 2))

error mean:  2.58
error std:  0.32


This tells us that our training RMSE of `0.85` when we use all the locations in training is too optimistic, especially for the case of generalizing to new locations over the same time period of our training data.

A simple way to combine these cross validated models for test/inference would be to average their outputs


$final prediction = (prediction_1 + prediction_2 + prediction_3 + ... + prediction_n) / n$