# Cross-Validating across Time using Multiple Linear Regression

A major challenge of modeling data that is collected over time is that there is likely to be correlation between consecutive entries. This poses problems because a key assumption that is made when fitting a linear regression model is that the observations are independent of one another. In order to work around this, we can instead evaluate the model by taking the first subset of observations (ordered by time), forecasting the next observation and recording the mean squared error. We can repeat this process, obtain the sum of the mean squared errors, and now we have a metric by which we can "cross-validate" the model!

We will write an algorithm to demonstrate this process using Python. Data for this particular example is air quality data from the UC Irvine Machine Learning Repository. The observations were collected between March 2004 to February 2005 and data was collected using metal oxide chemical sensors that were embedded in a device located in a significantly polluted area in Italy. More information on the dataset is available [here](https://archive.ics.uci.edu/dataset/360/air+quality).

We'll go ahead and read in the data now:

In [49]:
#Import necessary modules
import pandas as pd
from sklearn import linear_model

# Read in the dataset from the UCI Machine Learning Repository library
!pip install ucimlrepo
import ucimlrepo as uci

air_quality = uci.fetch_ucirepo(id=360)
aq = air_quality.data.features
aq.head()



Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


Before the models can be evaluated, we will need to clean up this data a little bit. First, there are missing values that we need to get rid of in some of the parameters we will be working with: `C6H6(GT)`, `CO(GT)`, `T`, `RH`, and `AH`.

In [47]:
#Clear out missing values (corresponding to -200 in this dataset)
aq_no_missing = aq[(aq['C6H6(GT)'] != -200) & (aq['CO(GT)'] != -200) & (aq['T'] != -200) & (aq['RH'] != -200) & (aq['AH'] != -200)]

#Sets the column to a datetime object
#This ensures the final data is in chronological order
#https://pandas.pydata.org/docs/user_guide/timeseries.html
aq_no_missing['Date_DT'] = aq_no_missing.Date.astype('datetime64[ns]')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aq_no_missing['Date_DT'] = aq_no_missing.Date.astype('datetime64[ns]')


 Next, the values we wish to work with are the average of each parameter of interest across a given `Date`, so we will group the data by `Date` and return the average for each of the five aforementioned variables. Note that we expect 347 observations after completing this operation.

In [48]:
#Group data by date, return the average across each day
aq_avg = aq_no_missing.groupby('Date_DT')[['C6H6(GT)', 'CO(GT)','T','RH','AH']].mean()

#Append a date column so the data is a bit easier to iterate over
aq_avg['Day'] = range(1, len(aq_avg) + 1)
aq_avg

Unnamed: 0_level_0,C6H6(GT),CO(GT),T,RH,AH,Day
Date_DT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004-03-10,8.450000,1.966667,12.033333,54.900000,0.765633,1
2004-03-11,8.269565,2.239130,9.826087,64.230435,0.777039,2
2004-03-12,12.177273,2.804545,11.618182,50.190909,0.665164,3
2004-03-13,11.121739,2.695652,13.121739,50.682609,0.733013,4
2004-03-14,9.830435,2.469565,16.182609,48.317391,0.849209,5
...,...,...,...,...,...,...
2005-03-31,5.220833,1.387500,17.550000,50.083333,0.951917,343
2005-04-01,3.526087,1.108696,16.026087,35.404348,0.631135,344
2005-04-02,2.529167,0.854167,15.483333,32.225000,0.546167,345
2005-04-03,4.316667,1.141667,18.383333,33.695833,0.617583,346


Now, we are ready to start modeling!

## The Models

In this context, we are interested in predicting/inferring the concentration of Benzene (`C6H6(GT)`). We have two models in mind:

- A simple linear regression (SLR) model using the concentration of CO `CO(GT)` as the predictor.
- A multiple linear regression (MLR) model using the concentration of CO `CO(GT)` as well as the temperature, relative humidity, and absolute humidity (`T`, `RH`, `AH`) as predictors.

Let's fit the basic models now:

In [54]:
#Fit the SLR model
reg = linear_model.LinearRegression()
reg.fit(X = aq_avg['CO(GT)'].values.reshape(-1,1), y = aq_avg['C6H6(GT)'])

string1 = "Intercept: {}"
string2 = "Coefficient(s): {}"
print(string1.format(reg.intercept_))
print(string2.format(reg.coef_))

Intercept: 0.644770948344199
Coefficient(s): [4.56536116]


In [57]:
#Fit the MLR model
reg.fit(X = aq_avg[['CO(GT)','T','RH','AH']], y = aq_avg['C6H6(GT)'])

string1 = "Intercept: {}"
string2 = "Coefficient(s): {}"
print(string1.format(reg.intercept_))
print(string2.format(reg.coef_))

Intercept: -1.8377694729981755
Coefficient(s): [ 4.77080433  0.11973259 -0.01620259  0.68866811]


Keep in mind that these models are "naive" in a sense, since they do not take the date into account and the assumption that observations are independent is currently in question. But now that we know how these models are going to be fit, we can write the functions that will evaluate them sequentially.

## Creating Functions to Sequentially Validate Linear Regression Models
