# Cross-Validating across Time using Multiple Linear Regression

A major challenge of modeling data that is collected over time is that there is likely to be correlation between consecutive entries. This poses problems because a key assumption that is made when fitting a linear regression model is that the observations are independent of one another. In order to work around this, we can instead evaluate the model by taking the first subset of observations (ordered by time), forecasting the next observation and recording the mean squared error. We can repeat this process, obtain the sum of the mean squared errors, and now we have a metric by which we can "cross-validate" the model!

We will write an algorithm to demonstrate this process using Python. Data for this particular example is air quality data from the UC Irvine Machine Learning Repository. The observations were collected between March 2004 to February 2005 and data was collected using metal oxide chemical sensors that were embedded in a device located in a significantly polluted area in Italy. More information on the dataset is available [here](https://archive.ics.uci.edu/dataset/360/air+quality).

We'll go ahead and read in the data now:

In [6]:
# Read in the dataset from the UCI Machine Learning Repository library
!pip install ucimlrepo
import ucimlrepo as uci

air_quality = uci.fetch_ucirepo(id=360)
aq = air_quality.data.features
aq.head()



Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


Before the models can be evaluated, we will need to clean up this data a little bit. First, there are missing values that we need to get rid of in some of the parameters we will be working with: `C6H6(GT)`, `CO(GT)`, `T`, `RH`, and `AH`.

In [7]:
#Clear out missing values (corresponding to -200 in this dataset)
aq_no_missing = aq[(aq['C6H6(GT)'] != -200) & (aq['CO(GT)'] != -200) & (aq['T'] != -200) & (aq['RH'] != -200) & (aq['AH'] != -200)]

Next, the values we wish to work with are the average of each parameter of interest across a given `Date`, so we will group the data by `Date` and return the average for each of the five aforementioned variables. Note that we expect 347 observations after completing this operation.

In [13]:
#Group data by date, return the average across each day
aq_avg = aq_no_missing.groupby('Date')[['C6H6(GT)', 'CO(GT)','T','RH','AH']].mean()

#Append a date column so the data is a bit easier to iterate over
aq_avg['Day'] = range(1, len(aq_avg) + 1)
aq_avg

Unnamed: 0_level_0,C6H6(GT),CO(GT),T,RH,AH,Day
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1/1/2005,7.313043,2.134783,6.813043,51.260870,0.501643,1
1/10/2005,13.463636,2.127273,13.377273,68.413636,1.044486,2
1/11/2005,13.779167,2.812500,12.779167,64.104167,0.941413,3
1/12/2005,15.817391,3.273913,12.021739,65.443478,0.904865,4
1/13/2005,12.495833,2.679167,9.991667,69.566667,0.847225,5
...,...,...,...,...,...,...
9/5/2004,5.712500,1.304167,29.204167,27.237500,1.057429,343
9/6/2004,6.278261,1.421739,26.752174,35.934783,1.219478,344
9/7/2004,9.700000,1.890909,27.618182,33.318182,1.154682,345
9/8/2004,17.016667,3.316667,27.966667,26.433333,0.965850,346


Now, we are ready to start modeling!

## The Models

