Multiple years, multiple months. Make a prediction for the very last month and last year. Don't do train/test split.
Pick training and test sets by hand. The idea of a baseline is to build the simplest model possible, that can give good information. According to plan, do the train/test split accordingly. 

Full dataset. Training set is everything except the last month. 
Training set = Jan 2008 to November 2015. Test set = December 2015.

Then build your linear regression with all your variables, do L1 and L2 regularization. Always compare training performance vs test performance. Compute R-squared and RMSE. When you're done with that, have a meeting and see the results, and decide what to do next.

L1 = Lasso
L2 = Ridge

Compare training performance vs test performance. R-squared and RMSE for both.

ALSO, 

Do the residual analysis. The idea is to compute all the residuals of the test set. Plot a histogram of those residuals. For both the L1 and L2 regularizations, do one histogram for each. Histogram plots frequency of residual values. Ideally it would be normally distributed. Compare actual vs predicted in the test set. 

IF going deeper, you can do margin of error for residuals.

In [1]:
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import datetime
import pickle
import seaborn as sns

In [2]:
df = pd.read_pickle('weather_ozone')
df.head()

Unnamed: 0_level_0,ozone_ppm,SPD,VSB,TEMP,DEWP,STP
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-01-01 00:00:00,0.019,8,10.0,41,39,1018.1
2008-01-01 01:00:00,0.02,8,10.0,41,41,1017.2
2008-01-01 02:00:00,0.02,8,10.0,42,41,1016.5
2008-01-01 03:00:00,0.019,12,10.0,43,41,1015.0
2008-01-01 04:00:00,0.019,13,10.0,43,42,1014.2


In [3]:
#Split to training and test sets
train = df['2008-01-01':'2015-11-30']
test  = df['2015-12-01':]
print('Train Dataset:',train.shape)
print('Test Dataset:',test.shape)

('Train Dataset:', (67579, 6))
('Test Dataset:', (742, 6))


In [4]:
#Define variables for training set
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error

X_train = train.drop('ozone_ppm', axis=1).values
y_train = train.ozone_ppm.values

In [5]:
#Create regressor
lm = LinearRegression()
# Fit the regressor to the training data
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [6]:
#Training set
print "Training set accuracy scores:"
print ''
y_pred_train = lm.predict(X_train)
print("R^2: {}".format(lm.score(X_train, y_train)))
rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))
mae = mean_absolute_error(y_train, y_pred_train)
print("Mean Absolute Error: {}".format(mae))

Training set accuracy scores:

R^2: 0.582197679433
Root Mean Squared Error: 0.0113865390885
Mean Absolute Error: 0.00903829888978


In [7]:
#Define variables for testing set
X_test = test.drop('ozone_ppm', axis=1).values
y_test = test.ozone_ppm.values

In [8]:
# Predict on the test data
y_pred_test = lm.predict(X_test)

In [9]:
#Test set
print "Test set accuracy scores:"
print ''
print("R^2: {}".format(lm.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("Root Mean Squared Error: {}".format(rmse))
mae = mean_absolute_error(y_test, y_pred_test)
print("Mean Absolute Error: {}".format(mae))
#MEAN ABSOLUTE ERROR TOO
#GridSearchCV for Lasso/Ridge


Test set accuracy scores:

R^2: 0.0432025252226
Root Mean Squared Error: 0.00903532790804
Mean Absolute Error: 0.00742406213061


In [10]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

print "Lasso regularization scores for training set:"
print ''
alpha_L = [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10, 100]
for alpha in alpha_L:
    lasso = Lasso(alpha=alpha, normalize=True)
    lasso.fit(X_train, y_train)
    score = lasso.score(X_train, y_train)
    print "R^2 is %f when alpha = "%score,alpha

Lasso regularization scores for training set:

R^2 is 0.582198 when alpha =  1e-15
R^2 is 0.582198 when alpha =  1e-10
R^2 is 0.582197 when alpha =  1e-08
R^2 is 0.426375 when alpha =  1e-05
R^2 is 0.000000 when alpha =  0.0001
R^2 is 0.000000 when alpha =  0.001
R^2 is 0.000000 when alpha =  0.01
R^2 is 0.000000 when alpha =  1
R^2 is 0.000000 when alpha =  5
R^2 is 0.000000 when alpha =  10
R^2 is 0.000000 when alpha =  100


In [11]:
print "Lasso regularization scores for test set:"
print ''
alpha_L = [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10, 100]
for alpha in alpha_L:
    lasso = Lasso(alpha=alpha, normalize=True)
    lasso.fit(X_test, y_test)
    score = lasso.score(X_test, y_test)
    print "R^2 is %f when alpha = "%score,alpha

Lasso regularization scores for test set:

R^2 is 0.584030 when alpha =  1e-15
R^2 is 0.584030 when alpha =  1e-10
R^2 is 0.584029 when alpha =  1e-08
R^2 is 0.566451 when alpha =  1e-05
R^2 is 0.397157 when alpha =  0.0001
R^2 is 0.000000 when alpha =  0.001
R^2 is 0.000000 when alpha =  0.01
R^2 is 0.000000 when alpha =  1
R^2 is 0.000000 when alpha =  5
R^2 is 0.000000 when alpha =  10
R^2 is 0.000000 when alpha =  100


In [12]:
from sklearn.linear_model import Ridge

print "Ridge regularization scores for training set:"
print ''
alpha_L = [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10, 100]
for alpha in alpha_L:
    ridge = Ridge(alpha=alpha, normalize=True)
    ridge.fit(X_train, y_train)
    score = ridge.score(X_train, y_train)
    print "R^2 is %f when alpha = "%score,alpha

Ridge regularization scores for training set:

R^2 is 0.582198 when alpha =  1e-15
R^2 is 0.582198 when alpha =  1e-10
R^2 is 0.582198 when alpha =  1e-08
R^2 is 0.582198 when alpha =  1e-05
R^2 is 0.582197 when alpha =  0.0001
R^2 is 0.582157 when alpha =  0.001
R^2 is 0.579106 when alpha =  0.01
R^2 is 0.359387 when alpha =  1
R^2 is 0.162587 when alpha =  5
R^2 is 0.096527 when alpha =  10
R^2 is 0.011614 when alpha =  100


In [13]:
print "Ridge regularization scores for test set:"
print ''
alpha_L = [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10, 100]
for alpha in alpha_L:
    ridge = Ridge(alpha=alpha, normalize=True)
    ridge.fit(X_test, y_test)
    score = ridge.score(X_test, y_test)
    print "R^2 is %f when alpha = "%score,alpha

Ridge regularization scores for test set:

R^2 is 0.584030 when alpha =  1e-15
R^2 is 0.584030 when alpha =  1e-10
R^2 is 0.584030 when alpha =  1e-08
R^2 is 0.584030 when alpha =  1e-05
R^2 is 0.584029 when alpha =  0.0001
R^2 is 0.584022 when alpha =  0.001
R^2 is 0.583441 when alpha =  0.01
R^2 is 0.422163 when alpha =  1
R^2 is 0.190086 when alpha =  5
R^2 is 0.112188 when alpha =  10
R^2 is 0.013367 when alpha =  100
