# Best Predictive Model

In this notebook we will find the best model in many models [link of the course](https://openclassrooms.com/en/courses/5873596-design-effective-statistical-models-to-understand-your-data/6233071-select-the-best-predictive-model)

Let's go with the ozone data set

In [45]:
import pandas as pd
df=pd.read_csv("data/ozone.csv")
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [47]:
import numpy as np
df = df.dropna().rename(columns = {'Solar.R': 'Solar'})
df['Wind2'] = np.square(df['Wind'])
df['Temp2'] = np.square(df['Temp'])
df.head()

Unnamed: 0,Ozone,Solar,Wind,Temp,Month,Day,Wind2,Temp2
0,41.0,190.0,7.4,67,5,1,54.76,4489
1,36.0,118.0,8.0,72,5,2,64.0,5184
2,12.0,149.0,12.6,74,5,3,158.76,5476
3,18.0,313.0,11.5,62,5,4,132.25,3844
6,23.0,299.0,8.6,65,5,7,73.96,4225


* Let's create the RMSE function which allow us to calcul the RMSE given the residuals

In [48]:
def RMSE(resid):
    return np.sqrt(np.square(resid).sum()) / len(resid)

* Now let's build models 


In [61]:
import statsmodels.formula.api as smf
formulas = ['Ozone ~ Temp',
'Ozone ~ Temp + Temp2',
'Ozone ~ Wind',
'Ozone ~ Wind + Wind2',
'Ozone ~ Temp + Wind + Solar',
'Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar'
]

scores = []
for formula in formulas:
    results = smf.ols(formula, df).fit()
    scores.append( { 'model': formula,
        'RMSE':RMSE(results.resid),
        'R-squared': results.rsquared} 
    )

scores = pd.DataFrame(scores)
scores

Unnamed: 0,model,RMSE,R-squared
0,Ozone ~ Temp,2.249862,0.48796
1,Ozone ~ Temp + Temp2,2.126373,0.542627
2,Ozone ~ Wind,2.485371,0.375152
3,Ozone ~ Wind + Wind2,2.211034,0.505481
4,Ozone ~ Temp + Wind + Solar,1.973832,0.605895
5,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,1.68643,0.712308


In [62]:
scores.sort_values(by="RMSE").reset_index(drop=True)

Unnamed: 0,model,RMSE,R-squared
0,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,1.68643,0.712308
1,Ozone ~ Temp + Wind + Solar,1.973832,0.605895
2,Ozone ~ Temp + Temp2,2.126373,0.542627
3,Ozone ~ Wind + Wind2,2.211034,0.505481
4,Ozone ~ Temp,2.249862,0.48796
5,Ozone ~ Wind,2.485371,0.375152


### Interpretation
The RMSE aligns with the R-squared metric. The best model is the most complex one
with the lowest RMSE and highest R-squared.

# Train and Test dataset
 We have not more data so we split our dataset to 70/20 for traning/testing

In [63]:
train_index=df.sample(frac=0.7).index
train=df.loc[df.index.isin(train_index)]
test=df.loc[~df.index.isin(train_index)]

Great now let's train and test .We also calculate the RSME of the test to find the best model

In [64]:
scores = []
for formula in formulas:
    results = smf.ols(formula, train).fit()
    yhat = results.predict(test)
    resid_test = yhat - test.Ozone
    scores.append( { 'model': formula,
        'RMSE_test':RMSE(resid_test),
        'RMSE_train':RMSE(results.resid),
        'R-squared': results.rsquared} 
    )

scores = pd.DataFrame(scores)
scores

Unnamed: 0,model,RMSE_test,RMSE_train,R-squared
0,Ozone ~ Temp,3.403745,2.862013,0.434874
1,Ozone ~ Temp + Temp2,3.034991,2.743244,0.480805
2,Ozone ~ Wind,4.243184,3.049452,0.358428
3,Ozone ~ Wind + Wind2,3.766661,2.719036,0.489927
4,Ozone ~ Temp + Wind + Solar,3.382505,2.443109,0.588199
5,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,2.62987,2.13455,0.685649


As you can see, the best model (the one with lowest RMSE) is still the most complex one: Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar. But now, the Ozone ~ Wind + Wind2 performs better than Ozone ~ Temp + Wind + Solar, which was not the case in our previous iteration.

<img src="overfitting.PNG">









##### Let's add extra complex models to our growing collection and see if some of them start overfitting

In [65]:
df['Wind3'] = df['Wind']**3
df['Wind4'] = df['Wind']**4
df['Temp3'] = df['Temp']**3
df['Temp4'] = df['Temp']**4
df['Solar3'] = df['Solar']**3
df['Solar2'] = df['Solar']**2
df

Unnamed: 0,Ozone,Solar,Wind,Temp,Month,Day,Wind2,Temp2,Wind3,Wind4,Temp3,Temp4,Solar3,Solar2
63,32.0,236.0,9.2,81,7,3,84.64,6561,778.688,7163.9296,531441,43046721,13144256.0,55696.0
141,24.0,238.0,10.3,68,9,19,106.09,4624,1092.727,11255.0881,314432,21381376,13481272.0,56644.0
91,59.0,254.0,9.2,81,7,31,84.64,6561,778.688,7163.9296,531441,43046721,16387064.0,64516.0
88,82.0,213.0,7.4,88,7,28,54.76,7744,405.224,2998.6576,681472,59969536,9663597.0,45369.0
47,37.0,284.0,20.7,72,6,17,428.49,5184,8869.743,183603.6801,373248,26873856,22906304.0,80656.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,21.0,230.0,10.9,75,9,9,118.81,5625,1295.029,14115.8161,421875,31640625,12167000.0,52900.0
7,19.0,99.0,13.8,59,5,8,190.44,3481,2628.072,36267.3936,205379,12117361,970299.0,9801.0
146,7.0,49.0,10.3,69,9,24,106.09,4761,1092.727,11255.0881,328509,22667121,117649.0,2401.0
125,73.0,183.0,2.8,93,9,3,7.84,8649,21.952,61.4656,804357,74805201,6128487.0,33489.0


In [66]:
scores = []
for formula in formulas:
    results = smf.ols(formula, train).fit()
    yhat = results.predict(test)
    resid_test = yhat - test.Ozone
    scores.append( { 'model': formula,
        'RMSE_test':RMSE(resid_test),
        'RMSE_train':RMSE(results.resid),
        'R-squared': results.rsquared} 
    )

scores = pd.DataFrame(scores)

In [68]:
K = 4
np.random.seed(8)
df = df.sample(frac = 1)
indexes = np.array_split(list(df.index),K)

scores = []

for formula in formulas:
    for i in range(K):
        test_index  = indexes[i]
        train_index = [idx for idx in list(df.index) if idx not in test_index]
        train = df.loc[df.index.isin(train_index)]
        test  = df.loc[~df.index.isin(train_index)]
    
        results = smf.ols(formula, train).fit()
        yhat    = results.predict(test)
        resid_test = yhat - test.Ozone
    
        scores.append( {
            'model': formula,
            'RMSE_test':RMSE(resid_test),
            'RMSE_train':RMSE(results.resid)
        })

scores = pd.DataFrame(scores)
scores = scores.groupby(by = 'model').mean().reset_index()
scores

Unnamed: 0,model,RMSE_test,RMSE_train
0,Ozone ~ Temp,4.461479,2.588634
1,Ozone ~ Temp + Temp2,4.168964,2.446016
2,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,3.543969,1.93046
3,Ozone ~ Temp + Wind + Solar,3.964809,2.269495
4,Ozone ~ Wind,4.985956,2.864774
5,Ozone ~ Wind + Wind2,4.530476,2.54379


##### The best model is (again) #7  Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar + Solar2