# Best Predictive Model

In this notebook we will find the best model in many models [link of the course](https://openclassrooms.com/en/courses/5873596-design-effective-statistical-models-to-understand-your-data/6233071-select-the-best-predictive-model)

Let's go with the ozone data set

In [45]:
import pandas as pd
df=pd.read_csv("data/ozone.csv")
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [47]:
import numpy as np
df = df.dropna().rename(columns = {'Solar.R': 'Solar'})
df['Wind2'] = np.square(df['Wind'])
df['Temp2'] = np.square(df['Temp'])
df.head()

Unnamed: 0,Ozone,Solar,Wind,Temp,Month,Day,Wind2,Temp2
0,41.0,190.0,7.4,67,5,1,54.76,4489
1,36.0,118.0,8.0,72,5,2,64.0,5184
2,12.0,149.0,12.6,74,5,3,158.76,5476
3,18.0,313.0,11.5,62,5,4,132.25,3844
6,23.0,299.0,8.6,65,5,7,73.96,4225


* Let's create the RMSE function which allow us to calcul the RMSE given the residuals

In [48]:
def RMSE(resid):
    return np.sqrt(np.square(resid).sum()) / len(resid)

* Now let's build models 


In [49]:
import statsmodels.formula.api as smf
formulas = ['Ozone ~ Temp',
'Ozone ~ Temp + Temp2',
'Ozone ~ Wind',
'Ozone ~ Wind + Wind2',
'Ozone ~ Temp + Wind + Solar',
'Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar'
]

scores = []
for formula in formulas:
    results = smf.ols(formula, df).fit()
    scores.append( { 'model': formula,
        'RMSE':RMSE(results.resid),
        'R-squared': results.rsquared} 
    )

scores = pd.DataFrame(scores)
scores

Unnamed: 0,model,RMSE,R-squared
0,Ozone ~ Temp,2.249862,0.48796
1,Ozone ~ Temp + Temp2,2.126373,0.542627
2,Ozone ~ Wind,2.485371,0.375152
3,Ozone ~ Wind + Wind2,2.211034,0.505481
4,Ozone ~ Temp + Wind + Solar,1.973832,0.605895
5,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,1.68643,0.712308


In [50]:
scores.sort_values(by="RMSE").reset_index(drop=True)

Unnamed: 0,model,RMSE,R-squared
0,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,1.68643,0.712308
1,Ozone ~ Temp + Wind + Solar,1.973832,0.605895
2,Ozone ~ Temp + Temp2,2.126373,0.542627
3,Ozone ~ Wind + Wind2,2.211034,0.505481
4,Ozone ~ Temp,2.249862,0.48796
5,Ozone ~ Wind,2.485371,0.375152


### Interpretation
The RMSE aligns with the R-squared metric. The best model is the most complex one
with the lowest RMSE and highest R-squared.

# Train and Test dataset
 We have not more data so we split our dataset to 70/20 for traning/testing

In [51]:
train_index=df.sample(frac=0.7).index
train=df.loc[df.index.isin(train_index)]
test=df.loc[~df.index.isin(train_index)]

Great now let's train and test .We also calculate the RSME of the test to find the best model

In [52]:
scores = []
for formula in formulas:
    results = smf.ols(formula, train).fit()
    yhat = results.predict(test)
    resid_test = yhat - test.Ozone
    scores.append( { 'model': formula,
        'RMSE_test':RMSE(resid_test),
        'RMSE_train':RMSE(results.resid),
        'R-squared': results.rsquared} 
    )

scores = pd.DataFrame(scores)
scores

Unnamed: 0,model,RMSE_test,RMSE_train,R-squared
0,Ozone ~ Temp,3.706594,2.79902,0.501236
1,Ozone ~ Temp + Temp2,3.538796,2.632961,0.558662
2,Ozone ~ Wind,4.146131,3.083249,0.394799
3,Ozone ~ Wind + Wind2,3.36957,2.805828,0.498807
4,Ozone ~ Temp + Wind + Solar,3.205538,2.468212,0.612164
5,Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar,2.558541,2.146186,0.706764


As you can see, the best model (the one with lowest RMSE) is still the most complex one: Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar. But now, the Ozone ~ Wind + Wind2 performs better than Ozone ~ Temp + Wind + Solar, which was not the case in our previous iteration.

<img src="overfitting.PNG">









##### Let's add extra complex models to our growing collection and see if some of them start overfitting

In [53]:
df['Wind3'] = df['Wind']**3
df['Wind4'] = df['Wind']**4
df['Temp3'] = df['Temp']**3
df['Temp4'] = df['Temp']**4
df['Solar3'] = df['Solar']**3
df['Solar2'] = df['Solar']**2
df

Unnamed: 0,Ozone,Solar,Wind,Temp,Month,Day,Wind2,Temp2,Wind3,Wind4,Temp3,Temp4,Solar3,Solar2
0,41.0,190.0,7.4,67,5,1,54.76,4489,405.224,2998.6576,300763,20151121,6859000.0,36100.0
1,36.0,118.0,8.0,72,5,2,64.00,5184,512.000,4096.0000,373248,26873856,1643032.0,13924.0
2,12.0,149.0,12.6,74,5,3,158.76,5476,2000.376,25204.7376,405224,29986576,3307949.0,22201.0
3,18.0,313.0,11.5,62,5,4,132.25,3844,1520.875,17490.0625,238328,14776336,30664297.0,97969.0
6,23.0,299.0,8.6,65,5,7,73.96,4225,636.056,5470.0816,274625,17850625,26730899.0,89401.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,14.0,20.0,16.6,63,9,25,275.56,3969,4574.296,75933.3136,250047,15752961,8000.0,400.0
148,30.0,193.0,6.9,70,9,26,47.61,4900,328.509,2266.7121,343000,24010000,7189057.0,37249.0
150,14.0,191.0,14.3,75,9,28,204.49,5625,2924.207,41816.1601,421875,31640625,6967871.0,36481.0
151,18.0,131.0,8.0,76,9,29,64.00,5776,512.000,4096.0000,438976,33362176,2248091.0,17161.0


In [54]:
formulas = [
'Ozone ~ Solar',
'Ozone ~ Solar + Solar2',
'Ozone ~ Wind',
'Ozone ~ Wind + Wind2',
'Ozone ~ Temp',
'Ozone ~ Temp + Temp2',
'Ozone ~ Temp + Wind + Solar',
'Ozone ~ Temp + Temp2 + Wind + Wind2 + Solar + Solar2',
'Ozone ~ Temp + Temp2 + Temp3 + Wind + Wind2 + Wind3 + Solar + Solar2 + Solar3',
'Ozone ~ Temp + Temp2 + Temp3 + Temp4 + Wind + Wind2 + Wind3 + Wind4 + Solar + Solar2 + Solar3 + Solar4',
'Ozone ~ Temp + Temp2 + Temp3 + Temp4 + Wind + Wind2 + Wind3 + Wind4 + Solar + Solar2 + Solar3 + Solar4 + Temp * Wind + Temp * Solar + Wind * Solar',
 'Ozone ~ Temp * Wind + Temp * Solar + Wind * Solar'   
]


In [56]:
scores = []
# for formula in formulas:
#     results = smf.ols(formula, train).fit()
#     yhat = results.predict(test)
#     resid_test = yhat - test.Ozone
#     scores.append( { 'model': formula,
#         'RMSE_test':RMSE(resid_test),
#         'RMSE_train':RMSE(results.resid),
#         'R-squared': results.rsquared} 
#     )

scores = pd.DataFrame(scores)

0      36100.0
1      13924.0
2      22201.0
3      97969.0
6      89401.0
        ...   
147      400.0
148    37249.0
150    36481.0
151    17161.0
152    49729.0
Name: Solar2, Length: 111, dtype: float64