# Cross-validation
Model selection is one of the important subjects in statistical modeling. We have the possibility of including as many as features, but all of them would not improve the model accuracy. It is important to know which features are improving the model. There are two different approaches: 

- Information theoretic approach such as AIC, BIC, Mallow's Cp.
- Predition accuracy approach: Leave-one-out, k-fold crossvalidation.











# Load file
Commonly two libraries are used to load a csv files.
- numpy function `np.loadtext` and `np.genfromtext ` 
- pandas function `pd.read_csv`

Here we prefer using pandas

In [None]:
import pandas as pd
path='data/'
filename = path+'Auto.csv'
auto = pd.read_csv(filename, na_values=['?'], na_filter=True)
auto = auto.dropna()

In [None]:
auto.head()

In [None]:
auto.info()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(auto['horsepower'], auto['mpg'], 'r+', mfc='none');

The seaborn package bring better styling and more plot function. The seaborn package enriches matplotlib. Let's try the regplot fucntion of seaborn for instance.


In [None]:
import seaborn as sns           
#sets up styles and gives us more plotting options
sns.regplot(x="horsepower", y="mpg", data=auto, ci = False,
    scatter_kws={"color":"r", "alpha":0.3, "s":100},
    line_kws={"color":"b", "alpha":0.75, "lw":4}, marker="o", order=2)

# Quadratic model
It appears that a quadratic model makes sense. Let's check if this guess has support from data.

In [None]:
import numpy as np
import statsmodels.formula.api as smf
model = smf.ols(formula='mpg ~ horsepower', data = auto)

lr1 = model.fit()
lr1.summary2()
lr1.aic 

In [None]:
model = smf.ols(formula='mpg ~ horsepower +\
                np.power(horsepower,2)', data = auto)
lr2 = model.fit()
lr2.aic

In [None]:
model = smf.ols(formula='mpg ~ horsepower +\
np.power(horsepower,2)+ np.power(horsepower,3)', data = auto)
lr3 = model.fit()
lr3.aic

In [None]:
model = smf.ols(formula='mpg ~ horsepower +\
np.power(horsepower,2)+ np.power(horsepower,3)+\
np.power(horsepower,4)', data = auto)
lr4 = model.fit()
lr4.aic

# Leave-one-out
One of the most common validation method is leave-one-out, or n-fold crossvalidation

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression


loo = LeaveOneOut()
loo.get_n_splits(auto)

Make sure you feed sklearn algorithms a numpy array. In many cases sklearn accepts pandas dataframes too, but it is highly recommended to feed numpy arrays into sklearn functions.

In [None]:
X = auto[['horsepower']].values
y = auto['mpg'].values

rss = np.zeros(auto.shape[0])
i = 0
for train_i, test_i in loo.split(auto):
    lr = LinearRegression() 
    lr = lr.fit(X[train_i], y[train_i])
    rss[i]=(lr.predict(X[test_i]) - y[test_i])**2
    i= i + 1
# mse is the squared error for each sample in the test set.
np.sum(rss)

In [None]:
X = auto[['horsepower', 'displacement']].values
rss = np.zeros(auto.shape[0])
i = 0
for train_i, test_i in loo.split(auto):
    lr = LinearRegression() 
    lr = lr.fit(X[train_i], y[train_i])
    rss[i]=(lr.predict(X[test_i]) - y[test_i])**2
# you may write i = i+1 as follows
    i += 1
np.sum(rss)

In [None]:
from sklearn.model_selection import KFold
X = auto[['horsepower', 'displacement']].values
k = 5
rss = np.zeros(k)
kf = KFold(n_splits=k, shuffle=True)
i = 0
for train_i, test_i in kf.split(auto):
    lr = LinearRegression() 
    lr = lr.fit(X[train_i], y[train_i])
    rss[i]=np.sum((lr.predict(X[test_i]) - y[test_i])**2)
    i+=1
rss


In [None]:
np.sum(rss)