## KFold Cross Validation

We are going to perform a K-Fold cross validation on a sklearn provided diabetes dataset

In [123]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import datasets, linear_model
from scipy import stats

#### Importing the data

In [46]:
diabetes = datasets.load_diabetes()

In [104]:
columns = 'age sex bmi map tc ldl hdl tch ltg glu'.split()
predictor = pd.DataFrame(diabetes.data, columns=columns)

target = pd.DataFrame(diabetes.target, columns=['progression'])

In [105]:
predictor.sample(5)

Unnamed: 0,age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu
403,-0.020045,-0.044642,0.097264,-0.005671,-0.005697,-0.023861,-0.021311,-0.002592,0.061686,0.040343
179,-0.023677,-0.044642,-0.015906,-0.012556,0.020446,0.041274,-0.043401,0.034309,0.014072,-0.009362
293,-0.0709,-0.044642,0.092953,0.012691,0.020446,0.042527,0.000779,0.00036,-0.054544,-0.001078
348,0.030811,-0.044642,-0.020218,-0.005671,-0.004321,-0.029497,0.078093,-0.039493,-0.010904,-0.001078
260,0.041708,-0.044642,-0.008362,-0.057314,0.008063,-0.031376,0.151726,-0.076395,-0.080237,-0.017646


In [106]:
target.sample(5)

Unnamed: 0,progression
345,139.0
255,153.0
30,129.0
347,88.0
398,242.0


#### Preparing the model

Now we will split our data in 80:20 ratio. 80 for training and 20 for testing.

In [107]:
train_test_split?

In [109]:
x_train, x_test, y_train, y_test = train_test_split(predictor, diabetes.target, test_size=0.2, random_state=0)

In [110]:
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(353, 10) (353,)
(89, 10) (89,)


We have generated our training and testing data. Now let's apply Linear Regression over it.

In [115]:
lm = linear_model.LinearRegression()
model = lm.fit(x_train, y_train)

In [118]:
model.score(x_test, y_test)

0.33222203269065154

#### Applying KFold


We have used K = 10

In [125]:
scores = cross_val_score(model, predictor, diabetes.target, cv=10)

In [126]:
print(scores)

[0.55614411 0.23056092 0.35357777 0.62190498 0.26587602 0.61819338
 0.41815916 0.43515232 0.43436983 0.68568514]


In [128]:
print(scores.mean())

0.4619623619583371


Hence, we have got an average accuracy of 46%