# Splitting dataset
- Generally k -fold cross-validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
- Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size. The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions. 
- If in doubt, use 10-fold cross-validation.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Read data

In [2]:
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names=names)

data.head(20)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
data.shape

(768, 9)

In [4]:
data.dtypes

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

### Separate dataset

In [5]:
data = data.values

In [6]:
X = data[:, 0:8]
Y = data[:, 8]

### Split into Train and Test Sets

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

In [8]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [9]:
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 75.591%


### K-fold Cross-Validation

In [10]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, random_state=7)

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [12]:
from sklearn.model_selection import cross_val_score

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.951% (4.841%)


### Leave One Out Cross-Validation
-  cross-validation which size of the fold = 1

In [13]:
from sklearn.model_selection import LeaveOneOut

loocv = LeaveOneOut()

In [14]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [15]:
from sklearn.model_selection import cross_val_score

results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.953% (42.113%)


### Repeated Random Test-Train Splits
-  repeat the process of splitting and evaluation of the algorithm multiple times
-  has the speed of using a train/test split and the reduction in variance in the estimated performance of k -fold cross-validation
-  can also repeat the process many more times as needed to improve the accuracy
-  A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation

In [16]:
from sklearn.model_selection import ShuffleSplit

kfold = ShuffleSplit(n_splits=10, test_size=0.33, random_state=7)

In [17]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [18]:
from sklearn.model_selection import cross_val_score

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.496% (1.698%)
