> Reference:
+ [machinelearningmastery: data resampling](http://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/)

To avoid overfitting, we can’t train a machine learning algorithm on a dataset and use predictions from this same dataset to evaluate machine learning algorithms.

We must evaluate our machine learning algorithms on data that is not used to train the algorithm.

The evaluation is an estimate that we can use to talk about how well we think the algorithm may actually do in practice. It is not a guarantee of performance.

Once we estimate the performance of our algorithm, we can then re-train the final algorithm on the entire training dataset and get it ready for operational use.

# Tips #
+ Generally k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
+ Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
+ Techniques like leave-one-out cross validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

# Train and Test Sets #

**Pluses**
+ Fast
+ Good for large datasets

**Minuses**
+ High variance (differences in train and test datasets can result in differences in accuracy estimation)

In addition to specifying the size of the split, we also specify the random seed. Because the split of the data is random, we want to ensure that the results are reproducible. 
This is important if we want to compare this result to the estimated accuracy of another machine learning algorithm or the same algorithm with a different configuration.

In [9]:
# Evaluate using a train and a test set (67% / 33%)
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: {:.3%}".format(result))

Accuracy: 75.591%


# K-fold Cross Validation #

For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common.

**Pluses**
+ Less variance than single train-test set split.

**Minuses**
+ Computationally more expensive than single train-test set split, especially for big datasets.

In [2]:
# Evaluate using Cross Validation
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: mean={:.3%}, std={:.3%}".format(results.mean(), results.std()))

Accuracy: mean=76.951%, std=4.841%


# Leave One Out Cross Validation #

The size of the fold is 1 as [k] is set to the number of observations in the dataset.

**Pluses**
+ More reasonable estimate of accuracy for the model on unseen data.

**Minuses**
+ Computationally more expensive than k-fold cross validation.
+ More variance than k-fold cross validation.

In [5]:
# Evaluate using Leave One Out Cross Validation
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_instances = len(X)
loocv = cross_validation.LeaveOneOut(n=num_instances)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: mean={:.3%}, std={:.3%}".format(results.mean(), results.std()))

Accuracy: mean=76.823%, std=42.196%


# Repeated Random Test-Train Splits #

Create a random split of the data like the train/test split, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation.

**Pluses**
+ Fast as the train/test split.
+ Reduction in variance as k-fold cross validation.

**Minuses**
+ Repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation.

In [6]:
# Evaluate using Shuffle Split Cross Validation
# data split = 67% / 33%; repetition = 10 times
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_samples = 10
test_size = 0.33
num_instances = len(X)
seed = 7
kfold = cross_validation.ShuffleSplit(n=num_instances, n_iter=num_samples, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: mean={:.3%}, std={:.3%}".format(results.mean(), results.std()))

Accuracy: mean=76.535%, std=1.672%
