# Chapter 03 - Algorithm Evaluation Methods

In [1]:
from random import seed, randrange

### Train and Test Split

This method consists in split the dataset into two parts:

* Training dataset
* Test dataset

The training dataset is used by the machine learning algorithm to train the model. The test dataset is used to evaluate the performance of the model. 

The rows assigned to each dataset are randomly selected.

If multiples algorithms or multiple configurations of the same algorithm are compared, the same train and test split of the dataset should be used.

In [2]:
# Split a dataset into a train and test set
def train_test_split(dataset, split=0.6):
    train = list()
    # Calculates how many rows the training set 
    # requires from the provided dataset.
    train_size = split * len(dataset)
    # Make a copy of the original dataset
    dataset_copy = list(dataset)
    # Random rows are selected and removed from
    # the copied dataset and added to the train
    # dataset.
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

In [3]:
# Test train/test split
seed(1)
dataset = [[1], [2], [3], [4], [5], 
[6], [7], [8], [9], [10]]
train, test = train_test_split(dataset)
print(train)
print(test)

[[3], [2], [7], [1], [8], [9]]
[[4], [5], [6], [10]]


### K-fold Cross-Validation Split

The K-Fold cross-validation method (also called just cross-validation) is a resampling method that provides a more accurate estimate of algorithm performance. 

The data is divided in k groups (folds) and the algorithm is trained and evaluated k times and the performance is the mean performance score.

First the algorithm was training with k-1 groups and evaluated with the kth group. This is repeated so that each of the k groups is given an opportunity to be held out and used as the test set. 

In [7]:
# Split a dataset into k folds
def cross_validation_split(dataset, folds=3):
    dataset_split = list()
    dataset_copy = list(dataset)
    # Calculate the size of each fold
    fold_size = int(len(dataset)/folds)
    for _ in range(folds):
        fold = list()
        # Its the same process that the 
        # train_test_split. Remove a row 
        # from the dataset_copy and add 
        # to a fold.  
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

In [8]:
# Test cross validation split
seed(1)
dataset = [[1], [2], [3], [4], [5], 
[6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset)
print(folds)

[[[3], [2], [7]], [[1], [8], [9]], [[10], [6], [5]]]
