# How To Create an Algorithm Test Harness From Scratch With Python

by Jason Brownlee on December 11, 2019.[Here](https://machinelearningmastery.com/create-algorithm-test-harness-scratch-python/) in [Code Algorithms From Scratch](https://machinelearningmastery.com/category/algorithms-from-scratch/). [Dataset File](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv)

We cannot know which algorithm will be best for a given problem.

Therefore, we need to design a test harness that we can use to evaluate different machine learning algorithms.

After completing this tutorial, you will know:

- How to implement a train-test algorithm test harness.
- How to implement a k-fold cross-validation algorithm test harness.

A __test harness__ provides a `consistent way to evaluate machine learning algorithms` on a dataset.

It involves 3 elements:

1. The __resampling__ method to split-up the dataset.
2. The __machine learning algorithm__ to evaluate.
3. The __performance measure__ by which to evaluate predictions.

__The loading and preparation of a dataset is a prerequisite step that must have been completed prior to using the test harness__.

The test harness must allow for different machine learning algorithms to be evaluated, whilst the dataset, resampling method and performance measures are kept constant.

The __Zero Rule algorithm__ will be evaluated as part of the tutorial. The Zero Rule algorithm `always predicts the class that has the most observations in the training dataset`.

## Tutorial
This tutorial is broken down into two main sections:

1. Train-Test Algorithm Test Harness.
2. Cross-Validation Algorithm Test Harness.
3. Extensions

## 1. Train-Test Algorithm Test Harness.
The __train-test split__ is a simple `resampling method` that can be used to evaluate a machine learning algorithm.

We need a function that can take a dataset and an algorithm and return a performance score.

Below is a function named __evaluate_algorithm()__ that achieves this. It takes __3 fixed arguments__ including the `dataset`, the `algorithm` function and the `split percentage` for the train-test split.

1. The dataset is `split into train` and `test` elements. 
2. `Copy of the test` set is made 
3. Each `output value is cleared` by setting it to the __None__ value, to prevent the algorithm from cheating accidentally.

The algorithm function is expected to return a __list of predictions__, one for each row in the training dataset. These are compared to the actual output values from the unmodified test dataset by the __accuracy_metric()__ function.

In [1]:
# Train-Test Test Harness
from random import seed
from random import randrange
from csv import reader
from math import sqrt

In [2]:
# Load a CSV file
# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

In [3]:
# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

In [4]:
# Split a dataset into a train and test set
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    dataset_copy = list(dataset)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

In [5]:
# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

In [6]:
# Calculate mean absolute error
def mae_metric(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        sum_error += abs(predicted[i] - actual[i])
    return sum_error / float(len(actual))

In [7]:
# Calculate root mean squared error
def rmse_metric(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        prediction_error = predicted[i] - actual[i]
        sum_error += (prediction_error ** 2)
    mean_error = sum_error / float(len(actual))
    return sqrt(mean_error)

In [8]:
# Evaluate an algorithm using a train/test split
def evaluate_algorithm(dataset, algorithm, split, *args):
    train, test = train_test_split(dataset, split)
    test_set = list()
    for row in test:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(train, test_set, *args)
    actual = [row[-1] for row in test]

    accuracy = accuracy_metric(actual, predicted)
    mae = mae_metric(actual, predicted)
    rmse = rmse_metric(actual, predicted)
    
    return accuracy, mae, rmse

In [9]:
# zero rule algorithm for classification
def zero_rule_algorithm_classification(train, test):
    output_values = [row[-1] for row in train]
    prediction = max(set(output_values), key=output_values.count)
    predicted = [prediction for i in range(len(test))]
    return predicted

In [10]:
# Test the zero rule algorithm on the diabetes dataset
seed(1)

# load and prepare data
filename = '..\\..\\..\\data\\pima-indians-diabetes.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)

# evaluate algorithm
split = 0.6
accuracy, mae, rmse = evaluate_algorithm(dataset, zero_rule_algorithm_classification, split)

print('Accuracy: %.3f%%' % (accuracy))
print('Mean Absolute Error (MAE): %.3f%%' % (mae))
print('Root Mean Squared Error (RMSE): %.3f%%' % (rmse))

Accuracy: 67.427%
Mean Absolute Error (MAE): 0.326%
Root Mean Squared Error (RMSE): 0.571%


The dataset was split into 60% for training the model and 40% for evaluating it.

Notice how the name of the Zero Rule algorithm zero_rule_algorithm_classification was passed as an argument to the evaluate_algorithm() function. You can see how this test harness may be used again and again with different algorithms.

## 2. Cross-Validation Algorithm Test Harness.

__Cross-validation__ is a `resampling technique` that provides more reliable estimates of algorithm performance on unseen data.

It requires the creation and evaluation of k models on different subsets of your data, and such is more computationally expensive. Nevertheless, it is the gold standard for evaluating machine learning algorithms.

The algorithm must be evaluated on different subsets of the dataset many times. This means we need additional loops within our __evaluate_algorithm()__ function.

Below is a function that implements algorithm evaluation with cross-validation.

1. The dataset is __split into n_folds groups__ called folds.
2. Loop giving each fold an opportunity to be held out of training and used to evaluate the algorithm. 
3. Copy the list of folds is created and the held out fold is removed from this list. 
4. Then the list of folds is flattened into one long list of rows to match the algorithms expectation of a training dataset. This is done using the sum() function.
5. Once the training dataset is prepared the rest of the function within this loop is as above. A copy of the test dataset (the fold) is made and the output values are cleared to avoid accidental cheating by algorithms. 

The algorithm is prepared on the train dataset and makes predictions on the test dataset. The predictions are evaluated and stored in a list.

In [11]:
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

In [12]:
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
    folds = cross_validation_split(dataset, n_folds)
    scores = list()
    maes = list()
    rmses = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set, [])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted = algorithm(train_set, test_set, *args)
        actual = [row[-1] for row in fold]

        accuracy = accuracy_metric(actual, predicted)
        mae = mae_metric(actual, predicted)
        rmse = rmse_metric(actual, predicted)
        
        scores.append(accuracy)
        maes.append(mae)
        rmses.append(rmse)
    return scores, maes, rmses

In [13]:
# Test the zero rule algorithm on the diabetes dataset
seed(1)

# load and prepare data
filename = '..\\..\\..\\data\\pima-indians-diabetes.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)

# evaluate algorithm
n_folds = 5
scores, maes, rmses = evaluate_algorithm(dataset, zero_rule_algorithm_classification, n_folds)

print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/len(scores)))

print('\nMAE: %s' % maes)
print('Mean Absolute Error (MAE): %.3f%%' % (sum(maes)/len(maes)))


print('\nRMSE: %s' % rmses)
print('Root Mean Squared Error (RMSE): %.3f%%' % (sum(rmses)/len(rmses)))

Scores: [62.091503267973856, 64.70588235294117, 64.70588235294117, 64.70588235294117, 69.28104575163398]
Mean Accuracy: 65.098%

MAE: [0.3790849673202614, 0.35294117647058826, 0.35294117647058826, 0.35294117647058826, 0.30718954248366015]
Mean Absolute Error (MAE): 0.349%

RMSE: [0.6156987634551992, 0.5940885257860046, 0.5940885257860046, 0.5940885257860046, 0.5542468245138262]
Root Mean Squared Error (RMSE): 0.590%


## 3. Extensions
This section lists extensions to this tutorial that you may wish to consider.

- __Parameterized Evaluation__. Pass in the function used to evaluate predictions, allowing you to seamlessly work with regression problems.
- __Parameterized Resampling__. Pass in the function used to calculate resampling splits, allowing you to easily switch between the train-test and cross-validation methods.
- __Standard Deviation Scores__. Calculate the standard deviation to get an idea of the spread of scores when evaluating algorithms using cross-validation.