Statistical Comparisons of Classifiers in Python

Here is a summary on the currently suggested statistical tests for comparing classification algorithms. All work is implemented in python.

A summary of the current ltierature is given in the slides below

Just tell me what to use (TLDR)

Statistical comparisons of 2 classifiers on a single dataset

The suggestion in Approximate statistical tests for comparing supervised classification learning algorithms (1998) is to perform a 5x2 t-test. This was further extended in Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms to the 5x2 f-test which is also recommended by the original authors above to be the new standard.

These tests are implemented in mlxtend

5x2 t-test (Note: prefer extension below)

5x2 F-test

Alternatively, if computational resources only allow a method to be run a single time (unlike the 10 times required above), the only acceptable statistical test appears to be the McNemars test, which is also implemented in mlxtend.

Statistical comparisons of 2 classifiers on multiple datasets

If we are comparing 2 classifiers across several datasets, the suggestion in Statistical Comparisons of Classifiers over Multiple Data Sets (2006) is to perform Wilcoxon Signed-Rank Tests between the datasets.

This is shown in:

Statistical comparisons of multiple classifiers over multiple data sets

If we are comparing multiple methods across multiple datasets, the suggestions in Statistical Comparisons of Classifiers over Multiple Data Sets (2006) are to perform a Friedman test, paired with either Nemenyi post hoc analysis (for comparing all methods), or FWER correction if comparing to a control classifier (i.e. comparing several methods to one dataset, not pairwise comparisons).

Both tests are shown in the following Jupyter notebook

^{All credits to the original authors of the above articles}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
stac		stac
README.md		README.md
Significance Testing for Classification.pdf		Significance Testing for Classification.pdf
StatisticalTests.ipynb		StatisticalTests.ipynb
flowchart.png		flowchart.png
title.png		title.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

data

data

stac

stac

README.md

README.md

Significance Testing for Classification.pdf

Significance Testing for Classification.pdf

StatisticalTests.ipynb

StatisticalTests.ipynb

flowchart.png

flowchart.png

title.png

title.png

Repository files navigation

Statistical Comparisons of Classifiers in Python

Just tell me what to use (TLDR)

Statistical comparisons of 2 classifiers on a single dataset

Statistical comparisons of 2 classifiers on multiple datasets

Statistical comparisons of multiple classifiers over multiple data sets

About

Releases

Packages

Languages

benjaminpatrickevans/MethodComparisonsInPython

Folders and files

Latest commit

History

Repository files navigation

Statistical Comparisons of Classifiers in Python

Just tell me what to use (TLDR)

Statistical comparisons of 2 classifiers on a single dataset

Statistical comparisons of 2 classifiers on multiple datasets

Statistical comparisons of multiple classifiers over multiple data sets

About

Resources

Stars

Watchers

Forks

Languages