Chapter 23
# Rank Data

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.  Samples of data where we already know (or can easily identify) the distriubution are called parametric data.

Data in which the distribution is unknown (or cannot be easily identified) is called nonparametric.  In this case, special nonparametric statistical methods can be used that discard all information about the distribution (thus these methods are often referred to as distribution-free methods).

# Parametric Data
This is a sample of data drawn from a known data distribution.  It is often shorthand for real-valued data drawn from a Gaussian distribution (although not strictly accurate)

If we continue with the shorthand of parametric meaning Gaussian, then if we have parametric data we can harness the entire suite of statistical methods developed for data assuming a Gaussian distribution e.g.
- Summary statistics
- Correlation between variables
- Significance tests for comparing means

Often we will use data preparation methods sthat make data parametric, such as data transforms, so that we can harness these well-understood statistical methods.

# Nonparametric Data
This is data that does not fit a known or well-understood distribution.  Data could be nonparametric for many reasons e.g.
- data is not real-valued, but is instead ordinals, intervals or some other form
- data is real-valued but does not fit a well understood shape
- data is almost parametric but contains outliers, multiple peaks, a shift or some other feature

Most parametric methods have an equivalent nonparametric version.  However, in general, the findings from nonparametric methods are less powerful than their parametric counterparts, namely because they must be generalised to work for all types of data.  Information about the distribution is discarded.

# Ranking Data
Before a nonparametric statistical method can be applied, the data must be converted into a rank format.  Statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests.

The procedure to rank data:
- sort all data in the sample in ascending order
- assign an integer rank from 1 to N for each unique value in the data sample

There are variations on this procedure for special circumstances such as:
- handling ties
- using a reverse ranking
- using a fractional rank score

The SciPy library provides the rankdata() function to rank numerical data, and supports a number of variations.

In [1]:
# example of ranking real-valued observations
from numpy.random import rand
from numpy.random import seed
from scipy.stats import rankdata

# seed random number generator
seed(1)

# generate dataset from a uniform distribution
data = rand(1000)

# review first 10 samples
print(data[:10])

# rank data
ranked = rankdata(data)

# review first 10 ranked samples
print(ranked[:10])

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01
 3.96767474e-01 5.38816734e-01]
[408. 721.   1. 300. 151.  93. 186. 342. 385. 535.]


# Working with Ranked Data
There are statistical tools you can use to check if your sample fits a given distribution.  For example, you can use statistical methods that quantify how Gaussian a sample of data is, and use nonparametric methods if the data fails those tests.

Three examples of statistical methods for normality testing are:
- Shapiro-Wilk Test
- D'Agostino's K2 Test
- Anderson-Darling Test

Once you have decided to use nonparametric statistics, you must then rank your data.  Most of the tools used for inference will automatically perform the ranking of sample data.  Nevertheless, it is important to understand how your sample data is being transformed prior to performing the tests.

There are two main types of questions you may have about your data that you can address with nonparametric statistical methods:
- Relationship Between Variables
- Compare Sample Means

# Relationship Between Variables
Methods for quantifying the dependency between variables are called correlation methods.  Nonparametric statistical correlation methods include:
- Spearman's Rank Correlation
- Kendall's Rank Correlation
- Goodman and Kruskal's Rank Correlation
- Somers' Rank Correlation

# Compare Sample Means
Methods for quantifying whether or not the mean between two populations is significantly different are called statistical significance tests, and include:
- Mann-Whitney U Test
- Wilcoxon Signed-Rank Test
- Kruskal-Wallis H Test
- Friedman Test

# Extensions

In [8]:
# develop your own example to demonstrate the capabilites of the rankdata() function
from scipy.stats import rankdata

print('default:', rankdata([0, 2, 3, 2]))
print('min:', rankdata([0, 2, 3, 2], method='min'))
print('max:', rankdata([0, 2, 3, 2], method='max'))
print('dense:', rankdata([0, 2, 3, 2], method='dense'))
print('ordinal', rankdata([0, 2, 3, 2], method='ordinal'))

default: [1.  2.5 4.  2.5]
min: [1 2 4 2]
max: [1 3 4 3]
dense: [1 2 3 2]
ordinal [1 2 4 3]
