In [46]:
import pandas as pd
import numpy as np
from scipy.stats import chi2
from scipy.stats import f
from scipy.stats import friedmanchisquare as friedman


# Friedman Test

This notebook shows the implementation of the Friedman test as explained in this [blog post](LINK).
We will use a subset of the classification accuracies of multiple classifiers on the UCR datasets, provided by the [UEA & UCR Time Series Classification Repository](http://www.timeseriesclassification.com/index.php).

In [47]:
# Open the csv file with the accuracies from 5 time-series classifiers on 12 datasets
df = pd.read_csv("./data/ts_accs.csv", index_col="TESTACC")
df.index.rename("", inplace=True)
df

Unnamed: 0,TS-CHIEF,ROCKET,BOSS,WEASEL,Catch22
,,,,,
Beef,0.632222,0.76,0.612222,0.74,0.473333
BME,0.996444,0.997333,0.865778,0.947778,0.904889
Car,0.878889,0.911667,0.848333,0.834444,0.746111
CBF,0.998444,0.995926,0.998926,0.979778,0.953667
Crop,0.762101,0.751685,0.685688,0.723825,0.653141
Fish,0.981524,0.974095,0.969714,0.950857,0.772571
Ham,0.805079,0.855238,0.83746,0.82127,0.693968
Meat,0.984444,0.988889,0.980556,0.976667,0.942778
Rock,0.832,0.804667,0.802667,0.854667,0.705333


**Step 1:** rank each algorithm on each dataset. The best algorithm has the lowest ranking and worst algorithms has the highest.

In [48]:
r = df.rank(axis=1, ascending=False)
r

Unnamed: 0,TS-CHIEF,ROCKET,BOSS,WEASEL,Catch22
,,,,,
Beef,3.0,1.0,4.0,2.0,5.0
BME,2.0,1.0,5.0,3.0,4.0
Car,2.0,1.0,3.0,4.0,5.0
CBF,2.0,3.0,1.0,4.0,5.0
Crop,1.0,2.0,4.0,3.0,5.0
Fish,1.0,2.0,3.0,4.0,5.0
Ham,4.0,1.0,2.0,3.0,5.0
Meat,2.0,1.0,3.0,4.0,5.0
Rock,2.0,3.0,4.0,1.0,5.0


**Step 2:** calculate the average rank of each method, which is just calculate the average of each individual column on the above table.

In [49]:
R = r.mean(axis=0)
R

TS-CHIEF    2.250000
ROCKET      1.666667
BOSS        3.166667
WEASEL      3.000000
Catch22     4.916667
dtype: float64

**Step 3:** Lets calculate the Friedman statistic and Iman and Davenport correction:

In [50]:
# Friedman statistic
M = len(df.columns)
N = len(df)

aux1 = M*(M+1)
aux2 = (R**2).sum()
chi2F = (12 * N / aux1) * (aux2 - aux1*(M+1)/4.0)

# p-value
p_chi2F = chi2.sf(chi2F, M-1)
print('Friedman statistic:', chi2F, 'pvalue:', p_chi2F)

Friedman statistic: 29.00000000000002 pvalue: 7.817388769802184e-06


In [51]:
# Iman/Davenport statistic
FF = (D-1)*chi2F /(D*(M-1)-chi2F)

# p-value:
p_FF = f.sf(FF, M-1, (M-1)*(D-1))
print('Iman-Davenport statistic:', FF, 'pvalue:', p_FF)

Iman-Davenport statistic: 16.789473684210556 pvalue: 1.996893379071801e-08


**Step 4:** We already calculated the p-value and in both cases is below the critical level of $\alpha=0.05$ so we can reject the null, but just in case, let's look at the critical value for the $\chi^2$ distribution with 4 degrees of freedom for $\alpha=0.05$ and the critical value for the $F$ distribution with $5-1=4$ and $(5-1)\cdot(12-1) = 44$ degrees of freedom for $\alpha=0.05$.

In [58]:
cc = chi2.ppf(0.95,4)
cf = f.ppf(0.95, M-1, (M-1)*(D-1))

print('The critical value of chi2(4):', cc, 'The critical value of F(4,44):', cf)

The critical value of chi2(4): 9.487729036781154 The critical value of F(4,44): 2.583667426803002


We can calculate Friedman test using scipy

In [55]:
from scipy.stats import friedmanchisquare as friedman
friedman(*df.T.values)

FriedmanchisquareResult(statistic=29.0, pvalue=7.817388769802272e-06)

😎 it's the same result as we calculated!