# Statistical comparison of formulas and models

In this notebook a statistical comparison of models is performed.

First, go to the parent directory so you can import all modules.

In [1]:
import sys
sys.path.insert(0,'..')

The metric used for comparison is Spearman's correlation.

In [2]:
from scipy.stats import spearmanr

metric = lambda predA, predB: abs(spearmanr(predA, predB)[0])

The statistical comparison will be performed with the bootstrap significance testing.

In [4]:
from comparison.bootstrap import bootstrap_significance_testing

In [27]:
# the number of times to perform bootstrap resampling
n = int(1e3)

## 1. Comparison of formulas

In [7]:
import pandas as pd

X_train = pd.read_csv("../features/weebit_train_with_features.csv", index_col=0)
X_test = pd.read_csv("../features/weebit_test_with_features.csv", index_col=0)

# get Y
y_train = X_train["Level"]
y_test = X_test["Level"]

# remove Y and Text columns 
X_train.drop(columns=['Text', 'Level'], inplace=True)
X_test.drop(columns=['Text', 'Level'], inplace=True)

# whole set
X = pd.concat([X_train, X_test]).reset_index(drop=True)
y = pd.concat([y_train, y_test]).reset_index(drop=True)

In [8]:
from formulas.readability_formulas import flesch, dale_chall, gunning_fog

X = flesch(X)
X = dale_chall(X)
X = gunning_fog(X)

### 1.1 Flesch vs Dale-Chall

In [13]:
metric(y, X["Dale_Chall"])

0.35448675874199415

In [14]:
metric(y, X["Flesch"])

0.3600500729354314

Flesch has a slightly higher correlation. But is it statistically significant?

In [18]:
bootstrap_significance_testing(y, X['Flesch'], X['Dale_Chall'], metric, n=n)

0.37423

As we can see, the p-value is quite high (>0.05). We fail to reject the null hypothesis: the difference between the Flesch and Dale-Chall formula is not statistically significant.

### 1.1 Gunning fog vs Flesch

In [19]:
metric(y, X["Flesch"])

0.3600500729354314

In [20]:
metric(y, X["Gunning_fog"])

0.43317970094386554

Gunning fog has a higher correlation. Is this stat. significant?

In [22]:
bootstrap_significance_testing(y, X['Gunning_fog'], X['Flesch'], metric, n=n)

0.0

The p-value is very small. We can say that Gunning fog formula performs significantlly better than Flesch formula.

### 1.1 Gunning fog vs Dale-Chall

In [23]:
metric(y, X["Dale_Chall"])

0.35448675874199415

In [24]:
metric(y, X["Gunning_fog"])

0.43317970094386554

Gunning fog has a higher correlation. Is this stat. significant?

In [28]:
bootstrap_significance_testing(y, X['Gunning_fog'], X['Dale_Chall'], metric, n=n)

0.0

The p-value is very small. We can say that Gunning fog formula performs significantlly better than Dale-Chall formula.

### 1.4. Conclusions

Based on our tests, there is __no statistical difference between Dale-Chall and Flesch formulas__.

__Gunning fog index performs better than both.__

## 2. Comparison of machine learning models