# Statistical comparison of formulas and models

In this notebook a statistical comparison of models is performed.

First, go to the parent directory so you can import all modules.

In [1]:
import sys
sys.path.insert(0,'..')

The metric used for comparison is Spearman's correlation.

In [2]:
from scipy.stats import spearmanr

metric = lambda predA, predB: abs(spearmanr(predA, predB)[0])

The statistical comparison will be performed with the bootstrap significance testing.

In [3]:
from comparison.bootstrap import bootstrap_significance_testing

In [4]:
# the number of times to perform bootstrap resampling
n = int(1e5)

We will use __significance level of 0.05.__

## 1. Comparison of formulas

In [5]:
import pandas as pd

X_train = pd.read_csv("../features/weebit_train_with_features.csv", index_col=0)
X_test = pd.read_csv("../features/weebit_test_with_features.csv", index_col=0)

# get Y
y_train = X_train["Level"]
y_test = X_test["Level"]

# remove Y and Text columns 
X_train.drop(columns=['Text', 'Level'], inplace=True)
X_test.drop(columns=['Text', 'Level'], inplace=True)

# whole set
X = pd.concat([X_train, X_test]).reset_index(drop=True)
y = pd.concat([y_train, y_test]).reset_index(drop=True)

In [6]:
from formulas.readability_formulas import flesch, dale_chall, gunning_fog

X = flesch(X)
X = dale_chall(X)
X = gunning_fog(X)

### 1.1 Flesch vs Dale-Chall

In [7]:
metric(y, X["Dale_Chall"])

0.35448675874199415

In [8]:
metric(y, X["Flesch"])

0.3600500729354314

Flesch has a slightly higher correlation. But is it statistically significant?

In [9]:
p_value = bootstrap_significance_testing(y, X['Flesch'], X['Dale_Chall'], metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.37526


As we can see, the p-value is quite high (>0.05). We fail to reject the null hypothesis: the difference between the Flesch and Dale-Chall formula is not statistically significant.

### 1.1 Gunning fog vs Flesch

In [10]:
metric(y, X["Flesch"])

0.3600500729354314

In [11]:
metric(y, X["Gunning_fog"])

0.43317970094386554

Gunning fog has a higher correlation. Is this stat. significant?

In [12]:
p_value = bootstrap_significance_testing(y, X['Gunning_fog'], X['Flesch'], metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.0


The p-value is very small (it rounds to 0.0). We can say that Gunning fog formula performs significantlly better than Flesch formula.

### 1.1 Gunning fog vs Dale-Chall

In [13]:
metric(y, X["Dale_Chall"])

0.35448675874199415

In [14]:
metric(y, X["Gunning_fog"])

0.43317970094386554

Gunning fog has a higher correlation. Is this stat. significant?

In [15]:
p_value = bootstrap_significance_testing(y, X['Gunning_fog'], X['Dale_Chall'], metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 1e-05


The p-value is very small. We can say that Gunning fog formula performs significantlly better than Dale-Chall formula.

### 1.4. Conclusions

Based on our tests, there is __no statistical difference between Dale-Chall and Flesch formulas__.

__Gunning fog index performs better than both.__

## 2. Comparison of machine learning models

In the ML model evaluation (done in `ml_models/model_evaluation.ipynb`), XGBoost and Multilayer Perceptron (MLP) performed the best, with XGBoost performing slightly better than MLP. In this section we will test if the difference between those two models and the rest is statistically significant, and also is XGBoost significantlly better than MLP.

In [16]:
from ml_models.models.random_forest import RandomForest
from ml_models.models.xgboost import XGBoost
from ml_models.models.support_vector_machine import SupportVectorMachine
from ml_models.models.multilayer_perceptron import MultilayerPerceptron

Using TensorFlow backend.


Get predictions for all models.

In [17]:
rf = RandomForest(use_saved_model=True, model_path='../ml_models/models/saved_models/rf.pickle')
y_pred_rf = rf.predict(X_test)

xgboost = XGBoost(use_saved_model=True, model_path='../ml_models/models/saved_models/xgboost.pickle')
y_pred_xgboost = xgboost.predict(X_test)

svm = SupportVectorMachine(use_saved_model=True, model_path='../ml_models/models/saved_models/svm.pickle')
y_pred_svm = svm.predict(X_test)

mlp = MultilayerPerceptron(input_dim=X_train.shape[1], use_saved_model=True, verbose=0, model_path='../ml_models/models/saved_models/mlp.h5')
y_pred_mlp = mlp.predict(X_test)

### 2.1 MLP vs RandomForest 

In [18]:
p_value = bootstrap_significance_testing(y_test, y_pred_mlp, y_pred_rf, metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.12076


The estimated p-value is larger than our significance level (0.05). We fail to reject the null hypothesis. __The difference between the MLP and RandomForest models is not statistically significant.__

### 2.2 MLP vs SVC

In [19]:
p_value = bootstrap_significance_testing(y_test, y_pred_mlp, y_pred_svm, metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.43394


The estimated p-value is larger than our significance level (0.05). We fail to reject the null hypothesis. __The difference between the MLP and SVC models is not statistically significant.__

### 2.3 XGBoost vs RandomForest

In [20]:
p_value = bootstrap_significance_testing(y_test, y_pred_xgboost, y_pred_rf, metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.02543


The p-value is small (<0.05). __We can say that XGBoost performs significantlly better than the RandomForest model.__

### 2.4 XGBoost vs SVC

In [21]:
p_value = bootstrap_significance_testing(y_test, y_pred_xgboost, y_pred_svm, metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.30053


The estimated p-value is larger than our significance level (0.05). We fail to reject the null hypothesis. __The difference between the XGBoost and SVC models is not statistically significant.__

### 2.5 XGboost vs MLP

In [22]:
p_value = bootstrap_significance_testing(y_test, y_pred_xgboost, y_pred_mlp, metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.34068


The estimated p-value is larger than our significance level (0.05). We fail to reject the null hypothesis. __The difference between the MLP and XGBoost models is not statistically significant.__

### Conclusions

Based on our tests, there is __almost no statistically significant difference between different ML models.__.

The only thing which we managed to prove is that __XGBoost outperforms the RandomForest model.__ 

This gives evidence to the claim that XGBoost is the best ML model we have, althrough the differences are minimal.

## 3. Comparison of formulas vs ML models

Considering the Gunning Fog formula performed the best of the formulas, and XGBoost performed best of the models, we will compare those two. We will try to test if XGBoost model is statistically significantlly better than the Gunning fog formula. 

The null hypothesis is that there is no difference between XGBoost and the Gunning fog formula.

In [23]:
X_test = gunning_fog(X_test)

In [24]:
p_value = bootstrap_significance_testing(y_test, y_pred_xgboost, X_test['Gunning_fog'], metric, n=n)
print("Estimated p-value: " + str(p_value))

Estimated p-value: 0.0


The p-value is very small (it rounds to 0.0). We reject the null hypothesis, which gives evidence that __XGBoost model is better than the Gunning fog formula.__

__Our conclusion is that ML models are truly better than traditional formulas.__ Considering they use much more features and are able to learn from them, this comes to no surprise.