# Ablation studies to demosntrate the importance of the factors analyzed in the model trained

## Introduction

The analysis of the variable importance of the model trained in the 'pvalue_prediction_model_training.ipynb' notebook showed that all the features analyzed were relevant in the prediction of low and high p-values. However, the high interaction effects between variables might exert an effect on the found results.

To ensure that all the features are relevant to improve the p-value class prediction, ablation studies were performed. Ablation studies consist of the individual removal of features to evaluate if the model was worsened or not its performance.

In [22]:
setwd("C:\\Users\\dani5\\Documents\\Projects\\PhD\\p_value_dist\\data")

In [23]:
library(caret)
library(dplyr)
library(xgboost)
library(DALEX)
library(MLmetrics)
library(onehot)
library(pROC)

First, we load the p-value dataset obtained in the previous notebook.

In [24]:
load("notebook_06_01_2019.RData")

## Training of models with feature removal

To demonstrate that a model is improved or worsened after adding a feature, it is necessary to perform hypothesis testing of the analyzed metric with and without the variable.

According to several sources (e.g. here: https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/) the 5x2 (5 repeats, 2 50% folds) repeated cross-validation is the most robust approach to perform hypothesis (by paired t-tests) testing when comparing models. Ensuring the same folds are created in each model is vital so paired t-tests can be performed.

These requirements are taken into account to update the setting and the hyperparameter of the model training:

In [25]:
train_control <- trainControl(method = "cv",
                              number = 10,
                              classProbs=TRUE,
                              savePredictions = "final")

In [26]:
xgb_grid <- expand.grid(
                        nrounds = 300,
                       max_depth = 9,
                       min_child_weight = 3,
                       subsample = 0.8,
                       gamma = 0,
                       colsample_bytree = 0.8,
                       eta = 0.3)

As the setting have been changed, the model with all the features needs to be trained again:

The AUC obtained with this kind of cross-validation (0.7010) is quite lower to the one (0.7416) obtained with .632 bootstrap during the prediction of the model. This lowering is caused by the much lower percentage of rows (50 % of the original dataset) taken during the generation of each repeat of the 2-fold cross-validation. 

Now it is the moment to train models removing each one of the six factors studied:

In [43]:
set.seed(1);xgboost_model_no_country <- train(pvalue ~ ., 
                       p_value_final%>%select(-Country), 
                      method = "xgbTree",
                       metric="Kappa",
                     tuneGrid = xgb_grid,
                       trControl = train_control)
set.seed(1);xgboost_model_no_species <- train(pvalue ~ ., 
                       p_value_final%>%select(-Species), 
                      method = "xgbTree", 
                      metric="Kappa",
                     tuneGrid = xgb_grid,
                       trControl = train_control)
set.seed(1);xgboost_model_no_field <- train(pvalue ~ ., 
                       p_value_final%>%select(-Field), 
                      method = "xgbTree", 
                       metric="Kappa",
                     tuneGrid = xgb_grid,
                       trControl = train_control)
set.seed(1);xgboost_model_no_dataset <- train(pvalue ~ ., 
                       p_value_final%>%select(-dataset), 
                      method = "xgbTree", 
                       metric="Kappa",
                     tuneGrid = xgb_grid,
                       trControl = train_control)
set.seed(1);xgboost_model_no_citation <- train(pvalue ~ ., 
                       p_value_final%>%select(-Citation), 
                      method = "xgbTree", 
                       metric="Kappa",
                     tuneGrid = xgb_grid,
                       trControl = train_control)
set.seed(1);xgboost_model_no_year <- train(pvalue ~ ., 
                       p_value_final%>%select(-year), 
                      method = "xgbTree", 
                       metric="Kappa",
                     tuneGrid = xgb_grid,
                       trControl = train_control)


## Results of hypothesis testing

Now it is the moment to:
* evaluate how the predictions worsen after removing each variable during model training.
* prepare  McNemar's tests to evaluate if the difference in the predictions when removing the variable is significant or not.

First, let's remember the metrics in the model trained with all variables.

In [44]:
print(confusionMatrix(xgboost_model$pred$pred,xgboost_model$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        37975        19548
  X.0.01.0.05.     15952        29618
                                          
               Accuracy : 0.6557          
                 95% CI : (0.6527, 0.6586)
    No Information Rate : 0.5231          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3076          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6024          
            Specificity : 0.7042          
         Pos Pred Value : 0.6499          
         Neg Pred Value : 0.6602          
              Precision : 0.6499          
                 Recall : 0.6024          
                     F1 : 0.6253          
             Prevalence : 0.4769          
         Detection Rate : 0.2873          
   Detection Prevalence : 0.4420          
      Balanc

Citation count:

In [45]:
print(confusionMatrix(xgboost_model_no_citation$pred$pred,xgboost_model_no_citation$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        37209        26059
  X.0.01.0.05.     16718        23107
                                         
               Accuracy : 0.5851         
                 95% CI : (0.582, 0.5881)
    No Information Rate : 0.5231         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.1613         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.4700         
            Specificity : 0.6900         
         Pos Pred Value : 0.5802         
         Neg Pred Value : 0.5881         
              Precision : 0.5802         
                 Recall : 0.4700         
                     F1 : 0.5193         
             Prevalence : 0.4769         
         Detection Rate : 0.2241         
   Detection Prevalence : 0.3863         
      Balanced Accuracy : 0.580

Country of affiliation:

In [47]:
print(confusionMatrix(xgboost_model_no_country$pred$pred,xgboost_model_no_country$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        36172        20629
  X.0.01.0.05.     17755        28537
                                          
               Accuracy : 0.6277          
                 95% CI : (0.6247, 0.6306)
    No Information Rate : 0.5231          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2518          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5804          
            Specificity : 0.6708          
         Pos Pred Value : 0.6165          
         Neg Pred Value : 0.6368          
              Precision : 0.6165          
                 Recall : 0.5804          
                     F1 : 0.5979          
             Prevalence : 0.4769          
         Detection Rate : 0.2768          
   Detection Prevalence : 0.4490          
      Balanc

Year of publication:

In [49]:
print(confusionMatrix(xgboost_model_no_year$pred$pred,xgboost_model_no_year$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        37467        21592
  X.0.01.0.05.     16460        27574
                                          
               Accuracy : 0.6309          
                 95% CI : (0.6279, 0.6338)
    No Information Rate : 0.5231          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2568          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5608          
            Specificity : 0.6948          
         Pos Pred Value : 0.6262          
         Neg Pred Value : 0.6344          
              Precision : 0.6262          
                 Recall : 0.5608          
                     F1 : 0.5917          
             Prevalence : 0.4769          
         Detection Rate : 0.2675          
   Detection Prevalence : 0.4271          
      Balanc

p-value dataset source:

In [51]:
print(confusionMatrix(xgboost_model_no_dataset$pred$pred,xgboost_model_no_dataset$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        37854        21016
  X.0.01.0.05.     16073        28150
                                          
               Accuracy : 0.6402          
                 95% CI : (0.6373, 0.6432)
    No Information Rate : 0.5231          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2757          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5726          
            Specificity : 0.7019          
         Pos Pred Value : 0.6365          
         Neg Pred Value : 0.6430          
              Precision : 0.6365          
                 Recall : 0.5726          
                     F1 : 0.6029          
             Prevalence : 0.4769          
         Detection Rate : 0.2731          
   Detection Prevalence : 0.4290          
      Balanc

Species/kingdom analyzed:

In [53]:
print(confusionMatrix(xgboost_model_no_species$pred$pred,xgboost_model_no_species$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        37541        19775
  X.0.01.0.05.     16386        29391
                                          
               Accuracy : 0.6492          
                 95% CI : (0.6463, 0.6522)
    No Information Rate : 0.5231          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2948          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5978          
            Specificity : 0.6961          
         Pos Pred Value : 0.6420          
         Neg Pred Value : 0.6550          
              Precision : 0.6420          
                 Recall : 0.5978          
                     F1 : 0.6191          
             Prevalence : 0.4769          
         Detection Rate : 0.2851          
   Detection Prevalence : 0.4440          
      Balanc

-Omics field:

In [55]:
print(confusionMatrix(xgboost_model_no_field$pred$pred,xgboost_model_no_field$pred$obs,positive = "X.0.01.0.05.",mode="everything"))

Confusion Matrix and Statistics

              Reference
Prediction     X.0.0.01. X.0.01.0.05.
  X.0.0.01.        36834        20519
  X.0.01.0.05.     17093        28647
                                          
               Accuracy : 0.6352          
                 95% CI : (0.6322, 0.6381)
    No Information Rate : 0.5231          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2665          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5827          
            Specificity : 0.6830          
         Pos Pred Value : 0.6263          
         Neg Pred Value : 0.6422          
              Precision : 0.6263          
                 Recall : 0.5827          
                     F1 : 0.6037          
             Prevalence : 0.4769          
         Detection Rate : 0.2779          
   Detection Prevalence : 0.4437          
      Balanc

In [66]:
metric_comp=xgboost_model$resample%>%
left_join(xgboost_model_no_field$resample,suffix=c("", ".no_field"),by='Resample')%>%
left_join(xgboost_model_no_citation$resample,suffix=c("", ".no_citation"),by='Resample')%>%
left_join(xgboost_model_no_dataset$resample,suffix=c("", ".no_dataset"),by='Resample')%>%
left_join(xgboost_model_no_country$resample,suffix=c("", ".no_country"),by='Resample')%>%
left_join(xgboost_model_no_year$resample,suffix=c("", ".no_year"),by='Resample')%>%
left_join(xgboost_model_no_species$resample,suffix=c("", ".no_species"),by='Resample')

metric_comp %>%select(noquote(order(colnames(metric_comp))))

Accuracy,Accuracy.no_citation,Accuracy.no_country,Accuracy.no_dataset,Accuracy.no_field,Accuracy.no_species,Accuracy.no_year,Kappa,Kappa.no_citation,Kappa.no_country,Kappa.no_dataset,Kappa.no_field,Kappa.no_species,Kappa.no_year,Resample
0.6505966,0.5814337,0.6265399,0.6332331,0.628092,0.6413813,0.6243089,0.2975326,0.1540778,0.2497901,0.2616844,0.2521257,0.2790901,0.2438698,Fold01
0.6512756,0.5886119,0.6258609,0.6393443,0.6330391,0.6485595,0.6260549,0.2997256,0.1688564,0.2487715,0.2744649,0.2630915,0.2941951,0.2479682,Fold02
0.6642739,0.5841498,0.630808,0.6493355,0.6410903,0.6552527,0.6437094,0.3254334,0.1595813,0.2586638,0.2945945,0.2786502,0.3077735,0.2829028,Fold03
0.6620757,0.5899127,0.6263822,0.6470417,0.6393792,0.6503395,0.6331717,0.3202428,0.1705395,0.2489166,0.2887645,0.274888,0.2961846,0.2603149,Fold04
0.6543797,0.5846348,0.6224658,0.6407023,0.6345911,0.6536036,0.629256,0.3048436,0.1604178,0.2410312,0.2765044,0.2652413,0.3031192,0.253739,Fold05
0.6490105,0.5811021,0.6346527,0.6304812,0.6331975,0.6427047,0.6276678,0.2938707,0.1540334,0.265462,0.2559962,0.2623437,0.2811651,0.2503385,Fold06
0.6532493,0.5820563,0.6300679,0.6380213,0.6383123,0.6499515,0.628419,0.3023937,0.1554366,0.2567141,0.2709102,0.2723954,0.2961776,0.2520349,Fold07
0.6507906,0.5867688,0.6240178,0.6421573,0.631293,0.6429334,0.6314871,0.2975657,0.1648668,0.2445681,0.2792214,0.2582464,0.2822836,0.2581297,Fold08
0.6601358,0.5881668,0.6304559,0.6407371,0.6311348,0.6521823,0.6322987,0.3165623,0.1677319,0.2573199,0.2772148,0.2584375,0.300984,0.2589762,Fold09
0.6607177,0.5838021,0.6255092,0.6413191,0.6415131,0.6554801,0.6325897,0.3177571,0.1576688,0.24706,0.2778652,0.2796974,0.3073374,0.2596495,Fold10


In [67]:
save.image("notebook_06_01_2019_ablation.RData")
