# ASReview - Testing Different Feature Extraction Methods

The goal is to compare classical feature extraction methods to transformer (state-of-the-art) feature extraction models, in the context of systematic reviews. To study the effect of transformer feature extractors, three new feature extraction models are implemented using the ASReview software. The new feature extraction models used for the simulations are:

1. **RoBERTa-base** (stsb-roberta-base-v2) → *transformer/state-of-the-art (SOTA) model*
2. **DistilRoBERTa** (all-distilroberta-v1) → *transformer/state-of-the-art (SOTA) model*
3. **SPECTER** (allenai-specter) → *transformer/state-of-the-art (SOTA) model*

The new models are comapred with three feature extraction models previously implemented by ASReview: 
4. **MPNET** (all-mpnet-base-v2), ASReview default transformer model → *transformer/state-of-the-art (SOTA) model*
5. **Tf-idf** → *classical model*
6. **Doc2Vec** → *classical model*

The transformer feature extractors are implemented using the sentence transformer from the [Hugging Face](https://huggingface.co/sentence-transformers) library. 

The Python API documentation for ASReview (used for the simulations below) can be found at: https://asreview.readthedocs.io/en/latest/reference.html (but please note that version 0.19.3 was used for this study; the newer version of ASReview is 1.0). Full ASReview code can be found on: https://github.com/asreview/asreview

In [1]:
%%capture
# to record runtime
!pip install ipython-autotime
%load_ext autotime

time: 3.81 ms (started: 2022-07-01 10:51:23 +00:00)


In [2]:
%%capture
# install ASReview software
!pip install asreview=="0.19.3"

time: 11.9 s (started: 2022-07-01 10:51:23 +00:00)


In [3]:
# import libraries
import numpy as np
import pandas as pd

time: 327 ms (started: 2022-07-01 10:51:35 +00:00)


In [4]:
# import libraries from asreview
import asreview
from asreview.models import *
from asreview.query_strategies import *
from asreview.balance_strategies import *
from asreview.feature_extraction import *

time: 3.36 s (started: 2022-07-01 10:51:35 +00:00)


In [5]:
%%capture
# install statistics (contains metrics) and visualization libraries from asreview
!pip install asreview-statistics asreview-visualization

time: 4.64 s (started: 2022-07-01 10:51:39 +00:00)


In [6]:
# to check if I'm using a high-ram runtime
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!
time: 4.17 ms (started: 2022-07-01 10:51:43 +00:00)


## Import Data: 

The PTSD Trajectories benchmark dataset by Van de Schoot et al. was chosen. This dataset deals with PTSD trajectories and uses articles about longitudinal studies examining posttraumatic stress after trauma (van de Schoot et al., 2017; van de Schoot et al., 2018). The dataset contains 6,189 studies that were extracted from Pubmed, Embase, PsychInfo, and Scopus, and as this is a labeled dataset, it is known that 43 of those studies are considered “relevant” (included in the systematic review). 


- Note: All of the benchmark datasets can be found on Github: https://github.com/asreview/systematic-review-datasets

In [7]:
# import van_de_Schoot dataset (Method 1 - Github link)
url = "https://raw.githubusercontent.com/asreview/systematic-review-datasets/master/datasets/van_de_Schoot_2017/output/van_de_Schoot_2017.csv"
data = pd.read_csv(url)

time: 1.62 s (started: 2022-07-01 10:51:44 +00:00)


In [8]:
# import van_de_Schoot dataset (Method 2 - Import local file)
#data = pd.read_csv("/content/van_de_Schoot_2017.csv")

time: 999 µs (started: 2022-07-01 10:51:45 +00:00)


In [9]:
# view data
data.head()

Unnamed: 0,record_id,title,abstract,keywords,authors,year,date,doi,label_included,label_abstract_screening,duplicate_record_id
0,1,Manual for ASEBA School-Age Forms & Profiles,,,"Achenbach, T. M., Rescorla, L. A.",2001.0,2001.0,,0,0,
1,2,Queensland Trauma Registry: A summary of paedi...,,,"Dallow, N., Lang, J., Bellamy, N.",2007.0,2007.0,,0,0,
2,3,Posttraumatic Stress Disorder: Scientific and ...,This comprehensive overview of research and cl...,,"Ford, J. D., Grasso, D. J., Elhai, J. D., Cour...",2015.0,,,0,0,
3,4,SOCIAL CLASS AND MENTAL ILLNESS,,,"Hollingshead, A. B., Redlich, F. C.",1958.0,,,0,0,
4,5,Computerised test generation for cross-nationa...,“‘Computerised Test Generation for Cross-Natio...,,"Irvine, S. H.",2014.0,,,0,0,


time: 24.3 ms (started: 2022-07-01 10:51:45 +00:00)


In [10]:
# have to ensure the columns are string type for the simulations to be run
data['abstract'] = data['abstract'].astype(str)
data['keywords'] = data['keywords'].astype(str)
data['title'] = data['title'].astype(str)
data['authors'] = data['authors'].astype(str)

time: 12.2 ms (started: 2022-07-01 10:51:45 +00:00)


## Hugging Face *sentence transformers* library:

All sentence transformer models in this study were implemented using the [sentence-transformers library](https://huggingface.co/sentence-transformers) on Hugging Face. Hugging face (https://huggingface.co/) is a website containing open-source machine learning models and allows users to upload and download models for free, including a wide variety of sentence transformer models.

A few things to note about Hugging Face model selection:
- Model version and the number of downloads were both examined when selecting which model to implement from the sentence-transformers page 
- The “base” (stsb-roberta-base-v2) version of the RoBERTa model was selected 
- The “distil” (all-distilroberta-v1) version of the RoBERTa model was also  selected. The “distilled” version is a lighter and faster version of the "base" model with half the number of layers and a little over half of the parameters
- MPNET and SPECTER were also implemented using the sentence transformers library

In [11]:
%%capture
!pip install -U sentence-transformers

time: 12 s (started: 2022-07-01 10:51:51 +00:00)


In [12]:
from sentence_transformers import SentenceTransformer

time: 3.11 s (started: 2022-07-01 10:52:03 +00:00)


#### Side-note: Code for working with the Sentence Transformers library
- if needed, one can extract the features matrix using the sentence transformers library 

In [9]:
# Ex of how to use this library to extract embeddings

'''
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)
'''

'\nsentences = ["This is an example sentence", "Each sentence is converted"]\n\nmodel = SentenceTransformer(\'sentence-transformers/all-mpnet-base-v2\')\nembeddings = model.encode(sentences)\nprint(embeddings)\n'

time: 8.67 ms (started: 2022-06-30 15:34:03 +00:00)


Below is an example of how features can be extracted using the dataset of this study: 

In [13]:
# run SBERT model on abstracts from van de Schoot dataset
abstracts = data[['abstract']]

# turn asbtracts into list
abstracts = abstracts.astype(str)
abstracts_list = abstracts.values.tolist()

data['abstract'] = data['abstract'].astype(str)
corpus = data.abstract.tolist()

type(corpus) # is a list now
len(corpus)

6189

time: 26.4 ms (started: 2022-07-01 10:52:38 +00:00)


In [14]:
# Run Hugging Face model and extract features
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings_schoot = model.encode(corpus)
print(embeddings_schoot)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

[[-0.02318714  0.05149743 -0.00239223 ...  0.02563844 -0.07033018
  -0.00896618]
 [-0.02318714  0.05149743 -0.00239223 ...  0.02563844 -0.07033018
  -0.00896618]
 [ 0.04867791 -0.01425323  0.0234943  ... -0.03346902  0.00484266
   0.0180548 ]
 ...
 [ 0.03718977 -0.02467943 -0.0255736  ... -0.01454282 -0.01708022
  -0.01819532]
 [-0.02318714  0.05149743 -0.00239223 ...  0.02563844 -0.07033018
  -0.00896618]
 [-0.02318714  0.05149743 -0.00239223 ...  0.02563844 -0.07033018
  -0.00896618]]
time: 1min 41s (started: 2022-07-01 10:52:42 +00:00)


## Simulation Mode: from ASReview Documentation
- The ASReview code for all feature extraction models can be found on Github: https://github.com/asreview/asreview/tree/master/asreview/models/feature_extraction

In [14]:
# set seed
import random
random.seed(10)

time: 1.36 ms (started: 2022-06-30 15:34:04 +00:00)



## **Running the Simulations:** 

Most of the simulations were run in a loop that uses the Python API to produce the state file and then the command line to calculate the metrics using the [ASReview *statistics* library](https://pypi.org/project/asreview-statistics/). For convenience, all simulations using the NN2 classifier were run separately due to the fact that it takes a long time to run. The NN2 classifier could easily be added to the simulation loop with the other classifiers if preferred.
- Note: initial_seed has been set in the simulation settings to reduce randomness and decrease bias when comparing models. However, each time the simulation is run the model will be slightly different, especially for feature extractors like the transformer models since they are not deterministic models.
- For further explanations about the metrics used, please refer to: https://asreview.nl/blog/simulation-mode-class-101/
- The following default settings have been used for the simulations: 
  - Query strategy: max query
  - Balance strategy: double balance
  - Number of instances: 10
  - Initial seed: 10 (can be any fixed number)
  - Prior knowledge: 1 relevant, 1 irrelevant


# (1) Feature Extractor: RoBERTa-base


The model is: "stsb-roberta-base-v2"

https://huggingface.co/sentence-transformers/stsb-roberta-base-v2


Running the RoBERTa-base model with the SVM, Logistic Regression, Random Forest, and NN2 classifiers: 

In [19]:
# list of state file names
state_file_list = ["schoot_robertabase_SVM.h5", "schoot_robertabase_LR.h5","schoot_robertabase_RF.h5"]
# list of classifiers
classifier_list = [SVMClassifier(), LogisticClassifier(), RandomForestClassifier()]

time: 2.08 ms (started: 2022-06-30 20:58:29 +00:00)


In [20]:
# RoBERTa-base - run on all classifiers but NN2
import random 
random.seed(10)

for i in range(0, len(state_file_list)):
  # Load data
  as_data = asreview.data.ASReviewData(df = data)
  # Settings
  train_model = classifier_list[i]
  query_model = MaxQuery()
  balance_model = DoubleBalance()
  feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "stsb-roberta-base-v2")

  # Start the review process
  reviewer_full_schoot = asreview.ReviewSimulate(
      as_data,
      model=train_model,
      query_model=query_model,
      balance_model=balance_model,
      feature_model=feature_model,
      n_instances=10,
      init_seed=10,
      n_prior_included=1,
      n_prior_excluded=1,
      state_file=state_file_list[i]
  )
  reviewer_full_schoot.review()

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


Downloading:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.68k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/675 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


time: 35min 52s (started: 2022-06-30 20:58:31 +00:00)


In [21]:
# metrics for robertba-base feature extractor (for all classifiers) 
!asreview stat "/content/schoot_robertabase_SVM.h5"
!asreview stat "/content/schoot_robertabase_LR.h5"
!asreview stat "/content/schoot_robertabase_RF.h5"

************  schoot_robertabase_SVM.h5  *

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : svm
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'gamma': 'auto', 'class_weight': 0.249, 'C': 15.4, 'kernel': 'linear'}
query_param       : {}
feature_param     : {'transformer_model': 'stsb-roberta-base-v2', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.137

Time to discovery:

    row   : value
    2616  : 14.0
    5244  : 31.0
    4011  : 59.0
    719   : 65.0
    720   : 66.0
    675   : 105.0
  

In [22]:
# RoBERTa-base - run on NN2
import random 
random.seed(10)

# Load data
as_data = asreview.data.ASReviewData(df = data)
# Settings
train_model = NN2LayerClassifier()
query_model = MaxQuery()
balance_model = DoubleBalance()
feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "stsb-roberta-base-v2")

# Start the review process
reviewer_full_schoot = asreview.ReviewSimulate(
    as_data,
    model=train_model,
    query_model=query_model,
    balance_model=balance_model,
    feature_model=feature_model,
    n_instances=10,
    init_seed=10,
    n_prior_included=1,
    n_prior_excluded=1,
    state_file="/content/schoot_robertabase_NN2.h5"
)
reviewer_full_schoot.review()

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


time: 1h 47min 18s (started: 2022-06-30 21:57:51 +00:00)


In [23]:
# metrics for robertba-base feature extractor with NN2
!asreview stat "/content/schoot_robertabase_NN2.h5"

************  schoot_robertabase_NN2.h5  *

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : nn-2-layer
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'dense_width': 128, 'optimizer': 'rmsprop', 'learn_rate': 1.0, 'regularization': 0.01, 'verbose': 0, 'epochs': 35, 'batch_size': 32, 'shuffle': False, 'class_weight': 30.0}
query_param       : {}
feature_param     : {'transformer_model': 'stsb-roberta-base-v2', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.157

Time to discovery:

    row   : va

# (2) Feature Extractor: All-distilroberta-v1
- The documentation for this model can be found on Hugging Face: https://huggingface.co/sentence-transformers/all-distilroberta-v1


Running the DistilRoBERTa model with the SVM, Logistic Regression, Random Forest, and NN2 classifiers: 

In [None]:
# list of state file names
state_file_list = ["schoot_distilroberta_SVM.h5", "schoot_distilroberta_LR.h5","schoot_distilroberta_RF.h5"]
# list of classifiers
classifier_list = [SVMClassifier(), LogisticClassifier(), RandomForestClassifier()]

time: 1.86 ms (started: 2022-06-29 15:55:14 +00:00)


In [None]:
# All-distilroberta-v1 - run on all classifiers but NN2
import random 
random.seed(10)

for i in range(0, len(state_file_list)):
  # Load data
  as_data = asreview.data.ASReviewData(df = data)
  # Settings
  train_model = classifier_list[i]
  query_model = MaxQuery()
  balance_model = DoubleBalance()
  feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "all-distilroberta-v1")

  # Start the review process
  reviewer_full_schoot = asreview.ReviewSimulate(
      as_data,
      model=train_model,
      query_model=query_model,
      balance_model=balance_model,
      feature_model=feature_model,
      n_instances=10,
      init_seed=10,
      n_prior_included=1,
      n_prior_excluded=1,
      state_file=state_file_list[i]
  )
  reviewer_full_schoot.review()

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


time: 35min 16s (started: 2022-06-29 15:55:16 +00:00)


In [None]:
# metrics for distilroberta feature extractor (for all classifiers but NN2) 
!asreview stat "/content/schoot_distilroberta_SVM.h5"
!asreview stat "/content/schoot_distilroberta_LR.h5"
!asreview stat "/content/schoot_distilroberta_RF.h5"

************  schoot_distilroberta_SVM.h5  

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 619

-----------  settings  -----------
data_name         : empty
model             : svm
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'gamma': 'auto', 'class_weight': 0.249, 'C': 15.4, 'kernel': 'linear'}
query_param       : {}
feature_param     : {'transformer_model': 'all-distilroberta-v1', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0437

Time to discovery:

    row   : value
    5244  : 21.0
    1922  : 22.0
    1933  : 50.0
    5284  : 53.0
    896   : 62.0
    3898  : 63.0
 

In [16]:
# All-distilroberta-v1 - run with NN2
import random 
random.seed(10)

# Load data
as_data = asreview.data.ASReviewData(df = data)
# Settings
train_model = NN2LayerClassifier()
query_model = MaxQuery()
balance_model = DoubleBalance()
feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "all-distilroberta-v1")

# Start the review process
reviewer_full_schoot = asreview.ReviewSimulate(
    as_data,
    model=train_model,
    query_model=query_model,
    balance_model=balance_model,
    feature_model=feature_model,
    n_instances=10,
    init_seed=10,
    n_prior_included=1,
    n_prior_excluded=1,
    state_file="/content/schoot_distilroberta_NN2.h5"
)
reviewer_full_schoot.review()

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

time: 1h 46min 59s (started: 2022-06-30 17:39:54 +00:00)


In [18]:
# metrics for distilroberta feature extractor with NN2 
!asreview stat "/content/schoot_distilroberta_NN2.h5"

************  schoot_distilroberta_NN2.h5  

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : nn-2-layer
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'dense_width': 128, 'optimizer': 'rmsprop', 'learn_rate': 1.0, 'regularization': 0.01, 'verbose': 0, 'epochs': 35, 'batch_size': 32, 'shuffle': False, 'class_weight': 30.0}
query_param       : {}
feature_param     : {'transformer_model': 'all-distilroberta-v1', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0358

Time to discovery:

    row   : 

# (3) Feature Extractor: Tf-idf

Implementation of Tf-idf from ASReview (uses the sklearn library): https://github.com/asreview/asreview/blob/master/asreview/models/feature_extraction/tfidf.py

Running the tf-idf model with the SVM, Logistic Regression, Random Forest, NN2, and Naive Bayes classifiers: 

- Note: Tf-idf is the only feature extractor that Naive Bayes can be run with 

In [None]:
# list of state file names
state_file_list = ["schoot_tfidf_SVM.h5", "schoot_tfidf_LR.h5","schoot_tfidf_RF.h5", "schoot_tfidf_NB.h5"]
# list of classifiers
classifier_list = [SVMClassifier(), LogisticClassifier(), RandomForestClassifier(), NaiveBayesClassifier()]

time: 2.13 ms (started: 2022-06-29 13:02:50 +00:00)


In [None]:
# TF-IDF - run on all classifiers but NN2 (including Naive Bayes)
import random 
random.seed(10)

for i in range(0, len(state_file_list)):
  # Load data
  as_data = asreview.data.ASReviewData(df = data)
  # Settings
  train_model = classifier_list[i]
  query_model = MaxQuery()
  balance_model = DoubleBalance()
  feature_model = Tfidf()

  # Start the review process
  reviewer_full_schoot = asreview.ReviewSimulate(
      as_data, 
      model=train_model,
      query_model=query_model,
      balance_model=balance_model,
      feature_model=feature_model,
      n_instances=10,
      init_seed=10,
      n_prior_included=1,
      n_prior_excluded=1,
      state_file=state_file_list[i]
  )
  reviewer_full_schoot.review()

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


time: 48min 13s (started: 2022-06-29 13:02:51 +00:00)


In [None]:
# TF-IDF - run on NN2
import random 
random.seed(10)

# Load data
as_data = asreview.data.ASReviewData(df = data)
# Settings
train_model = NN2LayerClassifier()
query_model = MaxQuery()
balance_model = DoubleBalance()
feature_model = Tfidf()

# Start the review process
reviewer_full_schoot = asreview.ReviewSimulate(
    as_data, 
    model=train_model,
    query_model=query_model,
    balance_model=balance_model,
    feature_model=feature_model,
    n_instances=10,
    init_seed=10,
    n_prior_included=1,
    n_prior_excluded=1,
    state_file="schoot_tfidf_NN2.h5"
)
reviewer_full_schoot.review()

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


time: 3h 6min 41s (started: 2022-06-29 09:23:59 +00:00)


In [None]:
# metrics for tf-idf feature extractor (for all classifiers but NN2) 
!asreview stat "/content/schoot_tfidf_SVM.h5"
!asreview stat "/content/schoot_tfidf_LR.h5"
!asreview stat "/content/schoot_tfidf_RF.h5"
!asreview stat "/content/schoot_tfidf_NB.h5"

************  schoot_tfidf_SVM.h5  *******

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : svm
query_strategy    : max
balance_strategy  : double
feature_extraction: tfidf
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'gamma': 'auto', 'class_weight': 0.249, 'C': 15.4, 'kernel': 'linear'}
query_param       : {}
feature_param     : {'ngram_max': 1, 'stop_words': 'english', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0258

Time to discovery:

    row   : value
    5054  : 26.0
    3053  : 28.0
    4938  : 36.0
    1425  : 38.0
    2472  : 39.0
    4939  : 42.0
    43

In [None]:
# metrics for tf-idf with NN2
!asreview stat "/content/schoot_tfidf_NN2.h5"

************  schoot_tfidf_NN2.h5  *******

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : nn-2-layer
query_strategy    : max
balance_strategy  : double
feature_extraction: tfidf
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'dense_width': 128, 'optimizer': 'rmsprop', 'learn_rate': 1.0, 'regularization': 0.01, 'verbose': 0, 'epochs': 35, 'batch_size': 32, 'shuffle': False, 'class_weight': 30.0}
query_param       : {}
feature_param     : {'ngram_max': 1, 'stop_words': 'english', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0262

Time to discovery:

    row   : value

# (4) Feature Extractor: Doc2Vec
The ASReview implementation of Doc2Vec (uses the gensim library): https://github.com/asreview/asreview/blob/master/asreview/models/feature_extraction/doc2vec.py

Running the Doc2Vec model with the SVM, Logistic Regression, Random Forest, and NN2 classifiers: 
- Note: Doc2Vec is the sentence-embedding level version of Word2Vec

In [None]:
# list of state file names
state_file_list = ["schoot_doc2vec_SVM.h5", "schoot_doc2vec_LR.h5","schoot_doc2vec_RF.h5"]
# list of classifiers
classifier_list = [SVMClassifier(), LogisticClassifier(), RandomForestClassifier()]

time: 1.92 ms (started: 2022-06-29 20:40:37 +00:00)


In [None]:
# Doc2Vec - run on all classifiers (except NN2)
import random 
random.seed(10)

for i in range(0, len(state_file_list)):
  # Load data
  as_data = asreview.data.ASReviewData(df = data)
  # Settings
  train_model = classifier_list[i]
  query_model = MaxQuery()
  balance_model = DoubleBalance()
  feature_model = Doc2Vec()

  # Start the review process
  reviewer_full_schoot = asreview.ReviewSimulate(
      as_data, 
      model=train_model,
      query_model=query_model,
      balance_model=balance_model,
      feature_model=feature_model,
      n_instances=10,
      init_seed=10,
      n_prior_included=1,
      n_prior_excluded=1,
      state_file=state_file_list[i]
  )
  reviewer_full_schoot.review()

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


time: 1h 45min 32s (started: 2022-06-29 20:40:39 +00:00)


In [None]:
# metrics for doc2vec feature extractor (for all but NN2 classifiers) 
!asreview stat "/content/schoot_doc2vec_SVM.h5"
!asreview stat "/content/schoot_doc2vec_LR.h5"
!asreview stat "/content/schoot_doc2vec_RF.h5"

************  schoot_doc2vec_SVM.h5  *****

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : svm
query_strategy    : max
balance_strategy  : double
feature_extraction: doc2vec
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'gamma': 'auto', 'class_weight': 0.249, 'C': 15.4, 'kernel': 'linear'}
query_param       : {}
feature_param     : {'vector_size': 40, 'epochs': 33, 'min_count': 1, 'n_jobs': 1, 'window': 7, 'dm_concat': 0, 'dm': 2, 'dbow_words': 0, 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0416

Time to discovery:

    row   : value
    675   : 47.0
    284   : 4

In [15]:
# Doc2Vec - run on NN2 classifier
import random 
random.seed(10)  

# Load data
as_data = asreview.data.ASReviewData(df = data)
# Settings
train_model = NN2LayerClassifier()
query_model = MaxQuery()
balance_model = DoubleBalance()
feature_model = Doc2Vec()

# Start the review process
reviewer_full_schoot = asreview.ReviewSimulate(
    as_data, 
    model=train_model,
    query_model=query_model,
    balance_model=balance_model,
    feature_model=feature_model,
    n_instances=10,
    init_seed=10,
    n_prior_included=1,
    n_prior_excluded=1,
    state_file="schoot_doc2vec_NN2.h5"
)
reviewer_full_schoot.review()

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


time: 1h 45min 40s (started: 2022-06-30 13:25:42 +00:00)


In [16]:
# metrics for NN2 classifier
!asreview stat "/content/schoot_doc2vec_NN2.h5"

************  schoot_doc2vec_NN2.h5  *****

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : nn-2-layer
query_strategy    : max
balance_strategy  : double
feature_extraction: doc2vec
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'dense_width': 128, 'optimizer': 'rmsprop', 'learn_rate': 1.0, 'regularization': 0.01, 'verbose': 0, 'epochs': 35, 'batch_size': 32, 'shuffle': False, 'class_weight': 30.0}
query_param       : {}
feature_param     : {'vector_size': 40, 'epochs': 33, 'min_count': 1, 'n_jobs': 1, 'window': 7, 'dm_concat': 0, 'dm': 2, 'dbow_words': 0, 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

--

# (5) Feature Extractor: Allenai-specter

- Scientific texts were part of the corpus it was trained on 
- The research paper for the SPECTER model: https://arxiv.org/abs/2004.07180
- Implemented using the Hugging Face sentence transformers library --> full model name on Hugging Face is "sentence-transformers/allenai-specter" : https://huggingface.co/sentence-transformers/allenai-specter


Running the allenai-specter model with the SVM, Logistic Regression, Random Forest, and NN2 classifiers:

In [None]:
# list of state file names
state_file_list = ["schoot_specter_SVM.h5", "schoot_specter_LR.h5","schoot_specter_RF.h5"]
# list of classifiers
classifier_list = [SVMClassifier(), LogisticClassifier(), RandomForestClassifier()]

time: 3.6 ms (started: 2022-06-29 16:42:15 +00:00)


In [None]:
# Allenai-specter - run on all classifiers
import random 
random.seed(10)

for i in range(0, len(state_file_list)):
  # Load data
  as_data = asreview.data.ASReviewData(df = data)
  # Settings
  train_model = classifier_list[i]
  query_model = MaxQuery()
  balance_model = DoubleBalance()
  feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "sentence-transformers/allenai-specter")

  # Start the review process
  reviewer_full_schoot = asreview.ReviewSimulate(
      as_data, 
      model=train_model,
      query_model=query_model,
      balance_model=balance_model,
      feature_model=feature_model,
      n_instances=10,
      init_seed=10,
      n_prior_included=1,
      n_prior_excluded=1,
      state_file=state_file_list[i]
  )
  reviewer_full_schoot.review()

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


time: 33min 19s (started: 2022-06-29 16:42:18 +00:00)


In [None]:
# metrics for SPECTER feature extractor (for all classifiers but NN2) 
!asreview stat "/content/schoot_specter_SVM.h5"
!asreview stat "/content/schoot_specter_LR.h5"
!asreview stat "/content/schoot_specter_RF.h5"

************  schoot_specter_SVM.h5  *****

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 619

-----------  settings  -----------
data_name         : empty
model             : svm
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'gamma': 'auto', 'class_weight': 0.249, 'C': 15.4, 'kernel': 'linear'}
query_param       : {}
feature_param     : {'transformer_model': 'sentence-transformers/allenai-specter', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0317

Time to discovery:

    row   : value
    896   : 24.0
    284   : 44.0
    5054  : 45.0
    719   : 46.0
    720   : 49.0
  

In [15]:
# Allenai-specter - run on NN2
import random 
random.seed(10)

# Load data
as_data = asreview.data.ASReviewData(df = data)
# Settings
train_model = NN2LayerClassifier()
query_model = MaxQuery()
balance_model = DoubleBalance()
feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "sentence-transformers/allenai-specter")

# Start the review process
reviewer_full_schoot = asreview.ReviewSimulate(
    as_data, 
    model=train_model,
    query_model=query_model,
    balance_model=balance_model,
    feature_model=feature_model,
    n_instances=10,
    init_seed=10,
    n_prior_included=1,
    n_prior_excluded=1,
    state_file="/content/schoot_specter_NN2.h5"
)
reviewer_full_schoot.review()

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

time: 1h 49min 3s (started: 2022-06-30 15:34:24 +00:00)


In [17]:
# metrics for SPECTER feature extractor with NN2
!asreview stat "/content/schoot_specter_NN2.h5"

************  schoot_specter_NN2.h5  *****

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 619

-----------  settings  -----------
data_name         : empty
model             : nn-2-layer
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'dense_width': 128, 'optimizer': 'rmsprop', 'learn_rate': 1.0, 'regularization': 0.01, 'verbose': 0, 'epochs': 35, 'batch_size': 32, 'shuffle': False, 'class_weight': 30.0}
query_param       : {}
feature_param     : {'transformer_model': 'sentence-transformers/allenai-specter', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0297

Time to discover

# (6) Feature Extractor: All-mpnet-base-v2 (ASReview default model) 
- The default SBERT-based model that ASReview currently uses: https://github.com/asreview/asreview/blob/master/asreview/models/feature_extraction/sbert.py
- Also implemented using the sentence transformers library on Hugging Face: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Running the all-mpnet-base-v2 model with the SVM, Logistic Regression, Random Forest, and NN2 classifiers:

In [None]:
# list of state file names
state_file_list = ["schoot_mpnet_SVM.h5", "schoot_mpnet_LR.h5","schoot_mpnet_RF.h5", "schoot_mpnet_NN2.h5"]
# list of classifiers
classifier_list = [SVMClassifier(), LogisticClassifier(), RandomForestClassifier(), NN2LayerClassifier()]

time: 1.93 ms (started: 2022-06-29 17:27:45 +00:00)


In [None]:
# MPNET - run on all classifiers
import random 
random.seed(10)

for i in range(0, len(state_file_list)):
  # Load data
  as_data = asreview.data.ASReviewData(df = data)
  # Settings
  train_model = classifier_list[i]
  query_model = MaxQuery()
  balance_model = DoubleBalance()
  feature_model = asreview.models.feature_extraction.SBERT(transformer_model= "sentence-transformers/all-mpnet-base-v2")

  # Start the review process
  reviewer_full_schoot = asreview.ReviewSimulate(
      as_data, 
      model=train_model,
      query_model=query_model,
      balance_model=balance_model,
      feature_model=feature_model,
      n_instances=10,
      init_seed=10,
      n_prior_included=1,
      n_prior_excluded=1,
      state_file=state_file_list[i]
  )
  reviewer_full_schoot.review()

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


time: 2h 28min 33s (started: 2022-06-29 17:27:54 +00:00)


In [None]:
# metrics for MPNET feature extractor (for all classifiers) 
!asreview stat "/content/schoot_mpnet_SVM.h5"
!asreview stat "/content/schoot_mpnet_LR.h5"
!asreview stat "/content/schoot_mpnet_RF.h5"
!asreview stat "/content/schoot_mpnet_NN2.h5"

************  schoot_mpnet_SVM.h5  *******

-----------  general  -----------
Number of runs            : 1
Number of papers          : 6189
Number of included papers : 43
Number of excluded papers : 6146
Number of unlabeled papers: 0
Number of queries         : 620

-----------  settings  -----------
data_name         : empty
model             : svm
query_strategy    : max
balance_strategy  : double
feature_extraction: sbert
n_instances       : 10
n_prior_included  : 1
n_prior_excluded  : 1
mode              : simulate
model_param       : {'gamma': 'auto', 'class_weight': 0.249, 'C': 15.4, 'kernel': 'linear'}
query_param       : {}
feature_param     : {'transformer_model': 'sentence-transformers/all-mpnet-base-v2', 'split_ta': 0, 'use_keywords': 0}
balance_param     : {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}

-----------    ATD    -----------
 0.0342

Time to discovery:

    row   : value
    5244  : 21.0
    4011  : 25.0
    1425  : 30.0
    3898  : 34.0
    675   : 36.0
