# Automated Systematic Review


## Getting Started

Create a pickle file with the data, labels and embedding layer with the
following shell command:  

``` bash
python src/data_prep.py --dataset=[ptsd]
```

### Passive learning
``` 
python src/systematic_review_passive.py --training_size=[500] --init_included_papers=[10] --dataset=[ptsd]
```

Parameters:

    --training_size: Size of training dataset

    --init_included_papers: The number of initially included papers

    --dataset: Name of dataset


### Active learning 

- Define prelabeled data points (papers). This step let us to have the same starting point for all query strategies, and have a fair comparison.

```
select_init_indices.py --dataset=[ptsd] --init_included_papers=[10]
```

Parameters:

    --dataset: Name of dataset

    --init_included_papers: The number of initially included papers

- Run active learning script
```
python src/systematic_review_active.py --dataset=[ptsd] --quota=[10] --init_included_papers=[10] 
--batch_size=[20] --query_strategy=[lc]
```

Parameters:

     --dataset: Name of dataset
     
     --quota: The number of queries
     
     --init_included_papers: The number of initially included papers
     
     --batch_size : Batch size
     
     --query_strategy: The query strategy name. currently random, lc, lcb and lcbmc are implemented
     
     See :https://www.sciencedirect.com/science/article/pii/S1532046411001912 


## Run simulations on HPC


### STEP 0: Setup configuration SurfSara

This sections contains an experiment to run a supervised learning simulation
on the SurfSara HPC infrastructure.


``` bash
module load eb
module load R
module load python/3.5.0-intel
```

### STEP 1: generate batch files

Generate the batch files for active learning, use the following command:

``` bash
Rscript hpc/make_sr_lstm_batch_active.R [DATASET_NAME] 
```

and for passive learning the following command:

``` bash
Rscript hpc/make_sr_lstm_batch_passive.R [DATASET_NAME] 
```

Working example: 

``` bash
Rscript hpc/make_sr_lstm_batch_active.R ptsd 
```


### STEP 2: prepare datasets [Locally]

To speed up the computations on the HPC, several Python objects are generated
beforehand and stored in a pickle file. This file makes it possible to load
the objects really fast on each core on the HPC cluster.

Create a pickle file with the data, labels and embedding layer with the
following shell command:  

``` bash
python src/data_prep.py --dataset=ptsd
```

for active learning, Select prelabeled papers

```
select_init_indices.py --dataset=[ptsd] --init_included_papers=[10]
```

Run the command locally, such that you do not have to upload the entire word 
embedding (`wiki.vec`) to the HPC cluster. After running this, upload the 'data_tmp' folder to the cluster.


### STEP 3: start simulation

Submit the jobs with: 

passive
```bash
source batch_files/passive/[dataset_name]/submit_[dataset_name].sh
```

active learning

```bash
source batch_files/active_learning/[dataset_name]/submit_[dataset_name].sh
```


Check the status of the job:

```bash 
squeue -j= [JOB_ID]
```
