# Automated Systematic Review

## Introduction
### Problem statement and project goals
The aim of this project is to develop and evaluate models to reduce the time and therefore the costs of conducting systematic reviews. There are various tools that automate the process of systematic reviews. While these tools mainly apply classical models, our project is aimed to explore deep learning models. 

Another goal of this project is applying active learning methods and compare them with the classical random approach. Active learning is a type of iterative supervised learning method that select the most informative data points to get labeled by an expert during multiple iterations. In scenarios where not enough labled data is available and manually labeling data is expensive, Active learning is expected to show better performance than random approach. 

We would like to validate our model on several databases with different sizes and various rates of included papers, rather than relying on a single database.

### Approach
Our approach is based on the following:
- Apply title and abstract screening.
- Build models with a pre-trained word embeddings from Wikipedia
- Start training models with a defined number of initially included papers.
- Run models on High Performance Computers (HPC)


### Dataset
our model is evaluated on the following datasets:
- Systematic Review on Post-Traumatic Stress Disorder (van de Schoot et al., 2018)
- 7 systematic drug class reviews (Cohen et al., 2006)
- Systematic Review on depression(Cuijpers et al., 2018)

### Modeling
In this project we developed models using deep learning algorithms and then compared them to classical algorithms based on the following evaluation criteria.    

#### Evaluation criteria
- minimize the number of relevant papers that is missed, i.e. False Negatives =< 5 
- minimize the total number of papers to read including seen and selected papers

#### Deep Learning algorithms
In this project we explored two types of deep learning algorithms which are commonly applied to various NLP tasks.

1. Long Short-Term Memory (LSTM) forward/backward
2. Convolutional Neural Network(CNN)

These models were built with a pre-trained word embeddings from Wikipedia and tested on various hyperparameters such as various numbers of epochs, batches, dropouts, and different optimization algorithms. We evaluated the models based on the above mentioned criteria. Among the algorithms LSTM backward represented the better results.

#### Classical algorithms
We compared our Deep Learning model with Rayyan, a web and mobile app for systematic reviews. Rayyan uses Support Vector Machine (SVM) as a classifier.

### Paper selection methods
There are different ways of selecting papers to present to researchers for labeling. The following are two approaches that we applied.

#### Passive Learning
Papers are randomly selected from datasets.

#### Active Learning
Papers are selected for labeling based on a query strategy. A query strategy basically chooses the most informative papers to get the best out of the model when there are not many labeled data available.


## Code Overview

### Step1 : Loading dataset

In [7]:
#texts, labels = load_data(dataset_name)

### Step2: Preprocess texts

In [8]:
##tokenize texts
# data, word_index = textManager.make_sequences(texts)

##create embedding layer
##transfer learning
# embedding = Word2VecEmbedding(word_index, max_num_words,  max_sequence_length)
# embedding.load_word2vec_data(GLOVE_PATH)
# embedding_layer = embedding.build_embedding()

### Step3: Make model

In [9]:
# sequence_input = Input(shape=(max_sequence_length, ), dtype='int32')
# embedded_sequences = embedding_layer(sequence_input)

# x = LSTM(
#     10,
#     input_shape=(max_sequence_length, ),
#     go_backwards=backwards,
#     dropout=dropout)(embedded_sequences)
# x = Dense(128, activation='relu')(x)
# output = Dense(2, activation='softmax')(x)

# model_lstm = Model(inputs=sequence_input, outputs=output)

# model_lstm.compile(
#     loss='binary_crossentropy', optimizer=optimizer, metrics=['acc'])

# model_lstm.summary()      

### Step4: Split dataset to train/test dataset
### Step5: Train model

In [10]:
#passive learning- Papers are selected randomly
#x_train, x_val, y_train, y_val = split_data( data, labels, training_size, init_included_papers)

# model = LSTM_Model(...)

# """ Train model, calculate scores"""
# model.train(x_train, y_train)
# pred = model.prediction(x_val)

#or 

##active learning - Papers are selected based on query strategies
# prelabeled_index = select_prelabeled(labels, init_included_papers)
# pool = make_pool(data, labels, prelabeled=prelabeled_index)

# model = LSTM_Model(...)
#init_weights = model._model.get_weights()

# while query_i <= args.quota:

#         # train the model
#         model.train(pool)

#         # predict the label of the unlabeled entries in the pool
#         idx,features = pool.get_unlabeled_entries()
#         pred = model.predict(features)

#         # make query
#         if (args.query_strategy == 'lc'):
#             qs = UncertaintySampling(
#                 pool, method='lc', model=model)
#         elif (args.query_strategy == 'random'):
#             qs = RandomSampling(pool)

#         ask_id = qs.make_query(n=args.batch_size)
                        
#         for i in ask_id:
#             lb = int(labels[i][1])
#             pool.update(i, lb)

         # reset the memory of the model
#        model._model.set_weights(init_weights)


### Step6: Store the results

In [None]:
#    # save the result to a file
#     if not os.path.exists(output_dir):
#         os.makedirs(output_dir)
#     export_path = os.path.join(
#         output_dir, 'dataset_{}_systematic_review_active{}.csv'.format(
#             args.dataset, args.T))

#     result_df.to_csv(export_path)
