<a href="https://colab.research.google.com/github/esnue/ThesisAllocationSystem/blob/main/model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Model**

# Introduction

This code script details the steps taken to develop a Transformers- and Pre-Trained based Language multi-label Text-Classification Model. 

The goal of the model is the ability to match students based on their thesis proposals to supervisors based on the content of their academic papers.

Using academic papers as train data is an approach that was proposed by the similarly-working [Toronto Matching System](https://www.cs.toronto.edu/~zemel/documents/tpms.pdf). Instead of matching students to professors, the Toronto Matching Systems assigns Peer-Reviewers based on their academic papers to submitted papers.

While thesis proposals tend to be coherent in their structure, academic papers usually have very differing structures, depending on the layout that the publisher demands. Constraining ourselves to few or one particular Journal outlet could have made the cleaning process and possibly the model training easier but would also have considerably reduced the size of the train data. 

Our multi-label text-classification method is based on Transformers and Pre-Trained Language Models. The Pre-Trained Mode is state-of-art [DistilBert](https://arxiv.org/abs/1910.01108). The Transformers version is [4.4.2](https://huggingface.co/transformers/).

We are finetuning a pretrained DistilBERT model for multilabel text classification. This is a very common application of text classification, where a given document can be classified into one or more categories. This approach best mirrors our use case where a thesis proposal could likely be allocated to more than one research area, given the interdisciplinary nature and overlap of research areas between chairs.

Before you attempt to run the script, make sure to secure the required modules and datasets.

## **Requirements**

In [18]:
! pip install transformers==3.0.2



In [19]:
# Import requirements
import warnings
warnings.simplefilter('ignore')
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel
import logging
logging.basicConfig(level=logging.ERROR)

In [20]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [21]:
def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        acc_list.append(tmp_a)
    return np.mean(acc_list)

<a id='section02'></a>
### Importing and Pre-Processing the domain data

The required data set to run this script is `train-papers-label.csv`. It can be downloaded [here](https://drive.google.com/file/d/1-12x2qro_m9HqWUEwZU_l4njMJ6y1LoX/view?usp=sharing). Please make sure you've stored it on your GDrive or on your computer.

The section of the script implements the following tasks:

* Load the dataset `train-papers-label.csv`
* The dataset is prepared for the DataLoader.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Make sure that list items are correctly stored and read as integers, not as strings, which can happen while saving a data frame as csv. 

In [22]:
# Load df
new_df = pd.read_csv('/content/drive/MyDrive/ThesisAllocationSystem/data_final/train-papers-label.csv', converters={'labels': eval})

# Rename df col
new_df.rename(columns={"content": "text"}, inplace = True)

# Check content & type
print(new_df.sample(10))
print(type(new_df['labels'].iloc[1]))
print(type(new_df['labels'].iloc[1][1]))

                                                  text                                             labels
298  b'LEQSPaper76\n\n\n \n\n \n\nLSE Europe in Que...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
270  b"Microsoft Word - Social Origins an Overview-...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
40   b'Scaling Policy Preferences from Coded Politi...  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
0    b'1 \n \n\nCurry, D., Hammerschmid, G., Jilke,...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
109  b'Emotionally Driven Robot Control Architectur...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
768  b'Special Issue on the Economics of Crime:\nEd...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
399  b'RSCAS 2017/26 From market integration to cor...  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
134  b'ICON (2018), Vol. 16 No. 1, 128135 doi:10.10...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
189  b'40 years of global environmental assess

In [23]:
# Check number of labels
print(len(new_df['labels'].iloc[1]))

29


<a id='section03'></a>
### Preparing the Dataset and Dataloader

First, we define some key variables used for training/fine tuning later on. Next, we create a MultiLabelDataset class alongside a DataLoader specifying how the text is pre-processed prior to sending it to the neural network and the number of batches to be sent to the neural network for training.

Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *MultiLabelDataset* Dataset Class
- This class is defined to accept the `tokenizer`, `dataframe` and `max_length` as input and generate tokenized output and tags that is used by the BERT model for training. 
- We are using the DistilBERT tokenizer to tokenize the data in the `text` column of the dataframe.
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`, `token_type_ids`

- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer)
- `targets` is the list of categories labled as `0` or `1` in the dataframe. 
- The *MultiLabelDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **currently 80% of the original data, but at later stage we need to include test data from different source**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [24]:
# Configurations

# Defining key var
MAX_LEN = 128
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 4
EPOCHS = 3
LEARNING_RATE = 1e-05

# Import DistilBert Tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)

In [25]:
class MultiLabelDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.targets = self.data.labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [26]:
# Creating the train dataset and dataloader for the neural network
train_size = 0.8
train_data=new_df.sample(frac=train_size,random_state=200)
test_data=new_df.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)

print(type(train_data['labels'].iloc[1][1]))

print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = MultiLabelDataset(train_data, tokenizer, MAX_LEN)
testing_set = MultiLabelDataset(test_data, tokenizer, MAX_LEN)

<class 'int'>
FULL Dataset: (811, 2)
TRAIN Dataset: (649, 2)
TEST Dataset: (162, 2)


In [27]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `DistilBERTClass`. 
 - This network will have the `DistilBERT` model.  Follwed by a `Droput` and `Linear Layer`. They are added for the purpose of **Regularization** and **Classification** respectively. 
 - In the forward loop, there are 2 output from the `DistilBERTClass` layer.
 - The second output `output_1` or called the `pooled output` is passed to the `Drop Out layer` and the subsequent output is given to the `Linear layer`. 
 - Keep note the number of dimensions for `Linear Layer` is **30** because that is the total number of categories in which we are looking to classify our model
 - The data will be fed to the `DistilBERTClass` as defined in the dataset. 
 - Final layer outputs is what will be used to calcuate the loss and to determine the accuracy of models prediction. 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - The Loss is defined in the next cell as `loss_fn`.
 - As defined above, the loss function used will be a combination of Binary Cross Entropy which is implemented as [BCELogits Loss](https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss) in PyTorch
 - `Optimizer` is defined in the next cell.
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

In [28]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 256)
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(256, 29)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.Tanh()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

model = DistilBERTClass()
model.to(device)

DistilBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_featu

In [29]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [30]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [31]:
def train(epoch):
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%5000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [32]:
for epoch in range(EPOCHS):
    train(epoch)


0it [00:00, ?it/s][A
1it [00:00,  1.08it/s][A

Epoch: 0, Loss:  0.6889526844024658



2it [00:01,  1.22it/s][A
3it [00:02,  1.24it/s][A
4it [00:03,  1.28it/s][A
5it [00:03,  1.27it/s][A
6it [00:04,  1.28it/s][A
7it [00:05,  1.31it/s][A
8it [00:06,  1.31it/s][A
9it [00:06,  1.38it/s][A
10it [00:07,  1.24it/s][A
11it [00:08,  1.21it/s][A
12it [00:10,  1.01s/it][A
13it [00:10,  1.04it/s][A
14it [00:11,  1.08it/s][A
15it [00:12,  1.12it/s][A
16it [00:13,  1.19it/s][A
17it [00:13,  1.25it/s][A
18it [00:14,  1.29it/s][A
19it [00:15,  1.28it/s][A
20it [00:16,  1.31it/s][A
21it [00:16,  1.27it/s][A
22it [00:17,  1.29it/s][A
23it [00:18,  1.23it/s][A
24it [00:19,  1.21it/s][A
25it [00:20,  1.17it/s][A
26it [00:21,  1.23it/s][A
27it [00:21,  1.28it/s][A
28it [00:22,  1.31it/s][A
29it [00:23,  1.30it/s][A
30it [00:24,  1.35it/s][A
31it [00:25,  1.07s/it][A
32it [00:26,  1.06it/s][A
33it [00:27,  1.17it/s][A
34it [00:27,  1.21it/s][A
35it [00:28,  1.31it/s][A
36it [00:29,  1.36it/s][A
37it [00:29,  1.38it/s][A
38it [00:30,  1.28it/s][A
39it [00

Epoch: 1, Loss:  0.2861529588699341



2it [00:01,  1.25it/s][A
3it [00:02,  1.26it/s][A
4it [00:02,  1.35it/s][A
5it [00:03,  1.36it/s][A
6it [00:04,  1.29it/s][A
7it [00:05,  1.24it/s][A
8it [00:07,  1.15s/it][A
9it [00:08,  1.05s/it][A
10it [00:08,  1.06it/s][A
11it [00:09,  1.17it/s][A
12it [00:10,  1.22it/s][A
13it [00:10,  1.31it/s][A
14it [00:11,  1.36it/s][A
15it [00:12,  1.50it/s][A
16it [00:12,  1.43it/s][A
17it [00:13,  1.36it/s][A
18it [00:14,  1.39it/s][A
19it [00:15,  1.35it/s][A
20it [00:15,  1.39it/s][A
21it [00:16,  1.33it/s][A
22it [00:17,  1.27it/s][A
23it [00:18,  1.31it/s][A
24it [00:18,  1.32it/s][A
25it [00:19,  1.28it/s][A
26it [00:20,  1.31it/s][A
27it [00:21,  1.36it/s][A
28it [00:21,  1.43it/s][A
29it [00:22,  1.48it/s][A
30it [00:23,  1.49it/s][A
31it [00:23,  1.55it/s][A
32it [00:24,  1.48it/s][A
33it [00:26,  1.05s/it][A
34it [00:27,  1.06it/s][A
35it [00:27,  1.12it/s][A
36it [00:28,  1.15it/s][A
37it [00:29,  1.20it/s][A
38it [00:30,  1.24it/s][A
39it [00

Epoch: 2, Loss:  0.2279970943927765



2it [00:01,  1.25it/s][A
3it [00:02,  1.28it/s][A
4it [00:02,  1.24it/s][A
5it [00:04,  1.14it/s][A
6it [00:05,  1.05it/s][A
7it [00:05,  1.11it/s][A
8it [00:06,  1.29it/s][A
9it [00:07,  1.21it/s][A
10it [00:08,  1.20it/s][A
11it [00:08,  1.25it/s][A
12it [00:09,  1.24it/s][A
13it [00:10,  1.31it/s][A
14it [00:11,  1.36it/s][A
15it [00:12,  1.23it/s][A
16it [00:12,  1.27it/s][A
17it [00:13,  1.28it/s][A
18it [00:14,  1.29it/s][A
19it [00:15,  1.33it/s][A
20it [00:15,  1.45it/s][A
21it [00:16,  1.45it/s][A
22it [00:17,  1.32it/s][A
23it [00:17,  1.37it/s][A
24it [00:18,  1.39it/s][A
25it [00:19,  1.39it/s][A
26it [00:20,  1.35it/s][A
27it [00:20,  1.36it/s][A
28it [00:21,  1.35it/s][A
29it [00:22,  1.31it/s][A
30it [00:22,  1.38it/s][A
31it [00:23,  1.35it/s][A
32it [00:24,  1.37it/s][A
33it [00:25,  1.43it/s][A
34it [00:25,  1.37it/s][A
35it [00:26,  1.26it/s][A
36it [00:28,  1.05it/s][A
37it [00:29,  1.08it/s][A
38it [00:29,  1.18it/s][A
39it [00

As we can see, the loss rates convert but are still pretty high. This is something we will need to work on during the next weeks. 

<a id='section06'></a>
### Validating the Model

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data. 

** ANPASSEN **
This unseen data is the 20% of `train.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model. 
** ANPASSEN **

As defined above to get a measure of our models performance we are using the following metrics. 
- Hamming Score
- Hamming Loss

In [33]:
def validation(testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [34]:
outputs, targets = validation(testing_loader)

final_outputs = np.array(outputs) >=0.5


0it [00:00, ?it/s][A
1it [00:00,  1.32it/s][A
2it [00:01,  1.33it/s][A
3it [00:02,  1.34it/s][A
4it [00:03,  1.02it/s][A
5it [00:04,  1.10it/s][A
6it [00:05,  1.17it/s][A
7it [00:06,  1.09it/s][A
8it [00:07,  1.15it/s][A
9it [00:07,  1.18it/s][A
10it [00:08,  1.36it/s][A
11it [00:09,  1.36it/s][A
12it [00:09,  1.35it/s][A
13it [00:10,  1.37it/s][A
14it [00:11,  1.39it/s][A
15it [00:11,  1.35it/s][A
16it [00:12,  1.36it/s][A
17it [00:13,  1.37it/s][A
18it [00:14,  1.33it/s][A
19it [00:14,  1.35it/s][A
20it [00:15,  1.51it/s][A
21it [00:16,  1.47it/s][A
22it [00:16,  1.44it/s][A
23it [00:17,  1.44it/s][A
24it [00:18,  1.47it/s][A
25it [00:19,  1.04it/s][A
26it [00:20,  1.15it/s][A
27it [00:22,  1.13s/it][A
28it [00:22,  1.01s/it][A
29it [00:23,  1.12it/s][A
30it [00:24,  1.25it/s][A
31it [00:24,  1.27it/s][A
32it [00:25,  1.38it/s][A
33it [00:26,  1.25it/s][A
34it [00:27,  1.28it/s][A
35it [00:27,  1.29it/s][A
36it [00:28,  1.32it/s][A
37it [00:29,  

In [35]:
val_hamming_loss = metrics.hamming_loss(targets, final_outputs)
val_hamming_score = hamming_score(np.array(targets), np.array(final_outputs))

print(f"Hamming Score = {val_hamming_score}")
print(f"Hamming Loss = {val_hamming_loss}")

Hamming Score = 0.08024691358024691
Hamming Loss = 0.031715623669646656


Hamming Loss: 3 % incorrectly predicted labels.

<a id='section07'></a>
### Saving the trained model for inference

This is the final step in the process of fine tuning the model. 

The model and its vocabulary are saved locally. These files are then used to make inferences on new inputs of student research proposals.

BA LINH: Ich weiß nicht wie man das Vocab speichert

In [36]:
# Saving the files for inference

#output_model_file = 'pytorch_distilbert_papers.bin'
#output_vocab_file = 'vocab_distilbert_papers.bin'

path = F"/content/drive/MyDrive/ThesisAllocationSystem/models/pytorch_distilbert_papers_3.bin" 
torch.save(model.state_dict(), path)

#path2 = F"/content/drive/MyDrive/ThesisAllocationSystem/models/pytorch_distilbert_papers.bin" 
#tokenizer.save_vocabulary(output_vocab_file.state_dict(), path)

#torch.save(model, output_model_file)
#tokenizer.save_vocabulary(output_vocab_file)

print('Saved')

Saved
