<a href="https://colab.research.google.com/github/fenchri/ai4h-nactem-handson/blob/main/AI4H_NaCTeM_hands_on.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the AI4H Winter School NaCTeM session ! 🎉 

The goal of this session is to give you a practical overview of how you can create an end-to-end pipeline model for Event Extraction. The code used in all of the following sections is packed up into a repository, which you can use to try on your own.
After you get the grasp of the main components the of the system, you can try to change the model architecture in order to further improve performance, using methods that we discussed during the morning session.

This notebook is divided into the following sections:  
1. Set up environment
2. Checking out the data
3. Named Entity Recognition
4. Relation Extraction
5. Event Extraction
6. Pipeline Performance



# Step 1: Import the code directory & install dependencies

Replace the *\<insert_token_here\>* part in the next cell with the gitlab token that will be provided to you during the session.

In [None]:
!git clone https://ai4h-nactem-participants:BJ2Htni9pMiYH6cqxUcY@gitlab.com/fenchri/ai4health-nactem.git
#!git pull

In [None]:
import os
directory_path = "/content/ai4health-nactem/"
os.chdir(directory_path)

In [None]:
# Installing requirements may take a few minutes ...
!pip install -r requirements.txt

# Evaluation Metrics

In order to evaluate the three networks of interest, we will incorporate the standard machine learning evaluation metrics of Preciion (P), Recall (R) and F1-score. 
In order to estimate each of these metrics we need to define the following:
- **TP** (True Positives): Number of instances correctly identified by the model
- **FP** (False Positives): Number of instances false identified by the model
- **FN** (False Negatives): Number of instances missed by the model

Then, we can calculate each of these metrics as follows:

$ P = \frac{TP}{TP + FP} $, $ R = \frac{TP}{TP + FN} $, $ F1 = \frac{2 P R }{P + R} $

> Note: The sum of $TP + FP$ essentially corresponds to all *predicted* instances, while the sum of $TP + FN$ corresponds to all *true* instances.

We can then estime micro- and macro-averaged version of these metrics as follows
$ P_\text{micro} = \frac{\sum_c TP_c}{\sum_c TP_c + FP_c} $, 
$ P_\text{macro} = \frac{1}{|c|} \sum_c P_c $

The micro-average measures the total performance across all instances, while the macro-average measures the performance for each category separately and takes the average values across categories.

# Step 2: Checking out the data

In this step we will have an initial look at the dataset we are going to use. 
In particular, we will make use of the [MLEE (Multi-level Event Extraction)](http://nactem.ac.uk/MLEE/) by [Pyysalo et al. (2012)](https://academic-oup-com.manchester.idm.oclc.org/bioinformatics/article/28/18/i575/249872) in order to perform Event Extraction.

## Step 2a: Inspecting the data

The dataset can be parser via the [brat](https://brat.nlplab.org/) annotation tool. In the following link you can navigate across the entire collection of documents in MLEE using the arrows.

http://www.nactem.ac.uk/eccb2012/index.xhtml#/10417401

## Step 2b: Data Statistics 

Lets delve into the dataset statistics, in terms of named entities, events, etc.Run the following cells:

In [None]:
% cd /content/ai4health-nactem/src

from statistics import calc_statistics
from matplotlib import pyplot as plt
import numpy as np


def get_keys(report,keys,main_value, sec_value):
  values = []
  for key in keys:
    if key in report[main_value][sec_value]:
      val = report[main_value][sec_value][key]['count']
      #print(val)
      values.append(val)
    else:
      values.append(0)
  return(values)


report_train = calc_statistics('train')  # get statistics
report_val = calc_statistics('val')
report_test = calc_statistics('test')

main_value = 'Entities'
sec_value = 'Types'
keys = list(report_test[main_value][sec_value].keys())

x_train = get_keys(report_train, keys, main_value, sec_value)
x_val = get_keys(report_val, keys, main_value, sec_value)
x_test = get_keys(report_test, keys, main_value, sec_value)
y = np.arange(len(keys))
#print(y)
ax = plt.subplot(111)
train = ax.bar(y-0.3, x_train, width=0.3, color='tab:blue', align='center')
val = ax.bar(y+0.0, x_val, width=0.3, color='tab:green', align='center')
test = ax.bar(y+0.3, x_test, width=0.3, color='tab:orange', align='center')
ax.legend( [train, val, test], ['train', 'dev', 'test'] )
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 90 
plt.xticks(y, keys)
plt.xticks(rotation=90)
plt.title('Entity Counts')
plt.ylabel('Frequency')

plt.show()

print('Total Entities:', (report_train['Entities']['Total']['count']+report_val['Entities']['Total']['count']+report_test['Entities']['Total']['count']))

In [None]:

main_value = 'Triggers'
sec_value = 'Types'
keys = list(report_test[main_value][sec_value].keys())

x_train = get_keys(report_train, keys, main_value, sec_value)
x_val = get_keys(report_val, keys, main_value, sec_value)
x_test = get_keys(report_test, keys, main_value, sec_value)


y = np.arange(len(keys))
#print(y)
ax = plt.subplot(111)
train = ax.bar(y-0.3, x_train, width=0.3, color='tab:blue', align='center')
val = ax.bar(y+0.0, x_val, width=0.3, color='tab:green', align='center')
test = ax.bar(y+0.3, x_test, width=0.3, color='tab:orange', align='center')
ax.legend( [train, val, test], ['train', 'dev', 'test'] )
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 90 
plt.xticks(y, keys)
plt.xticks(rotation=90)
plt.title('Trigger Counts')
plt.ylabel('Frequency')

plt.show()

print('Total Triggers:', (report_train[main_value]['Total']['count']+report_val[main_value]['Total']['count']+report_test[main_value]['Total']['count']))


In [None]:

main_value = 'Events'
sec_value = 'Roles'
keys = list(report_test[main_value][sec_value].keys())

x_train = get_keys(report_train, keys, main_value, sec_value)
x_val = get_keys(report_val, keys, main_value, sec_value)
x_test = get_keys(report_test, keys, main_value, sec_value)


y = np.arange(len(keys))
#print(y)
ax = plt.subplot(111)
train = ax.bar(y-0.3, x_train, width=0.3, color='tab:blue', align='center')
val = ax.bar(y+0.0, x_val, width=0.3, color='tab:green', align='center')
test = ax.bar(y+0.3, x_test, width=0.3, color='tab:orange', align='center')
ax.legend( [train, val, test], ['train', 'dev', 'test'] )
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 90 
plt.xticks(y, keys)
plt.xticks(rotation=90)
plt.title('Event Role Counts')
plt.ylabel('Frequency')

plt.show()


total = (report_train[main_value]['Total']['count']+report_val[main_value]['Total']['count']+report_test[main_value]['Total']['count'])
total_nested =  (report_train[main_value]['Nested']['count']+report_val[main_value]['Total']['count']+report_test[main_value]['Nested']['count'])
print('Total Events: ', total)
print('Nested Events: {} ({:.2f} %)'.format(total_nested, 100 * total_nested / total))


In [None]:
report['sent_len']['avg']
report['sent_len']['max']
report['sent_len']['min']

report['sent_ents']

In [None]:
report_train

##Step 2c: Sample Sentence

We will use the following sentence as out sample throughout the next modules.

![Example  Sentence]("/content/ai4health-nactem/images/example_sentence.png")


In [None]:
from IPython.display import Image #, display
#from google.colab.patches import cv2_imshow
#import cv2

example_image = '/content/ai4health-nactem/images/example_sentence.png'
#img = cv2.imread(example_image, cv2.IMREAD_UNCHANGED)
#cv2_imshow(img)
Image(example_image, width='1200')

# Step 3: Named Entity Recognition

The first we will investigate is Named Entity Recognition (NER). 
In this task, we will identify *named entities* and *triggers* in a given sentence simultaneously. 
We will treat this problem as a **token classification** task, i.e. we predict one label for each word in the sentence.
We will incorporate a pre-trained Language Model and finetune it on the MLEE dataset.

## Step 3a: Creating NER instances

An instance in this task is considered a sentence.
Since we treat NER as a token classification task, we need to associate each word with a particular label. We will thus incorporate the BIO tagging scheme that we saw earlier.
As a reminder:
- **B-CLASS** tag is assigned to the beginning tokens
- **I-CLASS** tag is assigned to the intermediate (and last tokens)
- **O** tag is assigned to all other tokens (not belonginning in an entity)

This is already provided by our dataset, hence each sentence is associated with a particular BIO sequence, as seen in the example below.

In [None]:
import json

with open('/content/ai4health-nactem/data/MLEE_train.json', 'r') as f:
    example = json.loads(f.readline())

example_sentence = example['sentence']
for w, l in zip(example_sentence, example['bio']):
  print('{:<10}\t{:<10}'.format(w, l))

## Step 3b: Pre-processing sentences

In order to feed each sentence into the model we first need to a apply a few basic pre-processing steps:
- Tokenization
- Subword segmentation
- Conversion of elements to identifiers

The dataset is already in a tokenized format, i.e. each word is separated by a space as we saw in step 3a. 
The most important (and tricky here) is subword tokenization.
As we saw during the lecture, state-of-the-art language models use subword segmentation on the input sequence, before feeding it into the network.
This means, that each word will need to be segmented based on the algorithm that each model uses (e.g. BPE, WordPiece, etc).

We first need to initialize a tokenizer as follows:

In [None]:
import transformers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
            'roberta-base',
            add_prefix_space=True,
)

We can then pass the sentence into the tokenizer and inspect the output:

In [None]:
segmented_sentence = [tokenizer.tokenize(w) for w in example_sentence]
segmented_sentence = [item for sublist in segmented_sentence for item in sublist]

print('Original sentence:\n{}\n'.format(' | '.join(example_sentence)))
print('Segmented sentence:\n{}\n'.format(' | '.join(segmented_sentence)))

print(f'Original sentence length:  {len(example_sentence)}')
print(f'Segmented sentence length: {len(segmented_sentence)}')

As we can see the length of the sentence is now longer than before since for example: 
```
extracellular --> extra ##cellular
```

This means we need to re-adjust the the labels we have, in order to have the same length in both sequences. We will do that using a "trick", to append an "X" label for each newly introduced subword.

The following code shows the sentence in column 1 and the expected (extended) output labels in column 2.

In [None]:
new_label_seq = []
for w, l in zip(example_sentence, example['bio']):
    w_ids = tokenizer.tokenize(w)
    n_subwords = len(w_ids)  # augment labels based on number of subwords (X label)
    new_label_seq += [l]
    new_label_seq.extend(['X'] * (n_subwords - 1))

for w, l in zip(segmented_sentence, new_label_seq):
  print('{:<10}\t{:<10}'.format(w, l))

Now we simply need to convert each token and its label into a unique id, in order to feed them into the network.

In [None]:
from models.ner_dataset import NERdataset
import yaml
import torch
from utils import *
import yamlordereddictloader
from torch.utils.data import DataLoader

with open('/content/ai4health-nactem/src/config.yaml', 'r') as f:
  config = yaml.load(f, Loader=yamlordereddictloader.Loader)

config = dict(config)
device = torch.device("cuda:{}".format(config['gpu']) if config['gpu'] != -1 else "cpu")
config['device'] = device
config['task'] = 'ner'

config['labels'] = ner_labels(config)
config['labels'].trigger_types = ner_triggers(config)
config['unique_labels'] = list(set([l.split('-')[1] 
                                    for l in config['labels'].ent2id.keys()
                                    if '-' in l]))

train_data = NERdataset(config, tokenizer, 'train')
train_loader = DataLoader(dataset=train_data,
                          batch_size=config['batch_size'],
                          shuffle=True,
                          collate_fn=train_data.collate)

for trd in train_loader:
  for i, id_ in enumerate(trd['ids']):
    if id_ == 'PMID-16407289-s0':
      print(' === Sentence ===')
      print(tokenizer.convert_ids_to_tokens(trd['input_ids'][i]))
      print(trd['input_ids'][i])

      print('\n === Labels ===')
      print([config['labels'].id2ent[l] for l in trd['labels'][i].tolist()])
      print(trd['labels'][i])
      break

## Step 3c: NER Module

We will use a very simple and straightforward NER component, as shown in the figure below.





In [None]:
ner_arch = '/content/ai4health-nactem/images/ner-arch.png'
Image(ner_arch, width=500)

The NER module written in [PyTorch](https://pytorch.org/) is shown below:
It essentially consists of the encoder layer (a pretrained Transformer-based LM) and a classification - linear layer or a "softmax" layer.

The linear layer is included inside the module, stacked on top of the encoder's output.

During the forward computation, the outpus of the encoder are fed into a linear layer which has as output dimensionality equal to the number of target semantic types. 
The extracted values are normalised via a softmax activation function and the predicted label is the one with the highest probability score.


```python
class NERmodel(nn.Module):
    def __init__(self, config):
        super(NERmodel, self).__init__()

        self.config = config
        self.num_labels = config['labels'].n_ent
        self.model_config = AutoConfig.from_pretrained(
            config['model_name'],
            num_labels=self.num_labels
        )
        self.encoder = AutoModelForTokenClassification.from_pretrained(config['model_name'],
                                                                       config=self.model_config)
        self.loss_fct = nn.CrossEntropyLoss(ignore_index=config['labels'].X_id)  # ignore masked subwords

    def forward(self, seqs):
        outputs = self.encoder(seqs['input_ids'],
                               attention_mask=seqs['attention_mask'])

        logits = outputs[0]

        mask = torch.ne(seqs['input_ids'], self.model_config.pad_token_id)  # remove padding
        active_logits = logits[mask]
        active_labels = seqs['labels'][mask]

        loss = self.loss_fct(active_logits, active_labels)

        return logits, loss
```

### Loss Function

It is important to note the loss function that we use: Cross Entropy.
This function measures the error of the model when predicting a certain label for a certain token. It uses the true probability distribution of the labels and the predicted label distribution in order to measure how close they are. This loss forces the model to give the highest probability to the correct category.

Another important aspect is the following `ignore_index=config['labels'].X_id`, which means that we want to ignore the predictions of the additional subwords - we only care about the prediction of the first token.

## Step 3d: Training

In order to train our model to learn how to correctly predict labels for each token in the sentence, we will employ a trainer function.
The trainer works in training steps, where after a full pass on the training dataset evaluates the model on the validation set.

We will use the micro- and macro-averaged Precision, Recall and F1-score, as our primary metrics, in a **strict** evaluation setting, i.e. we consider an entity as **correct** if and only if both its span and semantic type are predicted correctly.


```python
def train_epoch(self, epoch):
        self.model = self.model.train()
        total_loss = 0
        total_preds, total_truth, masks = [], [], []

        iterations = len(self.loaders['train'])
        loop = tqdm(enumerate(self.loaders['train']), total=iterations, leave=False)
        for batch_idx, batch in loop:
            step = ((epoch - 1) * iterations) + batch_idx

            self.model.zero_grad()
            for b in batch:
                if b != 'ids':
                    batch[b] = batch[b].to(self.config['device'])

            logits, loss = self.model(batch)
            preds = torch.argmax(logits, dim=2)

            total_preds.extend(preds.to('cpu').data.numpy().tolist())
            total_truth.extend(batch['labels'].to('cpu').data.numpy().tolist())
            masks.extend(batch['attention_mask'].to('cpu').data.numpy().tolist())

            # Backward & Updates
            total_loss += loss.item()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(parameters=self.model.parameters(),
                                           max_norm=self.config['max_grad_norm'])
            self.optimizer.step()
            self.scheduler.step()

        scores, scores_trig, scores_ent = \
                performance.compute_ner_metrics(total_preds, total_truth, masks, self.config['labels'])
        scores["loss"] = total_loss / iterations

```


```python
def eval(self, mode, per_class=False, write_preds=False):
        self.model = self.model.eval()  # main change
        total_loss = 0
        total_preds, total_truth, masks, ids = [], [], [], []

        iterations = len(self.loaders[mode])
        with torch.no_grad():   # main change
            for batch_idx, batch in enumerate(self.loaders[mode]):
                for b in batch:
                    if b != 'ids':
                        batch[b] = batch[b].to(self.config['device'])
                ...
```

In order to train the NER module, you can use the following command. This will start training the model and you can observe the performance on the training and validation sets after each epoch. 

It is important here to notice that performance goes up and the loss goes down after each iteration!


In [None]:
% cd /content/ai4health-nactem/src/
# !python main.py --config config.yaml --mode train --task ner

## Step 3e: Performance & Error Analysis



Since training can take some time, we have already trained the model and we can simply load it for evaluation on the validation and test sets.

In addition, we evaluate the performance of triggers and entities separaterly.


In [None]:
% cd /content/ai4health-nactem/src/

import pandas as pd
pd.set_option('display.max_columns', None)

!python main.py --config config.yaml --mode test --task ner

As we can observe, the model's overall performance is quite good on both the detection of named entities and triggers. 

However, there are several categories with very low scores. This is attributed to the small number of instances they contain in the training set - as can be confirmed from the data statistics in Step 2b.

We can observe more closely the errors that the model makes using the following command.

In [None]:
% cd /content/ai4health-nactem/src/
!python error_analysis.py --gold ../data/MLEE_val.json --pred ../saved/ner-roberta-base-val_preds.json --task ner --config config.yaml

We identify three main causes of errors in the model:
- **Type errors**: The predicted span is correct but the semantic type is wrong
- **Span error**: The predicted span is wrong but the semantic type is correct (there should be some overlap between the gold entity and the predicted one)
- **Wrong span, wrong type**: When both the span and the semantic type are wrong
- **False Positive**: When the model has predicted a totally new entity which does not belong in the gold set

As it can be observed, the largest category of errors is the New entity, i.e. the model falsely predicts entities that do not exist in the annotated data. 
The second largest category is the type error, and finally the span error.

### Error Examples

Below we show a few examples of cases where the model fails to identify correctly the target entities/triggers.



## Step 3f: Visualising network internals

We can visualise the attention heads of the trained model using the bertviz package.

In [None]:
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
# !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
!pip install boto3

In [None]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [None]:
% cd /content/ai4health-nactem/src/

from bertviz import head_view
from bertviz.transformers_attn import RobertaForSequenceClassification
from models.ner_trainer import NERtrainer
from models.ner import NERmodel
from models.ner_dataset import NERdataset
import yaml
import torch
from utils import *
import yamlordereddictloader
from torch.utils.data import DataLoader

with open('/content/ai4health-nactem/src/config.yaml', 'r') as f:
  config = yaml.load(f, Loader=yamlordereddictloader.Loader)

config = dict(config)
device = torch.device("cuda:{}".format(config['gpu']) if config['gpu'] != -1 else "cpu")
config['device'] = device
config['task'] = 'ner'

config['labels'] = ner_labels(config)
config['labels'].trigger_types = ner_triggers(config)
config['unique_labels'] = list(set([l.split('-')[1] 
                                    for l in config['labels'].ent2id.keys()
                                    if '-' in l]))

train_data = NERdataset(config, tokenizer, 'train')
train_loader = DataLoader(dataset=train_data,
                          batch_size=config['batch_size'],
                          shuffle=True,
                          collate_fn=train_data.collate)

model = NERmodel(config)
trainer = NERtrainer(config, model, None, None, {'train': train_loader})
trainer.load_model()

example_sentence = ' '.join(example_sentence)
inputs = tokenizer.encode_plus(example_sentence, example_sentence, return_tensors='pt', add_special_tokens=False)

input_ids = inputs['input_ids']
attention = trainer.model.encoder(inputs['input_ids'], inputs['attention_mask']).attentions
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

call_html()
head_view(attention, tokens)

# Step 4: Relation Extraction

The second task  we will investigate is to predict relations between elements in a sentence, in our case, entities and triggers. We will treat this problem as a **pair classification task** and again use a pre-trained Language Model, finetuned on the MLEE dataset.

## Step 4a: Creating relation instances

In the RE task it is quite common to replicate a sentence a number of times, equal to the number of pairs contained in it. Even if this produces a lot of training instances, it is helpful for the model to focus only on a certain pair.

When creating relation instances, the following should be kept in mind:
- We need to create **negative** instances for the model to learn to predict when a pair does not share a relation
- We need to take into account the **directionality** of the pair, so that we know which argument should come first and which should come second

For our particular task, we will break-down all existing events into *relation pairs*. In particular, the *role* of an argument will serve as the relation label between a trigger-argument pair. 
Although we know that the first argument should always be a trigger, we generate entity-entity pairs as well and give them the negative relation category: "NA". In essence, we let the model learn that those pairs should not share any relations.

Regarding *directionality*, we use the following label format:
- 1:Type:2 --> in case the relation arrow goes from left-to-right
- 2:Type:1 --> in case the relation should be inverse (from right-to-left)





The following piece of code takes care of:
- creation of positive and negative pairs
- incorporation of directionality into the relation labels

In [None]:
example_instances = []
with open(config['train_data'], 'r') as infile:
    for line in infile:
        data = json.loads(line)

        sentence = data['sentence']
        print(' === Sentence === ')
        print(' '.join(sentence)+'\n')
        bin_relations = {}
        event_map = {}
        ent_dict = {}

        for ent in data['entities']:
          ent_dict[ent['id']] = ent

        print(' === Events === ')
        for ev in data['events']:
            print('{} ({}) -> {}'.format(
                ent_dict[ev['trigger']]['surface'], 
                ev['event_type'], 
                [(a['role'], ent_dict[a['argument']]['surface']) for a in ev['arguments']]))
            event_map[ev['id']] = ev['trigger']

        # split into binary relations
        for e in data['events']:
            for arg in e['arguments']:
                role = arg['role'].replace('1', '').replace('2', '').replace('3', '').replace('4', '')
                if arg['argument'].startswith('E'):
                    if arg['argument'] in event_map:
                        bin_relations[(e['trigger'], event_map[arg['argument']])] = role
                    else:
                        errors += 1
                else:
                    bin_relations[(e['trigger'], arg['argument'])] = role

        # find related things
        print('\n === Relation pairs ===')
        for arg1, arg2 in combinations(data['entities'], 2):

            if arg1['tokens'][-1] < arg2['tokens'][0]:  # arg1 is 1st
                forward = (arg1['id'], arg2['id'])
                backward = (arg2['id'], arg1['id'])

                if forward in bin_relations:
                    pair_label = '1:' + bin_relations[forward] + ':2'
                elif backward in bin_relations:
                    pair_label = '2:' + bin_relations[backward] + ':1'
                else:
                    pair_label = 'NA'

                example_instances += [(sentence, arg1, arg2, pair_label)]
                
                if pair_label.startswith('2'):
                  print('{} <- {} <- {}'.format(
                      arg1['surface'], pair_label, arg2['surface']))
                else:
                  print('{} -> {} -> {}'.format(
                      arg1['surface'], pair_label, arg2['surface']))
                
            else:  # arg1 is 2nd
                forward = (arg2['id'], arg1['id'])
                backward = (arg1['id'], arg2['id'])

                if forward in bin_relations:
                    pair_label = '1:' + bin_relations[forward] + ':2'
                elif backward in bin_relations:
                    pair_label = '2:' + bin_relations[backward] + ':1'
                else:
                    pair_label = 'NA'

                example_instances += [(sentence, arg2, arg1, pair_label)]

                if pair_label.startswith('2'):
                  print('{} <- {} <- {}'.format(
                      arg2['surface'], pair_label, arg1['surface']))
                else:
                  print('{} -> {} -> {}'.format(
                      arg2['surface'], pair_label, arg1['surface']))
        break

## Step 4b: Using semantic Types

In order to take into accound the semantic types of the entities/triggers of a pairs we replace their surface form in text with a special semantic type token as shown below.

The benefit of this technique is to avoid overfitting and memorisation of the words in the training set.
This way, instead, we are able to learn patterns of relations!

> Note: Again, we need to adjust the token offsets due to the subword segmentation!

First we need to make sure, that we will add these tags as special tokens, so they will not be affected by subword segmentation! 
This is done via the `additional\_special\_tokens` attribute.

In [None]:
config['labels'], config['new_tokens'] = re_labels(config)

tokenizer = AutoTokenizer.from_pretrained(
    config['model_name'],
    add_prefix_space=True,
    additional_special_tokens=config['new_tokens']
)

In [None]:
def insert_special_entities(sentence, arg1, arg2):
    new_arg1 = '@' + arg1['type'] + '$'
    new_arg2 = '@' + arg2['type'] + '$'

    # insert special tokens into a sentence and update token ids
    new_sentence = sentence[0:arg1['tokens'][0]] + \
                    [new_arg1] + \
                    sentence[(arg1['tokens'][-1] + 1):arg2['tokens'][0]] + \
                    [new_arg2] + \
                    sentence[(arg2['tokens'][-1] + 1):]

    new_tokens = [len(sentence[0:arg1['tokens'][0]]),
                  len(sentence[0:arg1['tokens'][0]]) + 1 +
                  len(sentence[(arg1['tokens'][-1] + 1):arg2['tokens'][0]])]

    # fix token ids based on BPE tokenization
    final_tokens = [0, 0]
    tmp_seq = []
    for id_, w in enumerate(new_sentence):  # until 1st arg
        w_ids = tokenizer.tokenize(w)
        tmp_seq += w_ids
        n_subwords = len(w_ids)  # augment id based on number of subwords

        if id_ < new_tokens[0]:
            final_tokens[0] += n_subwords

        if id_ < new_tokens[1]:
            final_tokens[1] += n_subwords

    assert tmp_seq[final_tokens[0]] == new_arg1, '{} <> {}'.format(tmp_seq[final_tokens[0]], new_arg1)
    assert tmp_seq[final_tokens[1]] == new_arg2

    print('\n'+' '.join(new_sentence))

for instance in example_instances:
  insert_special_entities(instance[0], instance[1], instance[2])

Again, the final step is to convert the input sequence and its corresponding labels into a set of ids before feeding it into the model.

In [None]:
from models.re_dataset import REdataset

train_data = REdataset(config, tokenizer, 'train')
train_loader = DataLoader(dataset=train_data,
                          batch_size=config['batch_size'],
                          shuffle=True,
                          collate_fn=train_data.collate)

for trd in train_loader:
  for i, id_ in enumerate(trd['ids']):
    if id_[0] == 'PMID-16407289-s0':
      print('\n === Sentence ===')
      print(tokenizer.convert_ids_to_tokens(trd['input_ids'][i]))
      print(trd['input_ids'][i])

      print('\n === Labels ===')
      print(config['labels'].id2rel[trd['labels'][i].item()], trd['labels'][i])

## Step 4c: Relation Module

Finally, we are ready to construct our relation extraction model architecture, which will be as shown in the next figure.

Similarly to NER, we feed our input sequence into the model.
> *Note*: Here our sequence has target argument replaced by their semantic types!

The we take the representations from the last hidden layer of the encoder, that correspond to each one of the arguments.
The concatenation of the two vectors is given as input to a linear classification layer, that chooses the label with the highest probability score.

> *Note*: Due to our label convention we will also know the direction of the relation!

In [None]:
re_arch = '/content/ai4health-nactem/images/re-arch.png'
Image(re_arch, width='500')

```python
class REmodelPair(nn.Module):
    """
    Relation Extraction Model
    Each pair in a sentence is replaced by special tokens expressing the entity type
    The concatenation of the special token embeddings (after the encoder) as used for classification
    """
    def __init__(self, config):
        super(REmodelPair, self).__init__()

        self.config = config
        self.num_labels = config['labels'].n_rel
        configuration = AutoConfig.from_pretrained(
            config['model_name'],
            output_hidden_states=True)

        self.encoder = AutoModel.from_pretrained(
            config['model_name'],
            config=configuration)

        self.encoder.resize_token_embeddings(
            configuration.vocab_size + len(config['new_tokens']))

        self.classifier = nn.Linear(2*configuration.hidden_size, self.num_labels)
        self.loss_fct = nn.CrossEntropyLoss()

    def forward(self, seqs):
        outputs = self.encoder(
            seqs['input_ids'],
            attention_mask=seqs['attention_mask']
        )

        rows = torch.arange(seqs['input_ids'].size(0)).long().to(self.config['device'])

        first_arguments = outputs.last_hidden_state[rows, seqs['tokens'][:, 0]]
        second_arguments = outputs.last_hidden_state[rows, seqs['tokens'][:, 1]]

        pair_repr = torch.cat([first_arguments, second_arguments], dim=1)
        logits = self.classifier(pair_repr)

        loss = self.loss_fct(logits, seqs['labels'])

        return logits, loss
```

## Step 4d: Performance & Error Analysis

Training the model will take some time, so here instead, we will evaluate an already fine-tuned model on the validation set.

In [None]:
!python evaluation.py --config config.yaml --task re --gold ../data/MLEE_val.json --pred ../saved/re-roberta-base-val_preds.json

As we can see the model performance is high for the most frequent relation categories (e.g. Theme, Participant, etc). There are some categories with no gold labels, and some with very few training instances (e.g. ToLoc) where the model has poor performance.

Moving on to the error analysis, we identify 4 types of potential errors in the model:
- **Type Error**: Where the detected direction is correct, but the relation category is wrong
- **Direction Error**: Where the detected category is correct, but the direction is wrong
- **Entity-Entity Connection**: Where the model detects two named entities as sharing a relation
> *Note*: Here we wanted to let the model learn that these pairs shouldn't be related in our task
- **False positive**: Where the model detects a relation between a new pair (not sharing a relation in the gold data)

In [None]:
!python error_analysis.py --config config.yaml --task re --gold ../data/MLEE_val.json --pred ../saved/re-roberta-base-val_preds.json

As we can observe, the most common cause of erros are the false positives.
The second most common error is the type error while the model learns pretty well to distinguish the direction of the relation, as well as that entity-entity pairs shouldn't be related.

### Error Examples

We show a few cases of error instances of the model, on the validation set.

## Step 4e: Visualising Network internals



In [None]:
% cd /content/ai4health-nactem/src/

import torch
from transformers import AutoTokenizer
from utils import *
import yaml
import yamlordereddictloader
from bertviz import head_view
from models.re_trainer import REtrainer
from models.re import REmodelPair
from models.re_dataset import REdataset
from torch.utils.data import DataLoader

with open('/content/ai4health-nactem/src/config.yaml', 'r') as f:
  config = yaml.load(f, Loader=yamlordereddictloader.Loader)

config = dict(config)
config['labels'], config['new_tokens'] = re_labels(config)
device = torch.device("cuda:{}".format(config['gpu']) if config['gpu'] != -1 else "cpu")
config['device'] = device
config['task'] = 're'

tokenizer = AutoTokenizer.from_pretrained(
    config['model_name'],
    add_prefix_space=True,
    additional_special_tokens=config['new_tokens']
)

train_data = REdataset(config, tokenizer, 'train')
train_loader = DataLoader(dataset=train_data,
                          batch_size=config['batch_size'],
                          shuffle=True,
                          collate_fn=train_data.collate)

model = REmodelPair(config)
trainer = REtrainer(config, model, None, None, {'train': train_loader})
trainer.load_model()

example_sentence = "Regulation of the composition of the @Cellular_component$ by @Gene_or_gene_product$ : activities based on regulation of mRNA expression ."
inputs = tokenizer.encode_plus(example_sentence, example_sentence, return_tensors='pt', add_special_tokens=False)
input_ids = inputs['input_ids']
attention = trainer.model.encoder(inputs['input_ids'], inputs['attention_mask']).cross_attentions
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

call_html()
head_view(attention, tokens)

# Step 5: Event Extraction

The final task (for this session) is to learn how to combine binary relations, such as the relation pairs identified in the previous section into **event structures**.

We treat this as a binary classification task: For each generated event structure the model predicts whether it is valid (label = 1) or invalid (label = 0).

## Step 5a: Event Structure Representation

For the scope of this session we consider we consider events as a set of relationship (argument) tuples linking NEs to the same trigger entity: 

> Event = {ARG_1, ARG_2, ..., ARG_n}
>> ARG_n = (\<trigger\>, \<argument_n\>, \<ROLE\>)

Revisiting our sample sentence:









In [None]:
import json
with open('../data/MLEE_train.json', 'r') as f:
    example = json.loads(f.readline())

example_sentence = example['sentence']
entity_dict={}
for entity in example['entities']:
  eid = entity['id']
  entity_dict[eid] = entity
example_event = example['events'][0]
for arg in example_event['arguments']:
  arg['type']=entity_dict[arg['argument']]['type']
  arg['id']=arg['argument']

trigger = example_event['trigger']+"|"+example_event['event_type']
args = []
for arg in example_event['arguments']:
  arg_rep = arg['id']+"|"+arg['type']
  args.append(tuple([trigger,arg_rep, arg['role']]))

img = cv2.imread(example_image, cv2.IMREAD_UNCHANGED)
cv2_imshow(img)
print()
print('\033[1m' + str(example_event['id']+'-->'+str(args)) + '\033[0m')

### Nested events:
Flat events are easy to directly represent as described above.

The above representation does not consider the sub-structures in the case where the argument is another event trigger. 
So for a nested event as follows we would have:


Full event representation:




In [None]:
example_image_n = '/content/ai4health-nactem/images/example_nested.png'
img = cv2.imread(example_image, cv2.IMREAD_UNCHANGED)
cv2_imshow(img)

Our simplified event representation however would be:





In [None]:
example_image_nf = '/content/ai4health-nactem/images/example_nested_flat.png'
img = cv2.imread(example_image, cv2.IMREAD_UNCHANGED)
cv2_imshow(img)

As such, the following structures would be equivalent:

In [None]:
example_image_np = '/content/ai4health-nactem/images/example_nested_perm.png'
img = cv2.imread(example_image, cv2.IMREAD_UNCHANGED)
cv2_imshow(img)

## Step 5b: Event Structure Generation

How do we generate training instances? Some concerns


*   In different sentences the same event type may have a **varying number of arguments**
*   Some valid events have **no arguments**
*   The same event type may have **different role-entity** argument pairs

**Solution 1:** Exhaustive \\
(Argument/role agnostic) \\
Generate exhaustively all possible (Trigger, Entity, Role) triplets for each sentence

Training set: 500K instances / 0.1% positive





**Hint:** Some events would never take a specific role | entity type as argument

**Solution 2:** Valid template generator \\
(Argument aware) \\
For each event type, generate all possible argument combinations that have appeared in the dataset at least once.

Training set: 194K instances / 2%



```python
def event_templates(config):
    """
    Construct templates for events
    """
    templates = {}
    unique_templates = {}
    total_templates = 0
    total_unique_templates = 0
    non_existent_arg = 0
    for mode in ['train', 'val', 'test']:
        with open(config[mode + '_data'], 'r') as infile:
            for line in infile:
                data = json.loads(line)
                e_type = {e['id']: e['type'] for e in data['entities']}
                event_trig_map = {ev['id']: ev['trigger'] for ev in data['events']}

                for e in data['events']:
                    structure = []

                    if e['event_type'] not in templates:
                        templates[e['event_type']] = []
                        unique_templates[e['event_type']] = []

                    for arg in e['arguments']:
                        role = arg['role'].replace('1', '').replace('2', '').replace('3', '').replace('4', '')

                        if arg['argument'].startswith('T'):
                            structure.append((role, e_type[arg['argument']]))
                        else:
                            if arg['argument'] in event_trig_map:
                                structure.append((role, e_type[event_trig_map[arg['argument']]]))
                            else:
                                structure = []  # exclude event
                                non_existent_arg += 1

                    if structure:
                        perm_structure = list(permutations(structure, len(structure)))
                        templates[e['event_type']] += perm_structure
                        unique_templates[e['event_type']].append(tuple(structure))

    # remove duplicates
    for type_ in templates:
        templates[type_] = list(set(templates[type_]))
        # unique_templates[type_] = list(set(unique_templates[type_]))

        total_templates += len(templates[type_])

        unique_templates[type_].sort(key=itemgetter(0))
        unique_templates[type_] = list(set(unique_templates[type_]))
        total_unique_templates += len(unique_templates[type_])

    logging.info('Total Templates: {}'.format(total_templates))
    logging.info('Total Unique Templates: {}'.format(total_unique_templates))

    return  unique_templates
```



Sub-case: sample randomly from the generated negative instances



```python
# This is formatted as code
```



Training set: 25K instances / 16% positive

**Hint:** Assuming a perfect RE model, the EE model would only have to choose among partial events

**Solution 3:** Partial event generator \\
(Event aware) \\
For each event, generate all possible partial sub events (including a no-argument instance). 

Training set: 11K instances / 30% positive


## Step 5c: Event Structure Encoding




*   Combine entity and role information in event encoding:
*   Identify entities (including triggers) in sentence and replace with dedicated \<type\> tokens.
*   Maintain sentence splitting

Original sentence:




In [None]:

sentence_after = ['[Regulation]', 'of', 'the', 'composition', 'of', 'the', '[Cellular_component]', 'by', '[Gene_or_gene_product]', ':', 'activities', 'based', 'on', 'regulation', 'of', 'mRNA', 'expression', '.']
sentence_before = ['Regulation', 'of', 'the', 'composition', 'of', 'the', 'extracellular', 'matrix', 'by', 'low', 'density', 'lipoprotein', 'receptor', '-', 'related', 'protein', '-', '1', ':', 'activities', 'based', 'on', 'regulation', 'of', 'mRNA', 'expression', '.']
print('Word splitted sentence:')
print(sentence_before)
print('Triggers | Entities replaced:')
print(sentence_after)

In [None]:
config['new_event_tokens'], config['new_tokens'] = event_specific_tokens(config)
tokenizer = AutoTokenizer.from_pretrained(
            config['model_name'],
            add_prefix_space=True,
            additional_special_tokens=config['new_tokens']
        )
tokenized_sentence = tokenizer.tokenize(
                              sentence_after, 
                              is_split_into_words=True, 
                              add_special_tokens=True)

### EE module architecture

## Step 5d: EE Module





```python
class EEmodel(nn.Module):
    def __init__(self, config):
        super(EEmodel, self).__init__()
        self.config = config
        self.num_args = config['num_args']
        self.num_labels = 2
        configuration = AutoConfig.from_pretrained(config['model_name'],
                                                   num_labels=self.num_labels)

        self.role_embed = nn.Embedding(num_embeddings=len(self.config['roles']),
                                       embedding_dim=self.config['embedding_dim'])

        self.encoder = AutoModel.from_pretrained(config['model_name'],
                                                 config=configuration)
        self.encoder.resize_token_embeddings(
            configuration.vocab_size + len(config['new_tokens']))

        self.event_classifier = nn.Linear(configuration.hidden_size, 2)
        self.arg_layer = nn.Linear(2*configuration.hidden_size+config['embedding_dim'], configuration.hidden_size)
        self.loss_fct = nn.CrossEntropyLoss()


    def forward(self, seqs):
        outputs = self.encoder(seqs['input_ids'],
            attention_mask=seqs['attention_mask'])
       
        labels = seqs['labels']
        arglen = seqs['arglen'][0]
        rows = torch.arange(seqs['input_ids'].size(0)).long().to(self.config['device'])
        trigger = outputs.last_hidden_state[rows, seqs['tokens'][:, 1]]
        event_rep = trigger
     
        arg_repi = [] 
        for i in range(0,arglen):
            argument = outputs.last_hidden_state[rows, seqs['tokens'][:, i+2]]
            role_embedding = self.role_embed(seqs['roles'][:,i])   
            arg_l = torch.cat([trigger, argument, role_embedding], dim=1)
            arg_rep = self.arg_layer(arg_l).unsqueeze(1)
            arg_repi.append(arg_rep)
            arg_all_rep = torch.cat(arg_repi, dim=1)
            event_rep = torch.mean(arg_all_rep, dim=1)
        
        logits = self.event_classifier(event_rep)
        loss = self.loss_fct(logits, labels)
        
        return logits, loss
```




## Step 5e: Performance

## Step 5f: Error Analysis

# Step 6: Pipeline Performance

Now that all 3 components have been completed and trained, we can use them in a pipeline scenario in order to do end-to-end Event Extraction!

The NER component, will generate predictions for named entities and triggers in each sentence, which we can feed into the relation component. 

The following script will evaluate the RE component when using **predicted** named entities.

What do you notice?

In [None]:
!python evaluation.py --config config.yaml --task re --gold ../data/MLEE_val.json --pred ../saved/re-roberta-base-ner_val_preds.json 

Performance of the model drops significantly, almost 27\% !