# Entity Relation Extraction using R-BERT

> Indented block





In this notebook, entity relations are extracted from medical data about diseases.
Bert is customized for the task, using methods from the following paper:

Enriching Pre-trained Language Model with Entity Information for Relation Classification https://arxiv.org/abs/1905.08284.

The code for the implementation was borrowed from: https://github.com/wang-h/bert-relation-classification

This custom R-Bert model is then fine tuned for our data and used to predict what kind of relationships hold between entities in our test set.


Data: https://www.kaggle.com/kmader/figure-eight-medical-sentence-summary







Table of contents

1. Install dependencies, import modules and load helper functions.
2. Read the data as features.
3. Convert features to tensors.
4. Change the bert class for relation extraction
5. Prepare and load the model
6. Train the model!
7. Load the trained model.
8. Evaluate!

In [98]:
# Connect to google drive (where the data is, to access it):
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# 1. Install dependencies, import modules and load helper functions


In [8]:
! pip install pytorch-transformers #

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 26.7MB/s eta 0:00:01[K     |███▊                            | 20kB 2.1MB/s eta 0:00:01[K     |█████▋                          | 30kB 2.8MB/s eta 0:00:01[K     |███████▍                        | 40kB 2.0MB/s eta 0:00:01[K     |█████████▎                      | 51kB 2.3MB/s eta 0:00:01[K     |███████████▏                    | 61kB 2.7MB/s eta 0:00:01[K     |█████████████                   | 71kB 2.9MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 3.1MB/s eta 0:00:01[K     |████████████████▊               | 92kB 3.5MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 3.3MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 3.3MB/s eta 0:00:01[K     |██████████████████████▎     

In [0]:
# Classes for storing individual sentences:

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self,
                 input_ids,
                 input_mask,
                 e11_p, e12_p, e21_p, e22_p,
                 e1_mask, e2_mask,
                 segment_ids,
                 label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id

        #add enitity position and entity mask for BERT
        self.e11_p = e11_p
        self.e12_p = e12_p
        self.e21_p = e21_p
        self.e22_p = e22_p
        self.e1_mask = e1_mask
        self.e2_mask = e2_mask
        
    def print_contents(self):
        print(self.input_ids,self.input_mask,self.segment_ids, self.label_id,
        self.e11_p,self.e12_p,self.e21_p,
        self.e22_p,self.e1_mask, self.e2_mask)

In [0]:
# Functions for reading in the data:

import csv
import sys 
import logging

logger = logging.getLogger(__name__)

def read_tsv(input_file, quotechar=None):
    """Reads a tab separated value file."""
    with open(input_file, "r", encoding="utf-8-sig") as f:
        reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
        lines = []
        for line in reader:
            if sys.version_info[0] == 2:
                line = list(cell for cell in line)
            lines.append(line)
        return lines
      
def create_examples(lines, set_type):
    """Creates examples for the training and test sets.
  
    $AZATHIOPRINE$ is an immunosuppressive drug that is used to treat #RHEUMATOID ARTHRITIS#	8	treats2	treats1	2
    
    $ denotes first entity, # denotes second entitiy, 8 denotes type of relation and 2 denotes direction
    """
    examples = []
    for (i, line) in enumerate(lines):

        guid = "%s-%s" % (set_type, i)
        logger.info(line)
        text_a = line[1]
        text_b = None
        label = line[2]
        examples.append(
            InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    return examples

def get_train_examples(data_dir):
    logger.info("LOOKING AT {}".format(
        os.path.join(data_dir, "train.tsv")))
    return create_examples(
        read_tsv(os.path.join(data_dir, "train.tsv")), "train")
    

def get_test_examples(data_dir):
    return create_examples(
        read_tsv(os.path.join(data_dir, "test.tsv")), "test")

# 2. Read in the data and convert to features

In [31]:
from pytorch_transformers import WEIGHTS_NAME, BertConfig, BertTokenizer

# Configuration parameters:
use_entity_indicator=True
max_seq_len=176

tokenizer = BertTokenizer.from_pretrained(
        'bert-base-uncased', do_lower_case=True)

n_labels = 18
labels = [str(i) for i in range(n_labels)]


100%|██████████| 231508/231508 [00:00<00:00, 2746269.35B/s]


In [0]:
# BERT Class for converting the input to features according to the required input form
def convert_examples_to_features(examples, label_list, max_seq_len,
                                 tokenizer,
                                 cls_token='[CLS]',
                                 cls_token_segment_id=1,
                                 sep_token='[SEP]',
                                 pad_token=0,
                                 pad_token_segment_id=0,
                                 sequence_a_segment_id=0,
                                 sequence_b_segment_id=1,
                                 mask_padding_with_zero=True):
    ''' In: sentences with entities marked by $$ and ## around them
      Out: sentence represented as object of the InputFeature class '''

    label_map = {label: i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        if ex_index % 10000 == 0:
            logger.info("Writing example %d of %d" % (ex_index, len(examples)))

        tokens_a = tokenizer.tokenize(example.text_a)
        
        #convert the entity information to features as well
        l = len(tokens_a)
        
        # the start position of entity1:
        e11_p = tokens_a.index("#") + 1  
        # the end position of entity1
        e12_p = l - tokens_a[::-1].index("#") + 1  
        # the start position of entity2
        e21_p = tokens_a.index("$") + 1  
        # the end position of entity2
        e22_p = l - tokens_a[::-1].index("$") + 1 

        tokens_b = None

        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3".
            special_tokens_count = 3
            _truncate_seq_pair(tokens_a, tokens_b,
                               max_seq_len - special_tokens_count)
        else:
            # Account for [CLS] and [SEP] with "- 2" and with "
            special_tokens_count = 2
            if len(tokens_a) > max_seq_len - special_tokens_count:
                tokens_a = tokens_a[:(max_seq_len - special_tokens_count)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids:   0   0   0   0  0     0   0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = tokens_a + [sep_token]
        segment_ids = [sequence_a_segment_id] * len(tokens)

        if tokens_b:
            tokens += tokens_b + [sep_token]
            segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)

        tokens = [cls_token] + tokens
        segment_ids = [cls_token_segment_id] + segment_ids

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding_length = max_seq_len - len(input_ids)
        input_ids = input_ids + ([pad_token] * padding_length)
        input_mask = input_mask + \
                     ([0 if mask_padding_with_zero else 1] * padding_length)
        segment_ids = segment_ids + \
                      ([pad_token_segment_id] * padding_length)

        #add attention mask for entities as well
        e1_mask = [0 for i in range(len(input_mask))]

        e2_mask = [0 for i in range(len(input_mask))]

        for i in range(e11_p, e12_p):
            e1_mask[i] = 1
        for i in range(e21_p, e22_p):
            e2_mask[i] = 1

        assert len(input_ids) == max_seq_len
        assert len(input_mask) == max_seq_len
        assert len(segment_ids) == max_seq_len

        label_id = int(example.label)

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                [str(x) for x in tokens]))
            logger.info("input_ids: %s" %
                        " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" %
                        " ".join([str(x) for x in input_mask]))
            if use_entity_indicator:
                logger.info("e11_p: %s" % e11_p)
                logger.info("e12_p: %s" % e12_p)
                logger.info("e21_p: %s" % e21_p)
                logger.info("e22_p: %s" % e22_p)
                logger.info("e1_mask: %s" %
                            " ".join([str(x) for x in e1_mask]))
                logger.info("e2_mask: %s" %
                            " ".join([str(x) for x in e2_mask]))
            logger.info("segment_ids: %s" %
                        " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append( InputFeatures(input_ids=input_ids,input_mask=input_mask,e11_p=e11_p,e12_p=e12_p, e21_p=e21_p, e22_p=e22_p,
                          e1_mask=e1_mask,e2_mask=e2_mask, segment_ids=segment_ids,label_id=label_id))
    return features

In [0]:
import os

# Get the training data from the data folder, hosted on google drive:
data_folder = '/content/gdrive/My Drive/Colab Notebooks/data/'
examples = get_train_examples(data_folder)
features = convert_examples_to_features(
    examples, labels, max_seq_len, tokenizer)

*Convert* the features to tensors and make a tensor data set

In [0]:
import torch 
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler,TensorDataset

all_input_ids = torch.tensor(
        [f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor(
    [f.input_mask for f in features], dtype=torch.long)
all_segment_ids = torch.tensor(
    [f.segment_ids for f in features], dtype=torch.long)

#also for entities
all_e1_mask = torch.tensor(
    [f.e1_mask for f in features], dtype=torch.long)
all_e2_mask = torch.tensor(
    [f.e2_mask for f in features], dtype=torch.long) 

all_label_ids = torch.tensor(
        [f.label_id for f in features], dtype=torch.long)

dataset = TensorDataset(all_input_ids, all_input_mask,
                            all_segment_ids, all_label_ids, all_e1_mask, all_e2_mask)

# 3. Preparing the model

In [0]:
# Configuration parameters:

# batch size (low to save memory):
per_gpu_train_batch_size = 4
n_gpu = torch.cuda.device_count()

# the base BERT model (smaller, to save memory)
pretrained_model_name='bert-base-uncased'

# parameters for gradient descent:
max_steps=-1
gradient_accumulation_steps=1 

# Number of training epochs:
num_train_epochs=5.0

# Name of task for Bert:
task_name = 'semeval'

# hyperparameter for regularization
l2_reg_lambda=5e-3
local_rank=-1
no_cuda=False

train_batch_size = per_gpu_train_batch_size * \
        max(1, n_gpu)

# For sampling during the training:
train_sampler = RandomSampler(dataset)
train_dataloader = DataLoader(
        dataset, sampler=train_sampler, batch_size=train_batch_size)

# total number of steps for training:
t_total = len(train_dataloader) // gradient_accumulation_steps * num_train_epochs

# 4. Load the Bert customized for relation extraction (R-Bert)

In [0]:
import torch.nn as nn
import torch.nn.functional as F
from pytorch_transformers import (BertModel, BertPreTrainedModel, BertTokenizer)
from torch.nn import MSELoss, CrossEntropyLoss

def l2_loss(parameters):
  '''Calculates L2 loss (euclidian length) of 'parameters' vector.'''
  return torch.sum(   torch.tensor([torch.sum(p ** 2) / 2 for p in parameters if p.requires_grad ]))


# Huggingface Transformers Class for BERT Sequence Classification
class BertForSequenceClassification(BertPreTrainedModel):
    """
        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
            Labels for computing the sequence classification/regression loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).

    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
            Classification (or regression if config.num_labels==1) loss.
        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    Examples::

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertForSequenceClassification.from_pretrained(
            'bert-base-uncased')
        input_ids = torch.tensor(tokenizer.encode(
            "Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, labels=labels)
        loss, logits = outputs[:2]

    """

    def __init__(self, config):
        super(BertForSequenceClassification, self).__init__(config)
        self.num_labels = config.num_labels
        self.l2_reg_lambda = config.l2_reg_lambda
        self.bert = BertModel(config)
        self.latent_entity_typing = config.latent_entity_typing
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        classifier_size = config.hidden_size*3
        self.classifier = nn.Linear(
            classifier_size, self.config.num_labels)
        self.latent_size = config.hidden_size
        self.latent_type = nn.Parameter(torch.FloatTensor(
            3, config.hidden_size), requires_grad=True)

        self.init_weights()

    # Customized forward step, for relation extraction
    # Does the extra steps required, as described in the paper.
    # Enriching Pre-trained Language Model with Entity Information for Relation Classification https://arxiv.org/abs/1905.08284.

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, e1_mask=None, e2_mask=None, labels=None,
                position_ids=None, head_mask=None):

        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)
        pooled_output = outputs[1]
        sequence_output = outputs[0]

        def extract_entity(sequence_output, e_mask):
            extended_e_mask = e_mask.unsqueeze(1)
            extended_e_mask = torch.bmm(
                extended_e_mask.float(), sequence_output).squeeze(1)
            return extended_e_mask.float()

        e1_h = extract_entity(sequence_output, e1_mask)
        e2_h = extract_entity(sequence_output, e2_mask)
        context = self.dropout(pooled_output)
        pooled_output = torch.cat([context, e1_h, e2_h], dim=-1)

        # Extra logit layer on top of BERT,  in order to do relation extraction:
        logits = self.classifier(pooled_output)

        # add hidden states and attention
        outputs = (logits,) + outputs[2:]

        device = logits.get_device()
        l2 = l2_loss(self.parameters())

        if device >= 0:
            l2 = l2.to(device)
        loss = l2 * self.l2_reg_lambda
        if labels is not None:

            # transform to plausible probabilities,  between 0 and 1:            
            probabilities = F.softmax(logits, dim=-1)
            log_probs = F.log_softmax(logits, dim=-1)

            # Do one hot encoding:
            one_hot_labels = F.one_hot(labels, num_classes=self.num_labels)
            if device >= 0:
                one_hot_labels = one_hot_labels.to(device)

            # Calculate loss:
            dist = one_hot_labels[:, 1:].float() * log_probs[:, 1:]
            example_loss_except_other, _ = dist.min(dim=-1)
            per_example_loss = - example_loss_except_other.mean()

            rc_probabilities = probabilities - probabilities * one_hot_labels.float()
            second_pre,  _ = rc_probabilities[:, 1:].max(dim=-1)
            rc_loss = - (1 - second_pre).log().mean()

            loss += per_example_loss + 5 * rc_loss

            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

In [41]:
# Make config variable for the model:
bertconfig = BertConfig.from_pretrained(
        pretrained_model_name, num_labels=n_labels, finetuning_task=task_name)

bertconfig.l2_reg_lambda = l2_reg_lambda
bertconfig.latent_entity_typing = False
bertconfig.num_classes = n_labels

# Load the model:
model = BertForSequenceClassification.from_pretrained(
        pretrained_model_name, config=bertconfig)

100%|██████████| 361/361 [00:00<00:00, 101158.72B/s]
100%|██████████| 440473133/440473133 [00:06<00:00, 70873575.69B/s]


# 5. Get ready for training

In [0]:
# Prepare optimizer and schedule (linear warmup and decay)

from pytorch_transformers import AdamW, WarmupLinearSchedule

# Hyperparameters for the optimizer:
max_grad_norm = 1.0
learning_rate=2e-5
adam_epsilon=1e-8
warmup_steps=0
weight_decay=0.9


no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters()
                if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
    {'params': [p for n, p in model.named_parameters()
                if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

# Load optimizer and scheduler:
optimizer = AdamW(optimizer_grouped_parameters,
                  lr=learning_rate, eps=adam_epsilon)
scheduler = WarmupLinearSchedule(
    optimizer, warmup_steps=warmup_steps, t_total=t_total)

# Parallelize in case we have multiple GPUs:
if n_gpu > 1:
    model = torch.nn.DataParallel(model)

In [48]:
# Prepare for trainig:
from tqdm import tqdm, trange
import random
import numpy as np

#  Random seed for reproducability
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

global_step = 0
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(int(num_train_epochs),
                        desc="Epoch", disable=local_rank not in [-1, 0])



Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

In [0]:
# put the model to the device
device = torch.device("cuda" if torch.cuda.is_available() and not no_cuda else "cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

# 6. Train!

In [0]:
# Loops through the training set for a few epochs and backpropagate

# Collect the loss values:
loss_values = []

seed = 123456
set_seed(seed)

for _ in train_iterator:
    epoch_iterator = tqdm(train_dataloader, desc="Iteration",
                          disable=local_rank not in [-1, 0])
    
    # For each epoch,  split into batches and train!

    for step, batch in enumerate(epoch_iterator):
        model.train()
        batch = tuple(t.to(device) for t in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'token_type_ids': batch[2],
                  'labels':      batch[3],
                  'e1_mask': batch[4],
                  'e2_mask': batch[5],
                  }

        outputs = model(**inputs)
        # model outputs are always tuple in transformers
        
        loss = outputs[0]

        # Collect the loss:
        loss_values.append(loss)
        
        if n_gpu > 1:
            loss = loss.mean()  
            # mean() to average on multi-gpu parallel training
        if gradient_accumulation_steps > 1:
            loss = loss / gradient_accumulation_steps
        
        # Back propagate
        loss.backward()
        torch.nn.utils.clip_grad_norm_(
            model.parameters(), max_grad_norm)

        tr_loss += loss.item()
        if (step + 1) % gradient_accumulation_steps == 0:

            # Take a step! 
            optimizer.step()
            scheduler.step()              
            # Update learning rate schedule
            model.zero_grad()
            global_step += 1

        if max_steps > 0 and global_step > max_steps:
            # We're done!
            epoch_iterator.close()
            break
    if max_steps > 0 and global_step > max_steps:
        # We're done!
        train_iterator.close()
        break


Iteration:   0%|          | 0/403 [00:00<?, ?it/s][A
Iteration:   0%|          | 1/403 [00:00<04:52,  1.37it/s][A
Iteration:   0%|          | 2/403 [00:01<03:58,  1.68it/s][A
Iteration:   1%|          | 3/403 [00:01<03:18,  2.01it/s][A
Iteration:   1%|          | 4/403 [00:01<02:51,  2.33it/s][A
Iteration:   1%|          | 5/403 [00:01<02:32,  2.61it/s][A
Iteration:   1%|▏         | 6/403 [00:02<02:18,  2.86it/s][A
Iteration:   2%|▏         | 7/403 [00:02<02:08,  3.07it/s][A
Iteration:   2%|▏         | 8/403 [00:02<02:01,  3.25it/s][A
Iteration:   2%|▏         | 9/403 [00:02<01:56,  3.39it/s][A
Iteration:   2%|▏         | 10/403 [00:03<01:52,  3.49it/s][A
Iteration:   3%|▎         | 11/403 [00:03<01:50,  3.56it/s][A
Iteration:   3%|▎         | 12/403 [00:03<01:48,  3.62it/s][A
Iteration:   3%|▎         | 13/403 [00:03<01:46,  3.66it/s][A
Iteration:   3%|▎         | 14/403 [00:04<01:45,  3.68it/s][A
Iteration:   4%|▎         | 15/403 [00:04<01:44,  3.71it/s][A
Iteration

# 7. Save / Load model

In [0]:
# Save the trained model:
torch.save(model.state_dict(), '/content/gdrive/My Drive/Colab Notebooks/data/das_model_train2')

In [0]:
# Load the model, which was made on 8 GPUs (so the state_dict has a different format)
state_dict = torch.load('/content/gdrive/My Drive/Colab Notebooks/data/das_model_train')

# Fix the format on the state_dict:

# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v

In [44]:
device = torch.device("cuda" if torch.cuda.is_available() and not no_cuda else "cpu")


# Load the saved model from the state dict: 
model = BertForSequenceClassification.from_pretrained(pretrained_model_name, config=bertconfig)
model.load_state_dict(new_state_dict)
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

# 8. Evaluate!

In [0]:
# Metrics for evaluation (accuracy, f1 score),  from the official script for SemEval task-8
def acc_and_f1(preds, labels, average='macro'):
    acc = simple_accuracy(preds, labels)
    f1 = f1_score(y_true=labels, y_pred=preds, average=average)
    return {"acc": acc,
        "f1": f1,
        "acc_and_f1": (acc + f1) / 2}
    
def compute_metrics(task_name, preds, labels):
    assert len(preds) == len(labels)
    return acc_and_f1(preds, labels)

def simple_accuracy(preds, labels):
    return (preds == labels).mean()

In [0]:
# Evaluation

def evaluate(model, tokenizer, prefix=""):
    '''
    Reads the test set, makes predictions on it, saves the predictions
    returns the predictions / truth and accuracy+f1 score.
    '''
    # Loop to handle MNLI double evaluation (matched, mis-matched)

    # What kind of task it was, for BERT:
    eval_task = task_name

    # Save the evaluation metrics into results:
    results = {}

    # Load the test set and convert to features and to tensors:
    examples = get_test_examples('/content/gdrive/My Drive/Colab Notebooks/data/')
    features = convert_examples_to_features(
        examples, labels, max_seq_len, tokenizer, "classification", use_entity_indicator)

    all_input_ids = torch.tensor(
            [f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor(
        [f.input_mask for f in features], dtype=torch.long)
    all_segment_ids = torch.tensor(
        [f.segment_ids for f in features], dtype=torch.long)
    all_e1_mask = torch.tensor(
        [f.e1_mask for f in features], dtype=torch.long)  # add e1 mask
    all_e2_mask = torch.tensor(
        [f.e2_mask for f in features], dtype=torch.long)  # add e2 mask

    all_label_ids = torch.tensor(
        [f.label_id for f in features], dtype=torch.long)

    eval_dataset = TensorDataset(all_input_ids, all_input_mask,all_segment_ids, all_label_ids, all_e1_mask, all_e2_mask)

    # Size of batch per GPU:
    eval_batch_size = per_gpu_eval_batch_size * \
        max(1, n_gpu)

    # Sample and load data:
    eval_sampler = SequentialSampler(
        eval_dataset) 
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=eval_batch_size)

  # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    preds = None
    out_label_ids = None

    # Loop through the test set, batch by batch:

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            inputs = {'input_ids':      batch[0],
                      'attention_mask': batch[1],
                      'token_type_ids': batch[2],
                      'labels':      batch[3],
                      'e1_mask': batch[4],
                      'e2_mask': batch[5],
                      }
            outputs = model(**inputs)
            tmp_eval_loss, logits = outputs[:2]

            eval_loss += tmp_eval_loss.mean().item()
        nb_eval_steps += 1

        # Extract the predictions from the model's output:
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(
                out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
            
    # Get the loss, prediction and results:
    eval_loss = eval_loss / nb_eval_steps
    preds = np.argmax(preds, axis=1)


    result = compute_metrics(eval_task, preds, out_label_ids)
    results.update(result)

    logger.info("***** Eval results {} *****".format(prefix))
    for key in sorted(result.keys()):
        logger.info("  %s = %s", key, str(result[key]))
    
    # Write results to file:
    output_eval_file = "/content/gdrive/My Drive/Colab Notebooks/data/eval/results2.txt"
    with open(output_eval_file, "w") as writer:
        for key in range(len(preds)):
            writer.write("%d\t%s\n" %  (key+8001, str(RELATION_LABELS[preds[key]])))
                
    return result, preds, out_label_ids

In [96]:
import numpy as np 
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score

RELATION_LABELS = ['causes1-causes2(e1,e2)',
'causes2-causes1(e2,e1)',
'contraindicates1-contraindicates2(e1,e2)',
'contraindicates2-contraindicates1(e2,e1)',
'location1-location2(e1,e2)',
'location2-location1(e2,e1)',
'treats1-treats2(e1,e2)',
'treats2-treats1(e2,e1)',
'diagnosed by1-diagnosed by2(e1,e2)',
'diagnosed by2-diagnosed by1(e2,e1)']

per_gpu_eval_batch_size=4

result = evaluate(model, tokenizer)
result


Evaluating:   0%|          | 0/147 [00:00<?, ?it/s][A
Evaluating:   1%|▏         | 2/147 [00:00<00:12, 11.62it/s][A
Evaluating:   3%|▎         | 4/147 [00:00<00:11, 12.66it/s][A
Evaluating:   4%|▍         | 6/147 [00:00<00:10, 14.06it/s][A
Evaluating:   5%|▌         | 8/147 [00:00<00:09, 15.06it/s][A
Evaluating:   7%|▋         | 10/147 [00:00<00:08, 15.92it/s][A
Evaluating:   8%|▊         | 12/147 [00:00<00:08, 16.66it/s][A
Evaluating:  10%|▉         | 14/147 [00:00<00:07, 17.27it/s][A
Evaluating:  11%|█         | 16/147 [00:00<00:07, 17.69it/s][A
Evaluating:  12%|█▏        | 18/147 [00:01<00:07, 18.02it/s][A
Evaluating:  14%|█▎        | 20/147 [00:01<00:06, 18.32it/s][A
Evaluating:  15%|█▍        | 22/147 [00:01<00:06, 18.52it/s][A
Evaluating:  16%|█▋        | 24/147 [00:01<00:06, 18.57it/s][A
Evaluating:  18%|█▊        | 26/147 [00:01<00:06, 18.62it/s][A
Evaluating:  19%|█▉        | 28/147 [00:01<00:06, 18.67it/s][A
Evaluating:  20%|██        | 30/147 [00:01<00:06, 18

({'acc': 0.5316239316239316,
  'acc_and_f1': 0.3817510035286296,
  'f1': 0.23187807543332759},
 array([2, 0, 0, 2, 7, 8, 8, 2, 0, 8, 0, 1, 0, 8, 2, 1, 2, 7, 1, 2, 8, 0,
        1, 2, 6, 7, 2, 1, 8, 0, 0, 7, 0, 2, 2, 0, 7, 0, 0, 8, 0, 1, 0, 1,
        8, 0, 8, 2, 0, 0, 7, 8, 2, 8, 0, 8, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1,
        0, 1, 8, 7, 1, 0, 1, 8, 8, 0, 8, 7, 0, 7, 0, 0, 0, 2, 8, 2, 1, 0,
        1, 1, 1, 0, 0, 0, 8, 1, 0, 1, 0, 0, 8, 1, 1, 0, 0, 7, 0, 1, 1, 1,
        0, 1, 0, 2, 7, 0, 7, 0, 5, 7, 7, 2, 1, 8, 8, 2, 0, 8, 0, 1, 7, 8,
        0, 0, 7, 8, 0, 1, 0, 7, 8, 0, 0, 7, 0, 0, 1, 0, 0, 5, 0, 8, 8, 8,
        1, 0, 7, 1, 8, 8, 0, 0, 8, 7, 2, 1, 1, 0, 1, 9, 5, 2, 8, 2, 0, 8,
        1, 2, 1, 1, 1, 2, 1, 0, 2, 8, 8, 7, 7, 8, 1, 0, 0, 2, 1, 8, 6, 0,
        7, 7, 0, 8, 8, 8, 2, 2, 0, 2, 7, 7, 2, 0, 2, 7, 2, 0, 0, 0, 8, 1,
        1, 1, 0, 0, 2, 2, 2, 1, 1, 1, 1, 8, 2, 0, 2, 0, 0, 5, 2, 1, 2, 1,
        1, 1, 8, 8, 0, 8, 7, 7, 7, 7, 7, 8, 8, 8, 8, 1, 0, 8, 2, 7, 8, 7,
        0, 2, 7, 

# Turns out that the model predicts over half of the classes correctly!


Results of the evaluation:


accuracy: 0.532

f1-score (macro average): 0.232

Check what the predictions were, by running through the test file sentence by sentence:

In [0]:
# dict that relates the relation name and how it appears in the text:

RELATIONZ = {'causes1-causes2(e1,e2)' : '1	causes1	causes2	1',
'causes2-causes1(e2,e1)' : '2	causes2	causes1	2',
'contraindicates1-contraindicates2(e1,e2)' : '3	contraindicates1	contraindicates2	1',
'contraindicates2-contraindicates1(e2,e1)' : '4	contraindicates2	contraindicates1	2',
'location1-location2(e1,e2)' : '5	location1	location2	1',
'location2-location1(e2,e1)' : '6	location2	location1	2',
'treats1-treats2(e1,e2)' : '7	treats1	treats2	1',
'treats2-treats1(e2,e1)' : '8	treats2	treats1	2', 
'diagnosed by1-diagnosed by2(e1,e2)': '9	diagnosed by1	diagnosed by2	1',
'diagnosed by2-diagnosed by1(e2,e1)' : '10	diagnosed by2	diagnosed by1	2'}

In [0]:
predictions = []
with open('/content/gdrive/My Drive/Colab Notebooks/data/eval/results2.txt') as f:
  for l in f.readlines():
    predictions.append(l.split('	')[1].strip())

In [13]:
# Check which predictions were correctly done by the model:

with open('/content/gdrive/My Drive/Colab Notebooks/data/test.tsv') as f:
  correct = set() 
  i = 0 
  for l in f.readlines():
    if RELATIONZ[predictions[i]] in l[-30:]:
        print(predictions[i])
        print(l[6:])
        correct.add((l,predictions[i]))
    i+=1

causes1-causes2(e1,e2)
therapeutic results of Lp TAE (transcatheter arterial embolization in the presence or absence of Gelfoam particles preceded by the infusion of a mixture of lipiodol and an anticancer drug via the proper hepatic artery) or DSM TAE (transcatheter arterial embolization with degradable starch microspheres and the arterial injection of anticancer drugs via the hepatic artery) combined with $HYPERTHERMIA$ were evaluated in 30 patients with #HEPATOCELLULAR CARCINOMA# (HCC), 5 subjects with hepatic cholangiocarcinoma, and 22 patients with metastatic liver carcinoma.	1	causes1	causes2	1

causes1-causes2(e1,e2)
1 yr old woman with gallbladder stones, diabetes, weight loss, $DIARRHEA$ and steatorrhea, #IMMUNOHISTOCHEMICAL DIAGNOSIS OF SOMATOSTATINOMA# (liver biopsy) and high plasma values of somatostatin was studied by somatostatin receptor scintigraphy.	1	causes1	causes2	1

causes2-causes1(e2,e1)
ANTAVIRUS# PULMONARY SYNDROME$ (HPS) is a viral infection from a new strain o

It seems that the model only catches the causal relationships!

In [14]:
from collections import Counter

Counter([x[1] for x in correct])

Counter({'causes1-causes2(e1,e2)': 56,
         'causes2-causes1(e2,e1)': 16,
         'location2-location1(e2,e1)': 1,
         'treats2-treats1(e2,e1)': 2})

What was the distribution of relationships in the training data?



In [15]:
train_data = read_tsv('/content/gdrive/My Drive/Colab Notebooks/data/train.tsv')

Counter([(x[3],x[4]) for x in train_data])

Counter({('causes1', 'causes2'): 419,
         ('causes2', 'causes1'): 432,
         ('contraindicates1', 'contraindicates2'): 4,
         ('contraindicates2', 'contraindicates1'): 3,
         ('diagnosed by1', 'diagnosed by2'): 29,
         ('diagnosed by2', 'diagnosed by1'): 34,
         ('location1', 'location2'): 36,
         ('location2', 'location1'): 31,
         ('treats1', 'treats2'): 215,
         ('treats2', 'treats1'): 409})

What about the distribution of relationships in the test data?

In [16]:
test_data = read_tsv('/content/gdrive/My Drive/Colab Notebooks/data/test.tsv')

Counter([(x[3],x[4]) for x in test_data])

Counter({('causes1', 'causes2'): 157,
         ('causes2', 'causes1'): 133,
         ('contraindicates1', 'contraindicates2'): 2,
         ('contraindicates2', 'contraindicates1'): 2,
         ('diagnosed by1', 'diagnosed by2'): 15,
         ('diagnosed by2', 'diagnosed by1'): 19,
         ('location1', 'location2'): 12,
         ('location2', 'location1'): 20,
         ('treats1', 'treats2'): 75,
         ('treats2', 'treats1'): 150})

The model only identifies 'treats' twice correctly, even though it is almost as abundant as 'causes'...

What are 'treats' cases classified as?

In [17]:
treats = []
with open('/content/gdrive/My Drive/Colab Notebooks/data/test.tsv') as f:
  
  for i,l in enumerate(f.readlines()):
    if 'treats' in l[-30:]:
      # it is a "treats" relation 
        treats.append(predictions[i])
        if predictions[i][:7] == 'treats2' or predictions[i][:7] == 'treats1':
          # it is predicted to be a treats relation
          print(l)
          
Counter(treats)

4	Clonidine, oxymetazoline, tetrahydozoline, brimonidine, tizanidine; barbiturates; opioids; benzodiazepines  Give naloxone for suspected $OPIOID OVERDOSE$; consider #FLUMAZENIL# for benzodiazepine overdose Cholinergic (pinpoint pupils; variable HR; sweaty skin; abdominal cramps and diarrhea)  Organophosphate and carbamate insecticides; chemical warfare nerve agents  Give atropine and pralidoxime; obtain measurements of serum and RBC cholinesterase activity Anticholinergic (agitation; delirium; dilated pupils; tachycardia; decreased peristalsis; dry, flushed skin)  Atropine and related drugs; antihistamines; carbamazepine; phenothiazines; tricyclic antidepressants  Obtain immediate ECG.	7	treats1	treats2	1

17	Woscoff A, Carabeli S. Treatment of $TINEA PEDIS$ with #SULCONAZOLE NITRATE 1% CREAM# or miconazole nitrate 2% cream.	7	treats1	treats2	1

37	The study demonstrated that vasopressin is similar to epinephrine for OOH CA due to $VENTRICULAR FIBRILLATION$ or pulseless electrical act

Counter({'causes1-causes2(e1,e2)': 24,
         'causes2-causes1(e2,e1)': 5,
         'contraindicates1-contraindicates2(e1,e2)': 1,
         'diagnosed by1-diagnosed by2(e1,e2)': 144,
         'treats2-treats1(e2,e1)': 51})

The 'treats' examples are misclassified as belonging to the relatively rare class 'diagnosed by'.

Most of the treatments are misclassified in the wrong direction - e.g. that a pain treats a painkiller. The model might be better at non-directed relations.


# Confusion matrices, precision and recall for the 10 relations

In [0]:
from sklearn.metrics import multilabel_confusion_matrix

# Grab the true predictions:

truths = []
with open('/content/gdrive/My Drive/Colab Notebooks/data/test.tsv') as f:
    for l in f.readlines():
      found = False 
      for k,v in RELATIONZ.items():
        if v in l:
          truths.append(k) 

In [19]:
confusion = multilabel_confusion_matrix(truths,predictions)

for i,c in enumerate(confusion):
  print(sorted(list(RELATIONZ.keys()))[i])
  print(c)

causes1-causes2(e1,e2)
[[342  86]
 [101  56]]
causes2-causes1(e2,e1)
[[371  81]
 [117  16]]
contraindicates1-contraindicates2(e1,e2)
[[494  89]
 [  2   0]]
contraindicates2-contraindicates1(e2,e1)
[[583   0]
 [  2   0]]
diagnosed by1-diagnosed by2(e1,e2)
[[397 173]
 [ 15   0]]
diagnosed by2-diagnosed by1(e2,e1)
[[566   0]
 [ 18   1]]
location1-location2(e1,e2)
[[573   0]
 [ 12   0]]
location2-location1(e2,e1)
[[560   5]
 [ 19   1]]
treats1-treats2(e1,e2)
[[508   2]
 [ 75   0]]
treats2-treats1(e2,e1)
[[362  73]
 [148   2]]


The confusion matrices show that the accuracy of classification for each individual relation is pretty bad. 

What are the precision and recall for the 10 classes/relations?

In [26]:
from sklearn.metrics import precision_recall_fscore_support

precision,recall,_,_ = precision_recall_fscore_support(truths,predictions)

  _warn_prf(average, modifier, msg_start, len(result))


In [95]:

print('Precision    ', 'Recall')
print()
for i,p in enumerate(precision):
  print(sorted(list(RELATIONZ.keys()))[i])
  print('%.4f' % p, '      ', '%.4f' % recall[i])
  print()

Precision     Recall

causes1-causes2(e1,e2)
0.3944        0.3567

causes2-causes1(e2,e1)
0.1649        0.1203

contraindicates1-contraindicates2(e1,e2)
0.0000        0.0000

contraindicates2-contraindicates1(e2,e1)
0.0000        0.0000

diagnosed by1-diagnosed by2(e1,e2)
0.0000        0.0000

diagnosed by2-diagnosed by1(e2,e1)
1.0000        0.0526

location1-location2(e1,e2)
0.0000        0.0000

location2-location1(e2,e1)
0.1667        0.0500

treats1-treats2(e1,e2)
0.0000        0.0000

treats2-treats1(e2,e1)
0.0267        0.0133



## Each relation has bad precision / recall,   but overall the accuracy is 53%.

The not so good results make sense, given that it is a complicated task and the data set was small and obscure.

# References

J.  Devlin,  M.-W.  Chang,  K.  Lee,  and  K.  Toutanova,  “Bert.”https://github.com/google-research/bert, 2018.  


T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac,T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Huggingface's transformers.”https://github.com/huggingface/transformers, 2019.


H.   Wang,    “bert-relation-classification.”https://github.com/wang-h/bert-relation-classification, 2019.