# Triple and Perspective Extraction and Scoring with Normalization

In this notebook you will be fine-tuning a language model to perform triple argument (Subject, Predicate, Object) extraction and candidate triple scoring. For the predicates, you will create various categories, and the aim of the model is to find the predicate token span as well as the most likely category for the predicate. You will use the `pytorch` implementation of `albert-base` provided by the Huggingface `transformers` library and fine-tune this model on PersonaChat, DailyDialog and Circa data annotated with ground-truth triples. You will also try adapt the code to allow for other models and train those models to compare model performances on the normalized predicates. 

## Overview 
Adopting a two-stage setup allows maximum flexibility of the triple extraction while making efficient use of the annotated data. The two stages include:

1. A sequence labeling (BIO-tagging) model which extracts lists of subjects, predicates and objects from the input dialogue.

2. A model which takes combinations of subjects, predicates and objects found and scores these combinations (i.e. all candidate triples) to decide whether the triple can indeed be entailed from the dialogue and what its polarity is.

By using this two-stage approach, arbitrary numbers of triples can be extracted and linguistic phenomena such as ellipsis can be accounted for.<br>

The first part is most relevant for the extraction of abstract predicates, whereas the second part is relevant for the evaluation.

## Getting the Data

To get a dataset of ground truth triples a small development set was created. This data has been stored in Google Drive for easy access.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

root_dir = '/content/gdrive/MyDrive/Communicative Robotics' 


Mounted at /content/gdrive


In [None]:
import glob
import json
import random

def load_annotations(path, remove_unk=True, keep_skipped=False):
    """ Reads all annotation files from path. By default, it filters skipped
        files and removes the [unk] tokens appended at the end of each turn.

        params:
        str path:           name of directory containing annotations
        bool remove_unk:    whether to remove [unk] tokens (default: True)
        bool keep_skipped:  whether to keep skipped annotations (default: False)

        returns:    list of annotations dicts
    """
    annotations = []
    for fname in glob.glob(path + '/*.json'):
        with open(fname, 'r', encoding='utf-8') as file:
            data = json.load(file)

            if data['skipped'] and not keep_skipped:
                continue

            if remove_unk:
                data['tokens'] = [[t for t in turn if t != '[unk]'] for turn in data['tokens']]

            annotations.append(data)

    return annotations

annotations = load_annotations(root_dir + '/annotated_data/trainval') #include in new dir
annotations[0]

{'tokens': [['that',
   "'s",
   'nice',
   'i',
   'wish',
   'i',
   'was',
   'more',
   'creative'],
  ['what', 'kind', 'of', 'music', 'do', 'you', 'sing', '?'],
  ['pop',
   'music',
   'but',
   'i',
   'get',
   'nervous',
   'and',
   'do',
   "n't",
   'preform']],
 'annotations': [[[[0, 3]],
   [[0, 4]],
   [[0, 5], [0, 6], [0, 7], [0, 8]],
   [],
   []],
  [[[1, 5]], [[1, 6]], [[2, 0], [2, 1]], [], []],
  [[[2, 3]], [[2, 4]], [[2, 5]], [], []],
  [[[2, 3]], [[2, 7]], [[2, 9]], [[2, 8]], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []]],
 'skipped': False}

In [None]:
print('#dialogs:', len(annotations))
print('#triples:', sum([sum([any(t) for t in d['annotations']]) for d in annotations]))

#dialogs: 1117
#triples: 4786


In [None]:
def get_predicate_tokens(annotation, triple):

      if triple[1]:
        turn = triple[1][0][0]
        pred_tokens = []
        for n in range(0, len(triple[1])):
          token_index = triple[1][n][1]
          token = annotation['tokens'][turn][token_index]
          pred_tokens.append(token)
        return ' '.join(pred_tokens)
      else:
        return None



## Who are 'You'?: Disambiguating You and I

In the text the speakers are referred to as ambiguous tokens *You* and *I*. As these words are ambiguous and their meaning depends on the speaker who utters them, we replace these tokens by [SPEAKER1] and [SPEAKER2] contingent on the speaker (e.g. speaker 2 saying you indicates, [speaker1])

In [None]:
SPEAKER1 = 'SPEAKER1'
SPEAKER2 = 'SPEAKER2'

def disambiguate_pronouns(token, turn_idx):
    # Even turns -> speaker1
    if turn_idx % 2 == 0:
        if token in ['i', 'me', 'myself', 'we', 'ourselves']:
            return SPEAKER1
        elif token in ['my', 'mine', 'our', 'ours']:
            return SPEAKER1 + "'s"
        elif token in ['you', 'yourself', 'yourselves']:
            return SPEAKER2
        elif token in ['your', 'yours']:
            return SPEAKER2 + "'s"
    else:
        if token in ['i', 'me', 'myself', 'we', 'ourselves']:
            return SPEAKER2
        elif token in ['my', 'mine', 'our', 'ours']:
            return SPEAKER2 + "'s"
        elif token in ['you', 'yourself', 'yourselves']:
            return SPEAKER1
        elif token in ['your', 'yours']:
            return SPEAKER1 + "'s"
    return token

In [None]:
for annotation in annotations:
    annotation['tokens'] = [[disambiguate_pronouns(token, i % 2) for token in turn] for i, turn in enumerate(annotation['tokens'])]
annotations[0]

{'tokens': [['that',
   "'s",
   'nice',
   'SPEAKER1',
   'wish',
   'SPEAKER1',
   'was',
   'more',
   'creative'],
  ['what', 'kind', 'of', 'music', 'do', 'SPEAKER1', 'sing', '?'],
  ['pop',
   'music',
   'but',
   'SPEAKER1',
   'get',
   'nervous',
   'and',
   'do',
   "n't",
   'preform']],
 'annotations': [[[[0, 3]],
   [[0, 4]],
   [[0, 5], [0, 6], [0, 7], [0, 8]],
   [],
   []],
  [[[1, 5]], [[1, 6]], [[2, 0], [2, 1]], [], []],
  [[[2, 3]], [[2, 4]], [[2, 5]], [], []],
  [[[2, 3]], [[2, 7]], [[2, 9]], [[2, 8]], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []]],
 'skipped': False}

## Converting formats

Triple arguments are stored as lists of indices (e.g. [[0, 1], [0, 2]] indicating the second and third token of the first turn). We rather use a BIO tagging scheme to indicate these arguments as a vector of labels (one label for each token in the dialog).

Moreover, we flatten the dialogue turns into one flat dialogue using `<eos>` as a separator token.

In [None]:
import numpy as np
import pandas as pd
import re


def triple_to_bio_tags(annotation, arg):
    """ Converts the token indices of the annotations to a vector of BIO labels
        for an argument.

        params:
        dict annotation:    loaded annotation file (see load_annotations)
        int arg:            argument to create tag sequence for (subj=0, pred=1, obj=2)

        returns:    ndarray with BIO labels (I=2, B=1, O=0)
    """ 
    # Determine length of dialogue
    turns = annotation['tokens']
    triples = annotation['annotations']
    num_tokens = sum([len(turn) + 1 for turn in turns])  # +1 for <eos>

    # Create vector same size as dialogue
    mask_ = np.zeros(num_tokens, dtype=np.uint8)

    # Label annotated arguments as BIO tags
    for triple in triples:
        # if arg == 1:
        #     pred = get_predicate_tokens(annotation, triple)
        #     if pred is not None:
        #         try:
        #           pred = pred.strip()
        #           pred = pred.strip("'")
        #           pred = pred.strip()
        #           B_tag, I_tag = lookup[pred]
        #         except:
        #           B_tag, I_tag = (287,288)

        #         for j, (turn_id, token_id) in enumerate(triple[arg]):
        #             #print(j, triple[arg])
        #             test = sum([len(t) + 1 for t in turns[:turn_id]]) + token_id  # k = index of token in dialogue
        #             print(test)
        #             if j == 0:
        #               mask_[test] = B_tag
        #               print(f"mask k with is: {mask_[test]}, with B-tag {B_tag}")
        #             else:
        #               mask_[test] = I_tag
        #               print(f"mask k is: {mask_[test]}, with I-tag {I_tag}")
        # else:
          for j, (turn_id, token_id) in enumerate(triple[arg]):
              k = sum([len(t) + 1 for t in turns[:turn_id]]) + token_id  # k = index of token in dialogue
              mask_[k] = 1 if j == 0 else 2
            
    return mask_

In [None]:
tokens, labels = [], []
for ann in annotations:
    # Map triple arguments to BIO tagged masks
    labels.append((triple_to_bio_tags(ann, 0),
                   triple_to_bio_tags(ann, 1),
                   triple_to_bio_tags(ann, 2)))
    
    # Flatten turn sequence
    tokens.append([t for ts in ann['tokens'] for t in ts + ['<eos>']])
    
# Show as BIO scheme
i = random.randint(0, len(tokens) - 1)
pd.DataFrame(labels[i], columns=tokens[i], index=['subj', 'pred', 'obj'])

Unnamed: 0,a,pet,might,help,.,SPEAKER1's,dog,calms,SPEAKER1,down,...,last,year,but,SPEAKER1.1,still,have,2,cats,..1,<eos>
subj,1,2,0,0,0,1,2,0,0,0,...,0,0,0,1,0,0,0,0,0,0
pred,0,0,1,0,0,0,0,1,0,2,...,0,0,0,0,0,1,0,0,0,0
obj,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,1,2,0,0


In [None]:
i = random.randint(0, len(tokens) - 1)

print(tokens[i])
print()
print("token\tsubj\tpred\toj")
for j, token in enumerate(tokens[i]):
  print(token, "\t", labels[i][0][j], "\t", labels[i][1][j], "\t", labels[i][2][j])

['same', 'here', '.', '.', '.', 'lights', 'are', 'out', 'in', "SPEAKER1's", 'basement', '.', '.', '.', 'need', 'to', 'call', "SPEAKER1's", 'father', '<eos>', 'uh', 'oh', '.', '.', 'maybe', 'SPEAKER1', 'should', 'light', 'some', 'candles', '?', '<eos>', 'SPEAKER1', "'d", 'but', 'SPEAKER1', 'do', "n't", 'want', 'to', 'summon', 'ghosts', '<eos>']

token	subj	pred	oj
same 	 0 	 0 	 0
here 	 0 	 0 	 0
. 	 0 	 0 	 0
. 	 0 	 0 	 0
. 	 0 	 0 	 0
lights 	 1 	 0 	 0
are 	 0 	 1 	 0
out 	 0 	 0 	 1
in 	 0 	 0 	 0
SPEAKER1's 	 0 	 0 	 0
basement 	 0 	 0 	 0
. 	 0 	 0 	 0
. 	 0 	 0 	 0
. 	 0 	 0 	 0
need 	 0 	 1 	 0
to 	 0 	 2 	 0
call 	 0 	 0 	 1
SPEAKER1's 	 0 	 0 	 2
father 	 0 	 0 	 2
<eos> 	 0 	 0 	 0
uh 	 0 	 0 	 0
oh 	 0 	 0 	 0
. 	 0 	 0 	 0
. 	 0 	 0 	 0
maybe 	 0 	 0 	 0
SPEAKER1 	 1 	 0 	 0
should 	 0 	 1 	 0
light 	 0 	 2 	 0
some 	 0 	 0 	 1
candles 	 0 	 0 	 2
? 	 0 	 0 	 0
<eos> 	 0 	 0 	 0
SPEAKER1 	 0 	 0 	 0
'd 	 0 	 0 	 0
but 	 0 	 0 	 0
SPEAKER1 	 1 	 0 	 0
do 	 0 	 0 	 0
n't 	 

In [None]:
import re

def bio_tags_to_tokens(tokens, mask, predicate=False, one_hot=False):
    """ Converts a vector of BIO-tags into spans of tokens. If BIO-tags are one-hot encoded,
        one_hot=True will first perform an argmax to obtain the BIO labels.

        params:
        list tokens:    list of subwords or tokens (as tokenized by Albert/AutoTokenizer)
        ndarray mask:   list of bio labels (one for each subword or token in 'tokens')
        bool one_hot:   whether to interpret mask as a one-hot encoded sequence of shape |sequence|x3
    """
    out = []
    span = []
    for i, token in enumerate(tokens):
        pred = mask[i]

        # Reverse one-hot encoding (optional)
        if one_hot:
            pred = np.argmax(pred)
          

        if pred %2 == 1:  # B
            # if predicate:
            #     span = bio_lookup[pred]

            # else:
              span = re.sub('[^\w\d\-\']+', ' ', ''.join(span)).strip()
              out.append(span)
              span = [token]

        elif pred != 0 and pred %2 == 0:  # I
            if predicate:
                continue
            else:
                span.append(token)

    if span:
        span = re.sub('[^\w\d\-\']+', ' ', ''.join(span)).strip()
        out.append(span)

    # Remove empty strings and duplicates
    return set([span for span in out if span.strip()])

In [None]:
i = random.randint(0, len(labels))
print(' '.join(tokens[i]) + '\n')

print('Subjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens[i]], labels[i][0]))

print('\nPredicates:')
print(bio_tags_to_tokens(['+' + t for t in tokens[i]], labels[i][1], predicate=True))

print('\nObjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens[i]], labels[i][2]))

not at all . SPEAKER1 got a 3 . 8 . <eos> well , what did SPEAKER1 get in english 101 last year ? <eos> SPEAKER1 got a 4 . 0 in that class . <eos>

Subjects:
{'SPEAKER1'}

Predicates:
{'get', 'got'}

Objects:
{'a 4 0', 'a 3 8'}


## Setting up ALBERT for Argument Extraction

Now we set up ALBERT with a token classification head for each of the arguments. To this end we will use PyTorch to create a small linear classifier for each argument which we can slide over the output of ALBERT to make a prediction for each token. To train other models, you should adapt this code so that it works with your model of choice.

In [None]:
%%capture 
!pip install transformers

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from tqdm import tqdm
import numpy as np
import random
from datetime import date

In [None]:
class ArgumentExtraction(torch.nn.Module):
    def __init__(self, base_model='albert-base-v2', path=None, output_dim=3, sep='<eos>'): # You need to change the output_dim to the total number of BIO-tags (including 0,1,2)
        """ Init model with multi-span extraction heads for SPO arguments.

            params:
            str base_model: Transformer architecture to use (default: albert-base-v2)
            str path:       Path to pretrained model
        """
        super().__init__()
        print('loading %s for argument extraction' % base_model)
        self._model = AutoModel.from_pretrained(base_model)
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with special SPEAKER tokens
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # Add token classification heads
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._subj_head = torch.nn.Linear(hidden_size, output_dim)
        self._pred_head = torch.nn.Linear(hidden_size, output_dim)
        self._obj_head = torch.nn.Linear(hidden_size, output_dim)
        self._output_dim = output_dim

        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # Set GPU if available
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = path + '/argument_extraction_' + base_model
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids):
        """ Computes BIO label probabilities for each token
        """
        # Feed dialog through transformer
        y = self._model(input_ids=input_ids, token_type_ids=speaker_ids)
        h = self._relu(y.last_hidden_state)

        # Predict spans
        y_subj = self._softmax(self._subj_head(h))
        y_pred = self._softmax(self._pred_head(h))
        y_obj_ = self._softmax(self._obj_head(h))

        # Permute output as tensor of shape (N, |C|, seq_len)
        y_subj = y_subj.permute(0, 2, 1)
        y_pred = y_pred.permute(0, 2, 1)
        y_obj_ = y_obj_.permute(0, 2, 1)
        return y_subj, y_pred, y_obj_

    def _retokenize_tokens(self, tokens):
        # Tokenize each token individually (keeping track of subwords)
        input_ids = [[self._tokenizer.cls_token_id]]
        for t in tokens:
            if t != '<eos>':
                input_ids.append(self._tokenizer.encode(t, add_special_tokens=False))
            else:
                input_ids.append([self._tokenizer.eos_token_id])

        # Flatten input_ids
        f_input_ids = torch.LongTensor([[i for ids in input_ids for i in ids]]).to(self._device)

        # Determine how often we need to repeat the labels
        repeats = [len(ids) for ids in input_ids]

        # Set speaker IDs
        speaker_ids = [0] + [tokens[:i + 1].count(self._sep) % 2 for i in range(len(tokens))][:-1]  # TODO: make pretty
        speaker_ids = self._repeat_speaker_ids(speaker_ids, repeats)

        return f_input_ids, speaker_ids, repeats

    def _repeat_speaker_ids(self, speaker_ids, repeats):
        """ Repeats speaker IDs for oov tokens.
        """
        rep_speaker_ids = np.repeat([0] + list(speaker_ids), repeats=repeats)
        return torch.LongTensor([rep_speaker_ids]).to(self._device)

    def _repeat_labels(self, labels, repeats):
        """ Repeats BIO labels for OOV tokens. Ensure B-labeled tokens are repeated
            as B-I-I etc.
        """
        # Repeat each label b the amount of subwords per token
        rep_labels = []
        for label, rep in zip([0] + list(labels), repeats):
            # Outside
            if label == 0:
                rep_labels += [label] * rep
            # Beginning + Inside
            elif label == 1:
              rep_labels += [label] + ([label+1] * (rep - 1))  # If label = B -> B-I-I-I...
            else:
                rep_labels += [label] + ([label] * (rep - 1))  # If label = I, do not add 1, but keep the same
        return torch.LongTensor([rep_labels]).to(self._device)

    def fit(self, tokens, labels, epochs=2, lr=1e-5, weight=3):
        """ Fits the model to the annotations
        """
        #print("labels before repeating")
        #print(labels)
        # Re-tokenize to obtain input_ids and associated labels
        X = []
        for token_seq, (subj_labels, pred_labels, _obj_labels) in zip(tokens, labels):
            input_ids, speaker_ids, repeats = self._retokenize_tokens(token_seq)
            subj_labels = self._repeat_labels(subj_labels, repeats)  # repeat when split into subwords
            pred_labels = self._repeat_labels(pred_labels, repeats)
            _obj_labels = self._repeat_labels(_obj_labels, repeats)
            X.append((input_ids, speaker_ids, subj_labels, pred_labels, _obj_labels))

        # Set up optimizer
        optim = torch.optim.Adam(self.parameters(), lr=lr)

        # Higher weight for B- and I-tags to account for class imbalance
        class_weights = torch.Tensor([1] + [weight] * (self._output_dim - 1)).to(self._device)
        criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

        print('Training!')
        for epoch in range(epochs):
            losses = []
            random.shuffle(X)
            for input_ids, speaker_ids, subj_y, pred_y, obj_y in tqdm(X):
                # Forward pass
                subj_y_hat, pred_y_hat, obj_y_hat = self(input_ids, speaker_ids)
                #print(subj_y, pred_y, obj_y, subj_y_hat, pred_y_hat, obj_y_hat)
                # Compute loss
                loss = criterion(subj_y_hat, subj_y)
                loss += criterion(pred_y_hat, pred_y)
                loss += criterion(obj_y_hat, obj_y)
                losses.append(loss.item())

                optim.zero_grad()
                loss.backward()
                optim.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'argument_extraction_%s' % self._base)

    def predict(self, token_seq):
        """ Predicts """
        # Retokenize token sequence
        input_ids, speaker_ids, _ = self._retokenize_tokens(token_seq)

        # Invert tokenization for viewing
        subwords = self._tokenizer.convert_ids_to_tokens(input_ids[0])

        # Forward-pass
        predictions = self(input_ids, speaker_ids)
        subjs = predictions[0].cpu().detach().numpy()[0]
        preds = predictions[1].cpu().detach().numpy()[0]
        objs = predictions[2].cpu().detach().numpy()[0]

        return subjs, preds, objs, subwords

In [None]:
#CUDA_LAUNCH_BLOCKING=1
model = ArgumentExtraction(output_dim=3)  
model.fit(tokens, labels, epochs=6)

loading albert-base-v2 for argument extraction


Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

  return torch.LongTensor([rep_speaker_ids]).to(self._device)


Training!


100%|██████████| 1117/1117 [00:46<00:00, 24.18it/s]


mean loss = 2.0148611808314105


100%|██████████| 1117/1117 [00:35<00:00, 31.04it/s]


mean loss = 1.8831506366695585


100%|██████████| 1117/1117 [00:36<00:00, 30.43it/s]


mean loss = 1.8629916325785159


100%|██████████| 1117/1117 [00:35<00:00, 31.16it/s]


mean loss = 1.844103985421873


100%|██████████| 1117/1117 [00:38<00:00, 29.31it/s]


mean loss = 1.8267154453477372


100%|██████████| 1117/1117 [00:36<00:00, 30.74it/s]

mean loss = 1.8201741939585658





In [None]:
#load already trained model here

# argex_model = ArgumentExtraction(path=root_dir+'/models/baseline2022-12-09')

loading albert-base-v2 for argument extraction
	- Loading pretrained


In [None]:
import os, shutil

# out_dir = root_dir + '/models/' + str(date.today())
out_dir = root_dir + '/models/' + 'baseline' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('argument_extraction_albert-base-v2', out_dir)

'/content/gdrive/MyDrive/Communicative Robotics/models/baseline2022-12-14/argument_extraction_albert-base-v2'

## Putting It All Together

Below you can see the token assignments with the BIO scheme to SPO arguments

In [None]:
# inputs_example = 'What car do SPEAKER1 drive <eos> a big red truck <eos>'.split()
# print(inputs_example)

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_declarative_statements_abstract_val_l1.txt')
inputs = ''

string = '<eos>'
space = ' '

inputs = []
for row in file:
  if row.find('<eos>') >= 0:
    row = row.replace(string, space + string + space)
    all_tokens = []
    speaker_utterances = row.split('<eos>') #lijst met strings, elke string is een zin van 1 speaker
    for utt in speaker_utterances:
      tokens = [token.lower() for token in word_tokenize(utt)]
      tokens.append('<eos>')
      for token in tokens:
        all_tokens.append(token)
    #tokens = [token.lower() for token in word_tokenize(row)]
    #words = row.split()
    all_tokens.pop()
    inputs.append(all_tokens)

print("first utterance in tokens:", inputs[0])

#speaker disambiguation
disambig_inputs = []
for utterance in inputs:
  disambig_utterance = []
  turn_counter = 0
  for token in utterance:
    if token == '<eos>':
      turn_counter +=1
    disambig_utterance.append(disambiguate_pronouns(token, turn_counter))
  disambig_inputs.append(disambig_utterance)

print("first utterance in tokens with speaker disambiguation:", disambig_inputs[0])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


first utterance in tokens: ['good', 'morning', '.', '<eos>', 'what', "'s", 'the', 'problem', '?', '<eos>', 'i', "'m", 'running', 'a', 'high', 'fever', 'and', 'feeling', 'terribly', 'bad', '.']
first utterance in tokens with speaker disambiguation: ['good', 'morning', '.', '<eos>', 'what', "'s", 'the', 'problem', '?', '<eos>', 'SPEAKER1', "'m", 'running', 'a', 'high', 'fever', 'and', 'feeling', 'terribly', 'bad', '.']


In [None]:
for utterance in disambig_inputs[:10]:
  print(utterance)


['good', 'morning', '.', '<eos>', 'what', "'s", 'the', 'problem', '?', '<eos>', 'SPEAKER1', "'m", 'running', 'a', 'high', 'fever', 'and', 'feeling', 'terribly', 'bad', '.']
['haha', 'yeah', ',', 'SPEAKER1', 'like', 'football', 'and', 'country', 'music', '<eos>', 'sounds', 'like', 'SPEAKER1', 'are', 'from', 'the', 'south', ',', 'what', 'is', "SPEAKER1's", 'favorite', 'team', '?', '<eos>', 'the', 'colts', ',', 'from', "SPEAKER1's", 'home', 'state']
['yes', ',', 'it', "'s", 'SPEAKER1', '.', '<eos>', 'do', 'SPEAKER1', 'have', 'a', 'cold', '?', '<eos>', 'no', '.', 'worse', 'than', 'that', '.', 'SPEAKER1', 'have', 'a', 'flu', '.', 'SPEAKER1', "'m", 'in', 'bed', 'with', 'a', 'fever', '.']
['alot', 'really', ',', 'SPEAKER1', 'enjoy', 'going', 'out', 'to', 'eat', 'with', 'family', ',', 'going', 'to', 'the', 'movies', '<eos>', 'have', 'SPEAKER1', 'been', 'to', 'the', 'movies', 'lately', '?', '<eos>', 'yes', ',', 'SPEAKER1', 'go', 'a', 'couple', 'of', 'times', 'a', 'month', '.']
['what', 'is', "S

##predictions en validatie

In [None]:
y_subj, y_pred, y_obj, subwords = argex_model.predict(disambig_inputs[0])

# show results
for arg, y in [('Subject', y_subj), ('Predicate', y_pred), ('Object', y_obj)]:
    print('\n', arg)
    print('\t\t0\t1\t2\t3\t4\t5\t6')
    for score, token in zip(y.T, subwords):
        score_str = '\t'.join(["[" + str(s)[:5] + "]" if s == max(score) else " " + str(round(s, 4))[:5] + " " for s in score])
        token_str = token.replace('▁', '')
        print(token_str, '\t\t', score_str)

  return torch.LongTensor([rep_speaker_ids]).to(self._device)



 Subject
		0	1	2	3	4	5	6
[CLS] 		 [0.998]	 0.000 	 0.000 
good 		 [0.999]	 0.000 	 0.000 
morning 		 [0.999]	 0.000 	 0.000 
 		 [0.999]	 1e-04 	 0.000 
. 		 [0.999]	 0.000 	 0.000 
[SEP] 		 [0.998]	 0.001 	 0.001 
what 		 [0.999]	 0.000 	 0.000 
 		 [0.997]	 0.000 	 0.001 
' 		 [0.995]	 0.001 	 0.002 
s 		 [0.994]	 0.002 	 0.002 
the 		  0.001 	[0.997]	 0.001 
problem 		  0.008 	 0.007 	[0.983]
 		 [0.999]	 1e-04 	 1e-04 
? 		 [0.999]	 0.000 	 0.000 
[SEP] 		 [0.998]	 0.001 	 0.001 
SPEAKER1 		  0.000 	[0.999]	 0.000 
 		 [0.997]	 0.000 	 0.002 
' 		 [0.997]	 0.001 	 0.001 
m 		 [0.998]	 0.000 	 0.000 
running 		 [0.997]	 0.000 	 0.001 
a 		 [0.990]	 0.008 	 0.001 
high 		 [0.998]	 0.000 	 0.001 
fever 		 [0.998]	 0.000 	 0.001 
and 		 [0.999]	 0.000 	 0.000 
feeling 		 [0.997]	 0.000 	 0.001 
terribly 		 [0.995]	 0.003 	 0.001 
bad 		 [0.998]	 0.000 	 0.000 
 		 [0.999]	 0.000 	 0.000 
. 		 [0.999]	 0.000 	 0.000 

 Predicate
		0	1	2	3	4	5	6
[CLS] 		 [0.998]	 0.000 	 0.000 
good 		 

In [None]:
for input in disambig_inputs:
  y_subj, y_pred, y_obj, subwords = argex_model.predict(input)

  print(' '.join(subwords).replace('▁', '') + '\n')
  print('Subjects:  ', bio_tags_to_tokens(subwords, y_subj.T, one_hot=True))
  print('Predicates:', bio_tags_to_tokens(subwords, y_pred.T, predicate=True, one_hot=True))
  print('Objects:   ', bio_tags_to_tokens(subwords, y_obj.T, one_hot=True))
  print()

[CLS] good morning  . [SEP] what  ' s the problem  ? [SEP] SPEAKER1  ' m running a high fever and feeling terribly bad  .

Subjects:   {'the problem', 'SPEAKER1'}
Predicates: {'feeling', 'running'}
Objects:    {'a high fever', 'terribly bad'}

[CLS] ha ha yeah  , SPEAKER1 like football and country music [SEP] sounds like SPEAKER1 are from the south  , what is SPEAKER1  ' s favorite team  ? [SEP] the colts  , from SPEAKER1  ' s home state

Subjects:   {'SPEAKER1', "SPEAKER1 's favorite"}
Predicates: {'like', 'is', 'are'}
Objects:    {'country music', 'the south team', "SPEAKER1 's home state", 'football', 'the colts'}

[CLS] yes  , it  ' s SPEAKER1  . [SEP] do SPEAKER1 have a cold  ? [SEP] no  . worse than that  . SPEAKER1 have a flu  . SPEAKER1  ' m in bed with a fever  .

Subjects:   {'SPEAKER1', 'it'}
Predicates: {'worse', 'have'}
Objects:    {'a cold', 'that', 'SPEAKER1', 'a fever', 'a flu'}

[CLS] a lot really  , SPEAKER1 enjoy going out to eat with family  , going to the movies [S

Save model

In [None]:
import os, shutil

# out_dir = root_dir + '/models/' + str(date.today())
out_dir = root_dir + '/models/' + 'baseline' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('argument_extraction_albert-base-v2', out_dir)


'/content/gdrive/MyDrive/Communicative Robotics/models/baseline2022-12-09/argument_extraction_albert-base-v2'

In [None]:
y_subj, y_pred, y_obj, subwords = argex_model.predict(['Good', 'morning', '.', '<eos>', "What's", 'the', 'problem', '?', '<eos>', "SPEAKER2", "'m", 'running', 'a', 'high', 'fever', 'and', 'feeling', 'terribly', 'bad', '.'])

print(y_subj, y_pred, y_obj)

# Ranking the triples

Now we are able to extract the candidate arguments, but how do we combine them?

We compute all combinations of the subjects, predicates and objects and train a model to distinguish between those triples that are entailed (not considering negation here) and those that are not.

For this, we extract a number of negative examples from possible triples, i.e. those combinations of subjects, predicates and objects that were not annotated.

## Converting format

In [None]:
from collections import defaultdict
from copy import deepcopy


def extract_triples(annotation, neg_oversampling=7, contr_oversampling=0.7, ellipsis_oversampling=3):
    """ Extracts plain-text triples from an annotation file and samples 'negative' examples by
        crossover. By default, the function will over-extract triples with negative polarity and
        elliptical constructions to counter class imbalance.

        params:
        dict annotation:            loaded annotation file (see load_annotations)
        int neg_oversampling:       how much to over-sample triples with negative polarity
        float contr_oversampling:   how much to sample contrast/invalid triples relative to true triples
        int ellipsis_oversampling:  how much to over-sample elliptical triples
    """
    turns = annotation['tokens']
    triple_ids = [t[:4] for t in annotation['annotations']]

    arguments = defaultdict(list)
    triples = []
    labels = []

    # Oversampling of elliptical triples
    for triple in deepcopy(triple_ids):
        subj_obj_turns = set([i for i, _ in triple[0] + triple[2]])
        if len(subj_obj_turns) > 1:
            triple_ids += [triple] * int(ellipsis_oversampling)

    # Extract 'True' triples
    for subj, pred, obj, polar in triple_ids:

        subj = ' '.join(turns[i][j] for i, j in subj) if subj else ''
        pred = ' '.join(turns[i][j] for i, j in pred) if pred else ''
        obj = ' '.join(turns[i][j] for i, j in obj) if obj else ''

        if subj or pred or obj:

            if not polar:
                triples += [(subj, pred, obj)]
                labels += [1]
            else:
                triples += [(subj, pred, obj)] * neg_oversampling  # Oversampling negative polarities
                labels += [2] * neg_oversampling

            arguments['subjs'].append(subj)
            arguments['preds'].append(pred)
            arguments['objs'].append(obj)

    # Skip if the annotation file was blank
    if not triples:
        return [], [], []

    # Sample fake contrast examples (invalid extractions)
    n = int(len(triples) * contr_oversampling)
    for i in range(50):
        s = random.choice(arguments['subjs'])
        p = random.choice(arguments['preds'])
        o = random.choice(arguments['objs'])

        # Ensure samples are new (and not actually valid!)
        if (s, p, o) not in triples and s and p and o:
            triples += [(s, p, o)]
            labels += [0]
            n -= 1

        # Create as many fake examples as there were 'real' triples
        if n == 0:
            break

    return turns, triples, labels


In [None]:
tokens, triples, labels = [], [], []
for ann in annotations:
    ann_tokens, ann_triples, triple_labels = extract_triples(ann)
    triples.append(ann_triples)
    labels.append(triple_labels)
    tokens.append([t for ts in ann_tokens for t in ts + ['<eos>']])

j = random.choice(range(len(tokens)))
print('tokens: ', tokens[j])
print('triples:', triples[j])
print('labels: ', labels[j])

tokens:  ['did', 'SPEAKER2', 'vote', 'for', 'him', '?', '<eos>', 'SPEAKER2', 'sure', 'did', '.', 'how', 'about', 'SPEAKER1', '?', '<eos>', 'SPEAKER1', 'voted', 'for', 'him', '.', '<eos>']
triples: [('SPEAKER2', 'vote for', 'him'), ('SPEAKER1', 'voted for', 'him'), ('SPEAKER1', 'vote for', 'him')]
labels:  [1, 1, 0]


In [None]:
print('Class (im)balance:')
print('not entailed  ', sum([np.sum(np.array(t) == 0) for t in labels]))
print('entailed (pos)', sum([np.sum(np.array(t) == 1) for t in labels]))
print('entailed (neg)', sum([np.sum(np.array(t) == 2) for t in labels]))

Class (im)balance:
not entailed   6431
entailed (pos) 6690
entailed (neg) 5978


## Fine-tuning ALBERT for Triple Candidate Scoring

In [None]:
class TripleScoring(torch.nn.Module):
    def __init__(self, base_model='albert-base-v2', path=None, max_len=80, sep='<eos>'):
        super().__init__()
        # Base model
        print('loading %s for triple scoring' % base_model)
        # Load base model
        self._model = AutoModel.from_pretrained(base_model)
        self._max_len = max_len
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with SPEAKERS
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # SPO candidate scoring head
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._head = torch.nn.Linear(hidden_size, 3)
        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # GPU support
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = glob.glob(path + '/candidate_scorer_' + base_model)[0]
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids, attn_mask):
        """ Computes the forward pass through the model
        """
        out = self._model(input_ids=input_ids, token_type_ids=speaker_ids, attention_mask=attn_mask)
        h = self._relu(out.last_hidden_state[:, 0])
        return self._softmax(self._head(h))

    def _retokenize_dialogue(self, tokens, speaker=1):
        # Tokenize each token individually (keeping track of subwords)
        f_input_ids = [self._tokenizer.cls_token_id]
        speaker_ids = [speaker]
        for turn in ' '.join(tokens).split(self._sep):
            token_ids = self._tokenizer.encode(turn, add_special_tokens=True)[1:]  # strip [CLS]
            f_input_ids += token_ids
            speaker_ids += [speaker] * len(token_ids)
            speaker = 1 - speaker

        return f_input_ids, speaker_ids

    def _retokenize_triple(self, triple):
        # Append triple
        f_input_ids = self._tokenizer.encode(' '.join(triple), add_special_tokens=False)
        speaker_ids = [0] * len(f_input_ids)
        return f_input_ids, speaker_ids

    def _add_padding(self, sequence, pad_token):
        # If sequence is too long, cut off end
        sequence = sequence[:self._max_len]

        # Pad remainder to max_len
        padding = self._max_len - len(sequence)
        new_sequence = sequence + [pad_token] * padding

        # Mask out [PAD] tokens
        attn_mask = [1] * len(sequence) + [0] * padding
        return new_sequence, attn_mask

    def fit(self, tokens, triples, labels, epochs=2, lr=1e-6):
        """ Fits the model to the annotations
        """
        X = []
        for tokens, triple_lst, triple_labels in zip(tokens, triples, labels):

            # Tokenize dialogue
            dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

            for triple, label in zip(triple_lst, triple_labels):
                # Tokenize triple
                triple_input_ids, triple_speakers = self._retokenize_triple(triple)

                # Concatenate dialogue + [UNK] + triple
                input_ids = dialog_input_ids[:-1] + [self._tokenizer.unk_token_id] + triple_input_ids
                speakers = dialog_speakers[:-1] + [0] + triple_speakers

                # Pad sequence with [PAD] to max_len
                input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
                speakers, attn_mask = self._add_padding(speakers, 0)

                # Push Tensor to GPU
                input_ids = torch.LongTensor([input_ids]).to(self._device)
                speakers = torch.LongTensor([speakers]).to(self._device)
                attn_mask = torch.FloatTensor([attn_mask]).to(self._device)
                label_ids = torch.LongTensor([label]).to(self._device)

                X.append((input_ids, speakers, attn_mask, label_ids))

        # Set up optimizer and objective
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        criterion = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            random.shuffle(X)

            losses = []
            for input_ids, speaker_ids, attn_mask, y in tqdm(X):
                # Was the triple entailed? Positively? Negatively?
                y_hat = self(input_ids, speaker_ids, attn_mask)
                loss = criterion(y_hat, y)
                losses.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'candidate_scorer_%s' % self._base)

    def predict(self, tokens, triples):
        # Tokenize dialogue
        dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

        batch_input_ids = []
        batch_speakers = []
        batch_attn_mask = []

        for triple in triples:
            # Tokenize triple
            triple_input_ids, triple_speakers = self._retokenize_triple(triple)

            # Concatenate dialogue + [UNK] + triple
            input_ids = dialog_input_ids + [self._tokenizer.unk_token_id] + triple_input_ids
            speakers = dialog_speakers + [0] + triple_speakers

            # Pad sequence with [PAD] to max_len
            input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
            speakers, attn_mask = self._add_padding(speakers, 0)

            batch_input_ids.append(input_ids)
            batch_speakers.append(speakers)
            batch_attn_mask.append(attn_mask)

        # Push batches to GPU
        batch_input_ids = torch.LongTensor(batch_input_ids).to(self._device)
        batch_speakers = torch.LongTensor(batch_speakers).to(self._device)
        batch_attn_mask = torch.FloatTensor(batch_attn_mask).to(self._device)

        label = self(batch_input_ids, batch_speakers, batch_attn_mask)
        label = label.cpu().detach().numpy()
        return label

In [None]:
#load pretrained model
#triplescorer_model = TripleScoring(path=root_dir+'/models/baseline2022-12-09')

loading albert-base-v2 for triple scoring
	- Loading pretrained


In [None]:
# scorer = TripleScoring()
# scorer.fit(tokens, triples, labels, epochs=7)

loading albert-base-v2 for triple scoring


100%|██████████| 19087/19087 [12:31<00:00, 25.38it/s]


mean loss = 0.8256498628235502


100%|██████████| 19087/19087 [11:46<00:00, 27.00it/s]


mean loss = 0.6630780672386243


100%|██████████| 19087/19087 [11:37<00:00, 27.37it/s]


mean loss = 0.6284048839612711


100%|██████████| 19087/19087 [11:28<00:00, 27.74it/s]


mean loss = 0.611301173039365


100%|██████████| 19087/19087 [11:33<00:00, 27.51it/s]


mean loss = 0.6033324699320163


100%|██████████| 19087/19087 [11:42<00:00, 27.17it/s]


mean loss = 0.5961232537333148


100%|██████████| 19087/19087 [11:34<00:00, 27.48it/s]


mean loss = 0.5922655518242205


In [None]:
# inputs = 'staying here is fine though . SPEAKER1\'s two dogs keep me company <eos> SPEAKER2 do not love them ! What car do SPEAKER1 drive ? <eos> a toyota . but SPEAKER1 like nissans . <eos>'.split()
# triple_examples = [['SPEAKER1', 'drive', 'nissans'],
#                    ['SPEAKER1', 'like', 'nissans'], 
#                    ['SPEAKER2', 'like', 'nissans'], 
#                    ['SPEAKER2', 'love', 'two dogs'], 
#                    ['SPEAKER1', 'drive', 'a toyota']]

# inputs = '<eos> Do SPEAKER1 work in Amsterdam ? <eos> No , in London . <eos>'.split()
# triple_examples = [['SPEAKER1', 'work in', 'Amsterdam']]

inputs = 'SPEAKER1 adore unicorns but not photography <eos> What do SPEAKER1 like ? <eos> dogs and gaming, but not cats or elephants . <eos>'.split()
triple_examples = [['SPEAKER1', 'adore', 'cats'], 
                   ['SPEAKER1', 'like', 'dogs'],
                   ['SPEAKER1', 'like', 'gaming'],
                   ['SPEAKER1', 'adore', 'photography'],
                   ['SPEAKER1', 'like', 'cats'],
                   ['SPEAKER1', 'like', 'elephants'],
                   ['SPEAKER1', 'adore', 'elephants'],
                   ['SPEAKER1', 'like', 'photography'],
                   ['SPEAKER1', 'like', 'unicorns']]

np.round(triplescorer_model.predict(inputs, triple_examples), 3)

array([[0.998, 0.   , 0.001],
       [0.003, 0.945, 0.052],
       [0.003, 0.952, 0.045],
       [0.001, 0.001, 0.998],
       [0.   , 0.   , 1.   ],
       [0.   , 0.   , 1.   ],
       [0.999, 0.   , 0.001],
       [0.   , 0.   , 1.   ],
       [0.362, 0.009, 0.629]], dtype=float32)

We move the resulting model to Drive:

In [None]:
import os, shutil

out_dir = root_dir + '/models/baseline' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('candidate_scorer_albert-base-v2', out_dir)

'/content/gdrive/MyDrive/Communicative Robotics/models/baseline2022-12-09/candidate_scorer_albert-base-v2'