# Triple and Perspective Extraction and Scoring with Normalization

In this notebook you will be fine-tuning a language model to perform triple argument (Subject, Predicate, Object) extraction and candidate triple scoring. For the predicates, you will create various categories, and the aim of the model is to find the predicate token span as well as the most likely category for the predicate. You will use the `pytorch` implementation of `albert-base` provided by the Huggingface `transformers` library and fine-tune this model on PersonaChat, DailyDialog and Circa data annotated with ground-truth triples. You will also try adapt the code to allow for other models and train those models to compare model performances on the normalized predicates. 

## Overview 
Adopting a two-stage setup allows maximum flexibility of the triple extraction while making efficient use of the annotated data. The two stages include:

1. A sequence labeling (BIO-tagging) model which extracts lists of subjects, predicates and objects from the input dialogue.

2. A model which takes combinations of subjects, predicates and objects found and scores these combinations (i.e. all candidate triples) to decide whether the triple can indeed be entailed from the dialogue and what its polarity is.

By using this two-stage approach, arbitrary numbers of triples can be extracted and linguistic phenomena such as ellipsis can be accounted for.<br>

The first part is most relevant for the extraction of abstract predicates, whereas the second part is relevant for the evaluation.

## Getting the Data

To get a dataset of ground truth triples a small development set was created. This data has been stored in Google Drive for easy access.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')
root_dir = '/content/gdrive/MyDrive/Communicative Robotics' 


In [None]:
import glob
import json
import random

def load_annotations(path, remove_unk=True, keep_skipped=False):
    """ Reads all annotation files from path. By default, it filters skipped
        files and removes the [unk] tokens appended at the end of each turn.

        params:
        str path:           name of directory containing annotations
        bool remove_unk:    whether to remove [unk] tokens (default: True)
        bool keep_skipped:  whether to keep skipped annotations (default: False)

        returns:    list of annotations dicts
    """
    annotations = []
    for fname in glob.glob(path + '/*.json'):
        with open(fname, 'r', encoding='utf-8') as file:
            data = json.load(file)

            if data['skipped'] and not keep_skipped:
                continue

            if remove_unk:
                data['tokens'] = [[t for t in turn if t != '[unk]'] for turn in data['tokens']]

            annotations.append(data)

    return annotations

annotations = load_annotations(root_dir + '/annotated_data/trainval') #include in new dir
annotations[0]

In [None]:
print('#dialogs:', len(annotations))
print('#triples:', sum([sum([any(t) for t in d['annotations']]) for d in annotations]))

In [None]:
def get_predicate_tokens(annotation, triple):
    # if triple[1]:
    #     turn = triple[1][0][0] 
    #     start = triple[1][0][1]
    #     end = triple[1][-1][1] #volgens mij klopt triple[1][-1][0] niet, moet triple[1][-1][1] zijn denk ik
    #     return ' '.join(annotation['tokens'][turn][start:end + 1])
    # else:
    #     return None


    # heb deze functie verandert zodat ie niet alle tokens van start tot end pakt (dan pakt hij namelijk ook wel eens een subject of object token)
    # nu returnt ie alleen de tokens die geannoteerd zijn als predicate (zitten soms nog steeds subject/object bij, maar minder dan eerst)
      if triple[1]:
        turn = triple[1][0][0]
        pred_tokens = []
        for n in range(0, len(triple[1])):
          token_index = triple[1][n][1]
          token = annotation['tokens'][turn][token_index]
          pred_tokens.append(token)
        return ' '.join(pred_tokens)
      else:
        return None


# triple[1] verwijst naar tweede object in triple --> predicate. 
# triple[1][0][0] verwijst naar eerste integer (conversation turn) in de eerste token van dit predicate
# triple[1][0][1] verwijst naar tweede integer (token) in de eerste token van dit predicate
# triple [1][-1][0] verwijst naar eerste integer (conversation turn) in de laatste token van dit predicate --> niet logisch want we willen de index van de laatste token niet de turn



In [None]:
# for ann in annotations:
#   for triple in ann['annotations']:
#     print('predictate is: ',get_predicate_tokens(ann, triple))

## Who are 'You'?: Disambiguating You and I

In the text the speakers are referred to as ambiguous tokens *You* and *I*. As these words are ambiguous and their meaning depends on the speaker who utters them, we replace these tokens by [SPEAKER1] and [SPEAKER2] contingent on the speaker (e.g. speaker 2 saying you indicates, [speaker1])

In [None]:
SPEAKER1 = 'SPEAKER1'
SPEAKER2 = 'SPEAKER2'

def disambiguate_pronouns(token, turn_idx):
    # Even turns -> speaker1
    if turn_idx % 2 == 0:
        if token in ['i', 'me', 'myself', 'we', 'ourselves']:
            return SPEAKER1
        elif token in ['my', 'mine', 'our', 'ours']:
            return SPEAKER1 + "'s"
        elif token in ['you', 'yourself', 'yourselves']:
            return SPEAKER2
        elif token in ['your', 'yours']:
            return SPEAKER2 + "'s"
    else:
        if token in ['i', 'me', 'myself', 'we', 'ourselves']:
            return SPEAKER2
        elif token in ['my', 'mine', 'our', 'ours']:
            return SPEAKER2 + "'s"
        elif token in ['you', 'yourself', 'yourselves']:
            return SPEAKER1
        elif token in ['your', 'yours']:
            return SPEAKER1 + "'s"
    return token

In [None]:
for annotation in annotations:
    annotation['tokens'] = [[disambiguate_pronouns(token, i % 2) for token in turn] for i, turn in enumerate(annotation['tokens'])]
annotations[0]

# Creating sets of abstract predicates

First, we create a set consisting of all unique token spans annotated as 'predicate'. This is the set from which you need to create a set of abstract predicates. Notice that a lot of 'unique predicates' are also annotation errors, where the subject or object has been included in the predicate annotation. 

For each of the abstract predicates you define, you will need to create a numerical B- and I-tag for the BIO tag annotation that we will use. Start from (3,4) for the first tag (e.g. B-like, I-like). The tag 0 is reserved for 'O', and 1 and 2 are reserved for the B- and I-tags for subjects and objects. 

O-subject: 0   
B-subject: 1   
I-subject: 2   

O-object: 0   
B-object: 1   
I-object: 2    

O-predicate: 0   

I-predicate1: 3   
B-predicate1: 4   
I-predicate2: 5   
B-predicate2: 6   
I-predicate3: 7   
B-predicate3: 8   
I-predicate4: 9   
B-predicate4: 10   
etc
We will define these predicates

Once you've created a set of abstract predicates and their corresponding B- and I-tags, you will need to create two lookup dictionaries for the BIO tagging that we will do later on. The dictionary `lookup` is used to get the correct BIO tag from the predicate token span and is used for converting triples to BIO-tags. The dictionary `bio_lookup` is used to get the abstract predicate from the BIO tag, and is used to convert BIO tags to tokens.

In [None]:
unique_predicates = set()
for ann in annotations:
  for triple in ann['annotations']:
    pred = get_predicate_tokens(ann, triple)
    if pred is not None:
      unique_predicates.add(pred)

In [None]:
print(unique_predicates)

## Read train dictionarys from json file

In [None]:
with open(root_dir+'/conversion_dict_level1.json', 'r', encoding='utf-8') as fin:
  train_dict_level1 = json.load(fin)

In [None]:
with open(root_dir+'/conversion_dict_level2.json', 'r', encoding='utf-8') as fin:
  train_dict_level2 = json.load(fin)

## Continue with bio tagging

In [None]:
## annotatie van training data met BIO tags
## Inlezen van json file en mappen naar abstracte predicaten

## Biotag level 1

counter = 3
bio_lookup_l1 = {}
bio_dict_l1 = {}
lookup_l1 = {}

for abstract, predicates in train_dict_level1.items():
  bio_lookup_l1[counter] = abstract
  bio_dict_l1[(counter,counter+1)] = predicates
  counter += 2

for key, value in bio_dict_l1.items():
    for pred in value:
        lookup_l1[pred] = key

print(lookup_l1)
print(bio_lookup_l1)

In [None]:
## annotatie van training data met BIO tags
## Inlezen van json file en mappen naar abstracte predicaten

## Biotag level 2

counter = 3
bio_lookup_l2 = {}
bio_dict_l2 = {}
lookup_l2 = {}

for abstract, predicates in train_dict_level2.items():
  bio_lookup_l2[counter] = abstract
  bio_dict_l2[(counter,counter+1)] = predicates
  counter += 2

for key, value in bio_dict_l2.items():
    for pred in value:
        lookup_l2[pred] = key

print(lookup_l2)
print(bio_lookup_l2)

## Abstract predicates Evaluation data

In [None]:
# Verandert de predicate in de evaluatie data naar een abstract predicaat

def load_evaluation_data(path):
  """
  Read an file as txt file

  Returns 
  """
  with open(path, 'r', encoding='utf-8') as file:
    data = file.readlines()
    return data


def abstract_predicates(file, abstract_dict, lookup, bio_lookup, root_dir):
  dict_values = []
  abstract_content = ""
  for values in abstract_dict.values():
    for value in values:
      dict_values.append(value)

  # filename = root_dir + '/annotated_data/evaluatiedata/test_declarative_statements_abstract_val_l1.txt'
  # filename = root_dir + '/annotated_data/evaluatiedata/test_single_utterances_abstract_val_l1.txt'
  # filename = root_dir + '/annotated_data/evaluatiedata/test_coreference_abstract_val_l1.txt'

  for row in file:
    if len(row.split(",")) != 4:
      append_abstract_content(filename, row)
    else:
      triple = row.split(",")
      predicate = triple[1]
      if predicate in dict_values:
        key = lookup[predicate]
        abstract_pred = bio_lookup[key[0]]
        row = row.replace(predicate, abstract_pred)
      append_abstract_content(filename, row)
  return abstract_content

def append_abstract_content(filename, content):
  with open(filename, "a") as outfile:
    outfile.write(content)

# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_declarative_statements_val.txt')
# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_single_utterances_val.txt')
# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_coreference_val.txt')

# abstract_pedicate = abstract_predicates(file, bio_dict_l1, lookup_l1, bio_lookup_l1, root_dir)
# abstract_pedicate = abstract_predicates(file, bio_dict_l2, lookup_l2, bio_lookup_l2, root_dir)



## Converting formats

Triple arguments are stored as lists of indices (e.g. [[0, 1], [0, 2]] indicating the second and third token of the first turn). We rather use a BIO tagging scheme to indicate these arguments as a vector of labels (one label for each token in the dialog).

Moreover, we flatten the dialogue turns into one flat dialogue using `<eos>` as a separator token.

In [None]:
from numpy.lib.function_base import kaiser
import numpy as np
import pandas as pd
import re


def triple_to_bio_tags(annotation, arg, lookup):
    """ Converts the token indices of the annotations to a vector of BIO labels
        for an argument.

        params:
        dict annotation:    loaded annotation file (see load_annotations)
        int arg:            argument to create tag sequence for (subj=0, pred=1, obj=2)

        returns:    ndarray with BIO labels (I=2, B=1, O=0)
    """ 
    # Determine length of dialogue
    turns = annotation['tokens']
    triples = annotation['annotations']
    num_tokens = sum([len(turn) + 1 for turn in turns])  # +1 for <eos>

    # Create vector same size as dialogue
    mask = np.zeros(num_tokens, dtype=np.uint8)

    # Label annotated arguments as BIO tags
    for triple in triples:
        if arg == 1:
            pred = get_predicate_tokens(annotation, triple)
            if pred is not None:
                try:
                  pred = pred.strip()
                  pred = pred.strip("'")
                  pred = pred.strip()
                  B_tag, I_tag = lookup[pred]
                except:
                  B_tag, I_tag = (287,288)

                for j, (turn_id, token_id) in enumerate(triple[arg]):
                    k = sum([len(t) + 1 for t in turns[:turn_id]]) + token_id  # k = index of token in dialogue
                    if j == 0:
                      mask[k] = B_tag
                    else:
                      mask[k] = I_tag
        else:
            for j, (turn_id, token_id) in enumerate(triple[arg]):
                k = sum([len(t) + 1 for t in turns[:turn_id]]) + token_id  # k = index of token in dialogue
                mask[k] = 1 if j == 0 else 2
        
    return mask

In [None]:
## LEVEL 1
tokens, labels = [], []
for ann in annotations:
    # Map triple arguments to BIO tagged masks
    labels.append((triple_to_bio_tags(ann, 0, lookup_l1),
                   triple_to_bio_tags(ann, 1, lookup_l1),
                   triple_to_bio_tags(ann, 2, lookup_l1)))
    
    # Flatten turn sequence
    tokens.append([t for ts in ann['tokens'] for t in ts + ['<eos>']])
    
# Show as BIO scheme
i = random.randint(0, len(tokens) - 1)
pd.DataFrame(labels[i], columns=tokens[i], index=['subj', 'pred', 'obj'])

tokens_l1 = tokens
labels_l1 = labels

In [None]:
## LEVEL 2
tokens, labels = [], []
for ann in annotations:
    # Map triple arguments to BIO tagged masks
    labels.append((triple_to_bio_tags(ann, 0, lookup_l2),
                   triple_to_bio_tags(ann, 1, lookup_l2),
                   triple_to_bio_tags(ann, 2, lookup_l2)))
    
    # Flatten turn sequence
    tokens.append([t for ts in ann['tokens'] for t in ts + ['<eos>']])
    
# Show as BIO scheme
i = random.randint(0, len(tokens) - 1)
pd.DataFrame(labels[i], columns=tokens[i], index=['subj', 'pred', 'obj'])

tokens_l2 = tokens
labels_l2 = labels

In [None]:
import re

def bio_tags_to_tokens(tokens, mask, bio_lookup, predicate=False, one_hot=False):
    """ Converts a vector of BIO-tags into spans of tokens. If BIO-tags are one-hot encoded,
        one_hot=True will first perform an argmax to obtain the BIO labels.

        params:
        list tokens:    list of subwords or tokens (as tokenized by Albert/AutoTokenizer)
        ndarray mask:   list of bio labels (one for each subword or token in 'tokens')
        bool one_hot:   whether to interpret mask as a one-hot encoded sequence of shape |sequence|x3
    """
    out = []
    span = []
    for i, token in enumerate(tokens):
        pred = mask[i]

        # Reverse one-hot encoding (optional)
        if one_hot:
            pred = np.argmax(pred)
          

        if pred %2 == 1:  # B
            if predicate:
                span = bio_lookup[pred]
                out.append(span)

            else:
                span = re.sub('[^\w\d\-\']+', ' ', ''.join(span)).strip()
                out.append(span)
                span = [token]

        elif pred != 0 and pred %2 == 0:  # I
            if predicate:
                continue
            else:
                span.append(token)

    if span:
        span = re.sub('[^\w\d\-\']+', ' ', ''.join(span)).strip()
        out.append(span)

    # Remove empty strings and duplicates
    return set([span for span in out if span.strip()])

In [None]:
## LEVEL 1 

## ONLY PRINTS ONE PREDICATE (DOES PRINT MULTIPLE SUBJECTS AND OBJECTS)
i = random.randint(0, len(labels_l1))
print(' '.join(tokens_l1[i]) + '\n')

print('Subjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens_l1[i]], labels_l1[i][0], bio_lookup_l1))

print('\nPredicates:')
print(bio_tags_to_tokens(['+' + t for t in tokens_l1[i]], labels_l1[i][1], bio_lookup_l1, predicate=True))

print('\nObjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens_l1[i]], labels_l1[i][2], bio_lookup_l1))

In [None]:
## LEVEL 2

## ONLY PRINTS ONE PREDICATE (DOES PRINT MULTIPLE SUBJECTS AND OBJECTS)
i = random.randint(0, len(labels_l2))
print(' '.join(tokens_l2[i]) + '\n')

print('Subjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens_l2[i]], labels_l2[i][0], bio_lookup_l2))

print('\nPredicates:')
print(bio_tags_to_tokens(['+' + t for t in tokens_l2[i]], labels_l2[i][1], bio_lookup_l2, predicate=True))

print('\nObjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens_l2[i]], labels_l2[i][2], bio_lookup_l2))

## Setting up ALBERT for Argument Extraction

Now we set up ALBERT with a token classification head for each of the arguments. To this end we will use PyTorch to create a small linear classifier for each argument which we can slide over the output of ALBERT to make a prediction for each token. To train other models, you should adapt this code so that it works with your model of choice.

In [None]:
%%capture 
!pip install transformers

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from tqdm import tqdm
import numpy as np
import random
from datetime import date

In [None]:
print(bio_dict_l1) #504 + 1

In [None]:
print(bio_dict_l2) #340 + 1

In [None]:
class ArgumentExtraction(torch.nn.Module):
    def __init__(self, base_model='albert-base-v2', path=None, output_dim=505, sep='<eos>'): # You need to change the output_dim to the total number of BIO-tags (including 0,1,2)
        """ Init model with multi-span extraction heads for SPO arguments.

            params:
            str base_model: Transformer architecture to use (default: albert-base-v2)
            str path:       Path to pretrained model
        """
        super().__init__()
        print('loading %s for argument extraction' % base_model)
        self._model = AutoModel.from_pretrained(base_model)
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with special SPEAKER tokens
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # Add token classification heads
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._subj_head = torch.nn.Linear(hidden_size, output_dim)
        self._pred_head = torch.nn.Linear(hidden_size, output_dim)
        self._obj_head = torch.nn.Linear(hidden_size, output_dim)
        self._output_dim = output_dim

        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # Set GPU if available
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = path + '/argument_extraction_' + base_model
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids):
        """ Computes BIO label probabilities for each token
        """
        # Feed dialog through transformer
        y = self._model(input_ids=input_ids, token_type_ids=speaker_ids)
        h = self._relu(y.last_hidden_state)

        # Predict spans
        y_subj = self._softmax(self._subj_head(h))
        y_pred = self._softmax(self._pred_head(h))
        y_obj_ = self._softmax(self._obj_head(h))

        # Permute output as tensor of shape (N, |C|, seq_len)
        y_subj = y_subj.permute(0, 2, 1)
        y_pred = y_pred.permute(0, 2, 1)
        y_obj_ = y_obj_.permute(0, 2, 1)
        return y_subj, y_pred, y_obj_

    def _retokenize_tokens(self, tokens):
        # Tokenize each token individually (keeping track of subwords)
        input_ids = [[self._tokenizer.cls_token_id]]
        for t in tokens:
            if t != '<eos>':
                input_ids.append(self._tokenizer.encode(t, add_special_tokens=False))
            else:
                input_ids.append([self._tokenizer.eos_token_id])

        # Flatten input_ids
        f_input_ids = torch.LongTensor([[i for ids in input_ids for i in ids]]).to(self._device)

        # Determine how often we need to repeat the labels
        repeats = [len(ids) for ids in input_ids]

        # Set speaker IDs
        speaker_ids = [0] + [tokens[:i + 1].count(self._sep) % 2 for i in range(len(tokens))][:-1]  # TODO: make pretty
        speaker_ids = self._repeat_speaker_ids(speaker_ids, repeats)

        return f_input_ids, speaker_ids, repeats

    def _repeat_speaker_ids(self, speaker_ids, repeats):
        """ Repeats speaker IDs for oov tokens.
        """
        rep_speaker_ids = np.repeat([0] + list(speaker_ids), repeats=repeats)
        return torch.LongTensor([rep_speaker_ids]).to(self._device)

    def _repeat_labels(self, labels, repeats):
        """ Repeats BIO labels for OOV tokens. Ensure B-labeled tokens are repeated
            as B-I-I etc.
        """
        # Repeat each label b the amount of subwords per token
        rep_labels = []
        for label, rep in zip([0] + list(labels), repeats):
            # Outside
            if label == 0:
                rep_labels += [label] * rep
            # Beginning + Inside
            elif (label % 2) == 1:
                rep_labels += [label] + ([label+1] * (rep - 1))  # If label = B -> because all B-tags are odd numbers
            else:
               rep_labels += [label] + ([label] * (rep - 1))  # If label = I -> do not add 1, but keep the same 
        return torch.LongTensor([rep_labels]).to(self._device)

    def fit(self, tokens, labels, epochs=2, lr=1e-5, weight=3):
        """ Fits the model to the annotations
        """
        # Re-tokenize to obtain input_ids and associated labels
        X = []
        for token_seq, (subj_labels, pred_labels, _obj_labels) in zip(tokens, labels):
            input_ids, speaker_ids, repeats = self._retokenize_tokens(token_seq)
            subj_labels = self._repeat_labels(subj_labels, repeats)  # repeat when split into subwords
            pred_labels = self._repeat_labels(pred_labels, repeats)
            _obj_labels = self._repeat_labels(_obj_labels, repeats)
            X.append((input_ids, speaker_ids, subj_labels, pred_labels, _obj_labels))

        # Set up optimizer
        optim = torch.optim.Adam(self.parameters(), lr=lr)

        # Higher weight for B- and I-tags to account for class imbalance
        class_weights = torch.Tensor([1] + [weight] * (self._output_dim - 1)).to(self._device)
        criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

        print('Training!')
        for epoch in range(epochs):
            losses = []
            random.shuffle(X)
            for input_ids, speaker_ids, subj_y, pred_y, obj_y in tqdm(X):
                # Forward pass
                subj_y_hat, pred_y_hat, obj_y_hat = self(input_ids, speaker_ids)

                # Compute loss
                loss = criterion(subj_y_hat, subj_y)
                loss += criterion(pred_y_hat, pred_y)
                loss += criterion(obj_y_hat, obj_y)
                losses.append(loss.item())

                optim.zero_grad()
                loss.backward()
                optim.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'argument_extraction_%s' % self._base)

    def predict(self, token_seq):
        """ Predicts """
        # Retokenize token sequence
        input_ids, speaker_ids, _ = self._retokenize_tokens(token_seq)

        # Invert tokenization for viewing
        subwords = self._tokenizer.convert_ids_to_tokens(input_ids[0])

        # Forward-pass
        predictions = self(input_ids, speaker_ids)
        subjs = predictions[0].cpu().detach().numpy()[0]
        preds = predictions[1].cpu().detach().numpy()[0]
        objs = predictions[2].cpu().detach().numpy()[0]

        return subjs, preds, objs, subwords

In [None]:
def save_model(level, epochs, weight):
  import os, shutil
  
  out_dir = root_dir + '/models/' + level + '_' + str(epochs) + '_' + str(weight) + '_' + str(date.today())
  if not os.path.exists(out_dir):
      os.mkdir(out_dir)

  shutil.copy('argument_extraction_albert-base-v2', out_dir)

In [None]:
model_l1 = ArgumentExtraction(output_dim = 505)  
epochs = 4
weight = 60
model_l1.fit(tokens_l1, labels_l1, epochs=epochs, weight=weight)
save_model('l1',epochs,weight)

In [None]:
model_l2 = ArgumentExtraction(output_dim = 341)  
epochs = 6
weight = 40
model_l2.fit(tokens_l2, labels_l2, epochs=epochs, weight=weight)
save_model('l2',epochs,weight)

## Putting It All Together

Below you can see the token assignments with the BIO scheme to SPO arguments

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

def transform_data_speaker_disambiguation(file):
  string = '<eos>'
  space = ' '
  inputs = []
  for row in file:
    if row.find('<eos>') >= 0:
      row = row.replace(string, space + string + space)
      all_tokens = []
      speaker_utterances = row.split('<eos>') #lijst met strings, elke string is een zin van 1 speaker
      for utt in speaker_utterances:
        tokens = [token.lower() for token in word_tokenize(utt)]
        tokens.append('<eos>')
        for token in tokens:
          all_tokens.append(token)
      #tokens = [token.lower() for token in word_tokenize(row)]
      #words = row.split()
      all_tokens.pop()
      inputs.append(all_tokens)

  print("first utterance in tokens:", inputs[0])

  #speaker disambiguation
  disambig_inputs = []
  for utterance in inputs:
    disambig_utterance = []
    turn_counter = 0
    for token in utterance:
      if token == '<eos>':
        turn_counter +=1
      disambig_utterance.append(disambiguate_pronouns(token, turn_counter))
    disambig_inputs.append(disambig_utterance)

  print("first utterance in tokens with speaker disambiguation:", disambig_inputs[0])

  return input , disambig_inputs


In [None]:
file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_declarative_statements_abstract_val_l1.txt')
# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_coreference_abstract_val_l1.txt')
# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_single_utterances_abstract_val_l1.txt')
input, disambig_inputs = transform_data_speaker_disambiguation(file)


## level 1 predictions en validatie

In [None]:
tags_count_l1 = len(bio_dict_l1)*2+2
tags_count_l1 = list(range(0, tags_count_l1 + 1))

In [None]:
y_subj, y_pred, y_obj, subwords = model_l1.predict(disambig_inputs[0])

# show results
for arg, y in [('Subject', y_subj), ('Predicate', y_pred), ('Object', y_obj)]:
    print('\n', arg)
    print(''.ljust(15) + '\t'.join(map(str, tags_count_l1)))
    for score, token in zip(y.T, subwords):
        score_str = '\t'.join(["[" + str(s)[:5] + "]" if s == max(score) else " " + str(round(s, 4))[:5] + " " for s in score])
        token_str = token.replace('▁', '')
        print(token_str.ljust(15) + score_str)

In [None]:
for input in disambig_inputs:
  y_subj, y_pred, y_obj, subwords = model_l1.predict(input)

  print(' '.join(subwords).replace('▁', '') + '\n')
  print('Subjects:  ', bio_tags_to_tokens(subwords, y_subj.T, bio_lookup_l1, one_hot=True))
  print('Predicates:', bio_tags_to_tokens(subwords, y_pred.T, bio_lookup_l1, predicate=True, one_hot=True))
  print('Objects:   ', bio_tags_to_tokens(subwords, y_obj.T, bio_lookup_l1, one_hot=True))
  print()

## level 2 predictions en validatie

In [None]:
file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_declarative_statements_abstract_val_l2.txt')
# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_coreference_abstract_val_l2.txt')
# file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_single_utterances_abstract_val_l2.txt')
input, disambig_inputs = transform_data_speaker_disambiguation(file)

In [None]:
tags_count_l2 = len(bio_dict_l2)*2+2
tags_count_l2 = list(range(0, tags_count_l2 + 1))

In [None]:
y_subj, y_pred, y_obj, subwords = model_l2.predict(disambig_inputs[0])

# show results
for arg, y in [('Subject', y_subj), ('Predicate', y_pred), ('Object', y_obj)]:
    print('\n', arg)
    print(''.ljust(15) + '\t'.join(map(str, tags_count_l2)))
    for score, token in zip(y.T, subwords):
        score_str = '\t'.join(["[" + str(s)[:5] + "]" if s == max(score) else " " + str(round(s, 4))[:5] + " " for s in score])
        token_str = token.replace('▁', '')
        print(token_str.ljust(15) + score_str)

In [None]:
for input in disambig_inputs:
  y_subj, y_pred, y_obj, subwords = model_l2.predict(input)

  print(' '.join(subwords).replace('▁', '') + '\n')
  print('Subjects:  ', bio_tags_to_tokens(subwords, y_subj.T, bio_lookup_l2, one_hot=True))
  print('Predicates:', bio_tags_to_tokens(subwords, y_pred.T, bio_lookup_l2, predicate=True, one_hot=True))
  print('Objects:   ', bio_tags_to_tokens(subwords, y_obj.T, bio_lookup_l2, one_hot=True))
  print()

# Ranking the triples

Now we are able to extract the candidate arguments, but how do we combine them?

We compute all combinations of the subjects, predicates and objects and train a model to distinguish between those triples that are entailed (not considering negation here) and those that are not.

For this, we extract a number of negative examples from possible triples, i.e. those combinations of subjects, predicates and objects that were not annotated.

## Converting format

In [None]:
from collections import defaultdict
from copy import deepcopy


def extract_triples(annotation, neg_oversampling=7, contr_oversampling=0.7, ellipsis_oversampling=3):
    """ Extracts plain-text triples from an annotation file and samples 'negative' examples by
        crossover. By default, the function will over-extract triples with negative polarity and
        elliptical constructions to counter class imbalance.

        params:
        dict annotation:            loaded annotation file (see load_annotations)
        int neg_oversampling:       how much to over-sample triples with negative polarity
        float contr_oversampling:   how much to sample contrast/invalid triples relative to true triples
        int ellipsis_oversampling:  how much to over-sample elliptical triples
    """
    turns = annotation['tokens']
    triple_ids = [t[:4] for t in annotation['annotations']]

    arguments = defaultdict(list)
    triples = []
    labels = []

    # Oversampling of elliptical triples
    for triple in deepcopy(triple_ids):
        subj_obj_turns = set([i for i, _ in triple[0] + triple[2]])
        if len(subj_obj_turns) > 1:
            triple_ids += [triple] * int(ellipsis_oversampling)

    # Extract 'True' triples
    for subj, pred, obj, polar in triple_ids:

        subj = ' '.join(turns[i][j] for i, j in subj) if subj else ''
        pred = ' '.join(turns[i][j] for i, j in pred) if pred else ''
        obj = ' '.join(turns[i][j] for i, j in obj) if obj else ''

        #print(pred)

        if subj or pred or obj:

            if not polar:
                triples += [(subj, pred, obj)]
                labels += [1]
            else:
                triples += [(subj, pred, obj)] * neg_oversampling  # Oversampling negative polarities
                labels += [2] * neg_oversampling

            arguments['subjs'].append(subj)
            arguments['preds'].append(pred)
            arguments['objs'].append(obj)

    # Skip if the annotation file was blank
    if not triples:
        return [], [], []

    # Sample fake contrast examples (invalid extractions)
    n = int(len(triples) * contr_oversampling)
    for i in range(50):
        s = random.choice(arguments['subjs'])
        p = random.choice(arguments['preds'])
        o = random.choice(arguments['objs'])

        # Ensure samples are new (and not actually valid!)
        if (s, p, o) not in triples and s and p and o:
            triples += [(s, p, o)]
            labels += [0]
            n -= 1

        # Create as many fake examples as there were 'real' triples
        if n == 0:
            break

    return turns, triples, labels


In [None]:
tokens, triples, labels = [], [], []
for ann in annotations:
    ann_tokens, ann_triples, triple_labels = extract_triples(ann)
    triples.append(ann_triples)
    labels.append(triple_labels)
    tokens.append([t for ts in ann_tokens for t in ts + ['<eos>']])

j = random.choice(range(len(tokens)))
print('tokens: ', tokens[j])
print('triples:', triples[j])
print('labels: ', labels[j])

In [None]:
def abstract_predicates_triples(triples, abstract_dict, lookup, bio_lookup):
  dict_values = []
  trip = []
  for values in abstract_dict.values():
    for value in values:
      dict_values.append(value)

  for row in triples:
    abstract_triples = []
    for triple in row:
      if triple[1] in dict_values:
        key = lookup[triple[1]]
        abstract_pred = bio_lookup[key[0]]
        y = list(triple)
        y[1] = abstract_pred
        triple = tuple(y)
        # print(triple)
        abstract_triples.append(triple)
      else:
        abstract_triples.append(triple)
    trip.append(abstract_triples)
  return trip

triples_l1 = abstract_predicates_triples(triples, bio_dict_l1, lookup_l1, bio_lookup_l1)
triples_l2 = abstract_predicates_triples(triples, bio_dict_l2, lookup_l2, bio_lookup_l2)

# print(triples)
# print(triples_l1)
# print(triples_l2)

In [None]:
print('Class (im)balance:')
print('not entailed  ', sum([np.sum(np.array(t) == 0) for t in labels]))
print('entailed (pos)', sum([np.sum(np.array(t) == 1) for t in labels]))
print('entailed (neg)', sum([np.sum(np.array(t) == 2) for t in labels]))

## Fine-tuning ALBERT for Triple Candidate Scoring

In [None]:
class TripleScoring(torch.nn.Module):
    def __init__(self, base_model='albert-base-v2', path=None, max_len=80, sep='<eos>'):
        super().__init__()
        # Base model
        print('loading %s for triple scoring' % base_model)
        # Load base model
        self._model = AutoModel.from_pretrained(base_model)
        self._max_len = max_len
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with SPEAKERS
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # SPO candidate scoring head
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._head = torch.nn.Linear(hidden_size, 3)
        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # GPU support
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = glob.glob(path + '/candidate_scorer_' + base_model)[0]
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids, attn_mask):
        """ Computes the forward pass through the model
        """
        out = self._model(input_ids=input_ids, token_type_ids=speaker_ids, attention_mask=attn_mask)
        h = self._relu(out.last_hidden_state[:, 0])
        return self._softmax(self._head(h))

    def _retokenize_dialogue(self, tokens, speaker=1):
        # Tokenize each token individually (keeping track of subwords)
        f_input_ids = [self._tokenizer.cls_token_id]
        speaker_ids = [speaker]
        for turn in ' '.join(tokens).split(self._sep):
            token_ids = self._tokenizer.encode(turn, add_special_tokens=True)[1:]  # strip [CLS]
            f_input_ids += token_ids
            speaker_ids += [speaker] * len(token_ids)
            speaker = 1 - speaker

        return f_input_ids, speaker_ids

    def _retokenize_triple(self, triple):
        # Append triple
        f_input_ids = self._tokenizer.encode(' '.join(triple), add_special_tokens=False)
        speaker_ids = [0] * len(f_input_ids)
        return f_input_ids, speaker_ids

    def _add_padding(self, sequence, pad_token):
        # If sequence is too long, cut off end
        sequence = sequence[:self._max_len]

        # Pad remainder to max_len
        padding = self._max_len - len(sequence)
        new_sequence = sequence + [pad_token] * padding

        # Mask out [PAD] tokens
        attn_mask = [1] * len(sequence) + [0] * padding
        return new_sequence, attn_mask

    def fit(self, tokens, triples, labels, epochs=2, lr=1e-6):
        """ Fits the model to the annotations
        """
        X = []
        for tokens, triple_lst, triple_labels in zip(tokens, triples, labels):

            # Tokenize dialogue
            dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

            for triple, label in zip(triple_lst, triple_labels):
                # Tokenize triple
                triple_input_ids, triple_speakers = self._retokenize_triple(triple)

                # Concatenate dialogue + [UNK] + triple
                input_ids = dialog_input_ids[:-1] + [self._tokenizer.unk_token_id] + triple_input_ids
                speakers = dialog_speakers[:-1] + [0] + triple_speakers

                # Pad sequence with [PAD] to max_len
                input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
                speakers, attn_mask = self._add_padding(speakers, 0)

                # Push Tensor to GPU
                input_ids = torch.LongTensor([input_ids]).to(self._device)
                speakers = torch.LongTensor([speakers]).to(self._device)
                attn_mask = torch.FloatTensor([attn_mask]).to(self._device)
                label_ids = torch.LongTensor([label]).to(self._device)

                X.append((input_ids, speakers, attn_mask, label_ids))

        # Set up optimizer and objective
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        criterion = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            random.shuffle(X)

            losses = []
            for input_ids, speaker_ids, attn_mask, y in tqdm(X):
                # Was the triple entailed? Positively? Negatively?
                y_hat = self(input_ids, speaker_ids, attn_mask)
                loss = criterion(y_hat, y)
                losses.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'candidate_scorer_%s' % self._base)

    def predict(self, tokens, triples):
        # Tokenize dialogue
        dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

        batch_input_ids = []
        batch_speakers = []
        batch_attn_mask = []

        for triple in triples:
            # Tokenize triple
            triple_input_ids, triple_speakers = self._retokenize_triple(triple)

            # Concatenate dialogue + [UNK] + triple
            input_ids = dialog_input_ids + [self._tokenizer.unk_token_id] + triple_input_ids
            speakers = dialog_speakers + [0] + triple_speakers

            # Pad sequence with [PAD] to max_len
            input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
            speakers, attn_mask = self._add_padding(speakers, 0)

            batch_input_ids.append(input_ids)
            batch_speakers.append(speakers)
            batch_attn_mask.append(attn_mask)

        # Push batches to GPU
        batch_input_ids = torch.LongTensor(batch_input_ids).to(self._device)
        batch_speakers = torch.LongTensor(batch_speakers).to(self._device)
        batch_attn_mask = torch.FloatTensor(batch_attn_mask).to(self._device)

        label = self(batch_input_ids, batch_speakers, batch_attn_mask)
        label = label.cpu().detach().numpy()
        return label

In [None]:
def save_triple_scorer(level):
  import os, shutil

  out_dir = root_dir + '/models/candidate_scorer_' + level + '_' + str(date.today())
  if not os.path.exists(out_dir):
      os.mkdir(out_dir)

  shutil.copy('candidate_scorer_albert-base-v2', out_dir)


In [None]:
scorer_l1 = TripleScoring()
scorer_l1.fit(tokens, triples_l1, labels, epochs=7)
save_triple_scorer('l1')

In [None]:
scorer_l2 = TripleScoring()
scorer_l2.fit(tokens, triples_l2, labels, epochs=7)
save_triple_scorer('l2')

In [None]:
# inputs = 'staying here is fine though . SPEAKER1\'s two dogs keep me company <eos> SPEAKER2 do not love them ! What car do SPEAKER1 drive ? <eos> a toyota . but SPEAKER1 like nissans . <eos>'.split()
# triple_examples = [['SPEAKER1', 'drive', 'nissans'],
#                    ['SPEAKER1', 'like', 'nissans'], 
#                    ['SPEAKER2', 'like', 'nissans'], 
#                    ['SPEAKER2', 'love', 'two dogs'], 
#                    ['SPEAKER1', 'drive', 'a toyota']]

# inputs = '<eos> Do SPEAKER1 work in Amsterdam ? <eos> No , in London . <eos>'.split()
# triple_examples = [['SPEAKER1', 'work in', 'Amsterdam']]

inputs = 'SPEAKER1 adore unicorns but not photography <eos> What do SPEAKER1 like ? <eos> dogs and gaming, but not cats or elephants . <eos>'.split()
triple_examples = [['SPEAKER1', 'adore', 'unicorns'],
                   ['SPEAKER1', 'like', 'dogs'],
                   ['SPEAKER1', 'like', 'gaming'],
                   ['SPEAKER1', 'adore', 'photography'],
                   ['SPEAKER1', 'like', 'cats'],
                   ['SPEAKER1', 'like', 'elephants'],
                   ['SPEAKER1', 'adore', 'elephants'],
                   ['SPEAKER1', 'like', 'photography'],
                   ['SPEAKER1', 'like', 'unicorns']]

np.round(scorer_l1.predict(inputs, triple_examples), 3)

In [None]:
np.round(scorer_l2.predict(inputs, triple_examples), 3)

# Test Argument Extraction on EVAL data

### Level 1

In [None]:
#laad getrainde modellen
argex_model_l1 = ArgumentExtraction(path=root_dir+'/models/l1_6_40_2022-12-10')
scorer_model_l1 = TripleScoring(path=root_dir+'/models/candidate_scorer_l1_2022_12_13')

In [None]:
#laad eval data en pre-process
eval_file = load_evaluation_data(root_dir + '/annotated_data/evaluatiedata/test_declarative_statements_level1_eval.txt')
input, disambig_inputs = transform_data_speaker_disambiguation(eval_file)

In [None]:
#extract arguments
tags_count_l1 = len(bio_dict_l1)*2+2
tags_count_l1 = list(range(0, tags_count_l1 + 1))

with open(root_dir+"/output-l1-single_utt.csv", 'w') as outf:

  for input in disambig_inputs:
    y_subj, y_pred, y_obj, subwords = argex_model_l1.predict(input)

    
    print('\n')
    print(''.ljust(15) + '\t'.join(map(str, tags_count_l1)))
    for score, token in zip(y_pred.T, subwords):
        score_str = '\t'.join(["[" + str(s)[:5] + "]" if s == max(score) else " " + str(round(s, 4))[:5] + " " for s in score])
        token_str = token.replace('▁', '')
        print(token_str.ljust(15) + score_str)