# Triple and Perspective Extraction and Scoring with Normalization

In this notebook you will be fine-tuning a language model to perform triple argument (Subject, Predicate, Object) extraction and candidate triple scoring. For the predicates, you will create various categories, and the aim of the model is to find the predicate token span as well as the most likely category for the predicate. You will use the `pytorch` implementation of `albert-base` provided by the Huggingface `transformers` library and fine-tune this model on PersonaChat, DailyDialog and Circa data annotated with ground-truth triples. You will also try adapt the code to allow for other models and train those models to compare model performances on the normalized predicates. 

## Overview 
Adopting a two-stage setup allows maximum flexibility of the triple extraction while making efficient use of the annotated data. The two stages include:

1. A sequence labeling (BIO-tagging) model which extracts lists of subjects, predicates and objects from the input dialogue.

2. A model which takes combinations of subjects, predicates and objects found and scores these combinations (i.e. all candidate triples) to decide whether the triple can indeed be entailed from the dialogue and what its polarity is.

By using this two-stage approach, arbitrary numbers of triples can be extracted and linguistic phenomena such as ellipsis can be accounted for.<br>

The first part is most relevant for the extraction of abstract predicates, whereas the second part is relevant for the evaluation.

## Getting the Data

To get a dataset of ground truth triples a small development set was created. This data has been stored in Google Drive for easy access.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')
root_dir = '/content/gdrive/MyDrive/Combots Triple Extraction and Normalization' # change to your own directory

Mounted at /content/gdrive


In [None]:
import glob
import json
import random

def load_annotations(path, remove_unk=True, keep_skipped=False):
    """ Reads all annotation files from path. By default, it filters skipped
        files and removes the [unk] tokens appended at the end of each turn.

        params:
        str path:           name of directory containing annotations
        bool remove_unk:    whether to remove [unk] tokens (default: True)
        bool keep_skipped:  whether to keep skipped annotations (default: False)

        returns:    list of annotations dicts
    """
    annotations = []
    for fname in glob.glob(path + '/*.json'):
        with open(fname, 'r', encoding='utf-8') as file:
            data = json.load(file)

            if data['skipped'] and not keep_skipped:
                continue

            if remove_unk:
                data['tokens'] = [[t for t in turn if t != '[unk]'] for turn in data['tokens']]

            annotations.append(data)

    return annotations

annotations = load_annotations(root_dir + '/annotated_data/trainval') #include in new dir
annotations[998]

{'tokens': [['i', 'can', 'not', 'pick', '.', 'i', 'love', 'them', 'both'],
  ['fair',
   'enough',
   ',',
   'have',
   'you',
   'ever',
   'been',
   'to',
   'a',
   'bake',
   'sale',
   '?'],
  ['yeah', ',', 'i', 'love', 'baked', 'goods', '!']],
 'annotations': [[[[0, 0]], [[0, 1]], [[0, 3]], [[0, 2]], []],
  [[[0, 5]], [[0, 6]], [[0, 7]], [], []],
  [[[1, 4]], [[1, 6], [1, 7]], [[1, 8], [1, 9], [1, 10]], [], []],
  [[[2, 2]], [[2, 3]], [[2, 4], [2, 5]], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []]],
 'skipped': False}

In [None]:
print('#dialogs:', len(annotations))
print('#triples:', sum([sum([any(t) for t in d['annotations']]) for d in annotations]))

#dialogs: 1117
#triples: 4786


In [None]:
def get_predicate_tokens(annotation, triple):
    if triple[1]:
        turn = triple[1][0][0]
        start = triple[1][0][1]
        end = triple[1][-1][1]
        # print("turn, start, end", turn, start, end)
        # print("whole sentence:", ' '.join(annotation['tokens'][turn]))
        # ending = end + 1
        # print("the predicate:", ' '.join(annotation['tokens'][turn][start:end + 1]))
        return ' '.join(annotation['tokens'][turn][start:end + 1])
    else:
        return None

In [None]:
print(annotations[3])
get_predicate_tokens(annotations[1], annotations[1]['annotations'][0])

{'tokens': [['yes', ',', 'i', "'m", 'pretty', 'sure', '.'], ['do', 'you', 'have', 'her', 'address', '?'], ['yes', ',', 'i', 'do', '.', 'it', "'s", '109', 'locks', 'ave', ',', 'l8v', '4n9', '.']], 'annotations': [[[[0, 2]], [[0, 3]], [[0, 4], [0, 5]], [], []], [[[1, 1]], [[1, 2]], [[1, 3], [1, 4]], [], []], [[[2, 2]], [[1, 2]], [[1, 3], [1, 4]], [], []], [[[2, 5]], [[2, 6]], [[2, 7], [2, 8], [2, 9], [2, 10], [2, 11], [2, 12]], [], []], [[[1, 3], [1, 4]], [[2, 6]], [[2, 7], [2, 8], [2, 9], [2, 10], [2, 11], [2, 12]], [], []], [[], [], [], [], []], [[], [], [], [], []], [[], [], [], [], []], [[], [], [], [], []], [[], [], [], [], []], [[], [], [], [], []]], 'skipped': False}


'need'

## Who are 'You'?: Disambiguating You and I

In the text the speakers are referred to as ambiguous tokens *You* and *I*. As these words are ambiguous and their meaning depends on the speaker who utters them, we replace these tokens by [SPEAKER1] and [SPEAKER2] contingent on the speaker (e.g. speaker 2 saying you indicates, [speaker1])

In [None]:
SPEAKER1 = 'SPEAKER1'
SPEAKER2 = 'SPEAKER2'

def disambiguate_pronouns(token, turn_idx):
    # Even turns -> speaker1
    if turn_idx % 2 == 0:
        if token in ['i', 'me', 'myself', 'we', 'ourselves']:
            return SPEAKER1
        elif token in ['my', 'mine', 'our', 'ours']:
            return SPEAKER1 + "'s"
        elif token in ['you', 'yourself', 'yourselves']:
            return SPEAKER2
        elif token in ['your', 'yours']:
            return SPEAKER2 + "'s"
    else:
        if token in ['i', 'me', 'myself', 'we', 'ourselves']:
            return SPEAKER2
        elif token in ['my', 'mine', 'our', 'ours']:
            return SPEAKER2 + "'s"
        elif token in ['you', 'yourself', 'yourselves']:
            return SPEAKER1
        elif token in ['your', 'yours']:
            return SPEAKER1 + "'s"
    return token

In [None]:
for annotation in annotations:
    annotation['tokens'] = [[disambiguate_pronouns(token, i % 2) for token in turn] for i, turn in enumerate(annotation['tokens'])]
annotations[0]

{'tokens': [['what', 'about', 'this', '?'],
  ['let',
   'SPEAKER2',
   'try',
   'it',
   'on',
   ',',
   'it',
   "'s",
   'too',
   'small',
   'for',
   'SPEAKER2',
   ',',
   'have',
   "n't",
   'SPEAKER1',
   'got',
   'any',
   'larger',
   'ones',
   '?'],
  ['yes', ',', 'try', 'this', 'one', 'please', '.']],
 'annotations': [[[[1, 6]], [[1, 7]], [[1, 8], [1, 9]], [], []],
  [[[1, 15]], [[1, 16]], [[1, 18], [1, 19]], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []],
  [[], [], [], [], []]],
 'skipped': False}

# Creating sets of abstract predicates

First, we create a set consisting of all unique token spans annotated as 'predicate'. This is the set from which you need to create a set of abstract predicates. Notice that a lot of 'unique predicates' are also annotation errors, where the subject or object has been included in the predicate annotation. 

For each of the abstract predicates you define, you will need to create a numerical B- and I-tag for the BIO tag annotation that we will use. Start from (3,4) for the first tag (e.g. B-like, I-like). The tag 0 is reserved for 'O', and 1 and 2 are reserved for the B- and I-tags for subjects and objects. 

Once you've created a set of abstract predicates and their corresponding B- and I-tags, you will need to create two lookup dictionaries for the BIO tagging that we will do later on. The dictionary `lookup` is used to get the correct BIO tag from the predicate token span and is used for converting triples to BIO-tags. The dictionary `bio_lookup` is used to get the abstract predicate from the BIO tag, and is used to convert BIO tags to tokens.

In [None]:
unique_predicates = set()
for ann in annotations:
  for triple in ann['annotations']:
    pred = get_predicate_tokens(ann, triple)
    if pred is not None:
      unique_predicates.add(pred)

In [None]:
print(len(unique_predicates))
print(unique_predicates)
print(type(unique_predicates))

1231
{'', 'was still in', 'do for a job', 'think about', 'like us to', 'read it often', 'should go to buy', 'ai', 'exfoliating', 'in need of', 'called', 'worked with', 'look for', 'like to drink', 'help SPEAKER1 in this city', 'find out', 'getting', 'could supply', 'seems', 'will have', 'go', 'made', 'dropped', 'love to go to', 'presented', 'practiced a lot as a kid', 'calms SPEAKER1 down', 'fell on', "'m scared of", 'live in', 'moved from', 'daydream about', 'listening to', 'have anything to', 'only eat', 'sort', 'keen on', 'ready to', 'are SPEAKER1 up to', 'volunteer with', 'meet', 'followed', 'spend', 'made in', 'would', 'get along with', 'like to try', 'is', 'suitable for', 'ask bill if', 'ran', 'would be', 'expect to', 'go to college for', 'get to', 'want to be', 'like doing', "ca n't wait for", "could n't stand", 'comes by', "'m a huge fan of", 'trying to be', 'needed to', "help SPEAKER1's mom out", 'be here for', 'love watching birds from indoors', "'ll just put SPEAKER2's coat 

### Stemming and cleaning of the unique predicates

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [None]:
def cleaning(list_predicates):
  nlp = spacy.load('en_core_web_sm')
  cleaned = set()

  for predicate in list(list_predicates):
    working = nlp(predicate)
    subset = []

    for token in working:
      # don't include punctuation marks 
      if token.pos_ == "PUNCT":
        continue

      subset.append(token.lemma_)


      # stem all the shortened versions of to be and to have 
      if len(subset) >= 2:
        for i, sep in enumerate(subset):
          # change 'm 
          if (subset[-1]  == "m") & (subset[-2] == "'") :
            subset.pop(-2)
            subset[-1] = 'be'
          # change 've 
          elif (subset[-1]  == "ve") & (subset[-2] == "'") :
            subset.pop(-2)
            subset[-1] = 'have'
          
          # change 's 
      if subset[-1] == "'s" or subset[-1] == "m" or subset[-1] == "re":
        subset[-1] = 'be'
      # take out speaker since these are subj or obj 
      if subset[-1] == "SPEAKER1" or subset[-1] == "SPEAKER2" or subset[-1] == "speaker1":
        subset.pop(-1)
    cleaned.add(' '.join(subset))
  return cleaned

unique_predicates_clean = cleaning(unique_predicates)
print(len(unique_predicates_clean), len(unique_predicates))
print(unique_predicates_clean)

1000 1231
{'', 'catch on', 'do for a job', 'will suit well', 'think about', 'cook', 'want they all in fifty', 'read it often', 'should go to buy', 'ai', 'can', 'dislike', 'have seat for', 'like', 'like to purchase', 'read biography to', 'live', 'quit', 'go on', 'in need of', 'on', 'like to depart', 'be ready to', 'live here for', 'wass', 'look for', 'be a serious disturbance in', 'like to meet for a drink', 'like to drink', 'love to go to the beach with', 'relax with', 'find out', 'can relax', 'binge', 'mind go on', 'talk about', 'eat at', 'need to', 'move in', 'migrate to', 'save for', 'be try to get', 'could supply', 'get more', 'dress be yorkie as a lion', 'share a one bedroom with', 'relax', 'will have', 'go', 'obsess with', 'feel connected to', 'be just relax at', 'arrange', 'be one at', 'travel here for', 'love listen to', 'be in the chair', 'hold', 'well than', 'make of', 'love to go to', 'exchange', 'can t seem to make', 'like read', 'know', 'portray woman as weak', 'will take'

### Calculating similarity and clustering 

In [None]:
!python -m spacy download en_core_web_md
!pip install distance 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1-py3-none-any.whl (42.8 MB)
[K     |████████████████████████████████| 42.8 MB 1.6 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting distance
  Downloading Distance-0.1.3.tar.gz (180 kB)
[K     |████████████████████████████████| 180 kB 30.7 MB/s 
[?25hBuilding wheels for collected packages: distance
  Building wheel for distance (setup.py) ... [?25l[?25hdone
  Created wheel for distance: filename=Distance-0.1.3-py3-none-any.whl size=16275 sha256=30ab863e86b0e6ef4f30dc4f4c47a

#### Affinity propagation


In [None]:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
nlp_2 = spacy.load("en_core_web_md") # diff nlp package to calculate similarity 

words = list(unique_predicates_clean)

array = np.empty([len(words), len(words)]) 
# ToDo: make this more efficient, for example by using recursive loops 
for w1 in range(len(words)): # for each word 
  w1_nlp = nlp_2(words[w1])
  for w2 in range(len(words)): # match with every other word
    w2_nlp = nlp_2(words[w2])
    array[w1, w2] = w1_nlp.similarity(w2_nlp) # save similarity 


affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(array) # Use pre-computed sim
words = np.asarray(words) #So that indexing with a list will work

abstract_predicates = {}
for cluster_id in np.unique(affprop.labels_):
    print(cluster_id)
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]] # Use centroid as abstract 
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)]) 
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    abstract_predicates[str(exemplar)] = list(cluster)

print(abstract_predicates)

# Save
with open('abstract_predicates.txt', 'w') as convert_file:
     convert_file.write(json.dumps(abstract_predicates))


  


0
 - *:* 
1
 - *think:* forget, glad, hope, imagine, kid, know, mean, remember, say, scare, sort, spell anything wrong, think
2
 - *from:* call from, different from, exhausted from, from, graduate from, import from, migrate from, retire from, switch from, tired from, visit from
3
 - *should have:* have, have any room available, have catch, have experience, have family, have no problem with that, like have people around, may have, should have, should sms over, will have, will never have
4
 - *fast than:* fast than, hard than, make more than, well than
5
 - *take advantage of:* afraid of, deserve of, get enough of, have dream of go, hear of, in need of, inherit, involve, its, lose track of, make of, present, speak of, take advantage of, take care of, will spend most of the weekend with
6
 - *make it:* bring it over, could make it as, could try it on, it, like it cut, make it, make it a secret between we, read it often, want it, wear it much, wear it that way, will make it work
7
 - *brin

#### With just verbs

In [None]:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
nlp_2 = spacy.load("en_core_web_md") # diff nlp package to calculate similarity 

words = list(unique_predicates_clean)

array = np.empty([len(words), len(words)])
for w1 in range(len(words)):
  w1_nlp = nlp_2(words[w1])
  current = []
  for token in w1_nlp:
    if token.pos_ == "VERB" or token.pos_ == "AUX":
      current.append(token)
  current_w1 = nlp_2(''.join([tok.text_with_ws for tok in current]))
  # print(w1_nlp, "---", current_w1)
  for w2 in range(len(words)):
    w2_nlp = nlp_2(words[w2])
    current = []
    for token in w2_nlp:
      if token.pos_ == "VERB" or token.pos_ == "AUX":
        current.append(token)
    current_w2 = nlp_2(''.join([tok.text_with_ws for tok in current]))
    # print(w2_nlp, "---", current_w2)
    array[w1, w2] = current_w1.similarity(current_w2)

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(-1*array)
words = np.asarray(words) #So that indexing with a list will 

abstract_predicates_VERBS = {}
for cluster_id in np.unique(affprop.labels_):
    print(cluster_id)
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    abstract_predicates_VERBS[exemplar] = cluster_str

# Save as .json format to prevent running again. 
with open('abstract_predicates_VERBS.txt', 'w') as convert_file:
     convert_file.write(json.dumps(abstract_predicates_VERBS))

  array[w1, w2] = current_w1.similarity(current_w2)


## Abstract predicates
Run just this code to get output of above:

#### old

In [None]:
# abstract_predicates_R = {"": [""], "break up with": ["break up with", "can not put up with", "catch up on", "fill up", "finish up", "grow up as", "grow up in", "keep up all night", "like to meet up with", "make up", "meet up for", "meet up with", "open up", "show up for", "stand up at", "take up", "wake up", "wake up at"], "could not recognize": ["accept", "care if", "cause", "could", "could not recognize", "could not stand", "have not make", "have not pick", "have not time for", "never", "please that", "would", "would like", "would love", "would not change", "would not eat", "would not mind", "would rather"], "work as": ["portray woman as weak", "read those as a child", "roleplay as", "start as", "work", "work as", "work with vet as"], "come in": ["bring in", "come back in", "come in", "find in", "fit in", "get in because", "get in english 101", "get interested in", "go first thing in", "keep in", "live here in", "make in", "move in", "play the piano in", "shut down in", "spend the day in", "stay in", "stay last night in", "wait for in"], "save": ["rather save", "save", "spend"], "should buy": ["buy", "purchase", "sell", "shop", "should buy", "should sell"], "wash": ["dye", "exfoliate", "paint", "wash", "washing", "wear"], "ride": ["chase", "drive", "hike", "ride", "swim", "tour", "train"], "take advantage of": ["afraid of", "deserve of", "get enough of", "have dream of go", "hear of", "in need of", "inherit", "involve", "its", "lose track of", "make of", "present", "speak of", "take advantage of", "take care of", "will spend most of the weekend with"], "divorce": ["divorce", "marry"], "want to be": ["be able to", "be able to take the test again", "be excited to", "be go to", "be go to make be life easy", "be hope to visit", "be look forward to", "be ready to", "be require to", "be sneak away to", "be someone come to get", "be take the kid to", "be to", "be to leave nothing for a tip", "be to play", "be try to get", "be up to", "come here to be", "do be job remotely", "go to be", "go to be take", "have to be", "like to be", "like to buy a keepsake for be girlfriend", "need to be", "need to get be dance gear ready", "send be car to be fix", "try to be", "use to be", "want the money to be transfer", "want to be", "want to be here tomorrow", "will just put be coat away", "would like to be"], "code": ["code", "geocache"], "be still on": ["be a camera on", "be a copy machine on", "be good at work on", "be on", "be plan on go", "be set on", "be still on", "must not sit on", "spend all be money on shopping", "try they on be travel"], "move back here from": ["close", "come home from", "drive under", "get back from", "get home from", "land", "like cash back", "move", "move back here from", "move from", "move here when", "move here with", "send somebody over", "walk around"], "switch": ["dial", "drop", "switch"], "hear clearly": ["describe", "hear", "hear clearly", "sound", "speak"], "be here for": ["be always a need for", "be around for", "be come here for", "be currently look for", "be for", "be here for", "be it for", "be look for", "be not for", "be not look good for", "be not much room for", "be off for", "be out for", "be plenty for", "be quiet for", "be serious business for", "be shop for", "be take mom out", "be there before", "be there good hiking spot near", "be this for", "do all be shopping through", "drive be big truck around", "get be key out", "live there be whole life", "should take be shoe off", "spend be holiday from"], "on": ["catch on", "church on", "dance on", "depend on", "fall on", "keen on", "live on", "love work on", "on", "plan on", "plan on make", "plane on watch", "play on", "play soccer on", "put on", "scald on", "show on", "sit on", "work on"], "still at": ["arrive at", "available at", "bad at", "eat at", "go at", "good at", "stay at", "still at", "take lot at the bar", "teach at", "turn right at", "volunteer at", "work at", "work night at", "work tonight at"], "ai": ["ai"], "need to go back to": ["allow people to check out", "get ready to", "get ready to take", "get to", "go out to", "go to", "go to across", "go to college for", "go to far from", "go to town on", "go to transfer", "go to watch it out of curiosity", "have plan to go", "interested in go to", "like go out to", "like go to", "like to go", "like to go far in", "like to go for", "like to go to", "love to go to", "love to go to the beach with", "love to sit around", "move back to", "need to go back to", "plan to go", "should go to buy", "want to go out to", "want to take"], "want something to": ["can listen to", "can not afford to", "can t seem to make", "charge it to", "feel connected to", "give he to we", "give time to do", "have anything to", "have no reason to distrust", "know how to use", "like to do", "like to do for fun", "like to try", "like we to", "love listen to", "love to", "love to own", "love to sing", "mind if walk to", "need this weekend to", "own", "prefer not to", "prefer not to bother", "send it to", "take they to her grave", "try not to", "want something to", "want to", "want to do", "want to extend", "want to pay", "want to sign", "want to watch a movie together", "willing to do", "would ever learn to", "would love to"], "keep": ["keep", "keep from", "keep safe", "would please keep"], "into": ["back into", "bump into", "get into", "into", "transfer into"], "expect to": ["able to", "able to find", "about to", "about to fix", "allow to", "close to", "come to", "cook to", "could speak to", "devote more time to", "excited to", "expect to", "forget to", "glad to", "happen to", "hard to", "have to", "have to give", "hope to", "inspire to", "intend to", "listen to", "look forward to", "look in to", "may speak to", "mean to", "meditate to", "migrate to", "move to", "need to", "need to make per year", "new to", "open to", "pay attention to", "pay more attention to", "plan to", "prefer to", "propose to", "read biography to", "ready to", "respond to", "ride to", "scared to", "seem to", "similar to", "sound appeal to", "speak to", "stick to", "take to", "talk to", "tell to", "to", "to withdraw", "try to", "will have to take", "will learn to deal with", "will request hr to", "write to"], "like to purchase": ["build house upon request", "have to give 3 shoot", "like listen to", "like some magic to take away", "like to", "like to buy", "like to cook for", "like to depart", "like to drink", "like to eat at", "like to find", "like to have", "like to leave on", "like to live in", "like to make", "like to meet for a drink", "like to pay", "like to purchase", "like to read", "like to return", "like to see in beijing", "like to send", "like to sing", "like to stay", "like to use", "love to live on", "love to look at", "order", "use to", "use to buy", "use to have", "would like to", "would tike to buy"], "start": ["begin", "finish", "start"], "add": ["add", "could add", "mix"], "wass": ["wass"], "get a good look at": ["get a good look at", "get dog at the clinic", "get real good at", "good", "have breakfast at a great restaurant", "throw a milk carton at"], "be one of": ["be a fan of", "be a huge fan of", "be a type of", "be an employee of", "be just off of", "be one of", "be part of", "be scared of", "become", "need all of be supply right away"], "from": ["call from", "different from", "exhausted from", "from", "graduate from", "import from", "migrate from", "retire from", "switch from", "tired from", "visit from"], "name": ["name"], "like watch": ["enjoy", "like", "like watch", "look like", "love", "love watch bird from indoor", "sound like", "taste like", "watch"], "make it": ["bring it over", "could make it as", "could try it on", "it", "like it cut", "make it", "make it a secret between we", "read it often", "want it", "wear it much", "wear it that way", "will make it work"], "go there": ["can go", "could go", "fall", "go", "go for", "go great with", "go hike often", "go just for", "go next", "go there", "go there with", "go thru", "hang", "leave", "pass", "run", "rush", "s it go"], "like do": ["can do", "do", "do for", "do for fun", "do well than", "fancy", "got do", "hate", "let do", "like do"], "s": ["s"], "retire": ["quit", "retire"], "recycle": ["recycle", "scrap", "waste"], "go on": ["go on", "go out on", "mind go on", "spend too much on"], "ace": ["ace"], "kill": ["arrest", "die", "hurt", "kill", "rob", "shoot"], "be still in": ["be a serious disturbance in", "be a story about breakdance in", "be go in", "be in", "be in attendance at", "be in between", "be in the chair", "be in to", "be it in", "be more in to", "be still in", "be they in"], "see tomorrow": ["call", "miss", "see", "see tomorrow", "should see", "tell", "wait"], "live here for a long time": ["'s lunch time for", "be fake straight for a long time", "do for a job", "do for a living", "draw they all the time all over", "have a crush on", "have a seat available", "have a small team under", "know a country song about that", "live here for a long time", "live here for a month", "make a recommendation for", "need a full time job for health insurance", "practice a lot as a kid", "race the car for a living", "share a one bedroom with", "should leave for a tip", "work a lot during", "work here long time"], "frustrate with": ["align with", "breakdance", "busy with", "come with", "could start with", "display", "double cross", "enjoy work with", "frustrate with", "have direct contact with", "help with", "hike with", "interview with", "obsess with", "relax with", "set here with", "share with", "sleep with", "volunteer with", "with", "work with"], "sing": ["dance", "sing", "singe"], "eat": ["bake", "binge", "cook", "drink", "eat", "only eat"], "live there": ["could live", "live", "live above", "live around", "live there", "live there for", "starve", "survive"], "smoke": ["blast", "rain", "smell", "smoke", "smoking"], "make look": ["fit right", "look", "make", "make bank", "make look", "would make"], "come by": ["be lead by", "come by", "know by face", "live by", "pay by", "send it by airmail"], "re": ["re"], "mow": ["mow"], "should have": ["have", "have any room available", "have catch", "have experience", "have family", "have no problem with that", "like have people around", "may have", "should have", "should sms over", "will have", "will never have"], "should consider": ["change", "consider", "count", "decide", "exercise", "interest", "may consider", "might", "plan", "recommend", "rule", "should", "should appreciate", "should call", "should choose from", "should consider", "should drink more", "should leave", "should light", "should really limit"], "need help": ["allow", "assist", "can help contact", "guide", "help", "manage", "need", "need help", "seek"], "for": ["cook for", "distinguish for", "fine for", "for", "good for", "hard for", "have seat for", "leak for", "market for", "marry for", "meet for", "nervous for", "pray for", "ready for", "retire for", "save for", "shop for", "stand for", "stop for", "suitable for", "vote for", "work for"], "be naturally": ["be", "be 30 pound over weight", "be a block away", "be about", "be also from", "be always with", "be arrive", "be available", "be content", "be decide about switch", "be employ", "be far away from", "be from", "be get", "be glad", "be glad the week be over", "be have", "be into", "be just below", "be just walk", "be learn", "be like", "be live", "be mixed with", "be naturally", "be not familar with", "be not feel", "be not into", "be not more expensive than", "be not really a fan", "be obsess with", "be plan", "be really into", "be see", "be something wrong with", "be still into", "be sure", "be that consider", "be the novel about", "be they from", "be together", "be two pound over", "be usually out", "burn be house down", "can be", "do be body good", "dress be yorkie as a lion", "dye be hair red", "have be", "help be mom out", "know it be", "like be", "love be with", "must be", "purchase be from ikea", "will all be", "will be", "will be expect", "will there be", "would be"], "play with": ["lead", "play", "play hoop with", "play the cd loud", "play with"], "have experience in": ["always work in", "can really get in touch with", "get a 4 . 0 in that class", "grow well only in", "have always work in", "have any experience in", "have c complication from surgery", "have experience in", "have family in the area", "have soap in they", "help in this city", "help while in", "in touch with", "read the classic in college", "study", "want they all in fifty", "would never stab in the back"], "m also into": ["m", "m also into", "m in"], "be just relax at": ["be a run at", "be at", "be be family at", "be good at", "be just relax at", "be just relax at home in", "be one at", "be very sociable at the weekend", "have so much fun at", "like be at", "will be at"], "SPEAKER1": ["SPEAKER1"], "teach": ["graduate", "learn", "teach", "teach during", "tutor"], "get off": ["fight people off", "get off", "get off on", "play off", "take friday off", "tear off", "throw it off"], "get": ["catch", "could get", "dig", "find", "fix", "get", "get along", "get more", "get ready", "get to go get", "get we", "give", "have get", "lose", "send", "should get"], "bring": ["bring", "build", "come", "grow", "hold", "raise", "reach", "share", "shed", "turn"], "can not handle": ["can always play", "can always use", "can call it in", "can not", "can not afford", "can not deal with", "can not do", "can not drink", "can not eat", "can not endure", "can not find", "can not handle", "can not imagine", "can not kick", "can not stand", "can not wait for", "can talk about", "could supply", "must water", "type", "will not pull"], "will take": ["can take", "can take the test again", "could take", "may take", "take", "will", "will bring back", "will come", "will get", "will hold", "will open", "will take", "would give"], "fast than": ["fast than", "hard than", "make more than", "well than"], "match": ["beat", "draw", "match", "race", "replay", "win"], "should pay": ["ask bill if", "can pay with", "cancel", "charge", "cost", "deposit", "guarantee", "must pay", "owe", "pay", "pay for", "should pay", "would pay"], "wreck": ["wreck"], "think about": ["about", "concern about", "daydream about", "feel about", "have idea about", "hear about", "know a lot about", "know about", "know much about", "know nothing about", "like talk about", "open about", "passionate about", "talk about", "think about", "wanna talk about", "worry about"], "can accommodate": ["can", "can accommodate", "can alphabetize", "can also try", "can catch", "can get", "can give", "can handle", "can have", "can have weekend free", "can help", "can hold", "can make", "can only spend", "can phone", "can play", "can recommend", "can refund", "can relax", "can see", "can show", "can speak", "can spend", "can stay", "can stay there for hour", "can try", "can use", "may borrow", "will suit well"], "provide": ["arrange", "include", "locate", "provide", "require", "serve", "support", "use"], "should join": ["adopt", "follow", "join", "meet", "should join", "should meet", "sign", "volunteer"], "sleep like": ["feel like", "pet sitting", "relax", "sleep", "sleep like", "tired"], "read": ["book", "copy", "like read", "note", "post", "read", "write"], "stay with": ["get along with", "hang with", "have stay with", "live with", "rather stay", "stay", "stay home with", "stay with"], "pull out": ["burn down", "calm down", "check out", "could fill out", "cut out", "find out", "help out with", "lock out", "pass out", "pull out", "try out", "wanna try out", "work out"], "travel here for": ["cook 100 meal for", "exchange", "get ready for", "here for", "live here for", "look for", "make some for", "must leave for", "reserve", "travel", "travel here for", "travel more for", "visit"], "in": ["bear in", "believe in", "check in", "dance in", "fall in", "follow in", "in", "in between", "interested in", "live in", "locate in", "major in", "park in", "permit in", "play in", "play piano in", "proficiency in", "serve in", "set in", "sing in", "start in", "teach in", "work in"], "want": ["let", "prefer", "should try", "try", "want", "wish"], "think": ["dislike", "feel", "forget", "glad", "hope", "imagine", "kid", "know", "mean", "remember", "say", "scare", "seem", "sort", "spell anything wrong", "think"]}

#### New

In [None]:
abstract_predicates_R = {"": [""], "think": ["forget", "glad", "hope", "imagine", "kid", "know", "mean", "remember", "say", "scare", "sort", "spell anything wrong", "think"], "from": ["call from", "different from", "exhausted from", "from", "graduate from", "import from", "migrate from", "retire from", "switch from", "tired from", "visit from"], "should have": ["have", "have any room available", "have catch", "have experience", "have family", "have no problem with that", "like have people around", "may have", "should have", "should sms over", "will have", "will never have"], "fast than": ["fast than", "hard than", "make more than", "well than"], "take advantage of": ["afraid of", "deserve of", "get enough of", "have dream of go", "hear of", "in need of", "inherit", "involve", "its", "lose track of", "make of", "present", "speak of", "take advantage of", "take care of", "will spend most of the weekend with"], "make it": ["bring it over", "could make it as", "could try it on", "it", "like it cut", "make it", "make it a secret between we", "read it often", "want it", "wear it much", "wear it that way", "will make it work"], "bring": ["bring", "build", "come", "grow", "hold", "raise", "reach", "share", "shed", "turn"], "be still on": ["be a camera on", "be a copy machine on", "be good at work on", "be on", "be plan on go", "be set on", "be still on", "must not sit on", "spend all be money on shopping", "try they on be travel"], "get off": ["fight people off", "get off", "get off on", "play off", "take friday off", "tear off", "throw it off"], "recycle": ["recycle", "scrap", "waste"], "geocache": ["code", "geocache"], "ride": ["chase", "drive", "hike", "ride", "swim", "tour", "train"], "keep": ["keep", "keep from", "keep safe", "would please keep"], "match": ["draw", "match", "race", "replay"], "play with": ["lead", "play", "play hoop with", "play the cd loud", "play with"], "stay with": ["get along with", "hang with", "have stay with", "live with", "rather stay", "stay", "stay home with", "stay with"], "hear clearly": ["describe", "hear", "hear clearly", "sound", "speak"], "want": ["let", "prefer", "should try", "try", "want", "wish"], "wass": ["wass"], "get": ["catch", "could get", "dig", "find", "fix", "get", "get along", "get more", "get ready", "get to go get", "get we", "give", "have get", "lose", "send", "should get"], "still at": ["arrive at", "available at", "bad at", "eat at", "go at", "good at", "stay at", "still at", "take lot at the bar", "teach at", "turn right at", "volunteer at", "work at", "work night at", "work tonight at"], "beat": ["beat", "win"], "read": ["book", "copy", "like read", "note", "post", "read", "write"], "travel here for": ["cook 100 meal for", "exchange", "get ready for", "here for", "live here for", "look for", "make some for", "must leave for", "reserve", "travel", "travel here for", "travel more for", "visit"], "go there": ["can go", "could go", "fall", "go", "go for", "go great with", "go hike often", "go just for", "go next", "go there", "go there with", "go thru", "hang", "leave", "pass", "run", "rush", "s it go"], "for": ["cook for", "distinguish for", "fine for", "for", "good for", "hard for", "have seat for", "leak for", "market for", "marry for", "meet for", "nervous for", "pray for", "ready for", "retire for", "save for", "shop for", "stand for", "stop for", "suitable for", "vote for", "work for"], "see tomorrow": ["call", "miss", "see", "see tomorrow", "should see", "tell", "wait", "watch"], "add": ["add", "could add", "mix"], "will take": ["can take", "can take the test again", "could take", "may take", "take", "will", "will bring back", "will come", "will get", "will hold", "will open", "will take", "would give"], "can accommodate": ["can", "can accommodate", "can alphabetize", "can also try", "can catch", "can get", "can give", "can handle", "can have", "can have weekend free", "can help", "can hold", "can make", "can only spend", "can phone", "can play", "can recommend", "can refund", "can relax", "can see", "can show", "can speak", "can spend", "can stay", "can stay there for hour", "can try", "can use", "may borrow", "will suit well"], "frustrate with": ["align with", "breakdance", "busy with", "come with", "could start with", "display", "double cross", "enjoy work with", "frustrate with", "have direct contact with", "help with", "hike with", "interview with", "obsess with", "relax with", "set here with", "share with", "sleep with", "volunteer with", "with", "work with"], "think about": ["about", "concern about", "daydream about", "feel about", "have idea about", "hear about", "know a lot about", "know about", "know much about", "know nothing about", "like talk about", "open about", "passionate about", "talk about", "think about", "wanna talk about", "worry about"], "start": ["begin", "finish", "start"], "into": ["back into", "bump into", "get into", "into", "transfer into"], "like to see in beijing": ["can call it in", "can really get in touch with", "interested in go to", "like to go far in", "like to live in", "like to see in beijing", "look in to", "want they all in fifty", "would never stab in the back"], "teach": ["graduate", "learn", "teach", "teach during", "tutor"], "smoke": ["blast", "rain", "smell", "smoke", "smoking"], "want something to": ["can listen to", "can not afford to", "can t seem to make", "charge it to", "feel connected to", "give he to we", "give time to do", "have anything to", "have no reason to distrust", "know how to use", "like to do", "like to do for fun", "like to try", "like we to", "love listen to", "love to", "love to own", "love to sing", "mind if walk to", "need this weekend to", "own", "prefer not to", "prefer not to bother", "send it to", "take they to her grave", "try not to", "want something to", "want to", "want to do", "want to extend", "want to pay", "want to sign", "want to watch a movie together", "willing to do", "would ever learn to", "would love to"], "have experience in": ["always work in", "get a 4 . 0 in that class", "grow well only in", "have always work in", "have any experience in", "have c complication from surgery", "have experience in", "have family in the area", "have soap in they", "help in this city", "help while in", "in touch with", "read the classic in college", "study"], "move back here from": ["close", "come home from", "drive under", "get back from", "get home from", "land", "like cash back", "love watch bird from indoor", "move", "move back here from", "move from", "move here when", "move here with", "send somebody over", "walk around"], "get a good look at": ["get a good look at", "get dog at the clinic", "get real good at", "good", "have breakfast at a great restaurant", "throw a milk carton at"], "expect to": ["able to", "able to find", "about to", "about to fix", "allow to", "close to", "come to", "cook to", "could speak to", "devote more time to", "excited to", "expect to", "forget to", "glad to", "happen to", "hard to", "have to", "have to give", "hope to", "inspire to", "intend to", "listen to", "look forward to", "may speak to", "mean to", "meditate to", "migrate to", "move to", "need to", "need to make per year", "new to", "open to", "pay attention to", "pay more attention to", "plan to", "prefer to", "propose to", "read biography to", "ready to", "respond to", "ride to", "scared to", "seem to", "similar to", "sound appeal to", "speak to", "stick to", "take to", "talk to", "tell to", "to", "to withdraw", "try to", "will have to take", "will learn to deal with", "will request hr to", "write to"], "could not recognize": ["accept", "care if", "cause", "could", "could not recognize", "could not stand", "have not make", "have not pick", "have not time for", "never", "please that", "would", "would like", "would love", "would not change", "would not eat", "would not mind", "would rather"], "eat": ["bake", "binge", "cook", "drink", "eat", "only eat"], "sleep like": ["like", "like watch", "pet sitting", "relax", "sleep", "sleep like", "tired"], "wash": ["dye", "exfoliate", "paint", "wash", "washing", "wear"], "divorce": ["divorce", "marry"], "switch": ["dial", "drop", "switch"], "should buy": ["buy", "purchase", "sell", "shop", "should buy", "should sell"], "need help": ["allow", "assist", "can help contact", "guide", "help", "manage", "need", "need help", "seek"], "in": ["bear in", "believe in", "check in", "dance in", "fall in", "follow in", "in", "in between", "interested in", "live in", "locate in", "major in", "park in", "permit in", "play in", "play piano in", "proficiency in", "serve in", "set in", "sing in", "start in", "teach in", "work in"], "should join": ["adopt", "follow", "join", "meet", "should join", "should meet", "sign", "volunteer"], "ace": ["ace"], "come in": ["bring in", "come back in", "come in", "find in", "fit in", "get in because", "get in english 101", "get interested in", "go first thing in", "keep in", "live here in", "make in", "move in", "play the piano in", "shut down in", "spend the day in", "stay in", "stay last night in", "wait for in"], "need to go back to": ["allow people to check out", "get ready to", "get ready to take", "get to", "go out to", "go to", "go to across", "go to college for", "go to far from", "go to town on", "go to transfer", "go to watch it out of curiosity", "have plan to go", "like go out to", "like go to", "like to go", "like to go for", "like to go to", "love to go to", "love to go to the beach with", "love to sit around", "move back to", "need to go back to", "plan to go", "should go to buy", "want to go out to", "want to take"], "be still in": ["be a serious disturbance in", "be a story about breakdance in", "be go in", "be in", "be in attendance at", "be in between", "be in the chair", "be in to", "be it in", "be more in to", "be still in", "be they in"], "make look": ["fit right", "look", "make", "make bank", "make look", "would make"], "pull out": ["burn down", "calm down", "check out", "could fill out", "cut out", "find out", "help out with", "lock out", "pass out", "pull out", "try out", "wanna try out", "work out"], "should pay": ["ask bill if", "can pay with", "cancel", "charge", "cost", "deposit", "guarantee", "must pay", "owe", "pay", "pay for", "should pay", "would pay"], "live there": ["could live", "live", "live above", "live around", "live there", "live there for", "starve", "survive"], "live here for a long time": ["be fake straight for a long time", "do for a job", "do for a living", "draw they all the time all over", "have a crush on", "have a seat available", "have a small team under", "know a country song about that", "live here for a long time", "live here for a month", "make a recommendation for", "need a full time job for health insurance", "practice a lot as a kid", "race the car for a living", "share a one bedroom with", "should leave for a tip", "work a lot during", "work here long time"], "work as": ["portray woman as weak", "read those as a child", "roleplay as", "start as", "work", "work as", "work with vet as"], "wreck": ["wreck"], "want to be": ["be able to", "be able to take the test again", "be excited to", "be go to", "be go to make be life easy", "be hope to visit", "be look forward to", "be ready to", "be require to", "be sneak away to", "be someone come to get", "be take the kid to", "be to", "be to leave nothing for a tip", "be to play", "be try to get", "be up to", "come here to be", "do be job remotely", "go to be", "go to be take", "have to be", "like to be", "like to buy a keepsake for be girlfriend", "need to be", "need to get be dance gear ready", "send be car to be fix", "try to be", "use to be", "want the money to be transfer", "want to be", "want to be here tomorrow", "will just put be coat away", "would like to be"], "should consider": ["change", "consider", "count", "decide", "exercise", "interest", "may consider", "might", "plan", "recommend", "rule", "should", "should appreciate", "should call", "should choose from", "should consider", "should drink more", "should leave", "should light", "should really limit"], "mow": ["mow"], "ai": ["ai"], "like do": ["can do", "do", "do for", "do for fun", "do well than", "fancy", "got do", "hate", "let do", "like do"], "come by": ["be lead by", "come by", "know by face", "live by", "pay by", "send it by airmail"], "be here for": ["be always a need for", "be around for", "be come here for", "be currently look for", "be for", "be here for", "be it for", "be look for", "be lunch time for", "be not for", "be not look good for", "be not much room for", "be off for", "be out for", "be plenty for", "be quiet for", "be serious business for", "be shop for", "be take mom out", "be there before", "be there good hiking spot near", "be this for", "do all be shopping through", "drive be big truck around", "get be key out", "live there be whole life", "should take be shoe off", "spend be holiday from"], "be naturally": ["be", "be 30 pound over weight", "be a block away", "be about", "be also from", "be also into", "be always with", "be arrive", "be available", "be content", "be decide about switch", "be employ", "be far away from", "be from", "be get", "be glad", "be glad the week be over", "be have", "be into", "be just below", "be just walk", "be learn", "be like", "be live", "be mixed with", "be naturally", "be not familar with", "be not feel", "be not into", "be not more expensive than", "be not really a fan", "be obsess with", "be plan", "be really into", "be see", "be something wrong with", "be still into", "be sure", "be that consider", "be the novel about", "be they from", "be together", "be two pound over", "be usually out", "burn be house down", "can be", "do be body good", "dress be yorkie as a lion", "dye be hair red", "have be", "help be mom out", "know it be", "like be", "love be with", "must be", "purchase be from ikea", "will all be", "will be", "will be expect", "will there be", "would be"], "save": ["rather save", "save", "spend"], "kill": ["arrest", "die", "hurt", "kill", "rob", "shoot"], "sing": ["dance", "sing", "singe"], "like to purchase": ["build house upon request", "have to give 3 shoot", "like listen to", "like some magic to take away", "like to", "like to buy", "like to cook for", "like to depart", "like to drink", "like to eat at", "like to find", "like to have", "like to leave on", "like to make", "like to meet for a drink", "like to pay", "like to purchase", "like to read", "like to return", "like to send", "like to sing", "like to stay", "like to use", "love to live on", "love to look at", "order", "use to", "use to buy", "use to have", "would like to", "would tike to buy"], "be one of": ["be a fan of", "be a huge fan of", "be a type of", "be an employee of", "be just off of", "be one of", "be part of", "be scared of", "become", "need all of be supply right away"], "s": ["s"], "break up with": ["break up with", "can not put up with", "catch up on", "fill up", "finish up", "grow up as", "grow up in", "keep up all night", "like to meet up with", "make up", "meet up for", "meet up with", "open up", "show up for", "stand up at", "take up", "wake up", "wake up at"], "retire": ["quit", "retire"], "can not handle": ["can always play", "can always use", "can not", "can not afford", "can not deal with", "can not do", "can not drink", "can not eat", "can not endure", "can not find", "can not handle", "can not imagine", "can not kick", "can not stand", "can not wait for", "can talk about", "could supply", "must water", "type", "will not pull"], "be just relax at": ["be a run at", "be at", "be be family at", "be good at", "be just relax at", "be just relax at home in", "be one at", "be very sociable at the weekend", "have so much fun at", "like be at", "will be at"], "name": ["name"], "feel like": ["dislike", "enjoy", "feel", "feel like", "look like", "love", "seem", "sound like", "taste like"], "go on": ["go on", "go out on", "mind go on", "spend too much on"], "provide": ["arrange", "include", "locate", "provide", "require", "serve", "support", "use"], "on": ["catch on", "church on", "dance on", "depend on", "fall on", "keen on", "live on", "love work on", "on", "plan on", "plan on make", "plane on watch", "play on", "play soccer on", "put on", "scald on", "show on", "sit on", "work on"]}

In [None]:
## Make new dict containing the changed abstract predicates 
abstract_predicates = {}
new_keys = ["", "think",	"be from",	"have",	"be greater than",	"take advantage of",	"make ",	"bring",	"be",	"get off",	"recycle",	"geocache",	"ride",	"keep from ",	"gameplay",	"play (game)",	"stay",	"hear",	"want",	"be ",	"get",	"be at",	"winning/losing",	"literature",	"travel",	"go",	"do .. For",	"future actions",	"add",	"take",	"can",	"be with",	"think (about)",	"start",	"get into",	"intention ",	"teach",	"smoke",	"like ",	"have something in ",	"come back",	"look",	"try to",	"cannot ",	"food actions ",	"relax",	"personal hygiene",	"marriage",	"stop",	"buy",	"help",	"be in ",	"join ",	"ace",	"come in ",	"need",	"be",	"make ",	"try out",	"pay ",	"live",	"live",	"do ",	"stop ",	"want",	"consider",	"mow",	"ai",	"do ",	"be by ",	"be for",	"be about",	"save (money)",	"hurt",	"sing",	"like",	"be of","be",	"break up with",	"quit",	"cannot ",	"not be able ",	"name",	"like",	"go on",	"use",	"be on "]

i = 0
for k, v in abstract_predicates_R.items():
  if new_keys[i] not in abstract_predicates.keys():
    abstract_predicates[new_keys[i]] = []
  abstract_predicates[new_keys[i]] += v 
  i += 1

print(abstract_predicates)


{'': [''], 'think': ['forget', 'glad', 'hope', 'imagine', 'kid', 'know', 'mean', 'remember', 'say', 'scare', 'sort', 'spell anything wrong', 'think'], 'be from': ['call from', 'different from', 'exhausted from', 'from', 'graduate from', 'import from', 'migrate from', 'retire from', 'switch from', 'tired from', 'visit from'], 'have': ['have', 'have any room available', 'have catch', 'have experience', 'have family', 'have no problem with that', 'like have people around', 'may have', 'should have', 'should sms over', 'will have', 'will never have'], 'be greater than': ['fast than', 'hard than', 'make more than', 'well than'], 'take advantage of': ['afraid of', 'deserve of', 'get enough of', 'have dream of go', 'hear of', 'in need of', 'inherit', 'involve', 'its', 'lose track of', 'make of', 'present', 'speak of', 'take advantage of', 'take care of', 'will spend most of the weekend with'], 'make ': ['bring it over', 'could make it as', 'could try it on', 'it', 'like it cut', 'make it', 'm

#### Create lookup and bio-lookup 

In [None]:
unique_list = list(unique_predicates_clean)
first_half = unique_list[:int(len(unique_list)/2)] # this is just for exlanatory purposes! You should create a new categorization of the predicates.
second_half = unique_list[int(len(unique_list)/2):]
bio_dict = {(3,4): first_half, (5,6): second_half} # create a dictionary that contains tuple (B-tag, I-tag) as keys (int, int) and a list of all specific predicates belonging to that group as values
lookup = {}
for key, value in bio_dict.items():
    for pred in value:
        lookup[pred] = key
bio_lookup = {3: 'like', 5: 'do'} # create a dictionary that contains the B-tag (int) for each abstract predicate as key and the abstract predicate (string) as value

In [None]:
print(unique_list)
print(unique_predicates_clean)

['', 'go for', 'dance on', 'go to watch it out of curiosity', 'have idea about', 'can recommend', 'scald on', 'join', 'go to across', 'stay', 'can have', 'plan', 'never', 'lose', 'tutor', 'be ready to', 'talk about', 'need this weekend to', 'to withdraw', 'SPEAKER1', 'could add', 'stay at', 'arrest', 'code', 'should go to buy', 'involve', 'stay home with', 'can play', 'from', 'should meet', 'into', 'have a seat available', 'count', 'be take the kid to', 'propose to', 'hike', 'smoke', 'look forward to', 'kid', 'pet sitting', 'hard for', 'stay with', 'binge', 'write', 'describe', 'allow', 'obsess with', 'get ready', 'have experience', 're', 'rule', 'able to', 'will spend most of the weekend with', 'burn be house down', 'be currently look for', 'leave', 'taste like', 'cook', 'read biography to', 'break up with', 'make it a secret between we', 'dye be hair red', 'volunteer with', 'go at', 'new to', 'dance', 'use to have', 'be just off of', 'be serious business for', 'be fake straight for a

In [None]:
print(lookup)

{'': (3, 4), 'go for': (3, 4), 'dance on': (3, 4), 'go to watch it out of curiosity': (3, 4), 'have idea about': (3, 4), 'can recommend': (3, 4), 'scald on': (3, 4), 'join': (3, 4), 'go to across': (3, 4), 'stay': (3, 4), 'can have': (3, 4), 'plan': (3, 4), 'never': (3, 4), 'lose': (3, 4), 'tutor': (3, 4), 'be ready to': (3, 4), 'talk about': (3, 4), 'need this weekend to': (3, 4), 'to withdraw': (3, 4), 'SPEAKER1': (3, 4), 'could add': (3, 4), 'stay at': (3, 4), 'arrest': (3, 4), 'code': (3, 4), 'should go to buy': (3, 4), 'involve': (3, 4), 'stay home with': (3, 4), 'can play': (3, 4), 'from': (3, 4), 'should meet': (3, 4), 'into': (3, 4), 'have a seat available': (3, 4), 'count': (3, 4), 'be take the kid to': (3, 4), 'propose to': (3, 4), 'hike': (3, 4), 'smoke': (3, 4), 'look forward to': (3, 4), 'kid': (3, 4), 'pet sitting': (3, 4), 'hard for': (3, 4), 'stay with': (3, 4), 'binge': (3, 4), 'write': (3, 4), 'describe': (3, 4), 'allow': (3, 4), 'obsess with': (3, 4), 'get ready': (3

### Construct Bio-lookup - OUR implementation

First, we create a set consisting of all unique token spans annotated as 'predicate'. This is the set from which you need to create a set of abstract predicates. Notice that a lot of 'unique predicates' are also annotation errors, where the subject or object has been included in the predicate annotation. 

For each of the abstract predicates you define, you will need to create a numerical B- and I-tag for the BIO tag annotation that we will use. Start from (3,4) for the first tag (e.g. B-like, I-like). The tag 0 is reserved for 'O', and 1 and 2 are reserved for the B- and I-tags for subjects and objects. 

Once you've created a set of abstract predicates and their corresponding B- and I-tags, you will need to create two lookup dictionaries for the BIO tagging that we will do later on. The dictionary `lookup` is used to get the correct BIO tag from the predicate token span and is used for converting triples to BIO-tags. The dictionary `bio_lookup` is used to get the abstract predicate from the BIO tag, and is used to convert BIO tags to tokens.

In [None]:
# Init
bio_dict = dict()
btag_idx = 3 #initial predicate b tag 3
bio_lookup = dict()

####TODO: build bio_lookup with abstract predicate as value####
for abstract, unique_list in abstract_predicates.items():
  bio_dict_key = tuple() 
  bio_dict_value = list()
  is_b_tag = True

  # Bio_dict - contains B and I-tag for a list with all unique predicates belonging to 1 abstract one
  bio_dict_key = (btag_idx, btag_idx + 1)
  bio_dict[bio_dict_key] = unique_list

  # bio lookup - B-tag + abstract predicate
  bio_lookup[btag_idx] = abstract  #.split()[0]
  
  # make sure there are 2 new BI-tags
  btag_idx += 2

print("bio_dict: ", bio_dict)
print("bio_lookup: ", bio_lookup)

bio_dict:  {(3, 4): [''], (5, 6): ['forget', 'glad', 'hope', 'imagine', 'kid', 'know', 'mean', 'remember', 'say', 'scare', 'sort', 'spell anything wrong', 'think'], (7, 8): ['call from', 'different from', 'exhausted from', 'from', 'graduate from', 'import from', 'migrate from', 'retire from', 'switch from', 'tired from', 'visit from'], (9, 10): ['have', 'have any room available', 'have catch', 'have experience', 'have family', 'have no problem with that', 'like have people around', 'may have', 'should have', 'should sms over', 'will have', 'will never have'], (11, 12): ['fast than', 'hard than', 'make more than', 'well than'], (13, 14): ['afraid of', 'deserve of', 'get enough of', 'have dream of go', 'hear of', 'in need of', 'inherit', 'involve', 'its', 'lose track of', 'make of', 'present', 'speak of', 'take advantage of', 'take care of', 'will spend most of the weekend with'], (15, 16): ['bring it over', 'could make it as', 'could try it on', 'it', 'like it cut', 'make it', 'make it 

In [None]:
##### For Jaap, export abstract predicate list

# import pandas as pd

# df = pd.DataFrame.from_dict(bio_lookup, orient = "Index")
# df = df.transpose()
# df.to_csv("abstract_predicate.csv", index = False)

In [None]:
lookup = dict()

for key, value in bio_dict.items():
  for predicates in value:
    lookup[predicates] = key
print(lookup)

{'': (3, 4), 'forget': (5, 6), 'glad': (5, 6), 'hope': (5, 6), 'imagine': (5, 6), 'kid': (5, 6), 'know': (5, 6), 'mean': (5, 6), 'remember': (5, 6), 'say': (5, 6), 'scare': (5, 6), 'sort': (5, 6), 'spell anything wrong': (5, 6), 'think': (5, 6), 'call from': (7, 8), 'different from': (7, 8), 'exhausted from': (7, 8), 'from': (7, 8), 'graduate from': (7, 8), 'import from': (7, 8), 'migrate from': (7, 8), 'retire from': (7, 8), 'switch from': (7, 8), 'tired from': (7, 8), 'visit from': (7, 8), 'have': (9, 10), 'have any room available': (9, 10), 'have catch': (9, 10), 'have experience': (9, 10), 'have family': (9, 10), 'have no problem with that': (9, 10), 'like have people around': (9, 10), 'may have': (9, 10), 'should have': (9, 10), 'should sms over': (9, 10), 'will have': (9, 10), 'will never have': (9, 10), 'fast than': (11, 12), 'hard than': (11, 12), 'make more than': (11, 12), 'well than': (11, 12), 'afraid of': (13, 14), 'deserve of': (13, 14), 'get enough of': (13, 14), 'have d

In [None]:
def unique_to_abstract(pred, abstract_dict):
  """ Maps the predicate to its abstract version
      returns the abstracted predicate 
  """
  for items in list(abstract_dict.keys()):
    if pred in list(abstract_dict[items]) or pred == items:
      abstracted = items
      break
  return abstracted

## Converting formats

Triple arguments are stored as lists of indices (e.g. [[0, 1], [0, 2]] indicating the second and third token of the first turn). We rather use a BIO tagging scheme to indicate these arguments as a vector of labels (one label for each token in the dialog).

Moreover, we flatten the dialogue turns into one flat dialogue using `<eos>` as a separator token.

In [None]:
import numpy as np
import pandas as pd


def triple_to_bio_tags(annotation, arg, lookup):
    """ Converts the token indices of the annotations to a vector of BIO labels
        for an argument.

        params:
        dict annotation:    loaded annotation file (see load_annotations)
        int arg:            argument to create tag sequence for (subj=0, pred=1, obj=2)

        returns:    ndarray with BIO labels (I=2, B=1, O=0)
    """
    # Determine length of dialogue
    turns = annotation['tokens']
    triples = annotation['annotations']
    num_tokens = sum([len(turn) + 1 for turn in turns])  # +1 for <eos>

    # Create vector same size as dialogue
    mask = np.zeros(num_tokens, dtype=np.uint8)

    # Label annotated arguments as BIO tags
    for triple in triples:
        if arg == 1:
            pred = get_predicate_tokens(annotation, triple)
            
            if pred is not None:     #and pred is not ""
                #This use of predicate should also go through cleaning and lemmatization
                pred = cleaning([pred])
                pred = list(pred)[0]
                
                B_tag, I_tag = lookup[pred]
                # print(B_tag, I_tag, pred)
                
                for j, (turn_id, token_id) in enumerate(triple[arg]):
                    k = sum([len(t) + 1 for t in turns[:turn_id]]) + token_id  # k = index of token in dialogue
                    mask[k] = B_tag if j == 0 else I_tag # so if it's a predicate, it assigns B-tag to the first, I-tag to the second
        else:
            for j, (turn_id, token_id) in enumerate(triple[arg]):
                k = sum([len(t) + 1 for t in turns[:turn_id]]) + token_id  # k = index of token in dialogue
                mask[k] = 1 if j == 0 else 2 # So if it's a subj or object, the first is subject, else it is an object 
    return mask

In [None]:
tokens, labels = [], []

for ann in annotations:
    # Map triple arguments to BIO tagged masks
    labels.append((triple_to_bio_tags(ann, 0, lookup),
                   triple_to_bio_tags(ann, 1, lookup),
                   triple_to_bio_tags(ann, 2, lookup)))
    
    # Flatten turn sequence
    tokens.append([t for ts in ann['tokens'] for t in ts + ['<eos>']])
    
# Show as BIO scheme
i = random.randint(0, len(tokens) - 1)
print(i) #572, 1075
i = 1075
display(pd.DataFrame(labels[i], columns=tokens[i], index=['subj', 'pred', 'obj']))
print(labels[i])

366


Unnamed: 0,the,newest,show,that,just,got,leaked,<eos>,what,another,...,no,not,at,all,it,'s,a,fantasy,show.1,<eos>.1
subj,1,2,2,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
pred,0,0,0,0,0,43,0,0,0,0,...,0,0,0,0,0,-121,0,0,0,0
obj,0,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,1,2,2,0


(array([1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0], dtype=uint8), array([  0,   0,   0,   0,   0,  43,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0, 135,   0,   0,   0,   0],
      dtype=uint8), array([0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       2, 2, 0], dtype=uint8))


In [None]:
tagging = []
for label in labels:
  # print(label)
  for i in label:
    for j in i:
      tagging.append(j)

print(len(set(sorted(tagging))), set(sorted(tagging)))

In [None]:
import re

def bio_tags_to_tokens(tokens, mask, bio_lookup, predicate=False, one_hot=False):
    """ Converts a vector of BIO-tags into spans of tokens. If BIO-tags are one-hot encoded,
        one_hot=True will first perform an argmax to obtain the BIO labels.

        params:
        list tokens:    list of subwords or tokens (as tokenized by Albert/AutoTokenizer)
        ndarray mask:   list of bio labels (one for each subword or token in 'tokens')
        bool one_hot:   whether to interpret mask as a one-hot encoded sequence of shape |sequence|x3
    """
    out = []
    span = []
    for i, token in enumerate(tokens):
        pred = mask[i]

        # Reverse one-hot encoding (optional)
        if one_hot:
            pred = np.argmax(pred)

        if pred %2 == 1:  # B 

            if predicate: # B
                abstr_pred = bio_lookup[pred]
                out.append(abstr_pred)
            else:  
                span = re.sub('[^\w\d\-\']+', ' ', ''.join(span)).strip()
                out.append(span)
                span = [token]

        elif pred != 0 and pred % 2 == 0:  # I
            if predicate:
                continue
            else:  
                span.append(token)

    if span:
        span = re.sub('[^\w\d\-\']+', ' ', ''.join(span)).strip()
        out.append(span)

    # Remove empty strings and duplicates
    return set([span for span in out if span.strip()])

In [None]:
i = random.randint(0, len(labels))
i = 1075
print(i)
print(' '.join(tokens[i]) + '\n')

print('Subjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens[i]], labels[i][0], bio_lookup))

print('\nPredicates:')
print(bio_tags_to_tokens(['+' + t for t in tokens[i]], labels[i][1], bio_lookup, predicate=True))

print('\nObjects:')
print(bio_tags_to_tokens(['+' + t for t in tokens[i]], labels[i][2], bio_lookup))

print(labels[i][0], "\n",
      labels[i][1], "\n",
      labels[i][2])

1075
the newest show that just got leaked <eos> what another celebrity sex tape ? <eos> no not at all it 's a fantasy show <eos>

Subjects:
{'the newest show', 'it'}

Predicates:
{'be about', 'get'}

Objects:
{'a fantasy show', 'another celebrity sex tape', 'leaked'}
[1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] 
 [  0   0   0   0   0  43   0   0   0   0   0   0   0   0   0   0   0   0
   0   0 135   0   0   0   0] 
 [0 0 0 0 0 0 1 0 0 1 2 2 2 0 0 0 0 0 0 0 0 1 2 2 0]


# ALBERT

## Setting up ALBERT for Argument Extraction - Subj, Pred, Obj

Now we set up ALBERT with a token classification head for each of the arguments. To this end we will use PyTorch to create a small linear classifier for each argument which we can slide over the output of ALBERT to make a prediction for each token. To train other models, you should adapt this code so that it works with your model of choice.

In [None]:
%%capture 
!pip install transformers

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from tqdm import tqdm
import numpy as np
import random
from datetime import date

In [None]:
class ArgumentExtraction(torch.nn.Module):    # NOTE: don't hardcode output_dim
    def __init__(self, base_model='albert-base-v2', path=None, output_dim = 161, sep='<eos>'): # You need to change the output_dim to the total number of BIO-tags (including 0,1,2)
        """ Init model with multi-span extraction heads for SPO arguments.

            params:
            str base_model: Transformer architecture to use (default: albert-base-v2)
            str path:       Path to pretrained model
        """
        super().__init__()
        print('loading %s for argument extraction' % base_model)
        self._model = AutoModel.from_pretrained(base_model)
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with special SPEAKER tokens
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        print(len(self._tokenizer))
        self._model.resize_token_embeddings(len(self._tokenizer))

        # Add token classification heads
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._subj_head = torch.nn.Linear(hidden_size, output_dim)
        self._pred_head = torch.nn.Linear(hidden_size, output_dim)
        self._obj_head = torch.nn.Linear(hidden_size, output_dim)
        self._output_dim = output_dim

        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # Set GPU if available
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = glob.glob(path + '/argument_extraction_' + base_model)[0]
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids):
        """ Computes BIO label probabilities for each token
        """
        # Feed dialog through transformer
        y = self._model(input_ids=input_ids, token_type_ids=speaker_ids)
        h = self._relu(y.last_hidden_state)

        # Predict spans
        y_subj = self._softmax(self._subj_head(h))
        y_pred = self._softmax(self._pred_head(h))
        y_obj_ = self._softmax(self._obj_head(h))

        # Permute output as tensor of shape (N, |C|, seq_len)
        y_subj = y_subj.permute(0, 2, 1)
        y_pred = y_pred.permute(0, 2, 1)
        y_obj_ = y_obj_.permute(0, 2, 1)
        # print(y_subj)
        return y_subj, y_pred, y_obj_

    def _retokenize_tokens(self, tokens):
        # Tokenize each token individually (keeping track of subwords)
        input_ids = [[self._tokenizer.cls_token_id]]
        for t in tokens:
            if t != '<eos>':
                input_ids.append(self._tokenizer.encode(t, add_special_tokens=False))
            else:
                input_ids.append([self._tokenizer.eos_token_id])

        # Flatten input_ids
        f_input_ids = torch.LongTensor([[i for ids in input_ids for i in ids]]).to(self._device)

        # Determine how often we need to repeat the labels
        repeats = [len(ids) for ids in input_ids]

        # Set speaker IDs
        speaker_ids = [0] + [tokens[:i + 1].count(self._sep) % 2 for i in range(len(tokens))][:-1]  # TODO: make pretty
        speaker_ids = self._repeat_speaker_ids(speaker_ids, repeats)

        return f_input_ids, speaker_ids, repeats

    def _repeat_speaker_ids(self, speaker_ids, repeats):
        """ Repeats speaker IDs for oov tokens.
        """
        rep_speaker_ids = np.repeat([0] + list(speaker_ids), repeats=repeats)
        return torch.LongTensor([rep_speaker_ids]).to(self._device)

    def _repeat_labels(self, labels, repeats):
        """ Repeats BIO labels for OOV tokens. Ensure B-labeled tokens are repeated
            as B-I-I etc.
        """
        # Repeat each label b the amount of subwords per token
        rep_labels = []
        for label, rep in zip([0] + list(labels), repeats):
            # Outside
            if label == 0:
                rep_labels += [label] * rep
            # Beginning + Inside
            elif label % 2 == 1:    # uneven labels are B-tags
                rep_labels += [label] + ([label+1] * (rep - 1))  # If label = B -> B-I-I-I...
            else: 
                rep_labels += [label] + ([label] * (rep - 1)) # if label = I, do not add 1 but keep the same 
        return torch.LongTensor([rep_labels]).to(self._device)

    def fit(self, tokens, labels, epochs=2, lr=1e-5, weight=800):
        """ Fits the model to the annotations
        """
        # Re-tokenize to obtain input_ids and associated labels
        X = []
        for token_seq, (subj_labels, pred_labels, _obj_labels) in zip(tokens, labels):
            input_ids, speaker_ids, repeats = self._retokenize_tokens(token_seq)
            subj_labels = self._repeat_labels(subj_labels, repeats)  # repeat when split into subwords
            pred_labels = self._repeat_labels(pred_labels, repeats)
            _obj_labels = self._repeat_labels(_obj_labels, repeats)
            X.append((input_ids, speaker_ids, subj_labels, pred_labels, _obj_labels))

        # Set up optimizer
        optim = torch.optim.AdamW(self.parameters(), lr=lr)

        # Higher weight for B- and I-tags to account for class imbalance
        class_weights = torch.Tensor([1] + [weight] * (self._output_dim - 1)).to(self._device)
        criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

        print('Training!')
        for epoch in range(epochs):
            losses = []
            random.shuffle(X)
            for input_ids, speaker_ids, subj_y, pred_y, obj_y in tqdm(X):
                # Forward pass
                subj_y_hat, pred_y_hat, obj_y_hat = self(input_ids, speaker_ids)

                # Compute loss
                loss = criterion(subj_y_hat, subj_y)
                loss += criterion(pred_y_hat, pred_y)
                loss += criterion(obj_y_hat, obj_y)
                losses.append(loss.item())

                optim.zero_grad()
                loss.backward()
                optim.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'argument_extraction_%s' % self._base)

    def predict(self, token_seq):
        """ Predicts """
        # Retokenize token sequence
        input_ids, speaker_ids, _ = self._retokenize_tokens(token_seq)

        # Invert tokenization for viewing
        subwords = self._tokenizer.convert_ids_to_tokens(input_ids[0])

        # Forward-pass
        predictions = self(input_ids, speaker_ids)
        subjs = predictions[0].cpu().detach().numpy()[0]
        preds = predictions[1].cpu().detach().numpy()[0]
        objs = predictions[2].cpu().detach().numpy()[0]

        return subjs, preds, objs, subwords

In [None]:
# !kill -9 -1

In [None]:
model_albert_base = ArgumentExtraction()  
model_albert_base.fit(tokens, labels, epochs=7)

loading albert-base-v2 for argument extraction
30002
Training!


100%|██████████| 1117/1117 [00:45<00:00, 24.48it/s]


mean loss = 13.635485337478515


100%|██████████| 1117/1117 [00:42<00:00, 26.51it/s]


mean loss = 13.152230007475577


100%|██████████| 1117/1117 [00:34<00:00, 32.19it/s]


mean loss = 13.0836987089876


100%|██████████| 1117/1117 [00:36<00:00, 30.49it/s]


mean loss = 13.046468474337749


100%|██████████| 1117/1117 [00:35<00:00, 31.68it/s]


mean loss = 13.03992255713754


100%|██████████| 1117/1117 [00:35<00:00, 31.66it/s]


mean loss = 13.028891769176951


100%|██████████| 1117/1117 [00:35<00:00, 31.86it/s]


mean loss = 13.014455479388804


In [None]:
model_save_name = 'classifier_141222.pt'

torch.save(model_albert_base, F"/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/{model_save_name}")

In [None]:
# model_albert_base = torch.load("/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/classifier_141222.pt")

## Putting It All Together

Below you can see the token assignments with the BIO scheme to SPO arguments

In [None]:
inputs = 'SPEAKER1 enjoy watching american football but don\'t like to make homework <eos> what does Mike want to do? <eos> gaming, but SPEAKER1 hate cats <eos>'.split()
#inputs = 'What car do SPEAKER1 drive <eos> a big red truck <eos>'.split()

y_subj, y_pred, y_obj, subwords = model_albert_base.predict(inputs)

# show results
for arg, y in [('Subject', y_subj), ('Predicate', y_pred), ('Object', y_obj)]:
    print('\n', arg)
    print(["{}".format(num) for num in range(179)])
    # print('0\t1\t2\t3\t4\t5\t6')
    for score, token in zip(y.T, subwords):
        score_str = '\t'.join(["[" + str(s)[:5] + "]" if s == max(score) else " " + str(round(s, 4))[:5] + " " for s in score])
        token_str = token.replace('▁', '')
        print(score_str, token_str)


 Subject
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', 

In [None]:
y_subj, y_pred, y_obj, subwords = model_albert_base.predict(inputs)

print(' '.join(subwords).replace('▁', '') + '\n')
print('Subjects:  ', bio_tags_to_tokens(subwords, y_subj.T, bio_lookup, one_hot=True))
print('Predicates:', bio_tags_to_tokens(subwords, y_pred.T, bio_lookup, predicate=True, one_hot=True))
print('Objects:   ', bio_tags_to_tokens(subwords, y_obj.T, bio_lookup, one_hot=True))

[CLS] SPEAKER1 enjoy watching american football but don ' t like to make homework [SEP] what does mike want to do ? [SEP] gaming , but SPEAKER1 hate cats [SEP]

Subjects:   {'gaming', 'what does', 'hate cats', 'enjoy', 'watching american football', 'want do', 'like', 'homework', 'mike', 'make', 'SPEAKER1', "don'"}
Predicates: {'do ', 'like'}
Objects:    {'CLS', "don't", 'what does', 'SEP', 'enjoy', 'homework SEP', 'like to make', 'watching american football but', 'mike', 'want to do', 'gaming but', 'hate cats SEP', 'SPEAKER1'}


In [None]:
print(str(date.today()))

2022-12-14


In [None]:
import os, shutil

out_dir = root_dir + '/models/' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('argument_extraction_albert-base-v2', out_dir)


'/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/2022-12-14/argument_extraction_albert-base-v2'

# Ranking the triples

Now we are able to extract the candidate arguments, but how do we combine them?

We compute all combinations of the subjects, predicates and objects and train a model to distinguish between those triples that are entailed (not considering negation here) and those that are not.

For this, we extract a number of negative examples from possible triples, i.e. those combinations of subjects, predicates and objects that were not annotated.

## Converting format

In [None]:
from collections import defaultdict
from copy import deepcopy


def extract_triples(annotation, neg_oversampling=7, contr_oversampling=0.7, ellipsis_oversampling=3):
    """ Extracts plain-text triples from an annotation file and samples 'negative' examples by
        crossover. By default, the function will over-extract triples with negative polarity and
        elliptical constructions to counter class imbalance.

        params:
        dict annotation:            loaded annotation file (see load_annotations)
        int neg_oversampling:       how much to over-sample triples with negative polarity
        float contr_oversampling:   how much to sample contrast/invalid triples relative to true triples
        int ellipsis_oversampling:  how much to over-sample elliptical triples
    """
    turns = annotation['tokens']
    triple_ids = [t[:4] for t in annotation['annotations']]

    arguments = defaultdict(list)
    triples = []
    labels = []

    # Oversampling of elliptical triples
    for triple in deepcopy(triple_ids):
        subj_obj_turns = set([i for i, _ in triple[0] + triple[2]])
        if len(subj_obj_turns) > 1:
            triple_ids += [triple] * int(ellipsis_oversampling)

    # Extract 'True' triples
    for subj, pred, obj, polar in triple_ids:

        subj = ' '.join(turns[i][j] for i, j in subj) if subj else ''
        pred = ' '.join(turns[i][j] for i, j in pred) if pred else ''
        obj = ' '.join(turns[i][j] for i, j in obj) if obj else ''

        if subj or pred or obj:

            if not polar:
                triples += [(subj, pred, obj)]
                labels += [1]
            else:
                triples += [(subj, pred, obj)] * neg_oversampling  # Oversampling negative polarities
                labels += [2] * neg_oversampling

            arguments['subjs'].append(subj)
            arguments['preds'].append(pred)
            arguments['objs'].append(obj)

    # Skip if the annotation file was blank
    if not triples:
        return [], [], []

    # Sample fake contrast examples (invalid extractions)
    n = int(len(triples) * contr_oversampling)
    for i in range(50):
        s = random.choice(arguments['subjs'])
        p = random.choice(arguments['preds'])
        o = random.choice(arguments['objs'])

        # Ensure samples are new (and not actually valid!)
        if (s, p, o) not in triples and s and p and o:
            triples += [(s, p, o)]
            labels += [0]
            n -= 1

        # Create as many fake examples as there were 'real' triples
        if n == 0:
            break

    return turns, triples, labels


In [None]:
tokens, triples, labels = [], [], []
for ann in annotations:
    ann_tokens, ann_triples, triple_labels = extract_triples(ann)
    triples.append(ann_triples)
    labels.append(triple_labels)
    tokens.append([t for ts in ann_tokens for t in ts + ['<eos>']])

j = random.choice(range(len(tokens)))
print('tokens: ', tokens[j])
print('triples:', triples[j])
print('labels: ', labels[j])

tokens:  ['are', 'SPEAKER2', 'a', 'hiker', 'too', '?', '<eos>', 'yes', 'SPEAKER2', 'hike', ',', 'what', 'is', 'skittles', '?', 'like', 'the', 'candy', '?', '<eos>', 'no', ',', 'SPEAKER1', "'ve", 'actually', 'never', 'eaten', 'candy', '.', 'SPEAKER1', 'mean', 'the', 'game', '.', '<eos>']
triples: [('SPEAKER2', 'are', 'a hiker'), ('SPEAKER2', 'hike', ''), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'mean', 'the game'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), (

In [None]:
print('Class (im)balance:')
print('not entailed  ', sum([np.sum(np.array(t) == 0) for t in labels]))
print('entailed (pos)', sum([np.sum(np.array(t) == 1) for t in labels]))
print('entailed (neg)', sum([np.sum(np.array(t) == 2) for t in labels]))

Class (im)balance:
not entailed   6414
entailed (pos) 6690
entailed (neg) 5978


## Fine-tuning ALBERT for Triple Candidate Scoring

In [None]:
class TripleScoring(torch.nn.Module):
    def __init__(self, base_model='albert-base-v2', path=None, max_len=80, sep='<eos>'):
        super().__init__()
        # Base model
        print('loading %s for triple scoring' % base_model)
        # Load base model
        self._model = AutoModel.from_pretrained(base_model)
        self._max_len = max_len
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with SPEAKERS
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # SPO candidate scoring head
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._head = torch.nn.Linear(hidden_size, 3)
        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # GPU support
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = glob.glob(path + '/candidate_scorer_' + base_model)[0]
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids, attn_mask):
        """ Computes the forward pass through the model
        """
        out = self._model(input_ids=input_ids, token_type_ids=speaker_ids, attention_mask=attn_mask)
        h = self._relu(out.last_hidden_state[:, 0])
        return self._softmax(self._head(h))

    def _retokenize_dialogue(self, tokens, speaker=1):
        # Tokenize each token individually (keeping track of subwords)
        f_input_ids = [self._tokenizer.cls_token_id]
        speaker_ids = [speaker]
        for turn in ' '.join(tokens).split(self._sep):
            token_ids = self._tokenizer.encode(turn, add_special_tokens=True)[1:]  # strip [CLS]
            f_input_ids += token_ids
            speaker_ids += [speaker] * len(token_ids)
            speaker = 1 - speaker

        return f_input_ids, speaker_ids

    def _retokenize_triple(self, triple):
        # Append triple
        f_input_ids = self._tokenizer.encode(' '.join(triple), add_special_tokens=False)
        speaker_ids = [0] * len(f_input_ids)
        return f_input_ids, speaker_ids

    def _add_padding(self, sequence, pad_token):
        # If sequence is too long, cut off end
        sequence = sequence[:self._max_len]

        # Pad remainder to max_len
        padding = self._max_len - len(sequence)
        new_sequence = sequence + [pad_token] * padding

        # Mask out [PAD] tokens
        attn_mask = [1] * len(sequence) + [0] * padding
        return new_sequence, attn_mask

    def fit(self, tokens, triples, labels, epochs=2, lr=1e-6):
        """ Fits the model to the annotations
        """
        X = []
        for tokens, triple_lst, triple_labels in zip(tokens, triples, labels):

            # Tokenize dialogue
            dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

            for triple, label in zip(triple_lst, triple_labels):
                # Tokenize triple
                triple_input_ids, triple_speakers = self._retokenize_triple(triple)

                # Concatenate dialogue + [UNK] + triple
                input_ids = dialog_input_ids[:-1] + [self._tokenizer.unk_token_id] + triple_input_ids
                speakers = dialog_speakers[:-1] + [0] + triple_speakers

                # Pad sequence with [PAD] to max_len
                input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
                speakers, attn_mask = self._add_padding(speakers, 0)

                # Push Tensor to GPU
                input_ids = torch.LongTensor([input_ids]).to(self._device)
                speakers = torch.LongTensor([speakers]).to(self._device)
                attn_mask = torch.FloatTensor([attn_mask]).to(self._device)
                label_ids = torch.LongTensor([label]).to(self._device)

                X.append((input_ids, speakers, attn_mask, label_ids))

        # Set up optimizer and objective
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        criterion = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            random.shuffle(X)

            losses = []
            for input_ids, speaker_ids, attn_mask, y in tqdm(X):
                # Was the triple entailed? Positively? Negatively?
                y_hat = self(input_ids, speaker_ids, attn_mask)
                loss = criterion(y_hat, y)
                losses.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'candidate_scorer_%s' % self._base)

    def predict(self, tokens, triples):
        # Tokenize dialogue
        dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

        batch_input_ids = []
        batch_speakers = []
        batch_attn_mask = []

        for triple in triples:
            # Tokenize triple
            triple_input_ids, triple_speakers = self._retokenize_triple(triple)

            # Concatenate dialogue + [UNK] + triple
            input_ids = dialog_input_ids + [self._tokenizer.unk_token_id] + triple_input_ids
            speakers = dialog_speakers + [0] + triple_speakers

            # Pad sequence with [PAD] to max_len
            input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
            speakers, attn_mask = self._add_padding(speakers, 0)

            batch_input_ids.append(input_ids)
            batch_speakers.append(speakers)
            batch_attn_mask.append(attn_mask)

        # Push batches to GPU
        batch_input_ids = torch.LongTensor(batch_input_ids).to(self._device)
        batch_speakers = torch.LongTensor(batch_speakers).to(self._device)
        batch_attn_mask = torch.FloatTensor(batch_attn_mask).to(self._device)

        label = self(batch_input_ids, batch_speakers, batch_attn_mask)
        label = label.cpu().detach().numpy()
        return label

In [None]:
scorer = TripleScoring()
scorer.fit(tokens, triples, labels, epochs=7)

loading albert-base-v2 for triple scoring


100%|██████████| 19082/19082 [09:43<00:00, 32.69it/s]


mean loss = 0.8315123576386246


100%|██████████| 19082/19082 [09:39<00:00, 32.96it/s]


mean loss = 0.6663190892951917


100%|██████████| 19082/19082 [10:31<00:00, 30.22it/s]


mean loss = 0.629196528471462


100%|██████████| 19082/19082 [09:56<00:00, 32.00it/s]


mean loss = 0.6078200756335281


100%|██████████| 19082/19082 [09:43<00:00, 32.69it/s]


mean loss = 0.599275085726395


100%|██████████| 19082/19082 [09:37<00:00, 33.01it/s]


mean loss = 0.5935076415420091


100%|██████████| 19082/19082 [09:49<00:00, 32.35it/s]


mean loss = 0.5892000036623952


In [None]:
model_save_name = 'scorer_albert_abstr_081222.pt'

torch.save(scorer, F"/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/{model_save_name}")

In [None]:
# inputs = 'staying here is fine though . SPEAKER1\'s two dogs keep me company <eos> SPEAKER2 do not love them ! What car do SPEAKER1 drive ? <eos> a toyota . but SPEAKER1 like nissans . <eos>'.split()
# triple_examples = [['SPEAKER1', 'drive', 'nissans'],
#                    ['SPEAKER1', 'like', 'nissans'], 
#                    ['SPEAKER2', 'like', 'nissans'], 
#                    ['SPEAKER2', 'love', 'two dogs'], 
#                    ['SPEAKER1', 'drive', 'a toyota']]

# inputs = '<eos> Do SPEAKER1 work in Amsterdam ? <eos> No , in London . <eos>'.split()
# triple_examples = [['SPEAKER1', 'work in', 'Amsterdam']]

inputs = 'SPEAKER1 adore unicorns but not photography <eos> What do SPEAKER1 like ? <eos> dogs and gaming, but not cats or elephants . <eos>'.split()
triple_examples = [['SPEAKER1', 'adore', 'unicorns'],
                   ['SPEAKER1', 'like', 'dogs'],
                   ['SPEAKER1', 'like', 'gaming'],
                   ['SPEAKER1', 'adore', 'photography'],
                   ['SPEAKER1', 'like', 'cats'],
                   ['SPEAKER1', 'like', 'elephants'],
                   ['SPEAKER1', 'adore', 'elephants'],
                   ['SPEAKER1', 'like', 'photography'],
                   ['SPEAKER1', 'like', 'unicorns']]

np.round(scorer.predict(inputs, triple_examples), 3)

array([[0.001, 0.001, 0.999],
       [0.015, 0.975, 0.01 ],
       [0.02 , 0.64 , 0.34 ],
       [0.001, 0.   , 0.999],
       [0.   , 0.   , 1.   ],
       [0.001, 0.   , 0.999],
       [0.999, 0.001, 0.   ],
       [0.002, 0.   , 0.998],
       [0.217, 0.001, 0.782]], dtype=float32)

We move the resulting model to Drive:

In [None]:
import os, shutil

out_dir = root_dir + '/models/' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('candidate_scorer_albert-base-v2', out_dir)

'/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/2022-12-14/candidate_scorer_albert-base-v2'

# BERT

## Setting up BERT for Argument Extraction 


In [None]:
%%capture 
!pip install transformers

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from tqdm import tqdm
import numpy as np
import random
from datetime import date

In [None]:
class BertArgumentExtraction(torch.nn.Module):
    def __init__(self, base_model='bert-base-uncased', path=root_dir, output_dim=179, sep='<eos>'): # You need to change the output_dim to the total number of BIO-tags (including 0,1,2)
        """ Init model with multi-span extraction heads for SPO arguments.

            params:
            str base_model: Transformer architecture to use bert
            str path:       Path to pretrained model
        """
        super().__init__()
        print('loading %s for argument extraction' % base_model)
        self._model = AutoModel.from_pretrained(base_model)
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with special SPEAKER tokens
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # Add token classification heads
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._subj_head = torch.nn.Linear(hidden_size, output_dim)
        self._pred_head = torch.nn.Linear(hidden_size, output_dim)
        self._obj_head = torch.nn.Linear(hidden_size, output_dim)
        self._output_dim = output_dim

        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # Set GPU if available
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            model_path = glob.glob(path + '/argument_extraction_' + base_model)[0]
            self.load_state_dict(torch.load(model_path, map_location=self._device))

    def forward(self, input_ids, speaker_ids):
        """ Computes BIO label probabilities for each token
        """
        # Feed dialog through transformer
        y = self._model(input_ids=input_ids, token_type_ids=speaker_ids)
        h = self._relu(y.last_hidden_state)

        # Predict spans
        y_subj = self._softmax(self._subj_head(h))
        y_pred = self._softmax(self._pred_head(h))
        y_obj_ = self._softmax(self._obj_head(h))

        # Permute output as tensor of shape (N, |C|, seq_len)
        y_subj = y_subj.permute(0, 2, 1)
        y_pred = y_pred.permute(0, 2, 1)
        y_obj_ = y_obj_.permute(0, 2, 1)
        return y_subj, y_pred, y_obj_
#         return y_pred

    def _retokenize_tokens(self, tokens):
        # Tokenize each token individually (keeping track of subwords)
        input_ids = [[self._tokenizer.cls_token_id]]
        for t in tokens:
#             print(t)
            if t != '<eos>':
                tok=self._tokenizer(t, 
                                add_special_tokens=False,
#                                max_length = MAX_LEN,
#                                 padding = 'max_length',
#                                truncation=True,
#                                return_attention_mask=True,
#                                return_tensors='pt')
                                   )
                input_ids.append(tok['input_ids'])
            else:
                input_ids.append([self._tokenizer.sep_token_id])

        # Flatten input_ids

#         print([ids for ids in input_ids ])
        f_input_ids = torch.LongTensor([[i for ids in input_ids for i in ids]]).to(self._device)

        # Determine how often we need to repeat the labels
        repeats = [len(ids) for ids in input_ids]

        # Set speaker IDs
        speaker_ids = [0] + [tokens[:i + 1].count(self._sep) % 2 for i in range(len(tokens))][:-1]  # TODO: make pretty
        speaker_ids = self._repeat_speaker_ids(speaker_ids, repeats)

        return f_input_ids, speaker_ids, repeats

    def _repeat_speaker_ids(self, speaker_ids, repeats):
        """ Repeats speaker IDs for oov tokens.
        """
        rep_speaker_ids = np.repeat([0] + list(speaker_ids), repeats=repeats)
        return torch.LongTensor([rep_speaker_ids]).to(self._device)

    def _repeat_labels(self, labels, repeats):
        """ Repeats BIO labels for OOV tokens. Ensure B-labeled tokens are repeated
            as B-I-I etc.
        """
        # Repeat each label b the amount of subwords per token
        rep_labels = []
        for label, rep in zip([0] + list(labels), repeats):
            # Outside
            if label == 0:
                rep_labels += [label] * rep
            # Beginning + Inside
            elif label % 2 == 1:    # uneven labels are B
                rep_labels += [label] + ([label+1] * (rep - 1))  # If label = B -> B-I-I-I...
            else: 
                rep_labels += [label] + ([label] * (rep - 1)) # if label = I, do not add 1 but keep the same 
        return torch.LongTensor([rep_labels]).to(self._device)

    def fit(self, tokens, labels, epochs=2, lr=1e-5, weight=800):
        """ Fits the model to the annotations
        """
        # Re-tokenize to obtain input_ids and associated labels
        X = []
        for token_seq, (subj_labels, pred_labels, _obj_labels) in zip(tokens, labels):
            input_ids, speaker_ids, repeats = self._retokenize_tokens(token_seq)
            subj_labels = self._repeat_labels(subj_labels, repeats)  # repeat when split into subwords
            pred_labels = self._repeat_labels(pred_labels, repeats)
            _obj_labels = self._repeat_labels(_obj_labels, repeats)
            X.append((input_ids, speaker_ids, subj_labels, pred_labels, _obj_labels))

#             X.append((input_ids, speaker_ids, pred_labels))

        # Set up optimizer
        optim = torch.optim.Adam(self.parameters(), lr=lr)

        # Higher weight for B- and I-tags to account for class imbalance
        class_weights = torch.Tensor([1] + [weight] * (self._output_dim - 1)).to(self._device)
        criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

        print('Training!')
        for epoch in range(epochs):
            losses = []
            random.shuffle(X)
            for input_ids, speaker_ids, subj_y, pred_y, obj_y in tqdm(X):
#             for input_ids, speaker_ids, pred_y in tqdm(X):
                # Forward pass
                subj_y_hat, pred_y_hat, obj_y_hat = self.forward(input_ids, speaker_ids)
#                 pred_y_hat = self.forward(input_ids, speaker_ids)
                # Compute loss
                loss = criterion(subj_y_hat, subj_y)
                loss += criterion(pred_y_hat, pred_y)
                loss += criterion(obj_y_hat, obj_y)
                losses.append(loss.item())

                optim.zero_grad()
                loss.backward()
                optim.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'argument_extraction_%s' % self._base)

    def predict(self, token_seq):
        """ Predicts """
        # Retokenize token sequence
        input_ids, speaker_ids, _ = self._retokenize_tokens(token_seq)

        # Invert tokenization for viewing
        subwords = self._tokenizer.convert_ids_to_tokens(input_ids[0])

        # Forward-pass
        predictions = self.forward(input_ids, speaker_ids)
        subjs = predictions[0].cpu().detach().numpy()[0]
        preds = predictions[1].cpu().detach().numpy()[0]
        objs = predictions[2].cpu().detach().numpy()[0]

        return  preds, subwords, subjs, objs

In [None]:
bert_model = BertArgumentExtraction()  
# bert_model.fit(bert_tokens, bert_labels, epochs=2) #if you are loading from the model don't fit the model

loading albert-base-v2 for argument extraction
30002
Training!


100%|██████████| 1117/1117 [00:45<00:00, 24.48it/s]


mean loss = 13.635485337478515


100%|██████████| 1117/1117 [00:42<00:00, 26.51it/s]


mean loss = 13.152230007475577


100%|██████████| 1117/1117 [00:34<00:00, 32.19it/s]


mean loss = 13.0836987089876


100%|██████████| 1117/1117 [00:36<00:00, 30.49it/s]


mean loss = 13.046468474337749


100%|██████████| 1117/1117 [00:35<00:00, 31.68it/s]


mean loss = 13.03992255713754


100%|██████████| 1117/1117 [00:35<00:00, 31.66it/s]


mean loss = 13.028891769176951


100%|██████████| 1117/1117 [00:35<00:00, 31.86it/s]


mean loss = 13.014455479388804


## Putting It All Together

Below you can see the token assignments with the BIO scheme to SPO arguments

In [None]:
inputs = 'SPEAKER1 enjoy watching american football but don\'t like to make homework <eos> what does Mike want to do? <eos> gaming, but SPEAKER1 hate cats <eos>'.split()
#inputs = 'What car do SPEAKER1 drive <eos> a big red truck <eos>'.split()

y_pred, subwords, y_subj, y_obj, = bert_model.predict(inputs) #y_subj, y_obj,

# show results
for arg, y in [('Subject', y_subj), ('Predicate', y_pred), ('Object', y_obj)]: #('Subject', y_subj), , ('Object', y_obj)
    print('\n', arg)
    print(["{}\t".format(num) for num in range(179)])
    # print('0\t1\t2\t3\t4\t5\t6')
    for score, token in zip(y.T, subwords):
        score_str = '\t'.join(["[" + str(s)[:5] + "]" if s == max(score) else " " + str(round(s, 4))[:5] + " " for s in score])
        token_str = token.replace('▁', '')
        print(score_str, token_str)


 Subject
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', 

In [None]:
y_subj, y_pred, y_obj, subwords = model_albert_base.predict(inputs)

print(' '.join(subwords).replace('▁', '') + '\n')
print('Subjects:  ', bio_tags_to_tokens(subwords, y_subj.T, bio_lookup, one_hot=True))
print('Predicates:', bio_tags_to_tokens(subwords, y_pred.T, bio_lookup, predicate=True, one_hot=True))
print('Objects:   ', bio_tags_to_tokens(subwords, y_obj.T, bio_lookup, one_hot=True))

[CLS] SPEAKER1 enjoy watching american football but don ' t like to make homework [SEP] what does mike want to do ? [SEP] gaming , but SPEAKER1 hate cats [SEP]

Subjects:   {'gaming', 'what does', 'hate cats', 'enjoy', 'watching american football', 'want do', 'like', 'homework', 'mike', 'make', 'SPEAKER1', "don'"}
Predicates: {'do ', 'like'}
Objects:    {'CLS', "don't", 'what does', 'SEP', 'enjoy', 'homework SEP', 'like to make', 'watching american football but', 'mike', 'want to do', 'gaming but', 'hate cats SEP', 'SPEAKER1'}


In [None]:
print(str(date.today()))

2022-12-14


In [None]:
import os, shutil

out_dir = root_dir + '/models/' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('argument_extraction_albert-base-v2', out_dir)


'/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/2022-12-14/argument_extraction_albert-base-v2'

In [None]:
# To install the package "pytorch-transformers"
! pip install pytorch-transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 15.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 52.4 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 55.5 MB/s 
Collecting boto3
  Downloading boto3-1.26.28-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 62.9 MB/s 
Collecting botocore<1.30.0,>=1.29.28
  Downloading botocore-1.29.28-py3-none-any.whl (10.3 MB)
[K     |████████████████████████████████| 10.3 MB 69.1 MB/s 
[?25hCollecting jmespath<2.0.0,>=0.7.1
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.7.0,>=0.6.0
  Do

# Ranking the triples

Now we are able to extract the candidate arguments, but how do we combine them?

We compute all combinations of the subjects, predicates and objects and train a model to distinguish between those triples that are entailed (not considering negation here) and those that are not.

For this, we extract a number of negative examples from possible triples, i.e. those combinations of subjects, predicates and objects that were not annotated.

## Converting format

In [None]:
from collections import defaultdict
from copy import deepcopy


def extract_triples(annotation, neg_oversampling=7, contr_oversampling=0.7, ellipsis_oversampling=3):
    """ Extracts plain-text triples from an annotation file and samples 'negative' examples by
        crossover. By default, the function will over-extract triples with negative polarity and
        elliptical constructions to counter class imbalance.

        params:
        dict annotation:            loaded annotation file (see load_annotations)
        int neg_oversampling:       how much to over-sample triples with negative polarity
        float contr_oversampling:   how much to sample contrast/invalid triples relative to true triples
        int ellipsis_oversampling:  how much to over-sample elliptical triples
    """
    turns = annotation['tokens']
    triple_ids = [t[:4] for t in annotation['annotations']]

    arguments = defaultdict(list)
    triples = []
    labels = []

    # Oversampling of elliptical triples
    for triple in deepcopy(triple_ids):
        subj_obj_turns = set([i for i, _ in triple[0] + triple[2]])
        if len(subj_obj_turns) > 1:
            triple_ids += [triple] * int(ellipsis_oversampling)

    # Extract 'True' triples
    for subj, pred, obj, polar in triple_ids:

        subj = ' '.join(turns[i][j] for i, j in subj) if subj else ''
        pred = ' '.join(turns[i][j] for i, j in pred) if pred else ''
        obj = ' '.join(turns[i][j] for i, j in obj) if obj else ''

        if subj or pred or obj:

            if not polar:
                triples += [(subj, pred, obj)]
                labels += [1]
            else:
                triples += [(subj, pred, obj)] * neg_oversampling  # Oversampling negative polarities
                labels += [2] * neg_oversampling

            arguments['subjs'].append(subj)
            arguments['preds'].append(pred)
            arguments['objs'].append(obj)

    # Skip if the annotation file was blank
    if not triples:
        return [], [], []

    # Sample fake contrast examples (invalid extractions)
    n = int(len(triples) * contr_oversampling)
    for i in range(50):
        s = random.choice(arguments['subjs'])
        p = random.choice(arguments['preds'])
        o = random.choice(arguments['objs'])

        # Ensure samples are new (and not actually valid!)
        if (s, p, o) not in triples and s and p and o:
            triples += [(s, p, o)]
            labels += [0]
            n -= 1

        # Create as many fake examples as there were 'real' triples
        if n == 0:
            break

    return turns, triples, labels


In [None]:
tokens, triples, labels = [], [], []
for ann in annotations:
    ann_tokens, ann_triples, triple_labels = extract_triples(ann)
    triples.append(ann_triples)
    labels.append(triple_labels)
    tokens.append([t for ts in ann_tokens for t in ts + ['<eos>']])

j = random.choice(range(len(tokens)))
print('tokens: ', tokens[j])
print('triples:', triples[j])
print('labels: ', labels[j])

tokens:  ['are', 'SPEAKER2', 'a', 'hiker', 'too', '?', '<eos>', 'yes', 'SPEAKER2', 'hike', ',', 'what', 'is', 'skittles', '?', 'like', 'the', 'candy', '?', '<eos>', 'no', ',', 'SPEAKER1', "'ve", 'actually', 'never', 'eaten', 'candy', '.', 'SPEAKER1', 'mean', 'the', 'game', '.', '<eos>']
triples: [('SPEAKER2', 'are', 'a hiker'), ('SPEAKER2', 'hike', ''), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'eaten', 'candy'), ('SPEAKER1', 'mean', 'the game'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), ('SPEAKER1', 'like', 'the candy'), (

In [None]:
print('Class (im)balance:')
print('not entailed  ', sum([np.sum(np.array(t) == 0) for t in labels]))
print('entailed (pos)', sum([np.sum(np.array(t) == 1) for t in labels]))
print('entailed (neg)', sum([np.sum(np.array(t) == 2) for t in labels]))

Class (im)balance:
not entailed   6414
entailed (pos) 6690
entailed (neg) 5978


## Fine-tuning BERT for Triple Candidate Scoring

In [None]:
class BertTripleScoring(torch.nn.Module):
    def __init__(self, base_model='bert-base-uncased', path=root_dir, max_len=80, sep='<eos>'):
        super().__init__()
        # Base model
        print('loading %s for triple scoring' % base_model)
        # Load base model
        self._model = AutoModel.from_pretrained(base_model)
        self._max_len = max_len
        self._base = base_model
        self._sep = sep

        # Load and extend tokenizer with SPEAKERS
        self._tokenizer = AutoTokenizer.from_pretrained(base_model)
        self._tokenizer.add_tokens(['SPEAKER1', 'SPEAKER2'], special_tokens=True)
        self._model.resize_token_embeddings(len(self._tokenizer))

        # SPO candidate scoring head
        hidden_size = AutoConfig.from_pretrained(base_model).hidden_size
        self._head = torch.nn.Linear(hidden_size, 3)
        self._relu = torch.nn.ReLU()
        self._softmax = torch.nn.Softmax(dim=-1)

        # GPU support
        self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.to(self._device)

        # Load model / tokenizer if pretrained model is given
        if path:
            print('\t- Loading pretrained')
            # model_path = glob.glob(path + '/candidate_scorer_' + base_model)[0]
            # self.load_state_dict(torch.load(model_path, map_location=self._device))
            self.load_state_dict(torch.load(root_dir+'/scorer_'+base_model+'.pt')) #gdrive doesn't seem to save files without extension

    def forward(self, input_ids, speaker_ids, attn_mask):
        """ Computes the forward pass through the model
        """
        out = self._model(input_ids=input_ids, token_type_ids=speaker_ids, attention_mask=attn_mask)
        h = self._relu(out.last_hidden_state[:, 0])
        return self._softmax(self._head(h))

    def _retokenize_dialogue(self, tokens, speaker=1):
        # Tokenize each token individually (keeping track of subwords)
        f_input_ids = [self._tokenizer.cls_token_id]
        speaker_ids = [speaker]
        for turn in ' '.join(tokens).split(self._sep):
            token_ids = self._tokenizer.encode(turn, add_special_tokens=True)[1:]  # strip [CLS]
            f_input_ids += token_ids
            speaker_ids += [speaker] * len(token_ids)
            speaker = 1 - speaker

        return f_input_ids, speaker_ids

    def _retokenize_triple(self, triple):
        # Append triple
        f_input_ids = self._tokenizer.encode(' '.join(triple), add_special_tokens=False)
        speaker_ids = [0] * len(f_input_ids)
        return f_input_ids, speaker_ids

    def _add_padding(self, sequence, pad_token):
        # If sequence is too long, cut off end
        sequence = sequence[:self._max_len]

        # Pad remainder to max_len
        padding = self._max_len - len(sequence)
        new_sequence = sequence + [pad_token] * padding

        # Mask out [PAD] tokens
        attn_mask = [1] * len(sequence) + [0] * padding
        return new_sequence, attn_mask

    def fit(self, tokens, triples, labels, epochs=2, lr=1e-6):
        """ Fits the model to the annotations
        """
        X = []
        for tokens, triple_lst, triple_labels in zip(tokens, triples, labels):

            # Tokenize dialogue
            dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

            for triple, label in zip(triple_lst, triple_labels):
                # Tokenize triple
                triple_input_ids, triple_speakers = self._retokenize_triple(triple)

                # Concatenate dialogue + [UNK] + triple
                input_ids = dialog_input_ids[:-1] + [self._tokenizer.unk_token_id] + triple_input_ids
                speakers = dialog_speakers[:-1] + [0] + triple_speakers

                # Pad sequence with [PAD] to max_len
                input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
                speakers, attn_mask = self._add_padding(speakers, 0)

                # Push Tensor to GPU
                input_ids = torch.LongTensor([input_ids]).to(self._device)
                speakers = torch.LongTensor([speakers]).to(self._device)
                attn_mask = torch.FloatTensor([attn_mask]).to(self._device)
                label_ids = torch.LongTensor([label]).to(self._device)

                X.append((input_ids, speakers, attn_mask, label_ids))

        # Set up optimizer and objective
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        criterion = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            random.shuffle(X)

            losses = []
            for input_ids, speaker_ids, attn_mask, y in tqdm(X):
                # Was the triple entailed? Positively? Negatively?
                y_hat = self(input_ids, speaker_ids, attn_mask)
                loss = criterion(y_hat, y)
                losses.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            print("mean loss =", np.mean(losses))

        # Save model to file
        torch.save(self.state_dict(), 'candidate_scorer_%s.pt' % self._base) #if saving to gdrive seems to need file extension

    def predict(self, tokens, triples):
        # Tokenize dialogue
        dialog_input_ids, dialog_speakers = self._retokenize_dialogue(tokens)

        batch_input_ids = []
        batch_speakers = []
        batch_attn_mask = []

        for triple in triples:
            # Tokenize triple
            triple_input_ids, triple_speakers = self._retokenize_triple(triple)

            # Concatenate dialogue + [UNK] + triple
            input_ids = dialog_input_ids + [self._tokenizer.unk_token_id] + triple_input_ids
            speakers = dialog_speakers + [0] + triple_speakers

            # Pad sequence with [PAD] to max_len
            input_ids, _ = self._add_padding(input_ids, self._tokenizer.pad_token_id)
            speakers, attn_mask = self._add_padding(speakers, 0)

            batch_input_ids.append(input_ids)
            batch_speakers.append(speakers)
            batch_attn_mask.append(attn_mask)

        # Push batches to GPU
        batch_input_ids = torch.LongTensor(batch_input_ids).to(self._device)
        batch_speakers = torch.LongTensor(batch_speakers).to(self._device)
        batch_attn_mask = torch.FloatTensor(batch_attn_mask).to(self._device)

        label = self(batch_input_ids, batch_speakers, batch_attn_mask)
        label = label.cpu().detach().numpy()
        return label

In [None]:
bert_scorer = BertTripleScoring()
# scorer.fit(tokens, triples, labels, epochs=7) # don't train if there is a model to load

loading albert-base-v2 for triple scoring


100%|██████████| 19082/19082 [09:43<00:00, 32.69it/s]


mean loss = 0.8315123576386246


100%|██████████| 19082/19082 [09:39<00:00, 32.96it/s]


mean loss = 0.6663190892951917


100%|██████████| 19082/19082 [10:31<00:00, 30.22it/s]


mean loss = 0.629196528471462


100%|██████████| 19082/19082 [09:56<00:00, 32.00it/s]


mean loss = 0.6078200756335281


100%|██████████| 19082/19082 [09:43<00:00, 32.69it/s]


mean loss = 0.599275085726395


100%|██████████| 19082/19082 [09:37<00:00, 33.01it/s]


mean loss = 0.5935076415420091


100%|██████████| 19082/19082 [09:49<00:00, 32.35it/s]


mean loss = 0.5892000036623952


In [None]:
inputs = 'SPEAKER1 adore unicorns but not photography <eos> What do SPEAKER1 like ? <eos> dogs and gaming, but not cats or elephants . <eos>'.split()
triple_examples = [['SPEAKER1', 'adore', 'unicorns'],
                   ['SPEAKER1', 'like', 'dogs'],
                   ['SPEAKER1', 'like', 'gaming'],
                   ['SPEAKER1', 'adore', 'photography'],
                   ['SPEAKER1', 'like', 'cats'],
                   ['SPEAKER1', 'like', 'elephants'],
                   ['SPEAKER1', 'adore', 'elephants'],
                   ['SPEAKER1', 'like', 'photography'],
                   ['SPEAKER1', 'like', 'unicorns']]

np.round(bert_scorer.predict(inputs, triple_examples), 3)

array([[0.001, 0.001, 0.999],
       [0.015, 0.975, 0.01 ],
       [0.02 , 0.64 , 0.34 ],
       [0.001, 0.   , 0.999],
       [0.   , 0.   , 1.   ],
       [0.001, 0.   , 0.999],
       [0.999, 0.001, 0.   ],
       [0.002, 0.   , 0.998],
       [0.217, 0.001, 0.782]], dtype=float32)

We move the resulting model to Drive:

In [None]:
import os, shutil

out_dir = root_dir + '/models/' + str(date.today())
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

shutil.copy('candidate_scorer_albert-base-v2', out_dir)

'/content/gdrive/MyDrive/Combots Triple Extraction and Normalization/models/2022-12-14/candidate_scorer_albert-base-v2'