## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Xi Chen
    - Email: xc98@drexel.edu
- Group member 2
    - Name: Tai Nguyen
    - Email: tdn47@drexel.edu
- Group member 3
    - Name: Tien Nguyen
    - Email: thn44@drexel.edu
- Group member 4
    - Name: Raymond Yung
    - Email: raymond.yung@drexel.edu

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# DSCI 691: Natural language processing with deep learning <br> Assignment 5: Open Information Extraction (OIE)

## Overview
So what can RNN's do besides language modeling? Well, as sequence-based models, they're especially adept at sequential tasks, e.g., such as tagging. This includes NER, which we saw many variants of in __Chapter 2__, including that defined in the paper: [Question-Answer Driven Semantic Role Labeling: Using Natural Language to Annotate Natural Language](https://homes.cs.washington.edu/~lsz/papers/hlz-emnlp15.pdf)

However, that task has since been updated quite a bit to accept/predict more generalized Q/A structures based on predicates as discussed in their follow up on [Crowdsourcing Question-Answer Meaning Representations](https://www.aclweb.org/anthology/N18-2089.pdf) (QAMRs), which produced--at scale--a noisy-but-informative set of generalized Q/A extending the QA-SRL task, whose code can be [found here](https://github.com/uwnlp/qamr). SOTA for this task was ultimately assumed by a more-modern (and computationally expensive!) transformer-based architecture (which we'll define in future chapters) but at the time (in 2018) biLSTMs were still in vogue (and still are quite high-performing) and the authors likewise defined an interesting on based on the concept of [Supervised Open Information Extraction](https://www.aclweb.org/anthology/N18-1081.pdf), and even made their (now depricated) code (based in tensorflow/keras) available
- https://github.com/gabrielStanovsky/supervised-oie/blob/master/src/rnn/model.py

Once reviewing this paper, we'll see that highly-general Q/A systems can be trained via the BIO NER sequence labeling format by carefully pre-processing data and specifying a relatively-simple tag set.

### Review: Semantic Role Labeling (SRL)
Recall the Q/A _semantic role labeling_, where the goal was to identify token sequences&mdash;like named entities&mdash;that _answer_ specific, role-based questions about _entities_ contained within text, like who, what, when, where, and why? We discussed how this task clearly connects to NER by answering summary-level questions about sentences with specific points of justification for those answers:
```
|     UCD     |finished|the| 2006  |championship|as|Dublin|champions|,|
|B-who1/B-who2|   Q1   | 0 |B-what1|  I-what1   |0 |B-as1 |  I-as1  |0|
 ---------------------------------------------------------------------
|  by  | beating |       St       | Vincents  |in|  the  | final |.| 
|B-how1|Q2/I-how1|    B-whom2     |  I-whom2  |0 |B-when2|I-when2|0|
```

The excerpt was taken from [the QA-SRL dataset](https://huggingface.co/datasets/qa_srl) that we explored in __Chapter 2__ and is available from the `datasets` module and its [documentation is here](https://homes.cs.washington.edu/~lsz/papers/hlz-emnlp15.pdf). 



### Updated Task: Open Information Extraction
Instead of QA-SRL, in this Assignment (worth __78 points__) we'll explore a continuation of that work to a more generalized _Open Information Extraction_ task and biLSTM model who's [documentation can be found here](https://www.aclweb.org/anthology/N18-1081.pdf), and [experimental data can be found in this repository here](https://github.com/gabrielStanovsky/supervised-oie).

So long as we conform to the following `datasets` schema from our NER experiments:
```
DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})
```


## Utilities

In [1]:
import json
import os
import re

import torch

import pandas as pd

from tqdm.auto import tqdm

In [2]:
# Chapter 5 utility file
exec(open('./05-utilities.py').read())
# from utilities import *

# Random seed
RAND_SEED = 691

Additionally, let's go ahead and load the pre-trained vectors we'll be using for this assignment from gensim:

In [3]:
# %pip install gensim

In [4]:
import gensim.downloader as api

word_dim = 50
model = api.load(f"glove-wiki-gigaword-{word_dim}")

## Experiment

### 1. (5 pts) Data loading

As is typical, our first goal will be to pre-process the raw data format provided for this task into something that we can use to setup key objects we'll need like a `Vocab` and `torch.utils.data.Dataset`. 
We should be able to utilize our vocabulary loader (and other most code) from __Chapter 5__ to build out these objects. But first, we need to load the raw data into memory. 

Note: these data have no POS tags, and the paper we're following does! So we'll have to do a bit of featurization ahead.

Your first task will be to familiarize yourself with the raw data format so that you may complete the definition of `load_data` below. Specifically, you will complete the extraction of a single sample within the nested loop that is iterating over data splits and data examples. You will want to populate the following key-value pairs in the allocated `sample` dictionary:
- `tokens`: The tokens in the sample
- `ner_tags`: The BIO tags which we will be predicting
- `p_ix`: The integer index of the predicate term
- `id`: The run ID
- `pos_tags`: A blank list (we'll featurize this a bit later...)

Note: You should lowercase your tokens and NER labels.

In [5]:
# A1:Function(5/5)

def load_data():
    ds = {}
    for fold in ['train', 'train.noisy', 'test', 'validation']:
        ds[fold] = []

        # Read raw data as Pandas DataFrame
        df = pd.read_csv(f'./data/{fold}.oie.conll', sep='\t', header=0, keep_default_na = False)
        
        # Split into samples
        df_samples = [df[df.run_id == run_id] for run_id in sorted(set(df.run_id.values))]

        # Iterate over each sample
        for record in df_samples:
            sample = {}
            
            # --- Your code starts here ---
            sample['tokens'] = list(map(lambda s: s.lower(), record.word.tolist()))
            sample['ner_tags'] = list(map(lambda s: s.lower(), record.label.tolist()))
            sample['p_ix'] = record.head_pred_id.iloc[0]
            sample['id'] = record.run_id.iloc[0]
            sample['pos_tags'] = []
            # --- Your code ends here ---

            if sample and len(sample['tokens']) and len(sample['tokens']) == len(sample['ner_tags']):
                ds[fold].append(sample)
          
        print(fold, ": ", len(ds[fold]), " has sample instances")
    
    # merge the noisy training set with the normal training set
    ds['train'] = ds['train'] + ds['train.noisy']
    del ds['train.noisy']

    return ds

As a sanity check, you should expect to see:

```
rain :  5078  has sample instances
train.noisy :  18030  has sample instances
test :  1730  has sample instances
validation :  1673  has sample instances
{'id': 0, 'tokens': ['courtaulds', "'", 'spinoff', 'reflects', 'pressure', 'on', 'british', 'industry', 'to', 'boost', 'share', 'prices', 'beyond', 'the', 'reach', 'of', 'corporate', 'raiders', '.'], 'pos_tags': [], 'ner_tags': ['a0-b', 'a0-i', 'a0-i', 'p-b', 'a1-b', 'a1-i', 'a1-i', 'a1-i', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o'], 'p_ix': 3}
```

In [6]:
# A1:SanityCheck

ds = load_data()
print(ds['train'][0])

train :  5078  has sample instances
train.noisy :  18030  has sample instances
test :  1730  has sample instances
validation :  1673  has sample instances
{'tokens': ['courtaulds', "'", 'spinoff', 'reflects', 'pressure', 'on', 'british', 'industry', 'to', 'boost', 'share', 'prices', 'beyond', 'the', 'reach', 'of', 'corporate', 'raiders', '.'], 'ner_tags': ['a0-b', 'a0-i', 'a0-i', 'p-b', 'a1-b', 'a1-i', 'a1-i', 'a1-i', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o'], 'p_ix': 3, 'id': 0, 'pos_tags': []}


### 2. (3 pts) Vocabulary Construction

Next, we should setup our `Vocab` for this task. Since the `Vocab` class (in our utility file) is nicely abstracted, we can reuse it here on the `train` fold of our data. 

Additionally, let's use the Vocab class to encode our target variable, i.e., the NER tag space (our classification classes). Since there shoudn't be any unknowns in the target class, we'll enforce this by training the vocab to cover all target class labels across all folds of the data set.

Your task will be to complete the `init_vocab` function below, which is given the `ds` splits of our data and must return two vocabulary objects, `token_vocab` and `ner_vocab`. 

In [7]:
# A2:Function(3/3)

def init_vocab(ds):
    # Initialize Vocab objects for tokens and NER tags
    token_vocab = Vocab()
    ner_vocab = Vocab(stream = 'ner_tags', target = True)

    # --- Your code starts here ---
    token_vocab.train(ds["train"])
    ner_vocab.train(ds["train"])
    # --- Your code ends here ---

    return token_vocab, ner_vocab

As a sanity check, you should expect the following:

```
Token vocab length: 17606
Encoded `the` token: 15
NER vocab length: 15
NER vocab: {'a0-b': 0, 'a0-i': 1, 'p-b': 2, 'a1-b': 3, 'a1-i': 4, 'o': 5, 'a2-b': 6, 'a2-i': 7, 'p-i': 8, 'a3-b': 9, 'a3-i': 10, 'a5-b': 11, 'a4-b': 12, 'a4-i': 13, 'a5-i': 14}
Encoded `p-b` NER label: 2
```

In [8]:
# A2:SanityCheck

token_vocab, ner_vocab = init_vocab(ds)

print('Token vocab length:', len(token_vocab._word2idx))
print('Encoded `the` token:', token_vocab.encode('the'))
print('NER vocab length:', len(ner_vocab._word2idx))
print('NER vocab:', ner_vocab._word2idx)
print('Encoded `p-b` NER label:', ner_vocab.encode('p-b'))

Token vocab length: 17606
Encoded `the` token: 15
NER vocab length: 15
NER vocab: {'a0-b': 0, 'a0-i': 1, 'p-b': 2, 'a1-b': 3, 'a1-i': 4, 'o': 5, 'a2-b': 6, 'a2-i': 7, 'p-i': 8, 'a3-b': 9, 'a3-i': 10, 'a5-b': 11, 'a4-b': 12, 'a4-i': 13, 'a5-i': 14}
Encoded `p-b` NER label: 2


It looks like there could be up to 6 arguments and a predicate per sentence in this data set, so the number of target classes will have to include `'aX-b'` and `'aX-i'` (for `X in range(6)`). So, including these for each of the six and predicate (`'p-b'` and `'p-i'`), in additon to the null, `'o'` class, it appears the target space will result in a 15-dimensional system output (final layer) requirement.

### 3. (8 pts) Pre-processing for POS features
Now that we've got our vocab set up, we're almost ready to construct our data loader, but there's unfortunately a catch&mdash;since the data doesn't have a POS (the authors leverage POS features) column, we'll need to implement a tagger to fill these data in. As a quick solution, we'll use nltk's POS tagging capabilities to set us up with some solid tags.

In [9]:
# !pip install nltk
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Tai Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Your task will be to utilize nltk below to complete the definition of `add_pos_tags` by:
- For each `sample['tokens']`, extract [POS tags](https://www.nltk.org/api/nltk.tag.html), which should be stored in the sample dictionary under the key `pos_tags`
- Training a `pos_vocab` object on the `train` split of `ds`

Hint: To get the tags of the tokens, use `nltk`'s `pos_tag` over the tokens of the sample!

In [10]:
# A3:Function(8/8)

from nltk import pos_tag

def get_pos_tags(ds, retag=False):
    # Initialize Vocab object for POS tags
    pos_vocab = Vocab(stream = 'pos_tags')

    # If available, re-load data
    if os.path.exists('./data/data.json') and not retag:
        ds = json.load(open('./data/data.json'))
    else:

        for fold in ds:
            for sample in tqdm(ds[fold]):
                
                # --- Your code starts here ---
                sample["pos_tags"] = [i.lower() for i in list(zip(*pos_tag(sample["tokens"])))[1]]
                # --- Your code ends here ---
        
        with open('./data/data.json', 'w+') as f:
            f.write(json.dumps(ds))

    # --- Your code starts here ---
    pos_vocab.train(ds["train"])
    # --- Your code ends here --- 

    return ds, pos_vocab

As a sanity check, you should expect:

```
100%|██████████| 23108/23108 [00:35<00:00, 659.25it/s]
100%|██████████| 1730/1730 [00:01<00:00, 875.83it/s]
100%|██████████| 1673/1673 [00:01<00:00, 885.10it/s]

Train example: {'id': 0, 'tokens': ['courtaulds', "'", 'spinoff', 'reflects', 'pressure', 'on', 'british', 'industry', 'to', 'boost', 'share', 'prices', 'beyond', 'the', 'reach', 'of', 'corporate', 'raiders', '.'], 'pos_tags': ['NNS', 'POS', 'NN', 'VBZ', 'NN', 'IN', 'JJ', 'NN', 'TO', 'VB', 'NN', 'NNS', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NNS', '.'], 'ner_tags': ['a0-b', 'a0-i', 'a0-i', 'p-b', 'a1-b', 'a1-i', 'a1-i', 'a1-i', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o'], 'p_ix': 3}
POS vocab length: 46
Encoded POS tag `nn`: 4
POS tags: ['__UNK__', '', 'nns', 'pos', 'nn', 'vbz', 'in', 'jj', 'to', 'vb', 'dt', '.', 'vbd', 'vbg', 'cd', 'md', 'rb', 'wrb', 'fw', 'prp$', 'cc', ',', 'vbp', 'vbn', ':', '$', '``', "''", 'prp', 'ex', 'wdt', 'rp', 'rbr', 'jjs', 'sym', 'jjr', '(', ')', 'wp', 'nnp', 'rbs', 'pdt', 'nnps', '#', 'uh', 'wp$']
```

In [11]:
# A3:SanityCheck

retag = False  # Set to True to re-build if you make a change!
ds, pos_vocab = get_pos_tags(ds, retag=retag)

print('Train example:', ds['train'][0])
print('POS vocab length:', len(pos_vocab._word2idx))
print('Encoded POS tag `nn`:', pos_vocab.encode('nn'))
print('POS tags:', list(pos_vocab._word2idx.keys()))

Train example: {'id': 0, 'tokens': ['courtaulds', "'", 'spinoff', 'reflects', 'pressure', 'on', 'british', 'industry', 'to', 'boost', 'share', 'prices', 'beyond', 'the', 'reach', 'of', 'corporate', 'raiders', '.'], 'pos_tags': ['NNS', 'POS', 'NN', 'VBZ', 'NN', 'IN', 'JJ', 'NN', 'TO', 'VB', 'NN', 'NNS', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NNS', '.'], 'ner_tags': ['a0-b', 'a0-i', 'a0-i', 'p-b', 'a1-b', 'a1-i', 'a1-i', 'a1-i', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o'], 'p_ix': 3}
POS vocab length: 46
Encoded POS tag `nn`: 4
POS tags: ['__UNK__', '', 'nns', 'pos', 'nn', 'vbz', 'in', 'jj', 'to', 'vb', 'dt', '.', 'vbd', 'vbg', 'cd', 'md', 'rb', 'wrb', 'fw', 'prp$', 'cc', ',', 'vbp', 'vbn', ':', '$', '``', "''", 'prp', 'ex', 'wdt', 'rp', 'rbr', 'jjs', 'sym', 'jjr', '(', ')', 'wp', 'nnp', 'rbs', 'pdt', 'nnps', '#', 'uh', 'wp$']


### 4. (11 pts) Forming a multi-modal vector representation
Following the authors' approach, we'll model an input instance via the triple: $(\vec{t}, \vec{s}, p)$, where $\vec{t}$ is the input sentence's sequence of tokens, $\vec{s}$ is the input sentence's sequence of part of speech (POS) tags, and $p$ is the word index of the predicate's syntactic head. Supposing we have embedding matrices for the token and POS vocabularies, $V_W$ and $V_S$, the model will extract a joint feature with the predicate and position in the text stream, $v(i,p)$, as:
$$
v(i, p) = \left[v_{W}(t_i), v_{S}(s_i), v_{W}(t_p), v_{S}(s_p)\right]
$$
where the brackets, $[\cdot]$, indicate, again, a concatenation operation of all the contained vectors, into a single (flat) feature vector. Like the authors, we'll allow $5$-dimensional POS-embeddings and utilize the same-as-before 50-dimensional GloVe pre-trained vectors, placing each $v(i, p) \in \mathbb{R}^{110}$, i.e., making our input size a $110$-dimensional type-POS dense representation. Note: perhaps the most challenging aspect of this build (and __Chapter 6's__ RNN LM) is again the data loader's padding requirement, i.e., that _all_ sentences should be padded to the same length. This is handled in a pre-processing pass on the data.

For this question, your task will be to complete the PyTorch Datset class `PyOIE` by:
- Completing its `__len__` definition
- Completing its `__getitem__` definition
- Implementing the `_featurize` function, which takes a sample and adds a new set of PyTorch tensors to the data containers
  - For this step, be sure to pad all input using `self.pad_length` as the desired sequence length. Additionally, use the null string `''` as the pad token and pad POS tag (both will map to 1, per our `Vocab` class). For NER tags (aka `label`), use the `o` tag for padding.

In [12]:
# A4:Class(11/11)

class PyOIE(torch.utils.data.Dataset):

    """
    A PyTorch Dataset wrapper for the OIE dataset!

    This wrapper is instantiated by providing:
    * ds_split: the specific split from OIE
    * vocab, pos_vocab, ner_vocab: the vocabs needed for OIE
    """

    def __init__(self, ds_split, vocab, pos_vocab, ner_vocab, pad_len=32):
        # cache parameters
        self._vocab = vocab
        self._pos_vocab = pos_vocab
        self._ner_vocab = ner_vocab
        self._raw = ds_split

        # set up our data containers 
        # note: we will be creating a list of tensors
        self._x_tok = []; self._x_pos = []
        self._p_tok = []; self._p_pos = []
        self._label = []

        # setting pad to max sequence length
        max_len = max([len(r['tokens']) for r in self._raw]) 
        self.pad_length = min(pad_len, max_len)
        
        # pre-process our samples into structured features
        for r in tqdm(self._raw):
            self._featurize(r)

    def __len__(self):
        return len(self._raw)

    def __getitem__(self, item):
        # when calling a specific example,
        # we return both the input and the 
        # gold label!
        out = {}

        # --- Your code starts here ---
        out["x_tok"] = self._x_tok[item]
        out["x_pos"] = self._x_pos[item]
        out["p_tok"] = self._p_tok[item]
        out["p_pos"] = self._p_pos[item]
        out["label"] = self._label[item]
        # --- Your code ends here ---

        return out

    def _featurize(self, sample):
        # --- Your code starts here ---
        len_toks = len(sample["tokens"])
        npads = self.pad_length - len_toks
        tokens = (sample["tokens"] + [''] * (npads))[:self.pad_length]
        pos_tags = (sample["pos_tags"] + [''] * (npads))[:self.pad_length]
        pos_tags = list(map(lambda s: s.lower(), pos_tags))
        ner_tags = (sample["ner_tags"] + ['o'] * (npads))[:self.pad_length]
        ner_tags = list(map(lambda s: s.lower(), ner_tags))

        p_tok = ([ sample["tokens"][sample["p_ix"]] ] * len_toks 
                + [''] * (npads))[:self.pad_length]
        p_pos = ([ sample["pos_tags"][sample["p_ix"]].lower() ] * len_toks 
                + [''] * (npads))[:self.pad_length]

        enc_tokens = torch.LongTensor(list(map(self._vocab.encode, tokens)))
        enc_pos_tags = torch.LongTensor(list(map(self._pos_vocab.encode, pos_tags)))
        enc_ner_tags = torch.LongTensor(list(map(self._ner_vocab.encode, ner_tags)))
        enc_p_tok = torch.LongTensor(list(map(self._vocab.encode, p_tok)))
        enc_p_pos = torch.LongTensor(list(map(self._pos_vocab.encode, p_pos)))

        self._x_tok.append(enc_tokens)
        self._x_pos.append(enc_pos_tags)
        self._p_tok.append(enc_p_tok)
        self._p_pos.append(enc_p_pos)
        self._label.append(enc_ner_tags)
        # --- Your code ends here ---
        
        # Note: This function does not return anything.
        # You should be appending pre-processed samples to the data containers
        # initalized in the __init__ funciton

As a sanity check, you should expect:

```
100%|██████████| 23108/23108 [00:01<00:00, 12107.84it/s]
100%|██████████| 1673/1673 [00:00<00:00, 11230.22it/s]
100%|██████████| 1730/1730 [00:00<00:00, 10895.23it/s]


x_tok tensor([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])
x_pos tensor([ 2,  3,  4,  5,  4,  6,  7,  4,  8,  9,  4,  2,  6, 10,  4,  6,  7,  2,
        11,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])
p_tok tensor([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
p_pos tensor([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
label tensor([0, 1, 1, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 5, 5, 5, 5])

```

In [13]:
# A4:SanityCheck 

train = PyOIE(ds['train'], token_vocab, pos_vocab, ner_vocab)
val = PyOIE(ds['validation'], token_vocab, pos_vocab, ner_vocab)
test = PyOIE(ds['test'], token_vocab, pos_vocab, ner_vocab)

print()
for k, v in train[0].items():
  print(k, v)

  0%|          | 0/23108 [00:00<?, ?it/s]

  0%|          | 0/1673 [00:00<?, ?it/s]

  0%|          | 0/1730 [00:00<?, ?it/s]


x_tok tensor([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])
x_pos tensor([ 2,  3,  4,  5,  4,  6,  7,  4,  8,  9,  4,  2,  6, 10,  4,  6,  7,  2,
        11,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])
p_tok tensor([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
p_pos tensor([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
label tensor([0, 1, 1, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 5, 5, 5, 5])


### 5. (5 pts) Filling out the dense represenation

Now that we've set up our vocabs, encoded our sequences, and created a full `PyOIE` dataset object, let's construct the initial representation matrix with a set of pre-trained GloVe vectors. 
We'll randomly initialize normally-distributed random vectors for the tokens that are in our vocab that lack a pre-trained vector. 
<!-- As we will unfreeze this embedding layer, this shouldn't be a problem since we'll backpropagate and update the embedding to learn reasonable vectors for these randomly initialized ones. -->

Now, your task is to finish the `get_embeddings` function per our description. Specifically, you should:
- Load the pre-trained vector, if it's in the model
- Else, randomly initialize one using a standard normal distribution

In [14]:
# A5:Function(5/5)

def get_embeddings(vocab, model, vector_dim=300):
    torch.manual_seed(RAND_SEED)

    vocab_size = len(vocab._word2idx)

    wvs = torch.zeros((vocab_size, vector_dim))

    # --- Your code starts here ---
    for k in vocab._word2idx:
        try:
            pretrained_idx = model.get_index(k)
            wvs[vocab._word2idx[k]] = torch.tensor(model[pretrained_idx])
        except KeyError:
            wvs[vocab._word2idx[k]] = torch.randn((50, ))    
    # --- Your code ends here ---

    return wvs

As a sanity check, you should expect:

```
Word embedding shape: torch.Size([17606, 50])
Representation for `the`:
	 tensor([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
        -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
         2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
         1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
        -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
        -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
         4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
         7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
        -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
         1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])
Representation for `euromarket`:
	 tensor([ 0.8074, -0.6810,  1.0453, -0.4567,  0.7270, -0.5824, -0.2911,  0.3334,
         0.4072, -1.6747, -0.8621, -0.6207, -2.5802,  1.2279, -0.0549,  0.5757,
        -0.4442,  0.0988,  0.0253,  0.2364, -1.5998, -0.8065,  1.4552,  1.0078,
         1.0012,  0.2039, -0.6355,  0.9761, -2.5255,  2.1219, -0.0597, -0.3632,
        -2.4620,  1.8938, -0.8431, -0.4609, -0.4363, -0.7455,  3.3058, -0.1539,
         1.0984, -0.4961,  0.2065, -1.6130, -0.5355,  0.5481,  0.1270,  0.9623,
         0.6976,  0.8415])
```

In [15]:
# A5:SanityCheck

wvs = get_embeddings(token_vocab, model, vector_dim=word_dim)

print('Word embedding shape:', wvs.shape)
print('Representation for `the`:\n\t', wvs[token_vocab.encode('the')])
print('Representation for `euromarket`:\n\t', wvs[token_vocab.encode('euromarket')])

Word embedding shape: torch.Size([17606, 50])
Representation for `the`:
	 tensor([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
        -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
         2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
         1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
        -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
        -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
         4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
         7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
        -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
         1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])
Representation for `euromarket`:
	 tensor([-0.9060, -1.0927, -1.3634,  0.8262, -0.2185, -0.2157,  1.8471,  0.2207,
        -0.4699, -0.3421,  1.4126,  0.5451,  1.2909,  2.2761,  2.3379, -0.8562,


### 6. (15 pts) Defining the neural LSTM tagger

Now that we've loaded our initial token embeddings, we can define our model.
Specifically, we'll setup a biLSTM (like was proposed in the paper) that has the following layers:
- A token embedding, initialized from our `word_vectors` 
  * Set `freeze=True`
  * Also, set `padding_idx=1`. If done, this tells PyTorch that an encoded 1 represents a pad token and PyTorch will not compute gradients for these tokens (thus saving us computation time!)
- A token embedding dropout layer that drops out a word representation with probability `word_dropout`
- A POS embedding, ranomly initialized (i.e., initialized normally, not with any pre-trained representations)
  * Also set `padding_idx=1` here as well!
- An LSTM layer
  * For this, we recommend setting `batch_first=True` so that sequences are expected to be of shape `(batch_size, sequence_length, d)` instead of the default expectation of `(sequence_length, batch_size, d)`
- A ReLU non-linearity
- A prediction dropout that drops out (prior to the last layer) with probability `pred_dropout`
- A prediction layer that maps from the biLSTM's output to the target, NER class space
  * For this layer, set `bias=False`

After setting these layers up in `__init__`, you should complete the model definition by writing the `forward` function.

In [16]:
# A6:Class(15/15)

class biLSTM(torch.nn.Module):

    def __init__(self, 
                 in_dim, hidden_dim, out_dim, 
                 lstm_layers, word_vectors, 
                 pos_vocab_size, pos_dim=5,
                 word_dropout=0.1,
                 pred_dropout=0.1,
                 bidirectional = True):
        # remember to start __init__ with a call to super!
        super(biLSTM, self).__init__()

        torch.manual_seed(RAND_SEED)

        self.hidden_dim = hidden_dim
        self.in_dim = in_dim
        self.out_dim = out_dim
        self.lstm_layers = lstm_layers
        self.num_directions = 2 if bidirectional else 1

        # --- Your code starts here
        self.token_embed = torch.nn.Embedding.from_pretrained(word_vectors, padding_idx=1)
        self.pos_embed = torch.nn.Embedding(pos_vocab_size, pos_dim)

        self.token_embed_dropout = torch.nn.Dropout(word_dropout)
        self.predict_dropout = torch.nn.Dropout(pred_dropout)

        self.lstm = torch.nn.LSTM(
            in_dim, 
            hidden_dim, 
            lstm_layers, 
            batch_first=True, 
            bidirectional=bidirectional,
        )

        self.relu = torch.nn.ReLU()
        self.predict_layer = torch.nn.Linear(hidden_dim * 2, out_dim, bias=False)
        # --- Your code ends here ---

    def forward(self, x_tok, x_pos, p_tok, p_pos): 

        # --- Your code starts here ---
        tok_emb = self.token_embed(x_tok)
        tok_emb = self.token_embed_dropout(tok_emb)
        predicate_tok_emb = self.token_embed(p_tok)
        pos_emb = self.pos_embed(x_pos)
        pos_emb = self.token_embed_dropout(pos_emb)
        predicate_pos_emb = self.pos_embed(p_pos)

        emb = torch.concat([tok_emb, pos_emb, predicate_tok_emb, predicate_pos_emb], dim=2)
        emb, _ = self.lstm(emb)
        emb = self.relu(emb)
        emb = self.predict_dropout(emb)
        z_hat = self.predict_layer(emb)
        # --- Your code ends here ---

        return z_hat

As a sanity check, you should expect the following:

```
biLSTM(
  (_embed): Embedding(17606, 50, padding_idx=1)
  (_pos_embed): Embedding(50, 5, padding_idx=1)
  (_embed_dropout): Dropout(p=0.1, inplace=False)
  (_lstm): LSTM(110, 256, num_layers=3, batch_first=True, bidirectional=True)
  (_ReLU): ReLU()
  (_pred_dropout): Dropout(p=0.1, inplace=False)
  (_pred): Linear(in_features=512, out_features=15, bias=False)
)
Number of parameters: 4795814
Number of (trainable) parameters: 3915514
Test Prediction shape: torch.Size([1, 32, 15])
```

In [17]:
# A6:SanityCheck

pos_dim = 5
pos_vocab_size = len(pos_vocab._word2idx)

in_dim = 2 * word_dim + 2 * pos_dim

hidden_dim = 256

lstm_layers = 3
out_dim = len(ner_vocab._word2idx)

biLSTM_net = to_gpu(biLSTM(in_dim, hidden_dim, out_dim, lstm_layers, wvs, pos_vocab_size, pos_dim=pos_dim))

print(biLSTM_net)
print('Number of parameters:', sum(p.numel() for p in biLSTM_net.parameters()))
print('Number of (trainable) parameters:', sum(p.numel() for p in biLSTM_net.parameters() if p.requires_grad))

# Testing that input properly flows through the model:
x = train[0]
x = {k: to_gpu(v.unsqueeze(0)) for k, v in x.items()}  # create batch size of 1
out = biLSTM_net(x['x_tok'], x['x_pos'], x['p_tok'], x['p_pos'])
print('Test Prediction shape:', out.shape)

biLSTM(
  (token_embed): Embedding(17606, 50, padding_idx=1)
  (pos_embed): Embedding(46, 5)
  (token_embed_dropout): Dropout(p=0.1, inplace=False)
  (predict_dropout): Dropout(p=0.1, inplace=False)
  (lstm): LSTM(110, 256, num_layers=3, batch_first=True, bidirectional=True)
  (relu): ReLU()
  (predict_layer): Linear(in_features=512, out_features=15, bias=False)
)
Number of parameters: 4795794
Number of (trainable) parameters: 3915494
Test Prediction shape: torch.Size([1, 32, 15])


### 7. (15 pts) Training the neural tagger

Your next task will be to complete the definition of the `train` function below. 
In this train function, you will need to:
* Configure an optimizer
  * We'll use Adagrad and will set the learning rate equal to `lr`
* Configure a loss function
  * This should be `CrossEntropyLoss` with `reduction='sum'`
* Configure two DataLoaders, one for the `train` data and one for the `val` data
  * Be sure to enable shuffling and set the proper batch size!

Next, we'll want to configure a training loop that uses early stopping. 
Specifically, `train_model` receives input parameters `max_epochs` and `patience`. `max_epochs` tells us the _maximum_ number of epochs we'll let our model train for and `patience` tells us how many epochs we are willing to wait where our model does __not__ improve on the validation set before exiting. 
Within this loop, you should complete the two missing code chunks for the `train` and `val` iterations, respectively. Be sure to enable `.train()` and `.eval()` when appropriate.

Note: We could use gradient clipping here, but since our model is not super complicated nor that deep (in time and depth), we will opt to not clip our gradients. 


In [18]:
# A7:Function(15/15)

def train_model(model: torch.nn.Module, train_ds, val_ds, lr=1e-2, batch_size=50, max_epochs=100, patience=5, gpu=True):
    torch.manual_seed(RAND_SEED)
    train_dl = torch.utils.data.DataLoader(train_ds, batch_size=batch_size, shuffle=True, pin_memory=True)
    val_dl = torch.utils.data.DataLoader(val_ds, batch_size=128, pin_memory=True)

    # --- Your code starts here ---
    loss_fn = torch.nn.CrossEntropyLoss(reduction="sum")
    model = model.cuda()
    opt = torch.optim.Adagrad(model.parameters(), lr=lr)
    # --- Your code ends here --- 

    epoch = 0
    no_imp = 0
    best_val_loss = None 
    while epoch < max_epochs and no_imp < patience:
        # print('=' * 50)
        # print('Begin epoch', epoch + 1)
        # print('-' * 50)
        
        train_loss = 0
        # --- Your code starts here ---
        train_pbar = tqdm(total=len(train_dl))
        train_pbar.set_description_str(f"Epoch {epoch}: ")
        for i, batch in enumerate(train_dl):
            batch = {k: to_gpu(v) for k, v in batch.items()}
            opt.zero_grad()
            y_pred = model(batch['x_tok'], batch['x_pos'], batch['p_tok'], batch['p_pos'])
            y_pred = y_pred.flatten(0, 1)
            y_true = batch['label'].flatten()
            loss = loss_fn(y_pred, y_true)
            loss.backward()
            opt.step()

            train_loss += loss.cpu().detach().item()
            avg_train_loss = train_loss / ((i + 1) * len(y_true))

            train_pbar.set_postfix({"train_loss": avg_train_loss})
            train_pbar.update()
        # --- Your code ends here ---

        val_loss = 0
        # --- Your code starts here ---
        with torch.no_grad():
            val_pbar = tqdm(total=len(val_dl))
            val_pbar.set_description_str(f"Validating {epoch}: ")
            for i, batch in enumerate(val_dl):
                batch = {k: to_gpu(v) for k, v in batch.items()}
                y_pred = model(batch['x_tok'], batch['x_pos'], batch['p_tok'], batch['p_pos'])
                y_pred = y_pred.flatten(0, 1)
                y_true = batch['label'].flatten()
                loss = loss_fn(y_pred, y_true)

                val_loss += loss.cpu().detach().item()
                avg_val_loss = val_loss / ((i + 1) * len(y_true))

                val_pbar.set_postfix({"val_loss": avg_val_loss})
                val_pbar.update()
        # --- Your code ends here ---

        if best_val_loss is None or val_loss < best_val_loss:
            best_val_loss = val_loss
            no_imp = 0
            torch.save(model, './data/PyOIE_bilstm.pt')
        else:
            no_imp += 1

        print(f'Best validation loss: {best_val_loss:.2f} (Epochs without improvement: {no_imp})')

        epoch += 1
        # print('=' * 50)

    # Reload and return the best model
    return torch.load('./data/PyOIE_bilstm.pt')

As a sanity check, you should anticipate that your model can achieve a validation loss below `70_000`.

In [19]:
# A7:SanityCheck

lr = 1e-2
batch_size = 32
biLSTM_net = biLSTM(in_dim, hidden_dim, out_dim, lstm_layers, wvs, pos_vocab_size, pos_dim=pos_dim)

best_bilstm = train_model(biLSTM_net, train, val, lr=lr, batch_size=batch_size, patience=1)

  0%|          | 0/723 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

Best validation loss: 54181.60 (Epochs without improvement: 0)


  0%|          | 0/723 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

Best validation loss: 49163.67 (Epochs without improvement: 0)


  0%|          | 0/723 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

Best validation loss: 47834.65 (Epochs without improvement: 0)


  0%|          | 0/723 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

Best validation loss: 47834.65 (Epochs without improvement: 1)


### 8. (8 pts) Evaluating the neural tagger

Now that we've trained up a model, let's evaluate it on each split of the data. 
Specifically, you will complete the `eval_tagger` function which will take a split of the data and will return the sklearn `classification_report` for that split.

In [20]:
torch.argmax(torch.randn((32, 15)), dim=1)

tensor([ 3, 13, 14,  0, 10,  2,  0, 12,  0,  1, 13, 10, 12, 14, 12,  9,  5,  0,
        10, 14, 10,  9,  9, 12,  9,  7,  9,  6,  5, 13,  6,  1])

In [21]:
# A8:Function(8/8)

from sklearn.metrics import classification_report

def eval_tagger(model, data, batch_size=50, label_names=None):
    batch_loader = torch.utils.data.DataLoader(data, batch_size=batch_size)

    model = model.eval()

    preds = []; golds = []
    with torch.no_grad():
        val_pbar = tqdm(total=len(batch_loader))
        val_pbar.set_description_str(f"Testing: ")
        val_loss = 0
        for i, batch in enumerate(batch_loader):
            batch = {k: to_gpu(v) for k, v in batch.items()}
            y_pred = model(batch['x_tok'], batch['x_pos'], batch['p_tok'], batch['p_pos'])
            y_pred = y_pred.flatten(0, 1)
            y_pred = torch.argmax(y_pred, dim=1)
            y_true = batch['label'].flatten()

            preds += y_pred.detach().cpu().tolist()
            golds += y_true.detach().cpu().tolist()

            val_pbar.update()
        # --- Your code ends here ---

    return classification_report(golds, preds, target_names=label_names)

As a sanity check, your model should be able to attain a training weighted average F1-score above 0.70 and a validation weighted average F1-score above 50.

In [22]:
# A8:SanityCheck

for fold, dx in [('train', train), ('val', val), ('test', test)]:
    print(f'Evaluating {fold}')
    print(eval_tagger(best_bilstm, dx, label_names=[ner_vocab.decode(x) for x in range(15)]))

Evaluating train


  0%|          | 0/463 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        a0-b       0.60      0.28      0.38     19850
        a0-i       0.53      0.24      0.33     37848
         p-b       0.94      0.96      0.95     21028
        a1-b       0.43      0.21      0.29     17042
        a1-i       0.49      0.34      0.40     49657
           o       0.84      0.97      0.90    563306
        a2-b       0.00      0.00      0.00      6135
        a2-i       0.46      0.01      0.02     17475
         p-i       0.65      0.14      0.24      1475
        a3-b       0.00      0.00      0.00      1256
        a3-i       0.00      0.00      0.00      3545
        a5-b       0.00      0.00      0.00        20
        a4-b       0.00      0.00      0.00       189
        a4-i       0.00      0.00      0.00       598
        a5-i       0.00      0.00      0.00        32

    accuracy                           0.81    739456
   macro avg       0.33      0.21      0.23    739456
weighted avg       0.77   

  0%|          | 0/34 [00:00<?, ?it/s]

              precision    recall  f1-score   support

        a0-b       0.61      0.28      0.38      1569
        a0-i       0.56      0.19      0.28      3918
         p-b       0.90      0.94      0.92      1566
        a1-b       0.53      0.22      0.31      1424
        a1-i       0.57      0.21      0.31      6822
           o       0.72      0.98      0.83     34546
        a2-b       0.00      0.00      0.00       620
        a2-i       0.20      0.00      0.00      2151
         p-i       0.61      0.21      0.31       110
        a3-b       0.00      0.00      0.00       173
        a3-i       0.00      0.00      0.00       538
        a5-b       0.00      0.00      0.00         1
        a4-b       0.00      0.00      0.00        25
        a4-i       0.00      0.00      0.00        65
        a5-i       0.00      0.00      0.00         8

    accuracy                           0.71     53536
   macro avg       0.31      0.20      0.22     53536
weighted avg       0.65   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  0%|          | 0/35 [00:00<?, ?it/s]

              precision    recall  f1-score   support

        a0-b       0.61      0.25      0.36      1637
        a0-i       0.52      0.17      0.26      3910
         p-b       0.89      0.93      0.91      1612
        a1-b       0.50      0.18      0.26      1485
        a1-i       0.59      0.21      0.31      6575
           o       0.72      0.98      0.83     36174
        a2-b       0.00      0.00      0.00       654
        a2-i       0.38      0.00      0.01      2460
         p-i       0.48      0.11      0.17       104
        a3-b       0.00      0.00      0.00       153
        a3-i       0.00      0.00      0.00       496
        a5-b       0.00      0.00      0.00         1
        a4-b       0.00      0.00      0.00        25
        a4-i       0.00      0.00      0.00        72
        a5-i       0.00      0.00      0.00         2

    accuracy                           0.71     55360
   macro avg       0.31      0.19      0.21     55360
weighted avg       0.65   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 9. (8 pts) Inference: How does the tagger perform?

In this final question, you will complete the `inference` function, which will attempt to apply the model to new data. 

While we've setup the POS tagging of the raw text, you should:
* Grab the predicate and its POS
* Properly form the input (make input into a batch size of 1 with `.unsqueeze`)
* Obtain raw predictions
* Use `torch.argmax` to obtain the most likely classes and store them in the variable `preds`

After this implementation, you'll notice that we do a number of additional post-processing steps. Quoting from the [original paper](https://www.aclweb.org/anthology/N18-1081.pdf):

> "At inference time, we first identify all verbs and nominal predicates in the sentence as candidate predicate heads. We use a Part Of Speech (POS) tagger to identify verbs, and Catvar’s subcategorization frames (Habash and Dorr, 2003) for nominalizations, identifying nouns which share the same frame with a verbal equivalent (e.g., acquisition with acquire). We then generate an input instance for each candidate predicate head. For each instance, we tag each word with its most likely BIO label under the model, and reconstruct Open IE tuples from the resulting sequence according to the method described in Section 3, with the exception that we ignore malformed spans (i.e., if an A0-I label is not preceded by A0-I or A0-B, we treat it as O)."

While, perhaps, not quite as involved, you'll notice that we replicate several of these steps below, with the additional grounding that, if we're supplied the start of the predicate, we should probably use that!

In [23]:
import nltk
nltk.download('punkt')
from nltk import pos_tag, word_tokenize

[nltk_data] Downloading package punkt to C:\Users\Tai
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [24]:
# A9:Function(8/8)
# import nltk
# nltk.download('punkt')
# from nltk import pos_tag, word_tokenize

def inference(model, text, pred_idx, token_vocab, pos_vocab, ner_vocab, max_length=32):
    
    # Grab our tokens
    tokens = word_tokenize(text.lower())
    
    # Grab our POS tags
    pos_tags = [t[1].lower() for t in pos_tag(tokens)]

    # --- Your code starts here ---
    model = model.eval()
    len_toks = len(tokens)
    npads = max_length - len_toks
    tokens = (tokens + [''] * (npads))[:max_length]
    pos_tags = (pos_tags + [''] * (npads))[:max_length]
    pos_tags = list(map(lambda s: s.lower(), pos_tags))

    p_tok = ([ tokens[pred_idx] ] * len_toks + [''] * (npads))[:max_length]
    p_pos = ([ pos_tags[pred_idx].lower() ] * len_toks + [''] * (npads))[:max_length]

    enc_tokens = torch.LongTensor(list(map(token_vocab.encode, tokens))).unsqueeze(0).cuda()
    enc_pos_tags = torch.LongTensor(list(map(pos_vocab.encode, pos_tags))).unsqueeze(0).cuda()
    enc_p_tok = torch.LongTensor(list(map(token_vocab.encode, p_tok))).unsqueeze(0).cuda()
    enc_p_pos = torch.LongTensor(list(map(pos_vocab.encode, p_pos))).unsqueeze(0).cuda()

    with torch.no_grad():
        preds = model(enc_tokens, enc_pos_tags, enc_p_tok, enc_p_pos)
        preds = preds.squeeze()
        preds = torch.argmax(preds, dim=1)
    
    # --- Your code ends here --- 
    
    preds = preds.squeeze()
    preds = preds.detach().cpu().tolist()

    # Transform predictions into their string labels
    preds = [ner_vocab.decode(p) for p in preds]

    # Did we actually predict the predicate as the predicate?
    if preds[pred_idx] != 'p-b':
        preds[pred_idx] = 'p-b'

    # Filter completely un-reasonable predictions (aX-i when not proceeded by aX-b/i)
    for ix, p in enumerate(preds):
        if '-i' in p:
            # Gets the core of the prediction;
            # This will be p, a0, a1, ...
            core = p.split('-')[0]
            if not ix:
                # The first prediction cannot be an -i tag
                preds[ix] = 'o'
            elif core not in preds[ix - 1]:
                # The previous tag does not match, so sequence in incompatible
                preds[ix] = 'o'

    # Returning two formats for our inspection!
    zip_out = list(zip(tokens, preds))
    tuple_out = []
    prev = ''
    for ix, (t, p) in enumerate(zip_out):
        # Properly segmenting portions of the tuple form with ';
        if ix and p != prev and tuple_out[-1] != ';':
            tuple_out.append(';')
        
        if p != 'o':
            tuple_out.append(t)
        
        # Update previous for segmenting with ';'
        prev = p
    tuple_out = tuple(tuple_out)

    return zip_out, tuple_out


As a sanity check, you should anticipate something like so:

```
Test 0
([('the', 'a0-b'), ('assignment', 'a0-b'), ('was', 'p-b'), ('the', 'a4-b'), ('last', 'o'), ('one', 'o')], ('the', 'assignment', ';', 'was', ';', 'the', ';'))

Test 1
([('the', 'a0-b'), ('students', 'a0-b'), ('completed', 'p-b'), ('the', 'a4-b'), ('assignment', 'a5-b')], ('the', 'students', ';', 'completed', ';', 'the', ';', 'assignment'))
```

which, while not perfect, is actually not too shabby!

Additionally, we've built out a smaller model than the exact one in the original paper with word vectors of 50 dimensions (the original paper uses 300 dimensional vectors). However, we've gotten within 10 points of the previous model and all without any hyperparameter tuning nor fancy additions like attention!

In [25]:
# A9:SanityCheck

tests = [
         ('The assignment was the last one', 2),
         ('The students completed the assignment', 2)
]

for ix, (test_str, pred_idx) in enumerate(tests):
  print('Test', ix)
  result = inference(best_bilstm, test_str, pred_idx, token_vocab, pos_vocab, ner_vocab)
  print(result)
  print()

Test 0
([('the', 'a0-b'), ('assignment', 'a0-i'), ('was', 'p-b'), ('the', 'a1-b'), ('last', 'a1-i'), ('one', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o')], ('the', ';', 'assignment', ';', 'was', ';', 'the', ';', 'last', ';'))

Test 1
([('the', 'a0-b'), ('students', 'a0-i'), ('completed', 'p-b'), ('the', 'a1-b'), ('assignment', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o')], ('the', ';', 'students', ';', 'completed', ';', 'the', ';'))



As a final remark, notice that in this final inferential step, we applied our knowledge about the grammar of the task to refine the potential prediction our network would produce. For models and tasks like this, especially with limited data (~10,000 samples versus ~1,000,000+ samples for "big" datasets), it's important to observe that we gain by informing ourselves about the expected grammar of the task. 

Furthermore, if you review the original paper, you'll notice that they further improve their model by developing a notion of "confidence" in prediction and using that to further filter results. 
Additionally, there are ways like [Conditional Random Fields (CRFs)](https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf) that may be used to enforce constraints like grammars on final outputs. 
This is all beyond the scope of this current assignment, however, they're worth mentioning because they can be vital tricks that takes a model like we've built here from one that works kinda-alright into a model that works decently well in a practical application.

### Beyond this Assignment: Hyperparameter Tuning

One thing that we should make note of is the fact that our model has a lot of hyperparameters. These are our set of parameters that aren't adjusted during training, rather they set the structure for how training is allowed to proceed.
In all likelihood, _much better_ hyperparameter selections exist.

Architecturally, our model has hyperparameters like the number of layers, the hidden dimension, the word and POS embedding dimensions, the dropout rates, and the usage of bidirectionality (or not).

When it comes to learning, there are another segment of hyperparameters that we introduce like the learning rate and the batch size. In a way, even our selection of optimizer introduces another hyperparameter. 

Finally, consider how our pre-processing of our data also has hyperparameters associated with it: pad length and lowercasing. Though the latter of the two is necessary to be compatible with the set of pre-trained vectors we've used, it is good to acknowledge that this kind of pre-processing, inevitably, will affect downstream performance (as we've seen this term!).

How does one decide which hyperparameters are best? What are the approaches used to try different combinations of hyperparameters? 

A typical flow will have a model train on the train set, while the quality of the hyperparameters should be evaluated relative to the validation set (with the final test set as our true test that we should __never__ touch until final evaluation). Some of the hyperparameter tuning approaches you may come across:
* __Grid Search__: For each hyperparameter, enumerate any values you consider to be appropriate. Then, train a model for every combination of parameters. Notice that the "every combination" part can result in a very large set of combinations to consider. 
* __Random Walk__: Fix a budget for the number of models you are willing to consider and train that many models. For each new model, randomly generate the hyperparameters (within a specified set of options or range for each hyperparamter).
* __Bayesian Optimization__: Use the past experiments to inform the selection of hyperparameters for new experiments. Typically, this type of approach will have an engine that generates the parameters for you based on information you supply about a model's performance. 

There are many packages and applications that facilitate hyperparameter tuning. One such application that is extremely useful for training and debugging deep learning models is [Weights & Biases](https://wandb.ai/site), also known as `wandb`.

wandb is excellent for monitoring experiments, visualizing loss curves, tuning hyperparameters, and more! Once creating an account with wandb, it's merely a matter of installing its Python package and using basic commands like `wb.init(...)` to specify information about your experiment and `wb.log(...)` to push performance information to wandb. For more information, check out [this guide](https://docs.wandb.ai/quickstart) for rapidly getting started with wandb.

You are __highly encouraged__ to make use of such packages, either in your final project or in deep learning projects you take on beyond this course. 
There's nothing required for this portion, just merely a reflection on best practices and where one would be recommended to proceed beyond this point!