This is a tutorial for understanding some data structures in assignment 3 and clarifying some concepts.

In [1]:
!pip install --upgrade spacy==2.1.0 allennlp==0.9.0
# AllenNLP is a library built on top of PyTorch designed to help build deep NLP models with ease.
!pip install seqeval
# Seqeval supports POS tagging notations, and thus can be used for evaluation

Collecting spacy==2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/78/0f/ca790def675011f25bce8775cf9002b5085cd2288f85e891f70b32c18752/spacy-2.1.0-cp37-cp37m-manylinux1_x86_64.whl (27.7MB)
[K     |████████████████████████████████| 27.7MB 1.3MB/s 
[?25hCollecting allennlp==0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/bb/bb/041115d8bad1447080e5d1e30097c95e4b66e36074277afce8620a61cee3/allennlp-0.9.0-py3-none-any.whl (7.6MB)
[K     |████████████████████████████████| 7.6MB 33.1MB/s 
Collecting plac<1.0.0,>=0.9.6
  Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl
Collecting blis<0.3.0,>=0.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/fa/5f/47b7b29ad202b2210020e2f33bfb06d1db2abe0e709c2a84736e8a9d1bd5/blis-0.2.4-cp37-cp37m-manylinux1_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 40.2MB/s 
Collecting thinc<7.1.0,>=7

In [2]:
import spacy
import torch
import torch.optim as optim
import numpy as np

import seqeval
import copy

from typing import Iterator, List, Dict
from allennlp.data import Instance
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.common.file_utils import cached_path
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor
from allennlp.data.dataset_readers import conll2003

from seqeval.metrics import recall_score
from seqeval.metrics import precision_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score

torch.manual_seed(1)

<torch._C.Generator at 0x7f9a3b026b30>

LSTM Tagger is a basic model that takes embedded representation of data and encodes them based on sequence to sequence encoding model. 

In the last assignment you represented a word as its count in the document. It is possible to map a word onto a continuous domain, generating the word embeddings to either capture its context or the interactions with other words in higher dimensions. For instance, consider a sentence 'The dog is running after a car'. Here, *running* can be represented by the words around it, namely *is* and *after*. Thus we have an understanding of *running* in context of the occurence in sentence. There are many possible ways to create embeddings, and this is a separate field of research in itself. 

Finally, the forward function implements the forward computation for model to tag scores with cross entropy loss, since our labels are categorical. 'tag_logits' are scores in log domain.

In [3]:
class LstmTagger(Model):
  def __init__(self,
               word_embeddings: TextFieldEmbedder,
               encoder: Seq2SeqEncoder,
               vocab: Vocabulary) -> None:
    super().__init__(vocab)
    self.word_embeddings = word_embeddings
    self.encoder = encoder
    self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                      out_features=vocab.get_vocab_size('labels'))
    self.accuracy = CategoricalAccuracy()

  def forward(self,
              tokens: Dict[str, torch.Tensor],
              metadata,
              tags: torch.Tensor = None) -> Dict[str, torch.Tensor]:
    mask = get_text_field_mask(tokens)
    embeddings = self.word_embeddings(tokens)
    encoder_out = self.encoder(embeddings, mask)
    tag_logits = self.hidden2tag(encoder_out)
    output = {"tag_logits": tag_logits}
    if tags is not None:
      self.accuracy(tag_logits, tags, mask)
      output["loss"] = sequence_cross_entropy_with_logits(tag_logits, tags, mask)

    return output

  def get_metrics(self, reset: bool = False) -> Dict[str, float]:
    return {"accuracy": self.accuracy.get_metric(reset)}

CoNLL 2003 is a standardized NER and POS tagging dataset, which is avaialable through AllenNLP. DatasetReader used with AllenNLP helps convert the dataset file to instances, which are classes with multiple fields. For example, a single instance of train or validation data here contains three fields in a single instance: sentence tokens, metadata and associated tags. These classes are utilized to create a standard pipeline for the flow of data.

In [4]:
reader = conll2003.Conll2003DatasetReader()
train_dataset = reader.read(cached_path('http://www.ccs.neu.edu/home/dasmith/onto.train.ner.sample'))
validation_dataset = reader.read(cached_path('http://www.ccs.neu.edu/home/dasmith/onto.development.ner.sample'))
print('\nFirst sample instance from training dataset:\n', train_dataset[0])

from itertools import chain
vocab = Vocabulary.from_instances(chain(train_dataset, validation_dataset))

159377B [00:00, 1806026.94B/s]
562it [00:00, 18005.70it/s]
8366B [00:00, 4611584.61B/s]
23it [00:00, 8101.87it/s]
585it [00:00, 31610.06it/s]


First sample instance from training dataset:
 Instance with fields:
 	 tokens: TextField of length 5 with text: 
 		[What, kind, of, memory, ?]
 		and TokenIndexers : {'tokens': 'SingleIdTokenIndexer'} 
 	 metadata: MetadataField (print field.metadata to see specific information). 
 	 tags: SequenceLabelField of length 5 with labels:
 		['O', 'O', 'O', 'O', 'O']
 		in namespace: 'labels'. 






In [5]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, bidirectional=False, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
if torch.cuda.is_available():
    cuda_device = 0
    model = model.cuda(cuda_device)
else:
    cuda_device = -1
# optimizer = optim.AdamW(model.parameters(), lr=1e-4, eps=1e-8)
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("tokens", "num_tokens")])
iterator.index_with(vocab)
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=100,
                  cuda_device=cuda_device)
trainer.train()

accuracy: 0.8442, loss: 0.9096 ||: 100%|██████████| 281/281 [00:01<00:00, 226.88it/s]
accuracy: 0.7878, loss: 1.1957 ||: 100%|██████████| 12/12 [00:00<00:00, 440.61it/s]
accuracy: 0.8442, loss: 0.7290 ||: 100%|██████████| 281/281 [00:01<00:00, 278.56it/s]
accuracy: 0.7878, loss: 1.1993 ||: 100%|██████████| 12/12 [00:00<00:00, 452.86it/s]
accuracy: 0.8442, loss: 0.7162 ||: 100%|██████████| 281/281 [00:01<00:00, 275.62it/s]
accuracy: 0.7878, loss: 1.1955 ||: 100%|██████████| 12/12 [00:00<00:00, 467.62it/s]
accuracy: 0.8442, loss: 0.7065 ||: 100%|██████████| 281/281 [00:01<00:00, 278.34it/s]
accuracy: 0.7878, loss: 1.1663 ||: 100%|██████████| 12/12 [00:00<00:00, 466.28it/s]
accuracy: 0.8442, loss: 0.6982 ||: 100%|██████████| 281/281 [00:01<00:00, 270.51it/s]
accuracy: 0.7878, loss: 1.1688 ||: 100%|██████████| 12/12 [00:00<00:00, 473.18it/s]
accuracy: 0.8442, loss: 0.6904 ||: 100%|██████████| 281/281 [00:01<00:00, 279.65it/s]
accuracy: 0.7878, loss: 1.1509 ||: 100%|██████████| 12/12 [00:00

{'best_epoch': 99,
 'best_validation_accuracy': 0.8755102040816326,
 'best_validation_loss': 0.44149550279447186,
 'epoch': 99,
 'peak_cpu_memory_MB': 2575.904,
 'peak_gpu_0_memory_MB': 903,
 'training_accuracy': 0.924266497193094,
 'training_cpu_memory_MB': 2575.904,
 'training_duration': '0:01:50.754801',
 'training_epochs': 99,
 'training_gpu_0_memory_MB': 903,
 'training_loss': 0.21472710906341505,
 'training_start_epoch': 0,
 'validation_accuracy': 0.8755102040816326,
 'validation_loss': 0.44149550279447186}

```tag_sentence``` gives you the token, its correct tag and the predicted tag from model for each word in the sentence s. `model.forward_on_instance(s)['tag_logits']` is a 2D matrix of shape n_words x n_tags, and the argmax is taken to get the tag with maximum value for each word.



In [10]:
def tag_sentence(s):
  tag_ids = np.argmax(model.forward_on_instance(s)['tag_logits'], axis=-1)
  fields = zip(s['tokens'], s['tags'], [model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])
  return list(fields)

baseline_output = [tag_sentence(i) for i in validation_dataset]

## Show the first example of an instance passed to tag_sentence
print(validation_dataset[0])

## Show the first example of output
print('\n Corresponding output:\n')
baseline_output[0]

Instance with fields:
 	 tokens: TextField of length 27 with text: 
 		[With, a, wave, of, his, hand, ,, Peng, Dehuai, said, that, despite, being, over, 100, regiments, ,,
		let, 's, call, this, campaign, the, Hundred, Regiments, Offensive, .]
 		and TokenIndexers : {'tokens': 'SingleIdTokenIndexer'} 
 	 metadata: MetadataField (print field.metadata to see specific information). 
 	 tags: SequenceLabelField of length 27 with labels:
 		['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'O', 'O', 'O', 'O', 'B-CARDINAL',
		'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O']
 		in namespace: 'labels'. 


 Corresponding output:



[(With, 'O', 'O'),
 (a, 'O', 'O'),
 (wave, 'O', 'O'),
 (of, 'O', 'O'),
 (his, 'O', 'O'),
 (hand, 'O', 'O'),
 (,, 'O', 'O'),
 (Peng, 'B-PERSON', 'B-PERSON'),
 (Dehuai, 'I-PERSON', 'I-EVENT'),
 (said, 'O', 'O'),
 (that, 'O', 'O'),
 (despite, 'O', 'O'),
 (being, 'O', 'O'),
 (over, 'O', 'O'),
 (100, 'B-CARDINAL', 'B-CARDINAL'),
 (regiments, 'O', 'O'),
 (,, 'O', 'O'),
 (let, 'O', 'O'),
 ('s, 'O', 'O'),
 (call, 'O', 'O'),
 (this, 'O', 'O'),
 (campaign, 'O', 'O'),
 (the, 'B-EVENT', 'O'),
 (Hundred, 'I-EVENT', 'I-EVENT'),
 (Regiments, 'I-EVENT', 'I-EVENT'),
 (Offensive, 'I-EVENT', 'I-EVENT'),
 (., 'O', 'O')]

You will compare violations between two words in a sentence.
Sample violations: 

1.   B-PER I-LOC
2.   I-PER I-LOC
3.   O I-PER
4.   A sentence beginning with I-TYPE

Evaluations score would be based on valid spans. All the words tagged outside of span don't contribute to scores.
Examples of valid spans for corresponding words: 

1.   Peng Dehuai: [B-PERSON, I-PERSON]
2.   100 : [B-CARDINAL]

Precision = Number of spans identified correctly by model/Number of spans predicted by model

Recall = Number of spans identified correctly by model/Number of labelled spans

If you use seqeval, valid spans will be automatically extracted by the library for a given list of predictions and labels.



In [None]:
# TODO: count the number of NER label violations,
# such as O followed by I-TYPE or B-TYPE followed by
# I-OTHER_TYPE
# Take tagger output as input

# Valid NERs:
# B-TYPE: All valid occurance
# O-TYPE: All valid occurance
# I-TYPE:
#       a) If first I-TYPE, previous type should be same B-TYPE
#       b) If subsequent I-TYPE, previous type should be same I-TYPE
def violations(tagged):
    # Write code here
    return violations

# TODO: return the span-level precision, recall, and F1
# Only count valid spans that start with a B tag,
# followed by zero or more I tags of the same type.
# This is harsher than the token-level metric that the
# LSTM was trained to optimize, but it is the standard way
# of evaluating NER systems.
# Take tagger output as input
def span_stats(tagged):
    # Write code here
    return {'precision': precision,
            'recall': recall,
            'f1': f1}

**Viterbi Decoding**

You can either use a graph or dynamic programming to implement the viterbi decoder. The pseudocode given is for dynamic programming, but you are free to implement a graph procedure.

In [None]:
# Pseudocode as a guide for implementation

# For each sentence in validation dataset:

  # Create a k x k transition matrix, where k is the number of tags [O, B-PER, etc.]
  # transition_matrix = np.zeros((n_tags, n_tags))
  # Fill the matrix based on possible transitions from a tag to another tag
  # for tag_index_1 in range(len(tags)):
  #   for tag_index_2 in range(len(tags)):
  #     if is_a_violation(tags[tag_index_1], tags[tag_index_2]): 
  #       transition_matrix[tag_index_1][tag_index_2] = np.log(0)
  #     else:
  #       transition_matrix[tag_index_1][tag_index_2] = np.log(1) 

  # Get emission scores from the model for an instance s [instance as a class has been discussed above]
  # emission_scores = model.forward_on_instance(s)['tag_logits']
  # Construct a 2D DP table and a 2D backtracking table and fill them with zeros initially
  # for word_index in range(1, len(s)):
  #   for tag_index in range(len(tags)):
  #     dp_table[word_index][tag_index] = assign values here by adding emission_scores, transition probabilities and dp_table[word_index - 1][tags] and call max function
  #     backtracking_table[word_index][tag_index] = argmax(values generated above before calling max)
  # 
  # From the backtracking_table, trace back the path based on indices to get the list of tags for s.