# Lab 4: Recurrent models

This lab is supposed to give you some initial practice with neural models in NLP.

**This is the complete Lab 4, in two parts.** The purpose of the first part of the lab is to get you started with using neural models. The second part of the lab contains exercises on ELMo embeddings, applying them to the task of word sense disambuiguation following the approach from the original paper by Peters et al.


## Part 1 (50 points)

In the first part of lab 4, we will play with training a recurrent model for part of speech tagging. As an easy exercise, you will observe what happens when you plug in pretrained word embeddings into an neural NLP model and will experiment with different sizes of training data.

## Exercise 1: prepare the data (5 points)

Linguistic data come in a variety of formats. You already had a chance to play with POS-annotated corpus data in Lab 1.

In the first exercise, you will access POS-annotated data in one format (NLTK) and save it on the disk in a text format. Start with the tagged sentences from the Brown corpus, which can be retrieved as below:

In [1]:
import sys 
print(sys.version)

3.6.10 |Anaconda, Inc.| (default, May  7 2020, 23:06:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [2]:
import random
import nltk
nltk.corpus.brown.tagged_sents()


[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

In [2]:

len(nltk.corpus.brown.tagged_sents())
#randomsents = random.shuffle(nltk.corpus.brown.sents())




57340

Now randomize the order of all sentences in the corpus using <code>random.shuffle()</code> function and split it into 50K sentences for training, 5K for validation, and the rest for testing.

In [3]:
#Write your code here
tagged_sents = list(nltk.corpus.brown.tagged_sents())
random.shuffle(tagged_sents)
training_brown= tagged_sents[:50000]
validation_brown=tagged_sents[50000:55000]
testing_brown=tagged_sents[55000:]

Define a function for saving your datasets to a text file in the following format:
* one sentence per line
* tokens separated by spaces
* POS tag separated from the token by "###", for example <code>said###VBD</code>.

In [4]:
def write_posdata(sentences,outfile):
    f = open(outfile, "w")
    for sentence in sentences:
        for word, tag in sentence:
            f.write(""+word+"###"+tag+" ")
        f.write("\n")
    f.close()       

write_posdata(training_brown[:50],"train_brown_50.txt")  

f = open("train_brown_50.txt", "r")
print(f.read())

He###PPS had###HVD to###TO write###VB very###QL small###JJ to###TO get###VB it###PPO on###IN the###AT bottom###NN of###IN the###AT scrap###NN of###IN paper###NN .###. 
but###CC ,###, as###CS he###PPS had###HVD a###AT special###JJ fondness###NN for###IN magic###NN and###CC divination###NN ,###, he###PPS ordered###VBD that###CS books###NNS on###IN these###DTS subjects###NNS should###MD be###BE spared###VBN .###. 
For###IN large###JJ letters###NNS ,###, e.g.###RB thermoformed###VBN of###IN acrylic###NN or###CC butyrate###NN ,###, there###EX are###BER other###AP techniques###NNS .###. 
Congress###NP reacted###VBD with###IN a###AT series###NN of###IN measures###NNS modifying###VBG in###IN various###AP ways###NNS what###WDT it###PPS had###HVD granted###VBN in###IN 1875###CD .###. 
Fixed###VBN monthly###JJ allowances###NNS are###BER reimbursements###NNS for###IN the###AT same###AP purpose###NN except###IN on###IN a###AT non-itemized###JJ basis###NN .###. 
Education###NN must###MD not###* be##

Now save your data partitions in different sizes. We will start with small data samples since training on a large dataset may be very slow depending on your machine.

In [5]:
write_posdata(training_brown,"train_brown.txt")
write_posdata(testing_brown,"test_brown.txt")
write_posdata(validation_brown,"validation_brown.txt")
write_posdata(training_brown[:50],"train_brown_50.txt")
write_posdata(validation_brown[:50],"validation_brown_50.txt")
write_posdata(training_brown[:500],"train_brown_500.txt")
write_posdata(validation_brown[:500],"validation_brown_500.txt")
write_posdata(training_brown[:5000],"train_brown_5000.txt")

Congratulations, you have now saved the POS tagged data for model training purposes!

## Exercise 2: train neural POS tagger models (35 points)

We will now play with a neural model. First of all, install <code>allennlp</code>. The LSTM model we will train follows the AllenNLP tutorial https://allennlp.org/tutorials which contains ample explanations of the underlying code. Let us start by loading the model code and data, starting with a tiny sample for demonstration purposes

In [6]:
import sys 
print(sys.version)

3.6.10 |Anaconda, Inc.| (default, May  7 2020, 23:06:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [1]:
from lstm_tutorial import *

train_dataset_tiny = reader.read("train_brown_50.txt")
validation_dataset_tiny = reader.read("validation_brown_50.txt")

50it [00:00, 9074.65it/s]
50it [00:00, 13997.81it/s]


Fist of all we need to initialize the vocabulary and define an embedding (vector) for each token. We set the embedding size at 300, common in realistic applications. By default, the embeddings are initialized randomly and updated during trining (this can be changed but we start with a standard configuration). We also need to specify the <code>HIDDEN_DIM</code> parameter: the dimensionality of the hidden vector representations in the LSTM cell.

In [2]:
vocab_tiny = Vocabulary.from_instances(train_dataset_tiny + validation_dataset_tiny)

EMBEDDING_DIM = 300
HIDDEN_DIM = 20

token_embedding_tiny = Embedding(num_embeddings=vocab_tiny.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

100%|██████████| 100/100 [00:00<00:00, 33681.07it/s]


Download the smallest pretrained word vector model from https://nlp.stanford.edu/projects/glove/, unzip it, and extract the relevant file <code>'glove.6B.300d.txt'</code> in your working directory.

In [3]:
glove_token_embedding_tiny = Embedding.from_params(vocab=vocab_tiny,
                            params=Params({'pretrained_file':'glove.6B.300d.txt',
                                           'embedding_dim' : EMBEDDING_DIM}))

400000it [00:01, 304314.64it/s]


Now from embedding a single word with <code>token_embedding_tiny</code> we can proceed to mapping a word sequence into a sequence of vectors:

In [4]:
word_embeddings_tiny = BasicTextFieldEmbedder({"tokens": token_embedding_tiny})

The following initializes parameters of an LSTM model using <code>word_embeddings_tiny</code> input encoding

In [5]:
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model_tiny = LstmTagger(word_embeddings_tiny, lstm, vocab_tiny)

Now define an LSTM model called <code>glove_model_tiny</code> that uses <code>glove_token_embedding_tiny</code>:

In [6]:
#write your code here
glove_word_embeddings_tiny = BasicTextFieldEmbedder({"tokens": glove_token_embedding_tiny})
glove_model_tiny = LstmTagger(glove_word_embeddings_tiny,lstm,vocab_tiny)

Train the basic model for the tiny dataset:

In [None]:
basic_trainer_tiny=initialize_trainer(model_tiny,vocab_tiny,train_dataset_tiny,validation_dataset_tiny,batch_size=50)
basic_trainer_tiny.train()

You have trained an LSTM POS tagger for the basic model. Now train the <code>glove_model_tiny</code>. 

In [None]:
basic_trainer_tiny=initialize_trainer(glove_model_tiny,vocab_tiny,train_dataset_tiny,validation_dataset_tiny,batch_size=50)
basic_trainer_tiny.train()

## Exercise 3: Explore training parameters (10 points)

Create separate models on the basis of bigger datasets: the 500 sentence training and 500 sentence validation and 5000 sentence training and 5000 sentence validation. Using the full training set (50K sentences) is optional (your machine might be too slow). Initialize and train the basic model on 500 sentence training and 500 sentence validation data:

In [None]:
train_dataset_500 = reader.read("train_brown_500.txt")
validation_dataset_500 = reader.read("validation_brown_500.txt")
vocab_500 = Vocabulary.from_instances(train_dataset_500 + validation_dataset_500)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
token_embedding_500 = Embedding(num_embeddings=vocab_500.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings_500 = BasicTextFieldEmbedder({"tokens": token_embedding_500})
model_500 = LstmTagger(word_embeddings_500, lstm, vocab_500)
basic_trainer_500 = initialize_trainer(model_500, vocab_500, train_dataset_500, validation_dataset_500, batch_size=50)
basic_trainer_500.train()

Now do the same training (500 sentence training and 500 sentence validation sets) with GloVE embeddings:

In [None]:
train_dataset_500 = reader.read("train_brown_500.txt")
validation_dataset_500 = reader.read("validation_brown_500.txt")
vocab_500 = Vocabulary.from_instances(train_dataset_500 + validation_dataset_500)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

glove_token_embedding_500 = Embedding.from_params(vocab=vocab_500,
                            params=Params({'pretrained_file':'glove.6B.300d.txt',
                                           'embedding_dim' : EMBEDDING_DIM}))

glove_word_embeddings_500 = BasicTextFieldEmbedder({"tokens": glove_token_embedding_500})
glove_model_500 = LstmTagger(glove_word_embeddings_500, lstm, vocab_500)
glove_trainer_500 = initialize_trainer(glove_model_500, vocab_500, train_dataset_500, validation_dataset_500, batch_size=50)
glove_trainer_500.train()


Use a bigger training set now with 5K sentence training and 5K sentence validation sets and random initial embeddings:

In [None]:
train_dataset_5k = reader.read("train_brown_5000.txt")
validation_dataset_5k = reader.read("validation_brown.txt")
vocab_5k = Vocabulary.from_instances(train_dataset_5k + validation_dataset_5k)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
token_embedding_5k = Embedding(num_embeddings=vocab_5k.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings_5k = BasicTextFieldEmbedder({"tokens": token_embedding_5k})
model_5k = LstmTagger(word_embeddings_5k, lstm, vocab_5k)
basic_trainer_5k = initialize_trainer(model_5k, vocab_5k, train_dataset_5k, validation_dataset_5k, batch_size=50)
basic_trainer_5k.train()

Now do the same training (5K sentence training and 5K sentence validation sets) with GloVE embeddings:

In [None]:
train_dataset_5k = reader.read("train_brown_5000.txt")
validation_dataset_5k = reader.read("validation_brown.txt")
vocab_5k = Vocabulary.from_instances(train_dataset_5k + validation_dataset_5k)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

glove_token_embedding_5k = Embedding.from_params(vocab=vocab_5k,
                            params=Params({'pretrained_file':'glove.6B.300d.txt',
                                           'embedding_dim' : EMBEDDING_DIM}))

glove_word_embeddings_5k = BasicTextFieldEmbedder({"tokens": glove_token_embedding_5k})
glove_model_5k = LstmTagger(glove_word_embeddings_5k, lstm, vocab_5k)
glove_trainer_5k = initialize_trainer(glove_model_5k, vocab_5k, train_dataset_5k, validation_dataset_5k, batch_size=50)
glove_trainer_5k.train()


For each trained model, record validation accuracy and training duration (they are returned along with other training stats after training a model) and accuracy on the training set. Fill in the numbers in the table below:

| model | validation accuracy | training accuracy | training duration|
|-------|---------------------|---------------|-------------------------------------------
| basic model on 50 sentences|0.38638454461821525|0.41870350690754515|0:01:28.96586|    
| glove model on 50 sentences|0.5087396504139834|0.6801275239107333|0:01:31.767727|
| basic model on 500 sentences|0.7383440514469454|0.9234220135628586|0:07:46.176068|
| glove model on 500 sentences|0.7901929260450161|0.9209181011997913|0:08:14.402893|
| basic model on 5000 sentences|0.8557469889128181|0.856524686926697|0:28:55.007167|
| glove model on 5000 sentences|0.871809444045625|0.9148496445064127|0:35:22.978479|

**Question.** What do you conclude from these comparisons? when can it be especially beneficial to initialize a model with pretrained embeddings?

**Answer.** 

Pretrained word embeddings capture the semantic and syntactic meaning of a word as they are trained on large datasets. They are capable of boosting the performance of LSTM model, accordingly we can see from the accuracy results that the words trained with GLoVe Embeddings stands with higher accuracy results. 

During training, data is processed in batches so that the model performs computation for multiple examples simultaneously. How does batching affect model training? Modify the training to have smaller batches of data - let's use batches of 5 or 500 instead of 50. How does this affect the results? 

In [None]:
#Define your trainers with alternative batching here: batches of 5, 50 sentences

train_dataset_50 = reader.read("train_brown_50.txt")
validation_dataset_50 = reader.read("validation_brown_50.txt")
vocab_50 = Vocabulary.from_instances(train_dataset_50 + validation_dataset_50)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
token_embedding_50 = Embedding(num_embeddings=vocab_50.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings_50 = BasicTextFieldEmbedder({"tokens": token_embedding_50})
model_50 = LstmTagger(word_embeddings_50, lstm, vocab_50)
basic_trainer_50_b5 = initialize_trainer(model_50, vocab_50, train_dataset_50, validation_dataset_50, batch_size=5)
basic_trainer_50_b5.train()

In [None]:
# batches of 5, 500 sentences

train_dataset_500 = reader.read("train_brown_500.txt")
validation_dataset_500 = reader.read("validation_brown_500.txt")
vocab_500 = Vocabulary.from_instances(train_dataset_500 + validation_dataset_500)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
token_embedding_500 = Embedding(num_embeddings=vocab_500.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings_500 = BasicTextFieldEmbedder({"tokens": token_embedding_500})
model_500 = LstmTagger(word_embeddings_500, lstm, vocab_500)
basic_trainer_500_b5 = initialize_trainer(model_500, vocab_500, train_dataset_500, validation_dataset_500, batch_size=5)
basic_trainer_500_b5.train()

In [None]:
#batches of 500, 50 sentences

train_dataset_50 = reader.read("train_brown_50.txt")
validation_dataset_50 = reader.read("validation_brown_50.txt")
vocab_50 = Vocabulary.from_instances(train_dataset_50 + validation_dataset_50)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
token_embedding_50 = Embedding(num_embeddings=vocab_50.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings_50 = BasicTextFieldEmbedder({"tokens": token_embedding_50})
model_50 = LstmTagger(word_embeddings_50, lstm, vocab_50)
basic_trainer_50_b500 = initialize_trainer(model_50, vocab_50, train_dataset_50, validation_dataset_50, batch_size=500)
basic_trainer_50_b500.train()

In [None]:
#batches of 500, 500 sentences

train_dataset_500 = reader.read("train_brown_500.txt")
validation_dataset_500 = reader.read("validation_brown_500.txt")
vocab_500 = Vocabulary.from_instances(train_dataset_500 + validation_dataset_500)
EMBEDDING_DIM = 300
HIDDEN_DIM = 20
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
token_embedding_500 = Embedding(num_embeddings=vocab_500.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings_500 = BasicTextFieldEmbedder({"tokens": token_embedding_500})
model_500 = LstmTagger(word_embeddings_500, lstm, vocab_500)
basic_trainer_500_b500 = initialize_trainer(model_500, vocab_500, train_dataset_500, validation_dataset_500, batch_size=500)
basic_trainer_500_b500.train()

Report your results below:

**batches of 5**:

| model | validation accuracy | training accuracy | training duration|
|-------|---------------------|---------------|-------------------------------------------
| basic model on 50 sentences|0.5915363385464582|0.926673751328374|0:00:56.758152|
| basic model on 500 sentences|0.7405546623794212|0.9348982785602504|'0:03:01.415810|

**batches of 500**:

| model | validation accuracy | training accuracy | training duration|
|-------|---------------------|---------------|-------------------------------------------
| basic model on 50 sentences|0.40754369825206993|0.4357066950053135|0:01:39.887773|
| basic model on 500 sentences|0.37932073954983925|0.38935837245696403|0:13:36.935827|

**Question.** What do these results tell you?
**Answer.** WRITE YOUR ANSWER HERE


After each batch size has been passed into the model the networks parameters are updated.

The batch size is set to a value of 5 and the network weights are updated after each 5 training example.This can have the effect of faster learning, but also adds instability to the learning process as the weights widely vary with each 5 batches.


Another solution is to make all predictions at once in a batch (500 sentences with batches = 500).We adapted the model for batch forecasting by predicting with a batch size equal to the training batch size.This would mean that we could be very limited in the way the model is used. Therefore, too large of a batch size will lead to poor generalization




## Comment 
In this lab we used pretrained GloVe embeddings in a model for part of speech tagging. GloVe in its turn is also a neural word embedding model, but it had been trained on a completely different objective. GloVe vectors had been optimised on word cooccurrence matrix decomposition, i.e. on the task of predicting which words tend to occur with which other words. Part of speech certainly plays a role in determining statistical cooccurrence of words, but this role is indirect, and explicit part of speech information has not been used in training GloVe.

This makes our application an example of **transfer learning**, whereby a learned model trained on one objective (e.g. word cooccurrence) can benefit a different application (e.g. POS tagging), because some information is shared between them. 

## Part 2 - ELMo vectors (50 points)

In the second part of this lab we will reproduce the word sense disambiguation strategy that the authors of the ELMo vectors explored. The strategy consists in the following:

- create ELMo embeddings for all tokens in a sense-annotated corpus
- calculate mean sense vectors for each word sense in the training partition of the corpus
- for each sense-annotated token in the test partition of the corpus, assign it to the sense of the word to which its ELMo vector is the closest according to the cosine distance metric
- as a backup strategy, use the 1st sense of the word by default.

As a sense annotated corpus, we can use SemCor, conveniently available within NLTK. <code>semcor.sents()</code> iterates over all sentences represented as lists of tokens, while <code>semcor.tagged_sents()</code> iterates over the same sentences with additional annotation including WordNet lemma identifiers (lemmas in WordNet stand for a word taken in a specific sense).

In [78]:
import nltk
import random
from nltk.corpus import semcor
from nltk.stem import WordNetLemmatizer
from nltk.tree import Tree
from nltk.corpus import wordnet as wn
from collections import defaultdict
import numpy
import torch

semcor.sents()
semcor.tagged_sents(tag="sem")

[[['The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), ['an'], Tree(Lemma('probe.n.01.investigation'), ['investigation']), ['of'], Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), ["'s"], Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), ['``'], ['no'], Tree(Lemma('evidence.n.01.evidence'), ['evidence']), ["''"], ['that'], ['any'], Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), ['.']], [['The'], Tree(Lemma('jury.n.01.jury'), ['jury']), Tree(Lemma('far.r.02.far'), ['further']), Tree(Lemma('state.v.01.say'), ['said']), ['in'], Tree(Lemma('term.n.02.term'), ['term']), Tree(Lemma('end.n.02.end'), ['end']), Tree(Lemma('presentment.n.01.presentment'), ['presentments']), ['

## Exercise 1. Extract relevant data from SemCor (5 points)

First, split all the sentences in SemCor randomly into 90% training and 10% testing partitions:

In [2]:
semcor_tagged_sents = list(nltk.corpus.semcor.tagged_sents(tag="sem"))
random.shuffle(semcor_tagged_sents)

semcor_train= semcor_tagged_sents[:int(0.9 * len(semcor_tagged_sents))]
semcor_test= semcor_tagged_sents[int(0.9 * len(semcor_tagged_sents)):]

print("semcor_tagged_sents:", len(semcor_tagged_sents))
print("semcor_train:", len(semcor_train))
print("semcor_test:", len(semcor_test))

semcor_tagged_sents: 37176
semcor_train: 33458
semcor_test: 3718


Create a function that takes as input a sentence from SemCor and extracts a list which contains, for each token of the sentence, either the corresponding WordNet Lemma (e.g. <code>Lemma('friday.n.01.Friday')</code>) or <code>None</code>. <code>None</code> corresponds to tokens that are either 1) not annotated for word senses (e.g. articles); 2) are marked up as (part of) a named entity (e.g. "City of Atlanta" or placename "Fulton" annotated as  <code>Tree(Lemma('location.n.01.location'), [Tree('NE', ['Fulton'])])</code>)

In [3]:
def get_lemmas(input_tagged_sentence): 
    temp_list = []
    
    # check for Lemma's or None's
    for token in range(0, len(input_tagged_sentence)):
        if (type(input_tagged_sentence[token]) != nltk.tree.Tree):
            # print(1, "-", input_tagged_sentence[token])
            temp_list.append("None")
        elif type(input_tagged_sentence[token][0]) == str:
            # print(2, "-", input_tagged_sentence[token])
            temp_list.append(input_tagged_sentence[token])    
        elif input_tagged_sentence[token][0].pos()[0][1] == 'NE':
            # print(3, "-", input_tagged_sentence[token])
            temp_list.append("None")   
        else:
            print("Error at index:\n", token, "\n")
            
    return temp_list   

In [4]:
# test function 
semcor_tagged_sents = semcor.tagged_sents(tag="sem")
print(get_lemmas(semcor_tagged_sents[0]))



['None', 'None', Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), 'None', Tree(Lemma('probe.n.01.investigation'), ['investigation']), 'None', Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), 'None', Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), 'None', 'None', Tree(Lemma('evidence.n.01.evidence'), ['evidence']), 'None', 'None', 'None', Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), 'None']


You are now able to extract word senses (instantiated by WordNet lemmas) from the corpus. The next step is to associate senses with ELMo vectors. Create a dictionary of contextualized token embeddings from the training corpus grouped by the WordNet sense:

In [5]:
Train_embeddings=defaultdict(list) 

Now let's create contextualized ELMo word embeddings for the tokens in this corpus. We can load the pretrained ELMo model and define a function <code>sentences_to_elmo()</code> that receives a list of tokenized sentences as input and produces their ELMo vectors.

In [6]:
from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
elmo = Elmo(options_file, weight_file, 1, dropout=0)

def sentences_to_elmo(sentences): 
    character_ids = batch_to_ids(sentences)
    embeddings = elmo(character_ids)
    return embeddings 

Now you can process the corpus sentences and produce their ELMo vectors. It is recommended to pass the input to ELMo encoder in batches. A suggested batch size is 50 sentences. For example, the code below processes the first 50 sentences from the corpus:

In [7]:
sentences=semcor.sents()[:50]

embeddings=sentences_to_elmo(sentences)

The <code>embeddings</code> that we obtained is a dictionary that contains a list of ELMo embeddings and a list of masks. The mask tells us which embeddings correspond to tokens in the original input sentences and which correspond to the padding (introduced to give all sentences in the batch the same length).
In principle all embeddings are stored in PyTorch tensors so that they can be used in bigger neural models, but we are not going to do it now. For our purposes, PyTorch tensors can be converted to numpy arrays:

In [8]:
embeddings['elmo_representations'][0].detach().numpy() 

array([[[-6.46188855e-03,  6.02140278e-03, -3.55983436e-01, ...,
         -1.17147304e-02,  7.04263002e-02, -4.18728709e-01],
        [-3.77808213e-01,  2.81414628e-01, -2.58361459e-01, ...,
         -4.85472798e-01,  2.55084008e-01,  3.63812745e-02],
        [ 9.11907077e-01,  1.17794800e+00, -8.48333716e-01, ...,
          9.84723091e-01,  3.36747169e-01,  1.61717817e-01],
        ...,
        [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[-6.46188855e-03,  6.02140278e-03, -3.55983436e-01, ...,
         -4.48764861e-02,  1.13127291e-01, -9.96282995e-02],
        [ 1.37208998e-01, -2.00027555e-01, -1.30738422e-01, ...,
          5.94822645e-01,  9.33864832e

We can check the size of the embeddings we got. It has three dimensions: 1) the number of sentences 2) the number of tokens (corresponds to the tokens in the longest original sentence of the batch; shorter ones were padded)

In [9]:
embeddings['elmo_representations'][0].detach().size() # 3D: num sents, num tokens, masks

torch.Size([50, 59, 1024])

Another thing contained in the <code>embeddings</code> is the mask, a tensor encoding which tokens vectors correspond to original tokens and which are paddings:

In [10]:
embeddings['mask'][1]

tensor([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False])

## Exercise 2. Extract ELMo encoding of sentences using a mask (5 points)  

Now define a function <code>get_masked_vectors(embeddings)</code> that takes embeddings as input and returns a list of ELMo sentence encodings to which the mask has been applied, i.e. where the padding vectors have been removed so the representation of each sentence contains as many vectors as there were tokens in the original sentence.

In [11]:
def get_masked_vectors(embeddings): # dict of embeddings 
    temp_list_sentences = [] 
    
    for sentence in range(0, (len(embeddings['elmo_representations'][0]))):
        temp_list_tokens = [] 
            
        for token in range(0, (len(embeddings['elmo_representations'][0][0]))):
            
            # check if the mask is false at the corresponding location
            if embeddings['mask'].data[sentence][token] == True:
                
                # add the token with encoding to the adjusted sentence
                temp_list_tokens.append(embeddings['elmo_representations'][0][sentence][token])
                
        temp_list_sentences.append(temp_list_tokens)
        
    return temp_list_sentences # list of ELMo sentence encodings, where paddings are removed

In [11]:
# test functions
semcor_sents = nltk.corpus.semcor.sents()
sentences=semcor.sents()[:50] 
embeddings=sentences_to_elmo(sentences)
embeddings['elmo_representations'][0].detach().numpy()
new_embeddings = get_masked_vectors(embeddings)


# check if length of the 50 sentences are variable
for s in range(0, 50):
    print("len sent {}: {}".format(s+1, len(new_embeddings[s])))

len sent 1: 26
len sent 2: 44
len sent 3: 36
len sent 4: 37
len sent 5: 25
len sent 6: 24
len sent 7: 43
len sent 8: 26
len sent 9: 25
len sent 10: 14
len sent 11: 15
len sent 12: 28
len sent 13: 25
len sent 14: 59
len sent 15: 23
len sent 16: 25
len sent 17: 17
len sent 18: 35
len sent 19: 33
len sent 20: 34
len sent 21: 35
len sent 22: 33
len sent 23: 9
len sent 24: 31
len sent 25: 28
len sent 26: 35
len sent 27: 22
len sent 28: 6
len sent 29: 9
len sent 30: 20
len sent 31: 15
len sent 32: 17
len sent 33: 17
len sent 34: 20
len sent 35: 11
len sent 36: 14
len sent 37: 17
len sent 38: 14
len sent 39: 11
len sent 40: 30
len sent 41: 23
len sent 42: 23
len sent 43: 37
len sent 44: 34
len sent 45: 20
len sent 46: 28
len sent 47: 31
len sent 48: 52
len sent 49: 22
len sent 50: 10


## Exercise 3. Collect ELMo vectors from the training corpus (15 points)

Process the corpus updating your train word sense vectors. Iterate over the all the train sentences in the corpus, and retrieve for each lemma-annotated token (where lemma is not <code>None</code>) the corresponding ELMo vector. Store the ELMo sense embeddings that correspond to each lemma in the dictionary <code>Train_embeddings</code>.

In [33]:
def get_lemma_annotated_tokens(sentence): 
    list_lemmas_nones = get_lemmas(sentence)
    list_lemmas = []

    # select lemmas and create list of lemma-annotated tokens
    for token in range(0, len(list_lemmas_nones)):
        if list_lemmas_nones[token] != 'None':
            list_lemmas.append(list_lemmas_nones[token])
    
    return list_lemmas 

In [430]:
from collections import defaultdict
import numpy as np

def store_training_set(training_set):
    lemma_annotated_token_sents = []
    token_sents = []
    
    # create list of lists of lemma-annotated tokens
    for sentence in range(0, len(training_set)):
        print("sentence", sentence+1, ': \n', training_set[sentence], "\n")
        
        sent = get_lemma_annotated_tokens(training_set[sentence])
        lemma_annotated_token_sents.append(sent)
        print(len(sent), "lemma-annotated tokens :\n", sent, "\n\n")
        
    # get embeddings for each sentence
    for s in range(0, len(lemma_annotated_token_sents)):
        senses_of_sent = []
        tokens_of_sent = []
        idx = 0
        
        for token in range(0, len(lemma_annotated_token_sents[s])):
            senses_of_sent.append(lemma_annotated_token_sents[s][token].label())
            tokens_of_sent.append(lemma_annotated_token_sents[s][token][:]) 
          
        # remove paddings     
        embeddings_elmo = sentences_to_elmo(tokens_of_sent)
        embeddings_elmo['elmo_representations'][0].detach().numpy() 
        embeddings_without_paddings = get_masked_vectors(embeddings_elmo)

        # make tuples of sense and embeddings per token
        sense_embeddings = list(zip(senses_of_sent, embeddings_without_paddings))
        
        # store embeddings in Train_embeddings grouped by WordNet sense
        for sense, embeddings in sense_embeddings:  # list of tuples
            
            # ignore multiword expressions
            if len(embeddings) == 1:
    
                if sense in Train_embeddings:
                    current_vector = Train_embeddings.get(sense)
    
                    # Method1: sum vectors, and count
                    # updated_vector = [sum(x) for x in zip(current_vector, embeddings)]
                    # Train_embeddings[sense] = updated_vector
                    # current_count = Train_counter.get(sense)
                    # Train_counter[sense] += 1
                    
                    # Method 2:
                    Train_embeddings[sense].append(sense_embeddings[idx][1])
                    idx += 1
      
                # adding new sense # METHOD 1
                else:
                    # Method 1: add vectors, and count
                    # Train_embeddings.update( {sense : embeddings}) # METHOD 1
                    # Train_counter.update( {sense : 1}) # METHOD 1
                    
                    # Method 2:
                    Train_embeddings.update( {sense : sense_embeddings[idx][1]})
                    idx += 1
            
    return Train_embeddings

In [432]:
# test function: check multiple vectors per sense
new_data = semcor_train[:50] + semcor_train[:10]
len(new_data)

store_training_set(new_data)


# check total senses in dict
print('Total sentences in dict:', len(Train_embeddings))


# check total vectors in dict
count = 1
for key, value in Train_embeddings.items(): 
    print('Total items of', key, ':', len(Train_embeddings[key]))
    count += 1
    

sentence 1 : 
 [['A'], ['new'], ['radial'], ['drill'], ['press'], ['with'], ['a'], ['16'], ['inch'], ['capacity'], Tree(Lemma('have.v.02.have'), ['has']), ['a'], Tree(Lemma('lean.v.01.tilt'), ['tilting']), ['head'], ['that'], Tree(Lemma('let.v.01.allow'), ['allows']), ['drilling'], ['to'], ['be'], Tree(Lemma('make.v.01.do'), ['done']), ['at'], ['any'], ['angle'], ['.']] 

4 lemma-annotated tokens :
 [Tree(Lemma('have.v.02.have'), ['has']), Tree(Lemma('lean.v.01.tilt'), ['tilting']), Tree(Lemma('let.v.01.allow'), ['allows']), Tree(Lemma('make.v.01.do'), ['done'])] 


sentence 2 : 
 [['I'], ['used', 'to'], Tree(Lemma('love.v.01.love'), ['love']), ['this'], Tree(Lemma('country.n.02.country'), ['country']), ['and'], Tree(Lemma('believe.v.03.believe'), ['believe']), ['that'], Tree(Lemma('someday.r.01.someday'), ['someday']), ['we'], ["'d"], Tree(Lemma('win.v.01.win'), ['win']), ['our'], Tree(Lemma('struggle.n.01.battle'), ['battle']), ['for'], Tree(Lemma('equality.n.01.equality'), ['equalit

In [433]:
# check keys
Train_embeddings.keys()

dict_keys([Lemma('have.v.02.have'), Lemma('lean.v.01.tilt'), Lemma('let.v.01.allow'), Lemma('make.v.01.do'), Lemma('love.v.01.love'), Lemma('country.n.02.country'), Lemma('believe.v.03.believe'), Lemma('someday.r.01.someday'), Lemma('win.v.01.win'), Lemma('struggle.n.01.battle'), Lemma('equality.n.01.equality'), Lemma('state.v.01.say'), Lemma('be.v.01.be'), Lemma('happen.v.01.happen'), Lemma('wonder.v.02.wonder'), Lemma('applicability.n.01.applicability'), Lemma('people.n.01.people'), Lemma('make.v.01.make'), Lemma('transport.v.02.carry'), 'think.v.1;2', Lemma('visualize.v.01.see'), Lemma('ask.v.04.expect'), Lemma('come.v.03.come'), Lemma('leave.v.06.leave'), Lemma('unmask.v.01.unmask'), Lemma('uncover.v.01.reveal'), Lemma('never.r.01.never'), Lemma('bishop.n.01.bishop'), Lemma('anabaptist.n.01.Anabaptist'), Lemma('afraid.a.01.afraid'), Lemma('state.v.01.state'), Lemma('religion.n.01.faith'), Lemma('know.v.01.know'), Lemma('publish.v.03.write'), Lemma('book.n.01.book'), 'belief.n.00', 

## Exercise 4. Vector averaging (5 points)

Now you can calculate the average ELMo vector for each word sense in the training corpus:

In [None]:
# process complete train dataset
print("total sentences:", len(semcor_train))


In [None]:
# reset dicts for testing
Train_embeddings=defaultdict(list) 
Train_counter=defaultdict(list) 

In [414]:
# check addition of multiple vectors
new_data = semcor_train[:2] + semcor_train[:2]
len(new_data)

store_training_set(new_data)

sentence 0 : 
 [['A'], ['new'], ['radial'], ['drill'], ['press'], ['with'], ['a'], ['16'], ['inch'], ['capacity'], Tree(Lemma('have.v.02.have'), ['has']), ['a'], Tree(Lemma('lean.v.01.tilt'), ['tilting']), ['head'], ['that'], Tree(Lemma('let.v.01.allow'), ['allows']), ['drilling'], ['to'], ['be'], Tree(Lemma('make.v.01.do'), ['done']), ['at'], ['any'], ['angle'], ['.']] 

4 lemma-annotated tokens :
 [Tree(Lemma('have.v.02.have'), ['has']), Tree(Lemma('lean.v.01.tilt'), ['tilting']), Tree(Lemma('let.v.01.allow'), ['allows']), Tree(Lemma('make.v.01.do'), ['done'])] 


sentence 1 : 
 [['I'], ['used', 'to'], Tree(Lemma('love.v.01.love'), ['love']), ['this'], Tree(Lemma('country.n.02.country'), ['country']), ['and'], Tree(Lemma('believe.v.03.believe'), ['believe']), ['that'], Tree(Lemma('someday.r.01.someday'), ['someday']), ['we'], ["'d"], Tree(Lemma('win.v.01.win'), ['win']), ['our'], Tree(Lemma('struggle.n.01.battle'), ['battle']), ['for'], Tree(Lemma('equality.n.01.equality'), ['equalit

defaultdict(list,
            {Lemma('have.v.02.have'): [tensor([ 0.0837,  0.3644, -0.0595,  ...,  0.3183,  0.1400,  0.7171],
                     grad_fn=<SelectBackward>),
              [tensor([ 0.0837,  0.3644, -0.0595,  ...,  0.3183,  0.1400,  0.7171],
                      grad_fn=<SelectBackward>)],
              [tensor([ 0.0837,  0.3644, -0.0595,  ...,  0.3183,  0.1400,  0.7171],
                      grad_fn=<SelectBackward>)],
              [tensor([ 0.0837,  0.3644, -0.0595,  ...,  0.3183,  0.1400,  0.7171],
                      grad_fn=<SelectBackward>)]],
             Lemma('lean.v.01.tilt'): [tensor([-0.0495,  0.4396,  0.4416,  ...,  0.4333,  0.3886,  0.3021],
                     grad_fn=<SelectBackward>),
              [tensor([-0.0495,  0.4396,  0.4416,  ...,  0.4333,  0.3886,  0.3021],
                      grad_fn=<SelectBackward>)],
              [tensor([-0.0495,  0.4396,  0.4416,  ...,  0.4333,  0.3886,  0.3021],
                      grad_fn=<SelectBackward>)],

In [213]:
list_keys = list(Train_embeddings.keys())
print(list_keys)

for sense in range(0, len(list_keys)-1):
    print(list_keys[sense])

[Lemma('have.v.02.have'), Lemma('lean.v.01.tilt'), Lemma('let.v.01.allow'), Lemma('make.v.01.do'), Lemma('love.v.01.love'), Lemma('country.n.02.country'), Lemma('believe.v.03.believe'), Lemma('someday.r.01.someday'), Lemma('win.v.01.win'), Lemma('struggle.n.01.battle'), Lemma('equality.n.01.equality'), 0]
Lemma('have.v.02.have')
Lemma('lean.v.01.tilt')
Lemma('let.v.01.allow')
Lemma('make.v.01.do')
Lemma('love.v.01.love')
Lemma('country.n.02.country')
Lemma('believe.v.03.believe')
Lemma('someday.r.01.someday')
Lemma('win.v.01.win')
Lemma('struggle.n.01.battle')
Lemma('equality.n.01.equality')


In [1]:
# for each sense, calculate average
import torch 
import tensorflow as tf
import numpy as np
from torch import unsqueeze 
from collections import defaultdict


Train_avg=defaultdict(list) 

def averaging_vectors(Train_embeddings):
    for sense in range(0, len(list_keys)-1):
        
        key = list_keys[sense]
        total_items = len(Train_embeddings[key])

        for i in range(0, total_items):
            added_tensor = Train_embeddings[key][i]
            current_tensor = torch.empty(1, 1024)
            
            # extend dimension
            addedd_tensor = torch.unsqueeze(added_tensor[0], 0)
            
            # append at tensor
            new_tensor = torch.cat((added_tensor, current_tensor), dim=1)

        avg_vector = torch.mean(new_tensor, 0)

averaging_vectors(Train_embeddings)
    

NameError: name 'Train_embeddings' is not defined

In [434]:
# check dict

print("Keys Train_counter:\n")
for sense, count in Train_counter.items():
    print(sense, count)
print("\n")
print("Keys Train_embeddings:\n")
for sense, embeddings in Train_embeddings.items():
    print(sense, embeddings)

Keys Train_counter:



Keys Train_embeddings:

Lemma('have.v.02.have') [tensor([ 0.0837,  0.3644, -0.0595,  ...,  0.3183,  0.1400,  0.7171],
       grad_fn=<SelectBackward>), [tensor([ 0.5255,  0.0992, -0.4613,  ...,  0.3177, -0.0881,  0.3643],
       grad_fn=<SelectBackward>)], [tensor([ 0.0888,  0.3886, -0.0399,  ...,  0.3210,  0.1381,  0.7103],
       grad_fn=<SelectBackward>)], [tensor([ 0.0964,  0.3704, -0.0487,  ...,  0.3199,  0.1402,  0.7152],
       grad_fn=<SelectBackward>)], [tensor([ 0.5319,  0.0918, -0.4677,  ...,  0.3177, -0.0881,  0.3643],
       grad_fn=<SelectBackward>)], [tensor([ 0.0888,  0.3886, -0.0399,  ...,  0.3210,  0.1381,  0.7103],
       grad_fn=<SelectBackward>)]]
Lemma('lean.v.01.tilt') [tensor([-0.0495,  0.4396,  0.4416,  ...,  0.4333,  0.3886,  0.3021],
       grad_fn=<SelectBackward>), [tensor([-0.0506,  0.4356,  0.4227,  ...,  0.4280,  0.3890,  0.3081],
       grad_fn=<SelectBackward>)], [tensor([-0.0605,  0.4861,  0.4689,  ...,  0.4351,  0.3883,  0.3006

## Exercise 5. Testing the sense vectors (20 points)

Test your sense embeddings on your test data, which is a subset of the SemCor corpus. Use the strategy outlined above, with 1st WordNet sense as a fallback: 

- rely on mean sense vectors for each word sense in the training partition of the corpus
- for each sense-annotated token <i>t</i> (e.g. the verb "run") in the test partition of the corpus, assign it to the sense of the word "Lemma(*.v*.run)" to which ithe ELMo vector <i>t</i> is the closest according to the cosine distance metric
- as a backup strategy, use the 1st sense of the word (e.g. <code>Lemma(run.v.01.run)</code>) by default.

Report WSD accuracy in percentage points on your test data.

In [None]:
# X

## The end
Congratulations! this is the end of Lab 4.

**Acknowledgements** Tejaswini Deoskar has given valuable comments that helped improve this lab assignment. Timothee Mickus helped to test this assignment and gave extensive feedback on the instructions. Many thanks to both.