# Word Sense Disambiguation using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

A problem with static distributional vectors is the difficulty of distinguishing between different *word senses*. We will continue our exploration of word vectors by considering *trainable vectors* or *word embeddings* for Word Sense Disambiguation (WSD).

The goal of word sense disambiguation is to train a model to find the sense of a word (homonyms of a word-form). For example, the word "bank" can mean "sloping land" or "financial institution". 

(a) "I deposited my money in the **bank**" (financial institution)

(b) "I swam from the river **bank**" (sloping land)

In case a) and b) we can determine that the meaning of "bank" based on the *context*. To utilize context in a semantic model we use *contextualized word representations*. Previously we worked with *static word representations*, i.e. the representation does not depend on the context. To illustrate we can consider sentences (a) and (b), the word **bank** would have the same static representation in both sentences, which means that it becomes difficult for us to predict its sense. What we want is to create representations that depend on the context, i.e. *contextualized embeddings*. 

We will create contextualized embeddings with Recurrent Neural Networks. You can read more about recurrent neural netoworks [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Your overall task in this lab is to create a neural network model that can disambiguate the word sense of 30 different words. 

In [2]:
pip install pandas

Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/51/51/48f3fc47c4e2144da2806dfb6629c4dd1fa3d5a143f9652b141e979a8ca9/pandas-1.2.4-cp37-cp37m-manylinux1_x86_64.whl (9.9MB)
[K    100% |████████████████████████████████| 9.9MB 664kB/s eta 0:00:01
Collecting pytz>=2017.3 (from pandas)
[?25l  Downloading https://files.pythonhosted.org/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl (510kB)
[K    100% |████████████████████████████████| 512kB 2.0MB/s eta 0:00:01
Installing collected packages: pytz, pandas
Successfully installed pandas-1.2.4 pytz-2021.1
[33mYou are using pip version 19.0.3, however version 21.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# first we import some packages that we need
import torch
import torch.nn as nn
import torchtext
import numpy as np
import pandas as pd
import torch.optim as optim
#from torchtext.legacy.data import Field, BucketIterator, Iterator, TabularDataset, Dataset
import numpy as np
import random

# our hyperparameters (add more when/if you need them)
device = torch.device('cuda:0')

#torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#device = torch.device('cpu')

hyperparameters = {
    'epochs':3,
    'batch_size':16,
    'learning_rate':0.001,
    'embedding_dim':32,
    'output_dim':32
}

# 1. Working with data

A central part of any machine learning system is the data we're working with. In this section we will split the data (the dataset is located here: ``wsd-data/wsd_data.txt``) into a training set and a test set. We will also create a baseline to compare our model against. Finally, we will use TorchText to transform our data (raw text) into a convenient format that our neural network can work with.

## Data

The dataset we will use contain different word sense for 30 different words. The data is organized as follows (values separated by tabs): 
- Column 1: word-sense
- Column 2: word-form
- Column 3: index of word
- Column 4: white-space tokenized context

### Splitting the data

Your first task is to seperate the data into a *training set* and a *test set*. The training set should contain 80% of the examples and the test set the remaining 20%. The examples for the test/training set should be selected **randomly**. Save each dataset into a .csv file for loading later. **[2 marks]**

In [7]:
path = "wsd-data/wsd_data.txt"
import random



def data_split(path_to_dataset):
    
    list_of_lines = []
    with open(path_to_dataset, "r") as f:
        for line in f:
            list_of_lines.append(line)
    
    random.shuffle(list_of_lines)
    
    training = list_of_lines[:int(len(list_of_lines)*0.8)] 
    testing = list_of_lines[-int(len(list_of_lines)*0.2):]
    
    
    with open("train.csv", "w") as f:
        for line in training:
            f.write(line)
            
    with open("test.csv", "w") as f:
        for line in testing:
            f.write(line)
    
    return 


data_split(path)



### Creating a baseline

Your second task is to create a *baseline* for the task. A baseline is a "reality check" for a model, given a very simple heuristic/algorithmic/model solution to the problem, can our neural network perform better than this?
The baseline you are to create is the "most common sense" (MCS) baseline. For each word form, find the most commonly assigned sense to the word, and label a words with that sense. **[2 marks]**

E.g. In a fictional dataset, "bank" have two senses, "financial institution" which occur 5 times and "side of river" 3 times. Thus, all 8 occurences of bank is labeled "financial institution" and this yields an MCS accuracy of 5/8 = 62.5%. If a model obtain a higher score than this, we can conclude that the model *at least* is better than selecting the most frequent word sense.

In [109]:
def mcs_baseline(data):

    dataset = []
    with open(data) as f:
        for line in f:
            token = line.strip('\n')
            row = token.split('\t')
            dataset.append(row)

    word_forms_unique = set([row[1] for row in dataset])
    word_forms_dict = dict.fromkeys(word_forms_unique, 0)

    for word_form in word_forms_unique:
        d = {}
        for row in dataset:
            if row[1] == word_form:
                if row[0] not in d:
                    d[row[0]] = 1
                else:
                    d[row[0]] += 1
        word_forms_dict[word_form] = d

    word_sense_counts = dict.fromkeys(word_forms_unique, 0)
    baseline_dict = dict.fromkeys(word_forms_unique, 0)
    for word_form, word_sense_dict in word_forms_dict.items():
        total = []
        word_sense_counts[word_form] = len(word_sense_dict)
        for word_sense, count in word_sense_dict.items():
            total.append(count)
        baseline_dict[word_form] = str(np.round((max(total)/sum(total)) * 100,2))+' %'

    return baseline_dict, word_sense_counts

baseline_dict, word_sense_counts = mcs_baseline('wsd-data/wsd_data.txt')
print(baseline_dict)
        

print(word_sense_counts)

#print(baseline_dict)
#print(sum(baseline_dict.values()) / len(baseline_dict.keys()))
    
    

{'serve.v': '15.51 %', 'place.n': '24.29 %', 'professional.a': '21.76 %', 'major.a': '30.33 %', 'force.n': '16.27 %', 'bring.v': '21.17 %', 'bad.a': '60.74 %', 'time.n': '27.86 %', 'life.n': '22.47 %', 'critical.a': '27.44 %', 'hold.v': '15.2 %', 'case.n': '20.37 %', 'build.v': '21.2 %', 'find.v': '23.18 %', 'common.a': '25.06 %', 'positive.a': '35.44 %', 'point.n': '35.55 %', 'national.a': '20.46 %', 'extend.v': '18.03 %', 'follow.v': '14.59 %', 'keep.v': '39.2 %', 'lead.v': '17.97 %', 'active.a': '32.04 %', 'line.n': '85.12 %', 'see.v': '62.76 %', 'order.n': '21.96 %', 'security.n': '20.33 %', 'regular.a': '21.73 %', 'physical.a': '23.64 %', 'position.n': '20.17 %'}
{'serve.v': 9, 'place.n': 7, 'professional.a': 5, 'major.a': 4, 'force.n': 8, 'bring.v': 8, 'bad.a': 4, 'time.n': 5, 'life.n': 9, 'critical.a': 5, 'hold.v': 11, 'case.n': 8, 'build.v': 10, 'find.v': 10, 'common.a': 4, 'positive.a': 5, 'point.n': 8, 'national.a': 6, 'extend.v': 7, 'follow.v': 11, 'keep.v': 11, 'lead.v': 8,

### Creating data iterators

To train a neural network, we first need to prepare the data. This involves converting words (and labels) to a number, and organizing the data into batches. We also want the ability to shuffle the examples such that they appear in a random order.  

To do all of this we will use the torchtext library (https://torchtext.readthedocs.io/en/latest/index.html). In addition to converting our data into numerical form and creating batches, it will generate a word and label vocabulary, and data iterators than can sort and shuffle the examples. 

Your task is to create a dataloader for the training and test set you created previously. So, how do we go about doing this?

1) First we create a ``Field`` for each of our columns. A field is a function which tokenize the input, keep a dictionary of word-to-numbers, and fix paddings. So, we need four fields, one for the word-sense, one for the position, one for the lemma and one for the context. 

2) After we have our fields, we need to process the data. For this we use the ``TabularDataset`` class. We pass the name and path of the training and test files we created previously, then we assign which field to use in each column. The result is that each column will be processed by the field indicated. So, the context column will be tokenized and processed by the context field and so on. 

3) After we have processed the dataset we need to build the vocabulary, for this we call the function ``build_vocab()`` on the different ``Fields`` with the output from ``TabularDataset`` as input. This looks at our dataset and creates the necessary vocabularies (word-to-number mappings). 

4) Finally, the last step. In the last step we load the data objects given by the ``TabularDataset`` and pass it to the ``BucketIterator`` class. This class will organize our examples into batches and shuffle them around (such that for each epoch the model observe the examples in a different order). When we are done with this we can let our function return the data iterators and vocabularies, then we are ready to train and test our model!

Implement the dataloader. [**2 marks**]

*hint: for TabularDataset and BucketIterator use the class function splits()* 

In [11]:
path = "wsd-data/"


def dataloader(path):
    whitespacer = lambda x: x.split(' ')
    
    WORDSENSE = Field(tokenize    = whitespacer,
                   lower       = True,
                   batch_first = True) 
    
    POSITION = Field(batch_first = True) 
    
    LEMMA = Field(tokenize    = whitespacer,
                   lower       = True,
                   batch_first = True) 

    CONTEXT = Field(tokenize    = whitespacer,
                   lower        = True,
                   batch_first = True)

    # read the csv files
    train, test = TabularDataset.splits(path   = path,
                                        train  = 'train.csv',
                                        test   = 'test.csv',
                                        format = 'csv',
                                        fields = [('wordsense', WORDSENSE),
                                                  ("lemma", LEMMA),
                                                  ('position', POSITION),
                                                  ("context", CONTEXT)],
                                        skip_header       = True,
                                        csv_reader_params = {'delimiter':'\t',
                                                             'quotechar':'½'})

    # build vocabularies based on what our csv files contained and create word2id mapping
    WORDSENSE.build_vocab(train)
    LEMMA.build_vocab(train)
    POSITION.build_vocab(train)
    CONTEXT.build_vocab(train)

    # create batches from our data, and shuffle them for each epoch
    train_iter, test_iter = BucketIterator.splits((train, test),
                                                  batch_size        = 8,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.context),
                                                  shuffle           = True,
                                                  device            = device)

    return train_iter, test_iter, CONTEXT, WORDSENSE, POSITION, LEMMA

# 2.1 Creating and running a Neural Network for WSD

In this section we will create and run a neural network to predict word senses based on *contextualized representations*.

### Model

We will use a bidirectional Long-Short-Term Memory (LSTM) network to create a representation for the sentences and a Linear classifier to predict the sense of each word.

When we initialize the model, we need a few things:

    1) An embedding layer: a dictionary from which we can obtain word embeddings
    2) A LSTM-module to obtain contextual representations
    3) A classifier that compute scores for each word-sense given *some* input


The general procedure is the following:

    1) For each word in the sentence, obtain word embeddings
    2) Run the embedded sentences through the RNN
    3) Select the appropriate hidden state
    4) Predict the word-sense 

**Suggestion for efficiency:**  *Use a low dimensionality (32) for word embeddings and the LSTM when developing and testing the code, then scale up when running the full training/tests*
    
Your tasks will be to create two different models (both follow the two outlines described above), described below:

In the first approach to WSD, you are to select the index of our target word (column 3 in the dataset) and predict the word sense. **[5 marks]**


In [12]:
class WSDModel_approach1(nn.Module):
    def __init__(self, vocab_size, wordsense_size,  emb_dim, out_dim):
        super(WSDModel_approach1, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, emb_dim)
        self.LSTM = nn.LSTM(emb_dim, out_dim, batch_first = True, bidirectional = True)          
        self.classifier = nn.Linear(out_dim*2, wordsense_size)

    def forward(self, batch, word_index):                     
        embedded_batch = self.embeddings(batch)   
        contextualized_embedding, (hn, cn) = self.LSTM(embedded_batch)   

        output = contextualized_embedding[torch.arange(contextualized_embedding.size(0)), word_index]
        output = self.classifier(output)
        return output


In the second approach to WSD, you are to predict the word sense based on the final hidden state given by the RNN. **[5 marks]**

In [13]:

class WSDModel_approach2(nn.Module):
    def __init__(self, vocab_size, wordsense_size,  emb_dim, out_dim):              
        super(WSDModel_approach2, self).__init__()                 
        self.embeddings = nn.Embedding(vocab_size, emb_dim)                 
        self.LSTM = nn.LSTM(emb_dim, out_dim, batch_first = True, bidirectional = True)       
        self.classifier = nn.Linear(out_dim*2, wordsense_size)              # 34x2 = 64, wordsense_size is 217
        
    def forward(self, batch):                
        embedded_batch = self.embeddings(batch)  
        output, (hn, cn) = self.LSTM(embedded_batch)    
        forward = hn[0,:,:]                            # bidirectional, l2r
        backward = hn[1,:,:]                           # bidirectional, r2l
        out = torch.cat((forward, backward), 1)        #concatenate in first dimension 
        output = self.classifier(out)             
        

        return output

### Training and testing the model

Now we are ready to train and test our model. What we need now is a loss function, an optimizer, and our data. 

- First, create the loss function and the optimizer.
- Next, we iterate over the number of epochs (i.e. how many times we let the model see our data). 
- For each epoch, iterate over the dataset (``train_iter``) to obtain batches. Use the batch as input to the model, and let the model output scores for the different word senses.
- For each model output, calculate the loss (and print the loss) on the output and update the model parameters.
- Reset the gradients and repeat.
- After all epochs are done, test your trained model on the test set (``test_iter``) and calculate the total and per-word-form accuracy of your model.

Implement the training and testing of the model **[4 marks]**

**Suggestion for efficiency:** *when developing your model, try training and testing the model on one or two batches (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [14]:
train_iter, test_iter, vocab, labels, POSITION, LEMMA = dataloader("wsd-data/")


model1 = WSDModel_approach1(
        vocab_size = len(vocab.vocab), 
        wordsense_size = len(labels.vocab),
        emb_dim = hyperparameters["embedding_dim"],
        out_dim = hyperparameters["output_dim"])



loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    model1.parameters(),
    lr=hyperparameters['learning_rate']
)



total_loss = 0
for epoch in range(hyperparameters["epochs"]):
    for i, batch in enumerate(train_iter):
        
        
        #index
        batch_position = torch.tensor([int(POSITION.vocab.itos[x]) for x in batch.position])
        
        #sentence 
        context = batch.context 
        
        
        target_word = batch.wordsense
        

        output = model1(batch.context, batch_position)
        
        loss = loss_function(output, target_word.reshape(-1))
        total_loss += loss.item()
        
        print(total_loss/(i+1), end='\r') 
        

        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()      

# test model after all epochs are completed

NameError: name 'Field' is not defined

In [None]:
torch.save(model1, "model1")

In [114]:
train_iter, test_iter, vocab, labels, POSITION, LEMMA = dataloader("wsd-data/")


model2 = WSDModel_approach2(
        vocab_size = len(vocab.vocab), 
        wordsense_size = len(labels.vocab),
        emb_dim = hyperparameters["embedding_dim"],
        out_dim = hyperparameters["output_dim"])



loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    model2.parameters(),
    lr=hyperparameters['learning_rate']
)



total_loss = 0
for epoch in range(hyperparameters["epochs"]):
    for i, batch in enumerate(train_iter):
        
        
        batch_position = torch.tensor([int(POSITION.vocab.itos[x]) for x in batch.position])
        
        context = batch.context 
        
        
        target_word = batch.wordsense
        

        output = model2(batch.context) #, batch_position)
        loss = loss_function(output, target_word.reshape(-1))
        total_loss += loss.item()
        
        print(total_loss/(i+1), end='\r') 
        

        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()      

# test model after all epochs are completed

12.724319196531132

In [115]:
torch.save(model2, "model2")

WSDModel_approach1(
  (embeddings): Embedding(70324, 32)
  (LSTM): LSTM(32, 32, batch_first=True, bidirectional=True)
  (classifier): Linear(in_features=64, out_features=224, bias=True)
)

In [101]:
### TEST MODEL 1
model1 = torch.load('model1')

test_loss = 0
model1.eval()
total = 0
correct = 0



per_word_form_total = dict.fromkeys(LEMMA.vocab.itos, 0)
per_word_form_correct = dict.fromkeys(LEMMA.vocab.itos, 0)


for i, batch in enumerate(test_iter):

    try:
        context = batch.context
        word_sense = batch.wordsense
        lemma = batch.lemma
        position = torch.tensor([int(POSITION.vocab.itos[x]) for x in batch.position])

        with torch.no_grad(): 
            output = model1(context,position)

            # Calculate total accuracy
            total += word_sense.size(0)
            predicted = [labels.vocab.itos[x] for x in torch.max(output, 1)[1]]
            labels2 = [labels.vocab.itos[x] for x in word_sense]
            word_forms = [LEMMA.vocab.itos[x] for x in lemma]
            
            
            
            for n in range(len(predicted)):
                if predicted[n] == labels2[n]:
                    correct += 1

            # Calculate per_word_accuracy
            loss = loss_function(output, word_sense.reshape(-1))
            test_loss += loss.item()
            print('>', np.round(test_loss/(i+1), 4), end='\r')
            
            
        # Calculate per-word-form accuracy
            for n,word_form in enumerate(word_forms):
                if predicted[n] == labels2[n]:
                    per_word_form_correct[word_form] += 1
                per_word_form_total[word_form] += 1

    except ValueError:
        pass

accuracy1_model1 = correct / total
print(f'Total accuracy model 1: {np.round(accuracy * 100, 2)} %')

accuracy2_model1 = {word_form : (per_word_form_correct[word_form] / per_word_form_total[word_form]) for word_form in baseline_dict.keys() if per_word_form_total[word_form] > 0}


print('Per-word-form accuracy model 1:')
for k,v in accuracy2_model1.items():
    print(f'{k} : {np.round(v * 100, 2)} %')

Total accuracy model 1: 66.7 %94
Per-word-form accuracy model 1:
see.v : 78.43 %
line.n : 93.47 %
keep.v : 73.64 %
follow.v : 62.13 %
hold.v : 50.96 %
serve.v : 54.93 %
force.n : 80.25 %
lead.v : 44.55 %
bring.v : 51.19 %
build.v : 42.24 %
extend.v : 53.57 %
find.v : 65.73 %
case.n : 47.7 %
position.n : 47.61 %
time.n : 70.5 %
security.n : 81.47 %
national.a : 73.62 %
life.n : 70.26 %
order.n : 65.45 %
professional.a : 76.38 %
regular.a : 59.38 %
physical.a : 61.76 %
point.n : 74.55 %
place.n : 69.23 %
common.a : 62.94 %
bad.a : 77.51 %
critical.a : 66.36 %
major.a : 50.85 %
active.a : 75.86 %
positive.a : 60.5 %


In [116]:
model2 = torch.load('model2')

test_loss = 0
model2.eval()

total = 0
correct = 0

per_word_form_total = dict.fromkeys(LEMMA.vocab.itos, 0)
per_word_form_correct = dict.fromkeys(LEMMA.vocab.itos, 0)


#vocab, labels, POSITION, LEMMA



for i, batch in enumerate(test_iter):
    
    word_sense = batch.wordsense
    lemma = batch.lemma

    with torch.no_grad(): 
        output = model2(batch.context)
        loss = loss_function(output, word_sense.reshape(-1))
        test_loss += loss.item()

        # Calculate total accuracy
        total += word_sense.size(0)

        predicted = [labels.vocab.itos[x] for x in torch.max(output, 1)[1]]
        labels2 = [labels.vocab.itos[x] for x in word_sense]
        word_forms = [LEMMA.vocab.itos[x] for x in lemma]

        for n in range(len(predicted)):
            if predicted[n] == labels2[n]:
                correct += 1

        # Calculate per-word-form accuracy
        for n,word_form in enumerate(word_forms):
            if predicted[n] == labels2[n]:
                per_word_form_correct[word_form] += 1
            per_word_form_total[word_form] += 1


        print('>', np.round(test_loss/(i+1), 4), end='\r')

accuracy_model2 = correct / total
accuracy2_model2 = {word_form : (per_word_form_correct[word_form] / per_word_form_total[word_form]) for word_form in baseline_dict.keys() if per_word_form_total[word_form] > 0}
print(f'Total accuracy model 2: {np.round(accuracy_model2 * 100, 2)} %')

print('Per-word-form accuracy model 2:')
for k,v in accuracy2_model2.items():
    print(f'{k} : {np.round(v * 100, 2)} %')

Total accuracy model 2: 28.95 %
Per-word-form accuracy model 2:
lead.v : 12.28 %
line.n : 83.17 %
order.n : 30.05 %
extend.v : 24.58 %
security.n : 14.96 %
critical.a : 7.17 %
build.v : 8.16 %
active.a : 9.54 %
bring.v : 10.56 %
see.v : 62.97 %
find.v : 19.91 %
professional.a : 23.25 %
serve.v : 5.76 %
physical.a : 25.5 %
keep.v : 60.49 %
force.n : 14.85 %
position.n : 20.5 %
follow.v : 11.09 %
common.a : 27.06 %
hold.v : 8.29 %
point.n : 39.22 %
place.n : 20.43 %
case.n : 10.94 %
time.n : 12.24 %
regular.a : 25.72 %
positive.a : 13.81 %
life.n : 27.91 %
bad.a : 49.09 %
major.a : 1.71 %
national.a : 0.91 %


In [61]:
a = [159, 151, 115,  76, 121,  89, 128, 108]
b = [58, 57, 98, 68, 49, 69, 52, 73]

print([vocab.vocab.itos[x] for x in c])

['council', 'women', 'our', 'general', 'would', 'including', 'rights', 'his']


# 2.2 Running a transformer for WSD

In this section of the lab you'll try out the transformer, specifically the BERT model. For this we'll use the huggingface library (https://huggingface.co/).

You can find the documentation for the BERT model here (https://huggingface.co/transformers/model_doc/bert.html) and a general usage guide here (https://huggingface.co/transformers/quickstart.html).

What we're going to do is *fine-tune* the BERT model, i.e. update the weights of a pre-trained model. That is, we have a model that is trained on language modeling, but now we apply it to word sense disambiguation with the word representations it learnt from language modeling.

We'll use the same data splits for training and testing as before, but this time you'll not use a torchtext dataloader. Rather now you create an iterator that collects N sentences (where N is the batch size) then use the BertTokenizer to transform the sentence into integers. For your dataloader, remember to:
* Shuffle the data in each batch
* Make sure you get a new iterator for each *epoch*
* Create a vocabulary of *sense-labels* so you can calculate accuracy 

We then pass this batch into the BERT model and train as before. The BERT model will encode the sentence, then we send this encoded sentence into a prediction layer (you can either the the sentence-representation from bert, or the ambiguous word) like before and collect sense predictions.

About the hyperparameters and training:
* For BERT, usually a lower learning rate works best, between 0.0001-0.000001.
* BERT takes alot of resources, running it on CPU will take ages, utilize the GPUs :)
* Since BERT takes alot of resources, use a small batch size (4-8)
* Computing the BERT representation, make sure you pass the mask

**[10 marks]**

In [2]:
import torch
import transformers
from transformers import BertTokenizer,BertForSequenceClassification
import pandas as pd

In [61]:


# Load pre-trained model tokenizer (vocabulary)
import torch
import random
from transformers import AutoModel, AutoTokenizer, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')




#splits the dataframe into batch size
def split_dataframe(df, chunk_size): 
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
        
    chunks = chunks[:-1]
    random.shuffle(chunks)
    training = chunks[:int(len(chunks)*0.8)] 
    testing = chunks[-int(len(chunks)*0.2):]
    

    return training, testing

def dataloader_for_bert(path_to_file, batch_size):
    
    
    df = pd.read_csv(path_to_file, sep="\t", header=None)
    sense_labels = set(df[0].tolist())
    lemmas = set(df[1].tolist())
    

    df_batches, df_batches_test = split_dataframe(df, batch_size)
    
    
    
    
    batches_vectorized = list()
    
    #changed to test for testing
    for batch in df_batches_test:
        random_index = list(range(len(batch.index)))
        random.shuffle(random_index)
        batch_senses = list()
        batch_contexts = list()
        batch_lemmas = list()
        #random index = indexes generated from random shuffle
        #goes through one batch at a time
        for i in random_index:
            #selects wordsense
            
            
            #trying to select lemma instead
            batch_senses.append(batch.iloc[i][0])
            
            #selects context
            batch_contexts.append(batch.iloc[i][3])
            #selects index
            batch_lemmas.append(batch.iloc[i][1])
        
        ##encoding turns each word into a number corresponding to the bert vocab
        batches_vectorized.append([tokenizer.batch_encode_plus(batch_contexts,
                                                               max_length = 128,                #max length
                                                               add_special_tokens = True,       #start, stop symbol
                                                               truncation = True,               #?
                                                               padding = True,                  #pad to max length
                                                               return_attention_mask = True,    #attention mask
                                                               return_tensors = "pt",            #return tensors
                                                               return_token_type_ids = False     #not needed, single sentence
                                                               ).to(device),
                                   batch_senses
                                  ,batch_lemmas])
    
        
        
    batches_vectorized2 = list()
    
    #changed to test for testing
    for batch in df_batches:
        random_index = list(range(len(batch.index)))
        random.shuffle(random_index)
        batch_senses = list()
        batch_contexts = list()
        batch_lemmas = list()
        #random index = indexes generated from random shuffle
        #goes through one batch at a time
        for i in random_index:
            #selects wordsense
            
            
            #trying to select lemma instead
            batch_senses.append(batch.iloc[i][0])
            
            #selects context
            batch_contexts.append(batch.iloc[i][3])
            #selects index
            batch_lemmas.append(batch.iloc[i][1])
        
        ##encoding turns each word into a number corresponding to the bert vocab
        batches_vectorized2.append([tokenizer.batch_encode_plus(batch_contexts,
                                                               max_length = 128,                #max length
                                                               add_special_tokens = True,       #start, stop symbol
                                                               truncation = True,               #?
                                                               padding = True,                  #pad to max length
                                                               return_attention_mask = True,    #attention mask
                                                               return_tensors = "pt",            #return tensors
                                                               return_token_type_ids = False     #not needed, single sentence
                                                               ).to(device),
                                   batch_senses
                                  ,batch_lemmas])
        
        
    
    return sense_labels, batches_vectorized, batches_vectorized2, lemmas

path_to_file = "wsd-data/wsd_data.txt"
batch_size = 8
sense_labels, batches_test, batches, lemmas = dataloader_for_bert(path_to_file, batch_size)


#into forward, input_ids and attention_mask

In [62]:
sense_mapping = dict()
i = 0
for word in sense_labels:
    sense_mapping[word] = i
    i += 1
    
print(sense_mapping)

{'common%5:00:00:familiar:02': 0, 'extend%2:30:06::': 1, 'bring%2:40:02::': 2, 'hold%2:35:03::': 3, 'physical%3:01:00::': 4, 'build%2:41:00::': 5, 'line%1:06:01::': 6, 'order%1:14:00::': 7, 'point%1:15:00::': 8, 'serve%2:33:00::': 9, 'bring%2:30:00::': 10, 'major%3:00:01::': 11, 'find%2:40:02::': 12, 'case%1:04:00::': 13, 'force%1:04:01::': 14, 'place%1:04:00::': 15, 'serve%2:41:00::': 16, 'regular%5:00:00:steady:00': 17, 'see%2:31:00::': 18, 'force%1:07:01::': 19, 'see%2:39:02::': 20, 'security%1:26:00::': 21, 'keep%2:41:03::': 22, 'regular%5:00:00:symmetrical:00': 23, 'find%2:31:09::': 24, 'point%1:09:00::': 25, 'serve%2:35:00::': 26, 'major%3:00:07::': 27, 'serve%2:41:02::': 28, 'find%2:40:00::': 29, 'keep%2:35:10::': 30, 'keep%2:40:13::': 31, 'lead%2:42:04::': 32, 'bring%2:36:01::': 33, 'bad%5:00:00:inferior:02': 34, 'regular%5:00:00:usual:00': 35, 'life%1:28:02::': 36, 'life%1:26:02::': 37, 'regular%5:00:00:frequent:00': 38, 'national%5:00:00:domestic:00': 39, 'regular%5:00:00:nor

In [63]:
def get_key(val):
    for key, value in sense_mapping.items():
         if val == value:
            return key
 
    return "UNK"
 


In [64]:

bert = AutoModel.from_pretrained('bert-base-uncased')



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [65]:
class BERT_WSD(nn.Module):
    def __init__(self, wordsense_size, bert):
        super(BERT_WSD, self).__init__()   
        # your code goes here
        self.bert = bert
        self.relu =  nn.ReLU()
        self.fc1 = nn.Linear(768,512)
        self.fc2 = nn.Linear(512, wordsense_size)

    def forward(self, input_ids, attention_mask):
        # your code goes here
        last_hidden_state, pooler_output = self.bert(input_ids, attention_mask, return_dict=False) #pooled is of size (8, 768) this is the output of our [CLS] token, the first token in our sequence.
        x = self.fc1(pooler_output)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [66]:




model = BERT_WSD(bert = bert, wordsense_size = len(sense_labels))
model = model.to(device)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    model.parameters(),
    lr=0.00001
)


total_loss = 0
for _ in range(2):
    for i, batch in enumerate(batches): 
        
        
        
        wordsense = batch[1]
        
        
        #turn target words into integers
        wordsense = [sense_mapping[x] for x in wordsense]
        
        #turn integers into tensor
        target_words = torch.tensor(wordsense).to(device)
        
    
        attention_mask = batch[0]["attention_mask"]
        input_ids = batch[0]["input_ids"]
        
        
        
        output = model(input_ids, attention_mask)
      
        
        loss = loss_function(output, target_words)
        total_loss += loss.item()
        
        print(total_loss/(i+1), end='\r') 
        
        
        #convert_ids_to_tokens
        
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()
        
        if (i + 1) % 100 == 0:
            torch.save(model.state_dict(), "model_bert")
    
# test model after all epochs are completed


4.5330621343035485

In [67]:
#torch.save(model.state_dict(), "model_bert")

In [102]:
model = BERT_WSD(bert = bert, wordsense_size = len(sense_labels))
model.load_state_dict(torch.load("model_bert"))

model = model.to(device)


test_loss = 0
model.eval()

total = 0
correct = 0


##create equivalent from data_loader_from_bert
per_word_form_total = dict.fromkeys(lemmas, 0)
per_word_form_correct = dict.fromkeys(lemmas, 0)

#batch, sense, lemmas

for i, batch in enumerate(batches_test):
    
    #context = batch[0]
    word_sense_batch = batch[1]
    lemma = batch[2]
    
    
    

    with torch.no_grad(): 
        #turn target words into integers
        wordsense_numbers = [sense_mapping[x] for x in word_sense_batch]
        #turn integers into tensor
        target_words = torch.tensor(wordsense_numbers).to(device)
        
        
        
        
        attention_mask = batch[0]["attention_mask"]
        input_ids = batch[0]["input_ids"]
        
        
        output = model(input_ids, attention_mask)
        
        
        loss = loss_function(output, target_words.reshape(-1))
        test_loss += loss.item()

        # Calculate total accuracy
        total += len(wordsense_numbers)
        
               
        predicted = [get_key(x) for x in torch.max(output, 1)[1]]
        
        #correct label
        labels2 = [get_key(x) for x in wordsense_numbers]
        
        
        #get each lemma
        word_forms = [x for x in lemma]
        

        for n in range(len(predicted)):
            if predicted[n] == labels2[n]:
                correct += 1

        # Calculate per-word-form accuracy
        for n,word_form in enumerate(word_forms):
            if predicted[n] == labels2[n]:
                per_word_form_correct[word_form] += 1
            per_word_form_total[word_form] += 1


        print('>', np.round(test_loss/(i+1), 4), end='\r')

accuracy_model_bert = correct / total
accuracy2_model_bert = {word_form : (per_word_form_correct[word_form] / per_word_form_total[word_form]) for word_form in baseline_dict.keys() if per_word_form_total[word_form] > 0}
print(f'Total accuracy model 2: {np.round(accuracy_model_bert * 100, 2)} %')

print('Per-word-form accuracy model 2:')
for k,v in accuracy2_model_bert.items():
    print(f'{k} : {np.round(v * 100, 2)} %')

Total accuracy model 2: 70.45 %
Per-word-form accuracy model 2:
serve.v : 72.1 %
critical.a : 73.86 %
national.a : 56.5 %
case.n : 47.89 %
active.a : 76.26 %
follow.v : 65.67 %
see.v : 83.48 %
positive.a : 69.57 %
force.n : 74.28 %
position.n : 72.57 %
lead.v : 57.79 %
point.n : 70.66 %
life.n : 75.67 %
physical.a : 60.39 %
order.n : 75.25 %
find.v : 68.97 %
security.n : 77.52 %
major.a : 66.45 %
place.n : 72.24 %
line.n : 92.16 %
hold.v : 67.54 %
bring.v : 59.33 %
regular.a : 52.51 %
professional.a : 77.38 %
bad.a : 76.25 %
extend.v : 66.34 %
common.a : 60.58 %
keep.v : 78.58 %
time.n : 68.08 %
build.v : 38.18 %


In [104]:
pip install pickle

Collecting pickle
[31m  Could not find a version that satisfies the requirement pickle (from versions: )[0m
[31mNo matching distribution found for pickle[0m
[33mYou are using pip version 19.0.3, however version 21.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


# 3. Evaluation

Explain the difference between the first and second approach. What kind of representations are the different approaches using to predict word-senses? **[4 marks]**

In [None]:
First model uses a representation of each specific word.
Second model uses the entire

Evaluate your model with per-word-form *accuracy* and comment on the results you get, how does the model perform in comparison to the baseline, and how do the models compare to each other? 

Expand on the evaluation by sorting the word-forms by the number of senses they have. Are word-forms with fewer senses easier to predict? Give a short explanation of the results you get based on the number of senses per word.

**[6 marks]**

In [118]:

model1 = {
"see.v" : "78.43 %",
"line.n" : "93.47 %",
"keep.v" : "73.64 %",
"follow.v" : "62.13 %",
"hold.v" : "50.96 %",
"serve.v" : "54.93 %",
"force.n" : "80.25 %",
"lead.v" : "44.55 %",
"bring.v" : "51.19 %",
"build.v" : "42.24 %",
"extend.v" : "53.57 %",
"find.v" : "65.73 %",
"case.n" : "47.7 %",
"position.n" : "47.61 %",
"time.n" : "70.5 %",
"security.n" : "81.47 %",
"national.a" : "73.62 %",
"life.n" : "70.26 %",
"order.n" : "65.45 %",
"professional.a" : "76.38 %",
"regular.a" : "59.38 %",
"physical.a" : "61.76 %",
"point.n" : "74.55 %",
"place.n" : "69.23 %",
"common.a" : "62.94 %",
"bad.a" : "77.51 %",
"critical.a" : "66.36 %",
"major.a" : "50.85 %",
"active.a" : "75.86 %",
"positive.a" : "60.5 %"
}

model2 = {
"lead.v" : "12.28 %",
"line.n" : "83.17 %",
"order.n" : "30.05 %",
"extend.v" : "24.58 %",
"security.n" : "14.96 %",
"critical.a" : "7.17 %",
"build.v" : "8.16 %",
"active.a" : "9.54 %",
"bring.v" : "10.56 %",
"see.v" : "62.97 %",
"find.v" : "19.91 %",
"professional.a" : "23.25 %",
"serve.v" : "5.76 %",
"physical.a" : "25.5 %",
"keep.v" : "60.49 %",
"force.n" : "14.85 %",
"position.n" : "20.5 %",
"follow.v" : "11.09 %",
"common.a" : "27.06 %",
"hold.v" : "8.29 %",
"point.n" : "39.22 %",
"place.n" : "20.43 %",
"case.n" : "10.94 %",
"time.n" : "12.24 %",
"regular.a" : "25.72 %",
"positive.a" : "13.81 %",
"life.n" : "27.91 %",
"bad.a" : "49.09 %",
"major.a" : "1.71 %",
"national.a" : "0.91 %",
}

In [140]:
df = pd.DataFrame([accuracy2_model_bert,model1,model2, baseline_dict, word_sense_counts])
results = pd.DataFrame.transpose(df)
results.columns = ["BERT","Model1","Model2", "MSC", "Number of senses"]
results = results.sort_values('Number of senses',ascending=False)
results

results.to_csv("comparison_pd.csv")


results = pd.read_csv("comparison_pd.csv", index_col=0)
results

Unnamed: 0,BERT,Model1,Model2,MSC,Number of senses
see.v,0.834802,78.43 %,62.97 %,62.76 %,11
keep.v,0.785846,73.64 %,60.49 %,39.2 %,11
hold.v,0.675367,50.96 %,8.29 %,15.2 %,11
line.n,0.921604,93.47 %,83.17 %,85.12 %,11
follow.v,0.656716,62.13 %,11.09 %,14.59 %,11
find.v,0.689727,65.73 %,19.91 %,23.18 %,10
build.v,0.381818,42.24 %,8.16 %,21.2 %,10
serve.v,0.720971,54.93 %,5.76 %,15.51 %,9
life.n,0.756691,70.26 %,27.91 %,22.47 %,9
force.n,0.742754,80.25 %,14.85 %,16.27 %,8


How does the LSTMs perform in comparison to BERT? What's the difference between representations obtained by the LSTMs and BERT? **[2 marks]**

Bert outperforms the LSTMs on average with an accuracy of 70.45% compared to Model1 (67%) and Model2 (23%). Model1 performs better than Bert for some lemmas

With Bert we used the same approach as for model 2, in that we looked at the entire sentence.
Bert already starts with pre-existing embeddings that we use when training while model2 has to create the embeddings from scratch.


What could we do to improve our LSTM word sense disambiguation models and our BERT model? **[4 marks]**

LSTMs: More data, larger dimensionality, dynamic learning rate, more LSTM layers, increase number of epochs as well as batch size, adding regulizers and/or dropout.

Bert: We didnt use fine tuning but doing so would probably increase accuracy. Other potential improvements could be giving the model the index of the word to predict like in model 1 and using a dynamic learning rate.

# Readings:

[1] Kågebäck, M., & Salomonsson, H. (2016). Word Sense Disambiguation using a Bidirectional LSTM. arXiv preprint arXiv:1606.03568.

[2] https://cl.lingfil.uu.se/~nivre/master/NLP-LexSem.pdf