# Lab 3: Word Embeddings and Language Modelling

Adam Ek

In this lab we'll explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch. Some basic operations that will be useful can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* In general: we are not interested in getting state-of-the-art performance :) focus on the implementation and not results of your model. For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so, on linux/mac: ```head -n 10000 inputfile > outputfile```. 
* If possible, use the MLTGpu, it will make everything faster :)

In [208]:
# MAKE A SMALLER TXT CORPUS
# inputfile, outputfile, headn = 'wiki-corpus.txt', 'wiki-subset.txt', 10000
# with open(inputfile, encoding='utf8') as f:
#     lines_head = [l for l in f][:headn]
# with open(outputfile, 'w', encoding='utf8') as f:
#     for l in lines_head:
#         f.write(l)

In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd

# for gpu, replace "cpu" with "cuda:n" where n is the index of the GPU
hardware = 'cuda:0' if torch.cuda.is_available() else 'cpu'
device = torch.device(hardware)

# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data, you can download the file on canvas under files/03-lab-data/wiki-corpus.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the center word in one field and the context words in another (separate the fields with ```tab```).

For example, the sentece "this is a lab" with ```window size = 4``` will be formatted as:
```
center, context
---------------------
this    is a lab
is      this a lab
a       this is lab
lab     this is a
```

this will be our training examples when training the word2vec model.

[3 marks]

In [49]:
data_path = 'wiki-corpus.txt'
# data_path = 'wiki-subset.txt'
WINDOW_SIZE = 4  # Hyperparameter of context size

from string import punctuation
import pandas as pd

def corpus_reader(data_path):
    with open(data_path, encoding='utf8') as f:
      lines = [l.lower().split() for l in f] # tokenized sentences list
      center_and_context = []
      for l in lines:
        l = [token for token in l if token not in punctuation]
        
#         ngrammize = lambda input_list,n : list(zip(*[input_list[i:] for i in range(n)])) # Turn a tokens list to n-grams list
#         ngrams = ngrammize( l, WINDOW_SIZE )
#         for gram in ngrams:
#           idx_range = range(WINDOW_SIZE)
#           for i in idx_range:
#             center = gram[i] # One of the words in the ngram
#             context = [ gram[j] for j in idx_range if j!=i ] # The rest of the ngram words that isn't the center word
#             center_and_context.append(f"{center}\t{' '.join(context)}")

        k = int(WINDOW_SIZE/2) # window size to the left or right
#         l = ['<START>']*k + l + ['<END>']*k  # start/end tokens to make each context the same length
        for i in range( len(l) ):
            center, context = l[i], []
            context_idx = [i+j for j in range(-k,0)] + [i+j for j in range(1,k+1)] # indexes before and after
            for idx in context_idx:
                if idx>=0: # If index isn't negative, add as many context (before+after) words as allowed 
                    try:
                        context.append(l[idx])
                    except IndexError:
                        pass
#             if center not in ['<START>','<END>']:
            if len(context)!=0:
                center_and_context.append( [center, f"{' '.join(context)}"] )


#     # Create new CSV/TSV file where each line is Center<tab>Context
    df = pd.DataFrame(center_and_context, columns=['center','context'])
    df.to_csv(data_path.replace('.txt','-formatted.csv'), index=False, sep='\t')


corpus_reader(data_path) # Format txt and save as csv

In [50]:
# The saved CSV/TSV looks like this:
csvdf = pd.read_csv('wiki-corpus-formatted.csv', sep='\t')
csvdf

Unnamed: 0,center,context
0,anarchist,historian george
1,historian,anarchist george woodcock
2,george,anarchist historian woodcock reports
3,woodcock,historian george reports that
4,reports,george woodcock that the
...,...,...
1096648,split,race was into eight
1096649,into,was split eight stages
1096650,eight,split into stages covering
1096651,stages,into eight covering


We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

**Pros:** This method embeds each word within a given windows size, which gives us the freedom of adjusting the context size according to the purpose, for example, whether we want the context feature to capture syntactic or semantics information. Also, each token is only represented by one center+context pair so it computes faster during training/model-building. 
The input from Wikipedia is randomized so it represents a diverse set of data so it suits a wider variety of tasks.
(Compare to Note below.)

**Cons:**
The length of context words varies. Eg if a word is near the start/end of a sentence, its context will lack words before/after and hence carries less information (may be solved by adding start/end tokens on either ends of each sentence). Also, the tokens are derivative forms of words, so a word's different forms are considered different vocab entries; depending on how we want to represent the meanings, it may be a good idea to lemmatize them.
Also relating to the random Wikipedia input above, it may not suit tasks that require specific domains, since it's trained on general texts that belong to a bit of everything. Moreover, Wikipedia is crowd-sourced so the content are not necessarily reflecting the truth.

**Note:** The original way of formatting (before Adam updated the notebook on May 2; as implemented by the commented lines above) results in N center+context pairs for each token, where N=WINDOW_SIZE. For example, if WINDOW_SIZE=4, then a token is represented by 4 different combinations, namely: 

    center + [center-3, center-2, center-1]

    center + [center-2, center-1, center+1]

    center + [center-1, center+1, center+2]

    center + [center+1, center+2, center+3]



### Loading the data

We now need to load the data in an appropriate format for torchtext (https://torchtext.readthedocs.io/en/latest/). We'll use PyText for this and it'll follow the same structure as I showed you in the lecture (remember to lower-case all tokens). Create a function which returns a (bucket)iterator of the training data, and the vocabulary object (```Field```). 

(*hint1*: you can format the data such that the center word always is first, then you only need to use one field)

(*hint2*: the code I showed you during the leture is available in /files/pytorch_tutorial/ on canvas)

[4 marks]

In [2]:
from torchtext.data import Field, BucketIterator, TabularDataset, Iterator

In [3]:
def get_data(datafile):

    # "fields" that process the columns in TSV files
    Tokens = Field(tokenize = lambda x:x.split(), lower=True, batch_first=True)
    fields = [('center', Tokens),('context', Tokens)]
    
    # read the TSV files and create a dataset generator, for example:
    #     data[0].center # list of 1 item, eg, ['anarchist']
    #     data[4].context # list of context tokens, eg, ['george', 'woodcock', 'that', 'the']
    data = TabularDataset(path=datafile, format='csv', fields=fields,
                          skip_header=True, csv_reader_params = {'delimiter':'\t','quotechar':'、'})

    # build vocabularies based on what TSV files contained and create word2id mapping
    #   len(Center.vocab) == vocabsize
    #   Center.vocab[wordstr]==encoding, eg Center.vocab['anarchist']==3334 ; Center.vocab[UnknownWord]==0
    Tokens.build_vocab(data, min_freq=1)


    # create batches from our data, and shuffle them for each epoch
    # TODO Do we need sort_within_batch = True, sort_key= lambda x: len(x.context), 
    #      since center+context (1+4) are almost same length?
    dataset_iter = BucketIterator( data, batch_size= 8, shuffle= True, device= device)

    return dataset_iter, Tokens.vocab

In [4]:
# dataset, vocab = get_data('wiki-subset-formatted.csv') # From smaller corpus
dataset, vocab = get_data('wiki-corpus-formatted.csv') # From whole corpus


In [6]:
for i, batch in enumerate(dataset): # Generated results are random
    center = batch.center  
    context = batch.context
    context = torch.sum(context,dim=1)  # Expects (B,S)=>(B,1)|(B)
    print(i,center.shape, context.shape) # Context tensor, shape= 8,S  ie B=batchsize, S=winsize(range 2~4)
    
    if i==2:
        break

0 torch.Size([8, 1]) torch.Size([8])
1 torch.Size([8, 1]) torch.Size([8])
2 torch.Size([8, 1]) torch.Size([8])


We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

Some words have different meanings when they are capitalized vs uncaptalized, so lower-casing them will resulting in them being represented the same way. For exmaple, Turkey (country) and turkey (bird).

## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In [7]:
import torch.optim as optim

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [5]:
class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim): # Args: vocab_size:int, embed_dim:int
        super(CBOWModel, self).__init__()
        
        #out: 1 x V
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)  # Matrix of V*D => 128
        self.linear = nn.Linear(embedding_dim, vocab_size)  # 128 => V

    
    def forward(self, context):
        embedded_context = self.embeddings(context)
        embedded_context = self.projection_function(embedded_context) # [B,S,D] => [B,D]
        out = self.linear(embedded_context) # nonlinear + projection
        log_probs = F.log_softmax(out, dim=1) # softmax log-prob
        return log_probs
    

    def projection_function(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        xs_sum = torch.sum(xs,dim=1)
        return xs_sum

Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [6]:
# you can change these numbers to suit your needs :)
word_embeddings_hyperparameters = {'epochs':3,
                                   'batch_size':16,
                                   'embedding_size':128,
                                   'learning_rate':0.001,
                                   'embedding_dim':128}

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [63]:
# load data
# dataset, vocab = get_data( 'wiki-corpus-formatted.csv' )

# build model and construct loss/optimizer
cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss() #Alt, loss_fn = nn.NLLLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

# start training loop
from tqdm import tqdm

total_loss, batch_count = 0, 0 # Keeps accumulating after each epoch; current batch_count is the denominator

for epoch in tqdm( range(word_embeddings_hyperparameters['epochs']) ):
    epoch_loss = 0 # Reset for every epoch
    
    for i, batch in enumerate(dataset):
        
        context = batch.context # tensor of size 8,4 (B,winsize)
        center = batch.center # tensor of size 8,1
        
        # send your batch of sentences to the model
        output = cbow_model(context)  # output: tensor of size 8,vocsize (B,vocsize)
        
#         print(context.shape, center.shape, output.shape, sep='\n'); break  # [8,4], [8,1], [8, 31195]
#         print(center.view(-1).shape) ;break  # [8]

        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        loss = loss_fn(output, center.view(-1))  #in:(B,vocsize), target:(B)
        total_loss += loss.item()
        epoch_loss += loss.item()
        
        # print average loss for the epoch
#         print(f'Averge loss of epoch {epoch+1}: {epoch_loss/(i+1)}', end='\r')
        print(f'Averge total loss: {total_loss/(batch_count+1)}', end='\r'); batch_count+=1
        
        # compute gradients; # update parameters; # reset gradients
        loss.backward();     optimizer.step();    optimizer.zero_grad()
    
    print()
        

  0%|          | 0/3 [00:00<?, ?it/s]

Averge total loss: 8.0635953482679446

 33%|███▎      | 1/3 [39:26<1:18:52, 2366.43s/it]


Averge total loss: 7.9400545899065575

 67%|██████▋   | 2/3 [1:18:42<39:23, 2363.45s/it]

Averge total loss: 7.940056337650774
Averge total loss: 7.8440138761463035

100%|██████████| 3/3 [1:57:44<00:00, 2354.71s/it]







In [7]:
# Save:
# torch.save(cbow_model.state_dict(), 'cbow_model.pt')

# #Load:
cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.load_state_dict(torch.load('cbow_model.pt'), )
cbow_model.eval()

##Save whole model:
# torch.save(model, 'cbow_model_whole.pt')

# Load whol model:

# # Model class must be defined somewhere
# # class CBOWModel(nn.Module): ... ^
# acbowmodel = torch.load('cbow_model_whole.pt')
# acbowmodel.eval()


CBOWModel(
  (embeddings): Embedding(80673, 128)
  (linear): Linear(in_features=128, out_features=80673, bias=True)
)

## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in vanvas under files/03-l). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [8]:
vocab.itos[3334]  # i to string => anarchist
vocab.stoi['anarchist']  # string to i => 3334
lookup_tensor = lambda idx,embed : embed( torch.LongTensor([idx]) )  # or torch.LongTensor([idx]).to(device)
# cbow_model.embeddings(torch.LongTensor([6301]))
lookup_tensor(6301,cbow_model.embeddings) # tensor of the word of the index

6614

In [10]:
# your code goes here

def read_wordsim(path, vocab, embeddings):
    word_pairs = []
    dataset_sims = []
    model_sims = []
    with open(path) as f:
        for line in f:
            word1, word2, score = line.split()
            word_pairs.append((word1,word2))
            
            score = float(score)
            dataset_sims.append(score)
            
            # get the index for the word; lower word so idx won't be 0 
            word1_idx,word2_idx = vocab.stoi[word1.lower()],vocab.stoi[word2.lower()]
            
            # get the embedding of the word
            lookup_tensor = lambda idx:embeddings(torch.LongTensor([idx])) # or torch.LongTensor([idx]).to(device)
            word1_emb,word2_emb = lookup_tensor(word1_idx), lookup_tensor(word2_idx)
            
            # compute cosine similarity, we'll use the version included in pytorch functional
            # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
            cosine_similarity = F.cosine_similarity( word1_emb,word2_emb ) # Compares two tensors
            
            model_sims.append(cosine_similarity.item())
    
    return dataset_sims, model_sims, word_pairs  # list of golds VS list of embeds cos-sims

In [11]:
path = 'wordsim_similarity_goldstandard.txt'
data, model, pairs = read_wordsim( path, vocab, cbow_model.embeddings ) 
pearson_correlation = np.corrcoef(data, model)

# from scipy import stats 
# pearson_correlation = stats.pearsonr(data, model)
# the non-diagonals give the pearson correlation,
print(pearson_correlation)

[[1.         0.16006021]
 [0.16006021 1.        ]]


Do you think the model performs good or bad? Why?

[3 marks]

We think the performance was bad. A value between 0.1~0.3 means "positively and weakly correlated".

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

The best-10 and worst-10 pairs are computed below. Note that they don't necessarily mean 'most similar vs most dissimilar', but rather pairs whose cosine similarities are closest to / furthest from the WordSim353 scores.

The gold scores were judged by humans based on their semantic similarities, while the cosine simliarities from the model are derives from the context they occur in. So even if two words are deemed similar by humans, if they didn't occur in similar contexts in the training data, they don't get high cosine similarities.




In [28]:
scaleto10 = lambda x, minval, maxval : round( 10*(x-minval)/(maxval-minval),2 )
{pair: (gold, scaleto10(cos_sim,-1,1)) for pair,gold,cos_sim in list(zip(pairs, data, model)) }

{('tiger', 'cat'): (7.35, 4.44),
 ('tiger', 'tiger'): (10.0, 10.0),
 ('plane', 'car'): (5.77, 5.98),
 ('train', 'car'): (6.31, 5.75),
 ('television', 'radio'): (6.77, 7.72),
 ('media', 'radio'): (7.42, 5.35),
 ('bread', 'butter'): (6.19, 4.95),
 ('cucumber', 'potato'): (5.92, 4.91),
 ('doctor', 'nurse'): (7.0, 5.09),
 ('professor', 'doctor'): (6.62, 5.65),
 ('student', 'professor'): (6.81, 6.83),
 ('smart', 'stupid'): (5.81, 5.21),
 ('wood', 'forest'): (7.73, 5.25),
 ('money', 'cash'): (9.15, 5.14),
 ('king', 'queen'): (8.58, 6.77),
 ('king', 'rook'): (5.92, 4.98),
 ('bishop', 'rabbi'): (6.69, 5.54),
 ('fuck', 'sex'): (9.44, 4.84),
 ('football', 'soccer'): (9.03, 6.17),
 ('football', 'basketball'): (6.81, 7.24),
 ('football', 'tennis'): (6.63, 5.36),
 ('Arafat', 'Jackson'): (2.5, 4.44),
 ('physics', 'chemistry'): (7.35, 6.46),
 ('vodka', 'gin'): (8.46, 5.07),
 ('vodka', 'brandy'): (8.13, 6.01),
 ('drink', 'eat'): (6.87, 6.22),
 ('car', 'automobile'): (8.94, 5.37),
 ('gem', 'jewel'): (8

In [29]:
# Normalize to range 0~1, where 0/1 is lowest/highest value of all
normalize = lambda x, minval, maxval : (x-minval)/(maxval-minval) 
# The difference between normalized gold vs normalized cos-sim
goldmin,goldmax, cosmin,cosmax = min(data),max(data), min(model),max(model)
# Normalize to scale of min~max or absolute min~max
deviate_relative = lambda gold, cos_sim : abs( normalize(gold, goldmin,goldmax) - normalize(cos_sim, cosmin,cosmax) ) 
deviate_absolute = lambda gold, cos_sim : abs( normalize(gold, 0,10) - normalize(cos_sim, -1,1) ) 

print('Normalize to the scale of the respective min & max in data and model:')
pair_diff = { pair: deviate(gold,cos_sim) for pair,gold,cos_sim in list(zip(pairs, data, model)) }
best2worst = sorted( pair_diff.items(), key= lambda item:item[1]) # From min difference to max difference
best10, worst10 = best2worst[:10], best2worst[-10:]
print([p[0] for p in best10])  # word pairs whose cos-sims are closest to gold scores
print([p[0] for p in worst10]) # word pairs whose cos-sims deviate the most from gold scores

print('\nNormalize to 0~10 for data and -1~1 for model:')
pair_diff = { pair: deviate_absolute(gold,cos_sim) for pair,gold,cos_sim in list(zip(pairs, data, model)) }
best2worst = sorted( pair_diff.items(), key= lambda item:item[1]) # From min difference to max difference
best10, worst10 = best2worst[:10], best2worst[-10:]
print([p[0] for p in best10])  # word pairs whose cos-sims are closest to gold scores
print([p[0] for p in worst10]) # word pairs whose cos-sims deviate the most from gold scores


Normalize to the scale of the respective min & max in data and model:
[('tiger', 'tiger'), ('announcement', 'production'), ('school', 'center'), ('problem', 'airport'), ('drink', 'mother'), ('precedent', 'cognition'), ('precedent', 'group'), ('population', 'development'), ('start', 'match'), ('atmosphere', 'landscape')]
[('boy', 'lad'), ('cell', 'phone'), ('street', 'avenue'), ('magician', 'wizard'), ('money', 'cash'), ('psychology', 'psychiatry'), ('dollar', 'buck'), ('tiger', 'jaguar'), ('fuck', 'sex'), ('gem', 'jewel')]

Normalize to 0~10 for data and -1~1 for model:
[('tiger', 'tiger'), ('student', 'professor'), ('cup', 'food'), ('doctor', 'personnel'), ('car', 'flight'), ('street', 'children'), ('plane', 'car'), ('food', 'rooster'), ('tiger', 'organism'), ('precedent', 'example')]
[('sugar', 'approach'), ('drink', 'ear'), ('king', 'cabbage'), ('fuck', 'sex'), ('gem', 'jewel'), ('rooster', 'voyage'), ('direction', 'combination'), ('noon', 'string'), ('monk', 'slave'), ('chord', 'sm

Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

We may use other data form training. As mentioned before, the Wikipedia corpus is too diverse and too general. Or just having one corpus file is not enough and we simply need even larger amount of training data.

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hurt the performance of the sentiment model.

[3 marks]

**Pros for sentiment analysis:**

Since the model regards words in similar contexts as semantically similar, it can give the same sentiment ratings (or classifications) for text with synonymous modifiers, eg, great/awesome/terrific.

**Cons for sentiment analysis:**

From our evaluation above we saw the model didn't capture word similarities as how humans see them. Also, it cannot  disambiguate words with multiple meaning as well as humans do, therfore may result in the wrong classification in sentiment analysis. For example, there are contronyms which are same words with opposite meanings, eg, `fast: quick vs stuck; sanction: appove vs boycott`. 

Also it looks like that the Wikipedia corpua mixes a lot of languages, not just English, so a word with the same spelling may have different meanings in another language and appear in different contexts, which kind of 'pollutes' its representation.


# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predict the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-cropus.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

For this we'll build a new dataloader with torchtext, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token, there is a keyword argument in ```Field``` for this :). But other than that, as before you read the dataset and output a iterator over the dataset and a vocabulary. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [18]:
# you can change these numbers to suit your needs as before :)
lm_hyperparameters = {'epochs':3,
                      'batch_size':16,
                      'learning_rate':0.001,
                      'embedding_dim':128,
                      'output_dim':128}

In [21]:
# data_path = 'wiki-corpus.txt'
data_path = 'wiki-corpus.txt'

def get_data_lm(filename):
    # your code here, roughly the same as for the word2vec dataloader
    csvfile = filename.replace('.txt', '-lm.csv')
    with open(filename, encoding='utf8') as f:
        lines = [l for l in f]
        pd.DataFrame(lines).to_csv(csvfile, index=False)
    Sentence = Field(lower=True, tokenize=lambda x:x.split(), init_token='<start>', eos_token='<end>', batch_first=True)
    Examples = TabularDataset(path=csvfile, format='csv', fields=[('sentence', Sentence)])
    
    Sentence.build_vocab(Examples)
    
    dataset_iter = BucketIterator( Examples, batch_size= 8, shuffle= True, device= device)
        
    return dataset_iter, Sentence.vocab

lm_data, lm_vocab = get_data_lm(data_path)

In [26]:
# lm_vocab.stoi['<end>']
nums = [    2,    17,     7,     4,   120,    67,     5,     4, 18812,     5,
          79571,  8281,    36,    10,    43,     7, 36790,     6,     3,     1,
              1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
              1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
              1]
[lm_vocab.itos[n] for n in nums]

['<start>',
 'as',
 'of',
 'the',
 '2000',
 'census',
 ',',
 'the',
 'wheeling',
 ',',
 'wv',
 'msa',
 'had',
 'a',
 'population',
 'of',
 '153,172',
 '.',
 '<end>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>']

In [24]:
# len(lm_data)
# len(lm_vocab)-len(vocab)
[b.sentence for i,b in enumerate(lm_data)][:1]

[tensor([[    2,    17,     7,     4,   120,    67,     5,     4, 18812,     5,
          79571,  8281,    36,    10,    43,     7, 36790,     6,     3,     1,
              1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
              1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
              1],
         [    2,  1492,   174,     7,  1076, 19623,  2555,    66,   362,   102,
              4,   269, 31933,     5,     4,  6350,     7,   112,   147,     4,
          31932,     5,    33,   521,     4, 18840,    66,     8,   703,  3543,
              6,     3,     1,     1,     1,     1,     1,     1,     1,     1,
              1],
         [    2, 11423,    25,    55,   674,    36,    64,   316,     9,  2936,
              5,     8,    10,  5321,  7530,    13,  1462,     9,  2262,     6,
              3,     1,     1,     1,     1,     1,     1,     1,     1,     1,
              1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
    

In [109]:
class LM_withLSTM(nn.Module):
    def __init__(self, vocabsize, embed_dim, output_dim, num_layers=1, bidirectional=False):
        super(LM_withLSTM, self).__init__()
        self.embeddings = nn.Embedding(vocabsize-1, embed_dim)
        self.LSTM = nn.LSTM(embed_dim, output_dim,
                            num_layers=num_layers,bidirectional=bidirectional, batch_first=True
                    )
        self.predict_word = nn.Linear(output_dim, vocabsize-1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, seq):
        embedded_seq = self.embeddings(seq)
        timestep_reprentation, (h_n, c_n) = self.LSTM(embedded_seq)
        outputs = self.predict_word(h_n.squeeze(0))
        predicted_words = self.sigmoid(outputs)
        
        return predicted_words
    
#         out, (h_n, c_n) = self.LSTM(seq, None)
#         outputs = self.predict_word(h_n.squeeze(0))

#         return self.sigmoid(outputs)

In [113]:
# load data
# lm_dataset, lm_vocab = get_data('wiki-corpus.txt')

# build model and construct loss/optimizer
lm_model = LM_withLSTM(len(vocab), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'])
lm_model.to(device)

loss_func = nn.CrossEntropyLoss()
lm_optimizer = optim.Adam(cbow_model.parameters(), lr=lm_hyperparameters['learning_rate'])

# start training loop
# lm_total_loss, lm_batch_count = 0, 0
# for epoch in range(lm_hyperparameters['epochs']):
#     for i, batch in enumerate(dataset):
        
#         # the strucure for each BATCH is:
#         # <start>, w0, ..., wn, <end>
#         sentence = batch.sentence
        
#         # when training the model, at each input we predict the *NEXT* token
#         # consequently there is nothing to predict when we give the model 
#         # <end> as input. 
#         # thus, we do not want to give <end> as input to the model, select 
#         # from each batch all tokens except the last. 
#         # tip: use pytorch indexing/slicing (same as numpy) 
#         # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
#         # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
#         input_sentence = ...
        
#         # send your batch of sentences to the model
#         output = lm_model(input_sentence)
        
#         # for each output, the model predict the NEXT token, so we have to reshape 
#         # our dataset again. On timestep t, we evaluate on token t+1. That is,
#         # we never predict the <start> token ;) so this time, we select all but the first 
#         # token from sentences (that is, all the tokens that we predict)
#         gold_data = ...
        
#         # the shape of the output and sentence variable need to be changed,
#         # for the loss function. Details are in the documentation.
#         # You can use .view(...,...) to reshape the tensors  
#         loss = loss_func(...)
#         lm_total_loss += lm_loss.item()
        
#         # print average loss for the epoch
#         print(lm_total_loss/(batch_count+1), end='\r') 
        
#         # compute gradients
#         loss.backward()
#         # update parameters
#         lm_optimizer.step()
#         # reset gradients
#         lm_optimizer.zero_grad()
        
#     print()

lm_model

LM_withLSTM(
  (embeddings): Embedding(80672, 128)
  (LSTM): LSTM(128, 128, batch_first=True)
  (predict_word): Linear(in_features=128, out_features=80672, bias=True)
  (sigmoid): Sigmoid()
)

### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [None]:
# your code goes here
import json

def evaluate_model(path, vocab, model):
    
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = ...
            tok_bad_s = ...
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([_ for x in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([_ for x in tok_bad_s], device=device).unsqueeze(0)
            
            # pass your encoded sentences to the model and predict the next tokens
            good_s = LM_withLSTM(enc_good_s)
            bad_s = LM_withLSTM(enc_bad_s)
            
            # get probabilities with softmax
            gs_probs = F.softmax(...)
            bs_probs = F.softmax(...)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)
            
            accuracy.append(int(gs_sent_prob>bs_sent_prob))
            
    return accuracy
            
def find_token_probs(model_probs, encoded_sentece):
    probs = []

    # iterate over the tokens in your encoded sentence
    for token, gold_token in enumerate(encoded_sentece):
        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        prob = ...
        probs.append(prob)
    sentence_prob = ...
    return sentence_prob

path = 'existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(path, ..., ...)

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

Suggest some improvements you could make to your language model.

[3 marks]

Suggest some other metrics we can use to evaluate our system

[2 marks]

# Literature


Neural architectures:
* Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)
* T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
* T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


Total marks: 63