# Vector Semantics

## Objectives
- Understanding: 
    - different methods of representing words as vectors
    - vectors and similarity between vectors
    - evaluation of word embeddings
    
- Learning how to:
    - train word embeddings with gensim
    - use pre-trained word embeddings for similarity computation


### Recommended Reading
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)

### Covered Material
- SLP
    - [Chapter 6: Vector Semantics and Embeddings](https://web.stanford.edu/~jurafsky/slp3/6.pdf) 

### Requirements
- [spaCy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [pytorch](https://pytorch.org/get-started/locally/)
- tqdm
- matplotlib
    

*Recommended Reading*:
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)

*Notebook Covers Material of*:
- [SLP](https://web.stanford.edu/~jurafsky/slp3/6.pdf) Chapter 6: Vector Semantics and Embeddings


## 1. Words as Vectors (Embeddings)

In natural language processing (NLP), [**word embedding**](https://en.wikipedia.org/wiki/Word_embedding) is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves the mathematical embedding from space with many dimensions per word to a continuous vector space with a much lower dimension.

- Word embeddings is the process by which words are transformed into vectors of (real) numbers.
- Definition of meaning by distributional similarity / usage: similar words are close in "space"

### 1.1. One-Hot Encoding
- sparse vectors
- most basic way to turn a token into a vector
- method
    - associate a unique integer index with every word in a vocabulary of size $V$
    - turn this integer index $i$ into a binary vector of size $V$ (i.e. the size of the vocabulary)
    - the vector has all values `0` except for the $i$th entry, which is `1`

## 2. Co-Occurence Matrices and Word as Vectors

### 2.1. Term-Document Matrix
- could be used to represent words, where dimension are documents

### 2.2. TF-IDF
- sparse vectors
- generally used to represent documents, where dimensions are words

#### TF: Term Frequency
$$\text{tf}_{t,d} = \text{count}(t,d)$$
$$\text{tf}_{t,d} = \log_{10}(\text{count}(t,d) + 1)$$

`+1` is because log of 0 is undefined.

Alternatively:

$$\text{tf}_{t,d} = 
\begin{cases}
1 + \log_{10}(\text{count}(t,d)), & \text{if count}(t,d) > 0\\
0, & \text{otherwise}
\end{cases}$$

#### IDF: Inverse Document Frequency

$$\text{idf}_t = \frac{N}{\text{df}_t}$$

Usually in log space, like term frequency.

$$\text{idf}_t = \log_{10}(\frac{N}{\text{df}_t})$$

- $\text{df}_t$ is the number of documents in which term $t$ occurs
- $N$ is the total number of documents in the collection.


The __tf-idf__ weighted value $w_{t,d}$ for word $t$ in document $d$ is the combination of $\text{tf}_{t,d}$ and $\text{idf}_t$:

$$w_{t,d} = \text{tf}_{t,d} \times \text{idf}_t$$


### 2.3. Term-Term Matrix
- a.k.a. "word-word" or "word-context" matrix
- words are represented by a function of the counts of nearby words 
- size $|V| \times |V|$, where $V$ is the vocabulary size
    - usually context is taken to be a document or words in a window around the target word

### 2.4. Pointwise Mutual Information (PMI) and Positive Pointwise Mutual Information (PPMI)
- used for term-term matrices
- "the best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance."

#### 2.4.1. Pointwise Mutual Information (PMI)
- a measure of how often two events $x$ and $y$ occur, compared with what we would expect if they were independent:

$$I(x, y) = \log_2 \frac{P(x, y)}{P(x)P(y)}$$


The pointwise mutual information between a target word $w$ and a context word $c$ is defined as:

$$\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w)P(c)}$$


#### 2.4.2. Positive Pointwise Mutual Information (PMI)
- PMI values range from negative to positive infinity.
- negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable
- it is more common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero

$$\text{PPMI}(w, c) = \max(\log_2 \frac{P(w, c)}{P(w)P(c)}, 0)$$


#### 2.4.3. PPMI Matrix
To get a PPMI matrix from a co-occurrence matrix $F$, where $W$ rows are words and $C$ columns are contexts, and $f_{ij}$ is the number of times word $w_i$ appears in context $c_j$ (i.e. value of the cell).

$$P(w,c) = \frac{f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$$

$$P(w) = \frac{\sum_{j=1}^C f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$$

$$P(c) = \frac{\sum_{i=1}^W f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$$


- PMI has the problem of being biased toward infrequent events: very rare words tend to have very high PMI values.
- Thus, $P(c)$ is computed as $P_{\alpha}(c)$ that raises the probability of the context word to the power of $\alpha$ (e.g. $0.75$)
    - Alternative is Laplace smoothing

$$\text{PPMI}_{\alpha}(w, c) = \max(\log_2 \frac{P(w, c)}{P(w)P_{\alpha}(c)}, 0)$$

$$P_{\alpha}(c) = \frac{\text{count}(c)^{\alpha}}{\sum_{c}\text{count}(c)^{\alpha}}$$

## 3. Training Word Embeddings with `gensim`

### 3.1. Word2Vec
- dense vectors
- representation is created by training a classifier to distinguish nearby and far-away words
- Variants
    - SKIP-GRAM
        - given the target predict the context i.e. $P(w_0, w_1, w_{n-1}|w_n)$
    - CBOW (Continuous Bag of Words)
        - it's the opposite of skip-gram, given the context predict the target i.e. $P(w_n | w_0, w_1, w_{n-1})$
- Refer to [documentation](https://radimrehurek.com/gensim/models/word2vec.html) for details
- [Tutorial](https://rare-technologies.com/word2vec-tutorial/)

In [1]:
!pip install python-Levenshtein
!pip install gensim









In [2]:
# training the model
from gensim.models import Word2Vec
data = ['Iceland is faraway from Padova', 'Rome is the capital of Italy', 'Paris is a big city']
model = Word2Vec(sentences=[d.split() for d in data], vector_size=10, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [2]:
# loading the model
model = Word2Vec.load("word2vec.model")
print(model)

Word2Vec<vocab=14, vector_size=10, alpha=0.025>


In [5]:
# getting word vectors
print(model.wv['Rome'])
# getting most similar
print(model.wv.most_similar('Rome', topn=3))

[ 0.01631476  0.00189917  0.03473637  0.00217777  0.09618826  0.05060603
 -0.0891739  -0.0704156   0.00901456  0.06392534]
[('faraway', 0.5111488103866577), ('Italy', 0.2914133667945862), ('Iceland', 0.07346687465906143)]


## 4. Vector Similarity
- two words are similar in meaning if their context __vectors__ are similar
- __Cosine similarity__ measures the similarity between two vectors of an __inner product space__. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

### 4.1. Dot Product

- dot product (inner product)

$$\vec{v}\cdot\vec{w} = \sum^N_{i=1}v_i w_i = v_1 w_1 + v_2 w_2 + ... + v_N w_N$$

- vector length (L2 norm $||v||_2$)

$$|\vec{v}| = \sqrt{\sum^N_{i=1} v_i^2}$$ 

$$ |\vec{v}| = \sqrt{\vec{v}\cdot\vec{v}} = \sqrt{\sum^N_{i=1} v_i v_i} = \sqrt{\sum^N_{i=1} v_1 v_1 + v_2 v_2 + ... + v_N v_N}$$

### 4.2. Cosine Similarity

- L2 normalized dot product of 2 vectors
    - $\theta$ is the angle between $\vec{v}$ and $\vec{w}$

$$\vec{v}\cdot\vec{w} = |\vec{v}||\vec{w}|\cos\theta$$

$$\cos\theta = \frac{\vec{v}\cdot\vec{w}}{|\vec{v}||\vec{w}|}$$

$$\text{CosSim}(\vec{v},\vec{w}) = \frac{\vec{v}\cdot\vec{w}}{|\vec{v}||\vec{w}|} = \frac{\sum^N_{i=1}v_i w_i}{\sqrt{\sum^N_{i=1} v_i^2} \sqrt{\sum^N_{i=1} w_i^2}}$$

#### Cosine Distance
$$\text{Cosine Distance}(\vec{v}, \vec{w}) = 1 - \text{Cosine Similarity}(\vec{v}, \vec{w})$$

### Exercises
- Implement a function to compute __cosine similarity__ using `numpy` methods
    - `np.dot`
    - `norm`
- Using the defined functions
    - compute cosine similarity between two word embeddings for instance `Rome` and `city` or `Paris` and `Rome`
    - compare similarity values to the cosine similarity using the output of (`scipy.spatial.distance.cosine`)
        - i.e. use *distance* to compute *similarity*


In [3]:
import numpy as np
from numpy.linalg import norm
from scipy.spatial.distance import cosine

def cosine_similarity(v, w):
    return np.dot(v, w) / (norm(v) * norm(w))

rome = model.wv['Rome']
paris = model.wv['Paris']
print(cosine_similarity(rome, paris))
# print cosine similarity using distance
print(cosine(rome, paris))

0.04265024
0.9573497511446476


## 5. Pre-Trained Embeddings
- Training embeddings is computationally expensive
- Many pre-trained models are available

In [29]:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
# Download the 'word2vec-google-news-300' embeddings
w2v = gensim.downloader.load('word2vec-google-news-300')

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [30]:
w2v['Rome']

array([ 0.23535156,  0.18652344, -0.0390625 ,  0.31445312, -0.01019287,
        0.09375   , -0.3203125 , -0.01635742, -0.06347656,  0.22167969,
       -0.17382812,  0.04492188,  0.10595703,  0.06298828, -0.08300781,
       -0.03808594, -0.06982422, -0.05395508, -0.00891113,  0.14160156,
        0.08984375,  0.0703125 ,  0.2890625 , -0.06079102,  0.3515625 ,
        0.01855469,  0.03833008,  0.34375   , -0.24511719, -0.00958252,
        0.12060547, -0.04248047, -0.31445312,  0.109375  , -0.15039062,
       -0.31054688, -0.01452637,  0.16015625, -0.04711914,  0.14453125,
        0.13183594,  0.05541992,  0.34570312,  0.19921875,  0.12695312,
        0.0378418 ,  0.07519531,  0.38085938, -0.0135498 ,  0.24414062,
        0.01635742,  0.22851562, -0.04638672, -0.1953125 , -0.22949219,
        0.18554688, -0.16601562, -0.11914062, -0.19726562, -0.04199219,
        0.0859375 ,  0.09765625,  0.02624512, -0.07226562, -0.01055908,
       -0.10839844, -0.24804688, -0.03808594,  0.15722656, -0.17

In [None]:
w2v.most_similar('Rome', topn=3)

In [None]:
w2v.most_similar('Paris', topn=3)

### 5.1. Word Embeddings in spaCy

> To make them compact and fast, spaCy's small pipeline packages (all packages that end in `sm`) don't ship with word vectors, and only include context-sensitive tensors. This means you can still use the `similarity()` methods to compare documents, spans and tokens -- but the result won't be as good, and individual tokens won't have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:

> `python -m spacy download en_core_web_lg`

> Pipeline packages that come with built-in word vectors make them available as the `Token.vector` attribute. `Doc.vector` and `Span.vector` will default to an __average of their token vectors__. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors.

> Each `Doc`, `Span`, `Token` and `Lexeme` comes with a `.similarity` method that lets you compare it with another object, and determine the similarity. 

In [None]:
import spacy
spacy.cli.download('en_core_web_lg')

#### 5.1.1. Accessing Embedding Vectors

In [4]:
import spacy
import numpy as np

nlp = spacy.load('en_core_web_lg')

txt = 'Rome is the capital of Italy'
doc = nlp(txt)

tok = doc[0]  # let's take Rome

print("string:", tok.text)

print("vector dimension:", len(tok.vector))
print("spacy vector norm:", tok.vector_norm)
print("numpy vector norm:", np.sqrt(np.dot(tok.vector, tok.vector)))
print("numpy linalg norm:", np.linalg.norm(tok.vector))

string: Rome
vector dimension: 300
spacy vector norm: 54.82853
numpy vector norm: 54.82853
numpy linalg norm: 54.82853


In [5]:
from scipy.spatial.distance import cosine

# let's get Paris & compare its vector to rome
paris = nlp('Paris')[0]
print(paris.text)

print("spacy CosSim({}, {}):".format(tok.text, paris.text), tok.similarity(paris))
print("scipy CosSim({}, {}):".format(tok.text, paris.text), 1 - cosine(tok.vector, paris.vector))

Paris
spacy CosSim(Rome, Paris): 0.6117807626724243
scipy CosSim(Rome, Paris): 0.6117808222770691


# 6 Train your own Word Embeddings
One way to train word embeddings is to use a language model. We have already seen language models in Lab 3, but now we are going to develop a language model using a neural architecture.


## 6.1 Task definition
To model the probaiblity distribution over a sequence, we are going to use the Chain Rule as we have seen in LAB 3:
$$P(w_{1}^{n}) = P(w_1) P(w_2|w_1) P(w_3|w_1^2) ... P(w_n|w_{1}^{n-1}) = \prod_{i=1}^{n}{P(w_i|w_{1}^{i-1})}$$

However, at that time we have used ngram to trucate the previous context ($N-1$), in order to compute meaningfull probabilities. While using neural models, we will let the model to decide by itself how to manage the previous contex and thus which are the tokens relevant for the prediction. 

## 6.2 RNNs are the most suitable architacture
One of most suitable neural architecture for the Language Model task is the Recurrent Neural Network. The architecture is composed of a RNN layer (vanilla, LSTM, GRU) and a softmax that outputs the probability over the dictionary. Indeed the size of the output vector is equal to the size of the dictionary, i.e. the model cannot predict tokens that are not present in vocabularly. <br>
> LM task in RNN can be tackled as a sequence labelling task (i.e. len of input and output sequence are always the same) in which the input sequence is $ input = \{w_1, w_2, w_{n-1}\}$ and the output is $ output = \{w_2, w_2, w_{n}\}$
>
> **Example** our sentence is ***"I go to Miami"*** the input sequence would be ***"I go to"*** and the output is ***"go to Miami"***. 
>
> Notice: 
> - To proper model the sequence probabilities we need add boundary markers \<s\> and \</s\>.
> - However in LM RNN only the end of sentence token \</s\>  is usually used unless we need for some reason (e.g. in ASR) to compute the probability distribution of the first token of a sentence. 

<img src="https://i.postimg.cc/zGH99MFY/rnn-lm.png" alt="drawing" width="300"/>

In the image below you can see a working example of a language model with RNN. 

<img src="https://i.postimg.cc/fydQNrYP/LM-RNN.png" alt="drawing" width="300"/>

## 6.3 Model architecture

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import math
import numpy as np

# RNN Elman version
# We are not going to use this since for efficienty purposes it's better to use the RNN layer provided by pytorch  

class RNN_cell(nn.Module):
    def __init__(self,  hidden_size, input_size, output_size, vocab_size, dropout=0.1):
        super(RNN, self).__init__()
        
        self.W = nn.Linear(input_size, hidden_size, bias=False)
        self.U = nn.Linear(hidden_size, hidden_size)
        self.V = nn.Linear(hidden_size, hidden_size)
        self.vocab_size = vocab_size
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, prev_hidden, word):
        input_emb = self.W(word)
        prev_hidden_rep = self.U(prev_hidden)
        # ht = σ(Wx + Uht-1 + b)
        hidden_state = self.sigmoid(x + prev_hidden_rep)
        # yt = σ(Vht + b)
        output = self.output(hidden_state)
        return hidden_state, output
    

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import math
import numpy as np


class LM_RNN(nn.Module):
    def __init__(self, emb_size, hidden_size, output_size, pad_index=0, out_dropout=0.1,
                 emb_dropout=0.1, n_layers=1):
        super(LM_RNN, self).__init__()
        # Token ids to vectors, we will better see this in the next lab 
        self.embedding = nn.Embedding(output_size, emb_size, padding_idx=pad_index)
        # Pytorch's RNN layer: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
        self.rnn = nn.RNN(emb_size, hidden_size, n_layers, bidirectional=False)    
        self.pad_token = pad_index
        # Linear layer to project the hidden layer to our output space 
        self.output = nn.Linear(hidden_size, output_size)
        
    def forward(self, input_sequence):
        emb = self.embedding(input_sequence)
        rnn_out, _  = self.rnn(emb)
        output = self.output(rnn_out).permute(0,2,1)
        return output
    def get_word_embedding(self, token):
        return self.embedding(token).squeeze(0).detach().cpu().numpy()
    
    def get_most_similar(self, vector, top_k=10):
        embs = self.embedding.weight.detach().cpu().numpy()
        #Our function that we used before
        scores = []
        for i, x in enumerate(embs):
            if i != self.pad_token:
                scores.append(cosine_similarity(x, vector))
        # Take ids of the most similar tokens 
        scores = np.asarray(scores)
        indexes = np.argsort(scores)[::-1][:top_k]  
        top_scores = scores[indexes]
        return (indexes, top_scores)

## 6.4 Data loading 
For sake of time we are going to see this part in detail in the next lab.

In [7]:
def read_file(path, eos_token="<eos>"):
    output = []
    with open(path, "r") as f:
        for line in f.readlines():
            output.append(line + eos_token)
    return output

def get_vocab(corpus, special_tokens=[]):
    output = {}
    i = 0 
    for st in special_tokens:
        output[st] = i
        i += 1
    for sentence in corpus:
        for w in sentence.split():
            if w not in output:
                output[w] = i
                i += 1
    return output

In [8]:
train_raw = read_file("dataset/ptb.train.txt")
dev_raw = read_file("dataset/ptb.valid.txt")
test_raw = read_file("dataset/ptb.test.txt")

In [9]:
# Vocab is computed only on training set 
# However you can compute it for dev and test just for statistics about OOV 
vocab = get_vocab(train_raw, ["<pad>", "<eos>"])

In [10]:
len(vocab)

10001

In [11]:
class Lang():
    def __init__(self, corpus, special_tokens=[]):
        self.word2id = self.get_vocab(corpus, special_tokens)
        self.id2word = {v:k for k, v in self.word2id.items()}
        
    def get_vocab(self, corpus, special_tokens=[]):
        output = {}
        i = 0 
        for st in special_tokens:
            output[st] = i
            i += 1
        for sentence in corpus:
            for w in sentence.split():
                if w not in output:
                    output[w] = i
                    i += 1
        return output
    

In [12]:
lang = Lang(train_raw, ["<pad>", "<eos>"])

In [13]:
import torch
import torch.utils.data as data

class PennTreeBank (data.Dataset):
    # Mandatory methods are __init__, __len__ and __getitem__
    def __init__(self, corpus, lang):
        self.source = []
        self.target = []
        
        for sentence in corpus:
            self.source.append(sentence.split()[0:-1]) # We get from the first token till the second-last token
            self.target.append(sentence.split()[1:]) # We get from the second token till the last token
            # See example in section 6.2
        
        self.source_ids = self.mapping_seq(self.source, lang)
        self.target_ids = self.mapping_seq(self.target, lang)

    def __len__(self):
        return len(self.source)

    def __getitem__(self, idx):
        src= torch.LongTensor(self.source_ids[idx])
        trg = torch.LongTensor(self.target_ids[idx])
        sample = {'source': src, 'target': trg}
        return sample
    
    # Auxiliary methods
    
    def mapping_seq(self, data, lang): # Map sequences to number
        res = []
        for seq in data:
            tmp_seq = []
            for x in seq:
                if x in lang.word2id:
                    tmp_seq.append(lang.word2id[x])
                else:
                    print('OOV found!')
                    print('You have to deal with that') # PennTreeBank doesn't have OOV but "Trust is good, control is better!"
                    break
            res.append(tmp_seq)
        return res

In [14]:
train_dataset = PennTreeBank(train_raw, lang)
dev_dataset = PennTreeBank(dev_raw, lang)
test_dataset = PennTreeBank(test_raw, lang)

In [15]:
from functools import partial
from torch.utils.data import DataLoader
def collate_fn(data, pad_token):
    def merge(sequences):
        '''
        merge from batch * sent_len to batch * max_len 
        '''
        lengths = [len(seq) for seq in sequences]
        max_len = 1 if max(lengths)==0 else max(lengths)
        # Pad token is zero in our case
        # So we create a matrix full of PAD_TOKEN (i.e. 0) with the shape 
        # batch_size X maximum length of a sequence
        padded_seqs = torch.LongTensor(len(sequences),max_len).fill_(pad_token)
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq # We copy each sequence into the matrix
        padded_seqs = padded_seqs.detach()  # We remove these tensors from the computational graph
        return padded_seqs, lengths
    # Sort data by seq lengths

    data.sort(key=lambda x: len(x["source"]), reverse=True) 
    new_item = {}
    for key in data[0].keys():
        new_item[key] = [d[key] for d in data]

    source, _ = merge(new_item["source"])
    target, lengths = merge(new_item["target"])
    
    new_item["source"] = source.to(device)
    new_item["target"] = target.to(device)
    new_item["number_tokens"] = sum(lengths)
    return new_item

# Dataloader instantiation
train_loader = DataLoader(train_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]),  shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))
test_loader = DataLoader(test_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))

## 6.5 Train and validate the model 

In [16]:
import math
def train_loop(data, optimizer, criterion, model, clip=5):
    model.train()
    loss_array = []
    number_of_tokens = []
    
    for sample in data:
        optimizer.zero_grad() # Zeroing the gradient
        output = model(sample['source'])
        loss = criterion(output, sample['target'])
        loss_array.append(loss.item() * sample["number_tokens"])
        number_of_tokens.append(sample["number_tokens"])
        loss.backward() # Compute the gradient, deleting the computational graph
        # clip the gradient to avoid explosioning gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)  
        optimizer.step() # Update the weights
        
    return sum(loss_array)/sum(number_of_tokens)

def eval_loop(data, eval_criterion, model):
    model.eval()
    loss_to_return = []
    loss_array = []
    number_of_tokens = []
    # softmax = nn.Softmax(dim=1) # Use Softmax if you need the actual probability
    with torch.no_grad(): # It used to avoid the creation of computational graph
        for sample in data:
            output = model(sample['source'])
            loss = eval_criterion(output, sample['target'])
            loss_array.append(loss.item())
            number_of_tokens.append(sample["number_tokens"])
            
    ppl = math.exp(sum(loss_array) / sum(number_of_tokens))
    loss_to_return = sum(loss_array) / sum(number_of_tokens)
    return ppl, loss_to_return

def init_weights(mat):
    for m in mat.modules():
        if type(m) in [nn.GRU, nn.LSTM, nn.RNN]:
            for name, param in m.named_parameters():
                if 'weight_ih' in name:
                    for idx in range(4):
                        mul = param.shape[0]//4
                        torch.nn.init.xavier_uniform_(param[idx*mul:(idx+1)*mul])
                elif 'weight_hh' in name:
                    for idx in range(4):
                        mul = param.shape[0]//4
                        torch.nn.init.orthogonal_(param[idx*mul:(idx+1)*mul])
                elif 'bias' in name:
                    param.data.fill_(0)
        else:
            if type(m) in [nn.Linear]:
                torch.nn.init.uniform_(m.weight, -0.01, 0.01)
                if m.bias != None:
                    m.bias.data.fill_(0.01)


In [17]:
import torch.optim as optim
# Experiment also with a smaller or bigger model by changing hid and emb sizes 
# A large model tends to overfit
hid_size = 100
emb_size = 150

# With SGD try with an higer learning rate
lr = 0.1 # This is definitely not good for SGD
clip = 5 # Clip the gradient
device = 'cuda:0'

vocab_len = len(lang.word2id)

model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"]).to(device)
model.apply(init_weights)

optimizer = optim.SGD(model.parameters(), lr=lr)
criterion_train = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"])
criterion_eval = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"], reduction='sum')

In [15]:
import matplotlib.pyplot as plt
from tqdm import tqdm
import copy
import gc


# Set max_split_size_mb to avoid fragmentation of the GPU memory
torch.backends.cuda.max_split_size_mb = 512  # Adjust the value according to your needs
torch.cuda.empty_cache()
torch.cuda.max_memory_allocated(device=device) / 1024 ** 2
gc.collect()

n_epochs = 100
patience = 3
losses_train = []
losses_dev = []
sampled_epochs = []
best_ppl = math.inf
best_model = None
pbar = tqdm(range(1,n_epochs))
#If the PPL is too high try to change the learning rate
for epoch in pbar:
    loss = train_loop(train_loader, optimizer, criterion_train, model, clip)    
    
    if epoch % 1 == 0:
        sampled_epochs.append(epoch)
        losses_train.append(np.asarray(loss).mean())
        ppl_dev, loss_dev = eval_loop(dev_loader, criterion_eval, model)
        losses_dev.append(np.asarray(loss_dev).mean())
        pbar.set_description("PPL: %f" % ppl_dev)
        if  ppl_dev < best_ppl: # the lower, the better
            best_ppl = ppl_dev
            best_model = copy.deepcopy(model).to('cpu')
            patience = 3
        else:
            patience -= 1
            
        if patience <= 0: # Early stopping with patience
            break # Not nice but it keeps the code clean
                          
best_model.to(device)
final_ppl,  _ = eval_loop(test_loader, criterion_eval, best_model)    
print('Test ppl: ', final_ppl)

PPL: 331.672248: 100%|██████████| 99/99 [34:10<00:00, 20.72s/it]


Test ppl:  316.9268679431873


If your model makes you happy and you want to reuse it, you have [to save it and load it](https://pytorch.org/tutorials/beginner/saving_loading_models.html). 
In pytorch this is super straightforward.

In [3]:
# # To save the model
path = 'model_bin/model_name.pt'
# torch.save(model.state_dict(), path)
# To load the model you need to initialize it
model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"]).to(device)
# Then you load it
model.load_state_dict(torch.load(path))

NameError: name 'LM_RNN' is not defined

### 7 Evaluation: Analogy Task
In the word analogy task, we complete the sentence of the form

"$w_1$ is to $w_2$ as $w_3$ is to $w4$", where $w_4$ is a blank. 

For instance:

"*man* is to *woman* as *king* is to **__**", and our goal is to guess the missing word (*queen*)

The task is approached using cosine similarity between vector differences: 

$$\vec{w_2} - \vec{w_1} \approx \vec{w_4} - \vec{w_3}$$

$$\vec{w_4} \approx = \vec{w_3} + \vec{w_2} - \vec{w_1}$$

$$w = \arg\max_{w \in V}(\vec{w} \cdot (\vec{w_3} + \vec{w_2} - \vec{w_1}))$$


$$w = \arg\max_{w \in V}\text{CosSim}(\vec{w_2} - \vec{w_1}, \vec{w} - \vec{w_3})$$

#### Analogy using Most Similar
> For each of the given vectors, find the `n` most similar entries to it by cosine. 
Queries are by vector. Results are returned as a (`keys`, `best_rows`, `scores`)

In [24]:
def analogy_spacy(w1, w2, w3):
    v1 = nlp.vocab[w1].vector
    v2 = nlp.vocab[w2].vector
    v3 = nlp.vocab[w3].vector
    
    # relation vector
    rv = v3 + v2 - v1
   
    # n=1 & sorted by default
    ms = nlp.vocab.vectors.most_similar(np.asarray([rv]), n=10)
    
    # getting words & scores
    for i, key in enumerate(ms[0][0]):
        print(nlp.vocab.strings[key], ms[2][0][i])

In [1]:
print(analogy_spacy('man', 'woman', 'king'))

NameError: name 'analogy_spacy' is not defined

## Exercise

- Write a function that computes the analogy with our RNN based model
- Compare Spacy and our RNN based model (just try a couple of examples)


In [19]:
def analogy_our_model(w1, w2, w3, model, lang):
    model.eval().to('cpu')
    
    # Suggest: make use of torch.LongTensor and check if the word is in the vocab
    # Get word ids
    temp_w1 = lang.word2id[w1]
    temp_w2 = lang.word2id[w2]
    temp_w3 = lang.word2id[w3]

    # Get word vectors
    v1 = model.get_word_embedding(torch.LongTensor([temp_w1]))
    v2 = model.get_word_embedding(torch.LongTensor([temp_w2]))
    v3 = model.get_word_embedding(torch.LongTensor([temp_w3]))

    # relation vector
    rv = v3 + v2 - v1

    # Get the most similar word
    ms = model.get_most_similar(rv, top_k=10)

    # getting words & scores
    for i, key in enumerate(ms[0]):
        print(lang.id2word[key], ms[1][i])

In [20]:
# Our model is trained on WSJ news queen and king should be OOV or very rare tokens
# Try with different words
analogy_our_model('man', 'woman', 'u.s.', model, lang)

stakes 0.6451224
argue 0.41147023
has 0.30575985
efforts 0.2905369
novelist 0.27574766
teeth 0.27525675
hidden 0.26221195
chips 0.25811046
disadvantage 0.25252596
motion 0.25185156


In [23]:
# Our model is trained on WSJ news queen and king should be OOV or very rare tokens
analogy_our_model('a', 'woman', 'queen', model, lang)

valley 0.6443743
stakes 0.63430893
woolworth 0.30361828
shamir 0.28658432
interested 0.26614782
gillett 0.26429635
agnelli 0.262242
users 0.25788736
basir 0.25620177
breaker 0.24996667


## Exercise 1 (2 points)
Modify the baseline LM_RNN (the idea is to add a set of improvements and see how these affect the performance). Furthremore, you have to play with the hyperparameters to minimise the PPL and thus print the results achieved with the best configuration. Here are the links to the state-of-the-art papers which uses vanilla RNN [paper1](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5947611), [paper2](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf). 
- Replace RNN with LSTM (output the PPL)
- Add two dropout layers: (output the PPL)
    - one on embeddings, 
    - one on the output
- Replace SGD with AdamW (output the PPL)

### Import Libraries

In [1]:
import copy
import gc
import math
from functools import partial

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from numpy.linalg import norm
from torch.utils.data import DataLoader
from tqdm import tqdm

In [2]:
def cosine_similarity(v, w):
    return np.dot(v, w) / (norm(v) * norm(w))

### Implementing the RNN-LSTM Model

In [3]:
class LM_RNN(nn.Module):
    def __init__(self, emb_size, hidden_size, output_size, pad_index=0, out_dropout=0.1,
                 emb_dropout=0.1, n_layers=1):
        super(LM_RNN, self).__init__()

        # Token ids to vectors, we will better see this in the next lab
        self.embedding = nn.Embedding(output_size, emb_size, padding_idx=pad_index)
        self.emb_dropout = nn.Dropout(emb_dropout)  # Added Dropout layer after embedding
        # Pytorch LSTM layer
        self.rnn = nn.LSTM(emb_size, hidden_size, n_layers, bidirectional=False)  # Replaced RNN with LSTM
        self.out_dropout = nn.Dropout(out_dropout)  # Added Dropout layer before output
        self.pad_token = pad_index
        # Linear layer to project the hidden layer to our output space
        self.output = nn.Linear(hidden_size, output_size)

    def forward(self, input_sequence):
        emb = self.embedding(input_sequence)
        emb = self.emb_dropout(emb)  # Applied Dropout after embedding
        rnn_out, _ = self.rnn(emb)
        rnn_out = self.out_dropout(rnn_out)  # Applied Dropout before output
        output = self.output(rnn_out).permute(0, 2, 1)
        return output

    def get_word_embedding(self, token):
        return self.embedding(token).squeeze(0).detach().cpu().numpy()

    def get_most_similar(self, vector, top_k=10):
        embs = self.embedding.weight.detach().cpu().numpy()
        #Our function that we used before
        scores = []
        for i, x in enumerate(embs):
            if i != self.pad_token:
                scores.append(cosine_similarity(x, vector))
        # Take ids of the most similar tokens
        scores = np.asarray(scores)
        indexes = np.argsort(scores)[::-1][:top_k]
        top_scores = scores[indexes]

        return (indexes, top_scores)

### Data loading

In [4]:
def read_file(path, eos_token="<eos>"):
    output = []
    with open(path, "r") as f:
        for line in f.readlines():
            output.append(line + eos_token)
    return output

def get_vocab(corpus, special_tokens=[]):
    output = {}
    i = 0
    for st in special_tokens:
        output[st] = i
        i += 1
    for sentence in corpus:
        for w in sentence.split():
            if w not in output:
                output[w] = i
                i += 1
    return output

In [5]:
train_raw = read_file("dataset/ptb.train.txt")
dev_raw = read_file("dataset/ptb.valid.txt")
test_raw = read_file("dataset/ptb.test.txt")

In [6]:
# Vocab is computed only on training set
# However you can compute it for dev and test just for statistics about OOV
vocab = get_vocab(train_raw, ["<pad>", "<eos>"])

In [7]:
len(vocab)

10001

In [8]:
class PennTreeBank(data.Dataset):
    # Mandatory methods are __init__, __len__ and __getitem__
    def __init__(self, corpus, lang):
        self.source = []
        self.target = []

        for sentence in corpus:
            self.source.append(sentence.split()[0:-1]) # We get from the first token till the second-last token
            self.target.append(sentence.split()[1:]) # We get from the second token till the last token
            # See example in section 6.2

        self.source_ids = self.mapping_seq(self.source, lang)
        self.target_ids = self.mapping_seq(self.target, lang)

    def __len__(self):
        return len(self.source)

    def __getitem__(self, idx):
        src= torch.LongTensor(self.source_ids[idx])
        trg = torch.LongTensor(self.target_ids[idx])
        sample = {'source': src, 'target': trg}
        return sample

    # Auxiliary methods
    def mapping_seq(self, data, lang): # Map sequences to number
        res = []
        for seq in data:
            tmp_seq = []
            for x in seq:
                if x in lang.word2id:
                    tmp_seq.append(lang.word2id[x])
                else:
                    print('OOV found!')
                    print('You have to deal with that') # PennTreeBank doesn't have OOV but "Trust is good, control is better!"
                    break
            res.append(tmp_seq)
        return res

In [9]:
class Lang:
    """Simple vocabulary wrapper."""
    def __init__(self, corpus, special_tokens=[]):
        """
        :param corpus:
        :param special_tokens:
        """
        self.word2id = self.get_vocab(corpus, special_tokens)
        self.id2word = {v:k for k, v in self.word2id.items()}

    def get_vocab(self, corpus, special_tokens=[]):
        """
        description: function to get the vocabulary of the corpus
        :param corpus:
        :param special_tokens:
        :return: dict of word to id
        """
        output = {}
        i = 0
        for st in special_tokens:
            output[st] = i
            i += 1
        for sentence in corpus:
            for w in sentence.split():
                if w not in output:
                    output[w] = i
                    i += 1
        return output


In [10]:
lang = Lang(train_raw, ["<pad>", "<eos>"])

In [11]:
train_dataset = PennTreeBank(train_raw, lang)
dev_dataset = PennTreeBank(dev_raw, lang)
test_dataset = PennTreeBank(test_raw, lang)

In [12]:
def collate_fn(data, pad_token):
    def merge(sequences):
        '''
        merge from batch * sent_len to batch * max_len
        '''
        lengths = [len(seq) for seq in sequences]
        max_len = 1 if max(lengths)==0 else max(lengths)
        # Pad token is zero in our case
        # So we create a matrix full of PAD_TOKEN (i.e. 0) with the shape
        # batch_size X maximum length of a sequence
        padded_seqs = torch.LongTensor(len(sequences),max_len).fill_(pad_token)
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq # We copy each sequence into the matrix
        padded_seqs = padded_seqs.detach()  # We remove these tensors from the computational graph
        return padded_seqs, lengths
    # Sort data by seq lengths

    data.sort(key=lambda x: len(x["source"]), reverse=True)
    new_item = {}
    for key in data[0].keys():
        new_item[key] = [d[key] for d in data]

    source, _ = merge(new_item["source"])
    target, lengths = merge(new_item["target"])

    new_item["source"] = source.to(device)
    new_item["target"] = target.to(device)
    new_item["number_tokens"] = sum(lengths)
    return new_item

# Dataloader instantiation
train_loader = DataLoader(train_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]),  shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))
test_loader = DataLoader(test_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))

### Train and validate the model

In [13]:
def train_loop(data, optimizer, criterion, model, clip=5):
    model.train()
    loss_array = []
    number_of_tokens = []

    for sample in data:
        optimizer.zero_grad() # Zeroing the gradient
        output = model(sample['source'])
        loss = criterion(output, sample['target'])
        loss_array.append(loss.item() * sample["number_tokens"])
        number_of_tokens.append(sample["number_tokens"])
        loss.backward() # Compute the gradient, deleting the computational graph
        # clip the gradient to avoid explosioning gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step() # Update the weights

    return sum(loss_array)/sum(number_of_tokens)

def eval_loop(data, eval_criterion, model):
    model.eval()
    loss_to_return = []
    loss_array = []
    number_of_tokens = []
    # softmax = nn.Softmax(dim=1) # Use Softmax if you need the actual probability
    with torch.no_grad(): # It used to avoid the creation of computational graph
        for sample in data:
            output = model(sample['source'])
            loss = eval_criterion(output, sample['target'])
            loss_array.append(loss.item())
            number_of_tokens.append(sample["number_tokens"])

    ppl = math.exp(sum(loss_array) / sum(number_of_tokens))
    loss_to_return = sum(loss_array) / sum(number_of_tokens)
    return ppl, loss_to_return

def init_weights(mat):
    for m in mat.modules():
        if type(m) in [nn.GRU, nn.LSTM, nn.RNN]:
            for name, param in m.named_parameters():
                if 'weight_ih' in name:
                    for idx in range(4):
                        mul = param.shape[0]//4
                        torch.nn.init.xavier_uniform_(param[idx*mul:(idx+1)*mul])
                elif 'weight_hh' in name:
                    for idx in range(4):
                        mul = param.shape[0]//4
                        torch.nn.init.orthogonal_(param[idx*mul:(idx+1)*mul])
                elif 'bias' in name:
                    param.data.fill_(0)
        else:
            if type(m) in [nn.Linear]:
                torch.nn.init.uniform_(m.weight, -0.01, 0.01)
                if m.bias != None:
                    m.bias.data.fill_(0.01)


In [20]:
# Experiment also with a smaller or bigger model by changing hid and emb sizes
# A large model tends to overfit
hid_size = 50
emb_size = 100

# With SGD try with an higer learning rate
lr = 0.1 # This is definitely not good for SGD
clip = 5 # Clip the gradient
device = 'cuda:0'

vocab_len = len(lang.word2id)

model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"]).to(device)
model.apply(init_weights)

optimizer = optim.SGD(model.parameters(), lr=lr)
criterion_train = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"])
criterion_eval = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"], reduction='sum')

In [21]:
import matplotlib.pyplot as plt
from tqdm import tqdm
import copy
import gc


# Set max_split_size_mb to avoid fragmentation of the GPU memory
torch.backends.cuda.max_split_size_mb = 512  # Adjust the value according to your needs
torch.cuda.empty_cache()
torch.cuda.max_memory_allocated(device=device) / 1024 ** 2
gc.collect()

n_epochs = 100
patience = 3
losses_train = []
losses_dev = []
sampled_epochs = []
best_ppl = math.inf
best_model = None
pbar = tqdm(range(1,n_epochs))

#If the PPL is too high try to change the learning rate
for epoch in pbar:
    loss = train_loop(train_loader, optimizer, criterion_train, model, clip)

    if epoch % 1 == 0:
        sampled_epochs.append(epoch)
        losses_train.append(np.asarray(loss).mean())
        ppl_dev, loss_dev = eval_loop(dev_loader, criterion_eval, model)
        losses_dev.append(np.asarray(loss_dev).mean())
        pbar.set_description("PPL: %f" % ppl_dev)
        if  ppl_dev < best_ppl: # the lower, the better
            best_ppl = ppl_dev
            best_model = copy.deepcopy(model).to('cpu')
            patience = 3
        else:
            patience -= 1

        if patience <= 0: # Early stopping with patience
            break # Not nice but it keeps the code clean

best_model.to(device)
final_ppl,  _ = eval_loop(test_loader, criterion_eval, best_model)
print('Test ppl: ', final_ppl)

PPL: 454.034543: 100%|██████████| 99/99 [28:53<00:00, 17.51s/it]


Test ppl:  437.30882551999696


If your model makes you happy and you want to reuse it, you have [to save it and load it](https://pytorch.org/tutorials/beginner/saving_loading_models.html).
In pytorch this is super straightforward.

In [19]:
# # To save the model
path = 'model_bin/lab-9_model_rnn_lstm-lr-0.1_hs-100_es-150.pt'
torch.save(model.state_dict(), path)
# To load the model you need to initialize it
model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"]).to(device)
# Then you load it
model.load_state_dict(torch.load(path))

<All keys matched successfully>

## Exercise 2 (4 points)
Add to best model of Exercise 1 the following regularizations described in [this paper](https://openreview.net/pdf?id=SyyGPP0TZ):
- Weight Tying (PPL)
- Variational Dropout (PPL)
- Non-monotonically Triggered AvSGD (PPL)

In [1]:
import copy
import gc
import math
from functools import partial

import numpy as np
import torch
import torch.nn as nn
import torch.utils.data as data
from numpy.linalg import norm
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.optim.lr_scheduler import ReduceLROnPlateau

In [2]:
def cosine_similarity(v, w):
    return np.dot(v, w) / (norm(v) * norm(w))

In [3]:
class VariationalDropout(nn.Module):
    def __init__(self, p=0.5, batch_first=False):
        super().__init__()
        self.p = p
        self.batch_first = batch_first

    def forward(self, x):
        if not self.training or not self.p:
            return x
        m = x.data.new(x.size(0), 1, x.size(2) if self.batch_first else 1).bernoulli_(1 - self.p)
        mask = m.div_(1 - self.p)
        mask = mask.expand_as(x)
        return mask * x

In [4]:
class LM_RNN(nn.Module):
    def __init__(self, emb_size, hidden_size, output_size, pad_index=0, out_dropout=0.1,
                 emb_dropout=0.1, n_layers=1, tied_weights=False):
        super(LM_RNN, self).__init__()

        # Token ids to vectors, we will better see this in the next lab
        self.embedding = nn.Embedding(output_size, emb_size, padding_idx=pad_index)
        # self.emb_dropout = nn.Dropout(emb_dropout)  # Added Dropout layer after embedding
        # added variational dropout
        self.emb_dropout = VariationalDropout(emb_dropout)
        # Pytorch LSTM layer
        self.rnn = nn.LSTM(emb_size, hidden_size, n_layers, bidirectional=False)  # Replaced RNN with LSTM

        if tied_weights:
            # Linear layer to project the hidden layer to our output space
            self.output = nn.Linear(hidden_size, output_size, bias=False)  # no bias if weights are tied
            self.output.weight = self.embedding.weight  # tie weights
        else:
            # Linear layer to project the hidden layer to our output space
            self.output = nn.Linear(hidden_size, output_size)

        # self.out_dropout = nn.Dropout(out_dropout)  # Added Dropout layer before output
        # added variational dropout to output
        self.out_dropout = VariationalDropout(out_dropout)
        self.pad_token = pad_index

    def forward(self, input_sequence):
        emb = self.embedding(input_sequence)
        emb = self.emb_dropout(emb)  # Applied Dropout after embedding
        rnn_out, _ = self.rnn(emb)
        rnn_out = self.out_dropout(rnn_out)  # Applied Dropout before output
        output = self.output(rnn_out).permute(0, 2, 1)
        return output

    def get_word_embedding(self, token):
        return self.embedding(token).squeeze(0).detach().cpu().numpy()

    def get_most_similar(self, vector, top_k=10):
        embs = self.embedding.weight.detach().cpu().numpy()
        # Our function that we used before
        scores = []
        for i, x in enumerate(embs):
            if i != self.pad_token:
                scores.append(cosine_similarity(x, vector))
        # Take ids of the most similar tokens
        scores = np.asarray(scores)
        indexes = np.argsort(scores)[::-1][:top_k]
        top_scores = scores[indexes]
        return (indexes, top_scores)

In [5]:
def read_file(path, eos_token="<eos>"):
    output = []
    with open(path, "r") as f:
        for line in f.readlines():
            output.append(line + eos_token)
    return output


def get_vocab(corpus, special_tokens=[]):
    output = {}
    i = 0
    for st in special_tokens:
        output[st] = i
        i += 1
    for sentence in corpus:
        for w in sentence.split():
            if w not in output:
                output[w] = i
                i += 1
    return output


In [6]:
class Lang():
    def __init__(self, corpus, special_tokens=[]):
        self.word2id = self.get_vocab(corpus, special_tokens)
        self.id2word = {v: k for k, v in self.word2id.items()}

    def get_vocab(self, corpus, special_tokens=[]):
        output = {}
        i = 0
        for st in special_tokens:
            output[st] = i
            i += 1
        for sentence in corpus:
            for w in sentence.split():
                if w not in output:
                    output[w] = i
                    i += 1
        return output


In [7]:
class PennTreeBank(data.Dataset):
    # Mandatory methods are __init__, __len__ and __getitem__
    def __init__(self, corpus, lang):
        self.source = []
        self.target = []

        for sentence in corpus:
            self.source.append(sentence.split()[0:-1])  # We get from the first token till the second-last token
            self.target.append(sentence.split()[1:])  # We get from the second token till the last token
            # See example in section 6.2

        self.source_ids = self.mapping_seq(self.source, lang)
        self.target_ids = self.mapping_seq(self.target, lang)

    def __len__(self):
        return len(self.source)

    def __getitem__(self, idx):
        src = torch.LongTensor(self.source_ids[idx])
        trg = torch.LongTensor(self.target_ids[idx])
        sample = {'source': src, 'target': trg}
        return sample

    # Auxiliary methods
    def mapping_seq(self, data, lang):  # Map sequences to number
        res = []
        for seq in data:
            tmp_seq = []
            for x in seq:
                if x in lang.word2id:
                    tmp_seq.append(lang.word2id[x])
                else:
                    print('OOV found!')
                    print(
                        'You have to deal with that')  # PennTreeBank doesn't have OOV but "Trust is good, control is better!"
                    break
            res.append(tmp_seq)
        return res


In [8]:
def collate_fn(data, pad_token):
    def merge(sequences):
        '''
        merge from batch * sent_len to batch * max_len
        '''
        lengths = [len(seq) for seq in sequences]
        max_len = 1 if max(lengths) == 0 else max(lengths)
        # Pad token is zero in our case
        # So we create a matrix full of PAD_TOKEN (i.e. 0) with the shape
        # batch_size X maximum length of a sequence
        padded_seqs = torch.LongTensor(len(sequences), max_len).fill_(pad_token)
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq  # We copy each sequence into the matrix
        padded_seqs = padded_seqs.detach()  # We remove these tensors from the computational graph
        return padded_seqs, lengths

    # Sort data by seq lengths

    data.sort(key=lambda x: len(x["source"]), reverse=True)
    new_item = {}
    for key in data[0].keys():
        new_item[key] = [d[key] for d in data]

    source, _ = merge(new_item["source"])
    target, lengths = merge(new_item["target"])

    new_item["source"] = source.to(device)
    new_item["target"] = target.to(device)
    new_item["number_tokens"] = sum(lengths)
    return new_item


In [9]:
# 6.5 Train and validate the model
def train_loop(data, optimizer, criterion, model, average_model, clip=5):
    model.train()
    loss_array = []
    number_of_tokens = []

    for sample in data:
        optimizer.zero_grad()  # Zeroing the gradient
        output = model(sample['source'])
        loss = criterion(output, sample['target'])
        loss_array.append(loss.item() * sample["number_tokens"])
        number_of_tokens.append(sample["number_tokens"])
        loss.backward()  # Compute the gradient, deleting the computational graph
        # clip the gradient to avoid explosion gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()  # Update the weights

        # Update the average model
        # for param, avg_param in zip(model.parameters(), average_model.parameters()):
        #     avg_param.data.mul_(0.999).add_(0.001, param.data)

        #  Update the average model.
        for param, avg_param in zip(model.parameters(), average_model.parameters()):
            avg_param.data.mul_(0.9).add_(param.data, alpha=0.1)

    return sum(loss_array) / sum(number_of_tokens)

In [10]:
def eval_loop(data, eval_criterion, model):
    model.eval()
    loss_to_return = []
    loss_array = []
    number_of_tokens = []
    # softmax = nn.Softmax(dim=1) # Use Softmax if you need the actual probability
    with torch.no_grad():  # It used to avoid the creation of computational graph
        for sample in data:
            output = model(sample['source'])
            loss = eval_criterion(output, sample['target'])
            loss_array.append(loss.item())
            number_of_tokens.append(sample["number_tokens"])

    ppl = math.exp(sum(loss_array) / sum(number_of_tokens))
    loss_to_return = sum(loss_array) / sum(number_of_tokens)
    return ppl, loss_to_return


In [11]:
def init_weights(mat):
    for m in mat.modules():
        if type(m) in [nn.GRU, nn.LSTM, nn.RNN]:
            for name, param in m.named_parameters():
                if 'weight_ih' in name:
                    for idx in range(4):
                        mul = param.shape[0] // 4
                        torch.nn.init.xavier_uniform_(param[idx * mul:(idx + 1) * mul])
                elif 'weight_hh' in name:
                    for idx in range(4):
                        mul = param.shape[0] // 4
                        torch.nn.init.orthogonal_(param[idx * mul:(idx + 1) * mul])
                elif 'bias' in name:
                    param.data.fill_(0)
        else:
            if type(m) in [nn.Linear]:
                torch.nn.init.uniform_(m.weight, -0.01, 0.01)
                if m.bias != None:
                    m.bias.data.fill_(0.01)


In [14]:

# Load the dataset
train_raw = read_file("dataset/ptb.train.txt")
dev_raw = read_file("dataset/ptb.valid.txt")
test_raw = read_file("dataset/ptb.test.txt")

# Create the vocabulary
# Vocab is computed only on training set
# However you can compute it for dev and test just for statistics about OOV
vocab = get_vocab(train_raw, ["<pad>", "<eos>"])
lang = Lang(train_raw, ["<pad>", "<eos>"])

# PennTreeBank dataset instantiation
train_dataset = PennTreeBank(train_raw, lang)
dev_dataset = PennTreeBank(dev_raw, lang)
test_dataset = PennTreeBank(test_raw, lang)

# Dataloader instantiation
train_loader = DataLoader(train_dataset, batch_size=256,
                          collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]), shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))
test_loader = DataLoader(test_dataset, batch_size=256, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))

# Experiment also with a smaller or bigger model by changing hid and emb sizes
# A large model tends to overfit
hid_size = 100
emb_size = 150

# With SGD try with an higer learning rate
lr = 0.1  # This is definitely not good for SGD
clip = 5  # Clip the gradient
device = 'cuda:0'

vocab_len = len(lang.word2id)
model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"], tied_weights=True).to(device)
model.apply(init_weights)

# Optimizer
# optimizer = optim.AdamW(model.parameters(), lr=lr)  # Replaced SGD with AdamW
# Changing optimizer to ASGD, which is the basis for NT-ASGD (Non-monotonically Triggered AvSGD)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
# make copy of model
average_model = copy.deepcopy(model)
check_interval = 5  # Set check interval for NT-AvSGD
non_monotonic_trigger = 2  # Set trigger for NT-AvSGD
last_losses = []  # Store losses of last check_interval epochs

criterion_train = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"])
criterion_eval = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"], reduction='sum')

In [15]:
# Training loop
# Set max_split_size_mb to avoid fragmentation of the GPU memory
torch.backends.cuda.max_split_size_mb = 512  # Adjust the value according to your needs
torch.cuda.empty_cache()
torch.cuda.max_memory_allocated(device=device) / 1024 ** 2
gc.collect()

n_epochs = 100
patience = 3
losses_train = []
losses_dev = []
sampled_epochs = []
best_ppl = math.inf
best_model = None
pbar = tqdm(range(1, n_epochs))

# LR scheduler that reduces LR when a metric has stopped improving (patience=3 means reducing LR after 3 epochs)
scheduler = ReduceLROnPlateau(optimizer, patience=patience, factor=0.1, verbose=True)

# If the PPL is too high try to change the learning rate
for epoch in pbar:
    loss = train_loop(train_loader, optimizer, criterion_train, model, average_model, clip)

    if epoch % 1 == 0:
        sampled_epochs.append(epoch)
        losses_train.append(np.asarray(loss).mean())
        ppl_dev, loss_dev = eval_loop(dev_loader, criterion_eval, model)
        losses_dev.append(np.asarray(loss_dev).mean())
        pbar.set_description("PPL: %f" % ppl_dev)
        if ppl_dev < best_ppl:  # the lower, the better
            best_ppl = ppl_dev
            best_model = copy.deepcopy(model).to('cpu')
            patience = 3
        else:
            patience -= 1

        # NT-AvSGD
        # Add logic for non-monotonic triggering
        last_losses.append(loss_dev)
        if len(last_losses) > check_interval:
            last_losses.pop(0)
            # Check if the last check_interval losses are not monotonically decreasing (trigger)
            # and switch to average model
            if sum(x > y for x, y in zip(last_losses[1:], last_losses[:-1])) >= non_monotonic_trigger:
                model.load_state_dict(average_model.state_dict())  # Switch to average model

        if patience <= 0:  # Early stopping with patience
            break  # Not nice but it keeps the code clean

best_model.to(device)
final_ppl, _ = eval_loop(test_loader, criterion_eval, best_model)
print('Test ppl: ', final_ppl)


  0%|          | 0/99 [00:00<?, ?it/s]


RuntimeError: mat1 and mat2 shapes cannot be multiplied (19712x100 and 150x10001)

In [15]:
# # To save the model
path = 'model_bin/ex2-model_name-2.pt'
torch.save(model.state_dict(), path)
# To load the model you need to initialize it
model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"]).to(device)
# Then you load it
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [16]:
def analogy_our_model(w1, w2, w3, model, lang):
    model.eval().to('cpu')

    # Suggest: make use of torch.LongTensor and check if the word is in the vocab
    # Get word ids
    temp_w1 = lang.word2id[w1]
    temp_w2 = lang.word2id[w2]
    temp_w3 = lang.word2id[w3]

    # Get word vectors
    v1 = model.get_word_embedding(torch.LongTensor([temp_w1]))
    v2 = model.get_word_embedding(torch.LongTensor([temp_w2]))
    v3 = model.get_word_embedding(torch.LongTensor([temp_w3]))

    # relation vector
    rv = v3 + v2 - v1

    # Get the most similar word
    ms = model.get_most_similar(rv, top_k=10)

    # getting words & scores
    for i, key in enumerate(ms[0]):
        print(lang.id2word[key], ms[1][i])

In [17]:
# Our model is trained on WSJ news queen and king should be OOV or very rare tokens
# Try with different words
analogy_our_model('man', 'woman', 'u.s.', model, lang)

stakes 0.60711277
argue 0.5819302
banks 0.3156696
exercises 0.31024888
entrepreneurial 0.27440998
recover 0.27205944
alternatively 0.26840913
whiskey 0.26704118
aware 0.26165247
mile 0.25744873


In [18]:
# Our model is trained on WSJ news queen and king should be OOV or very rare tokens
analogy_our_model('a', 'woman', 'queen', model, lang)

stakes 0.5844376
valley 0.5793578
travel 0.30261374
timing 0.2850653
cocoa 0.2767574
strip 0.27127886
mills 0.26965898
procedural 0.2546013
locations 0.25051957
volatile 0.24943632
