# Week 3 - NLP and Deep Learning

---

# Lecture 5 - Language Identification with a Feedforward Neural Network


In this exercise, you will implement the forward step of a FFNN from scratch and compare your solution to Pytorch on a small toy example to predict the language for a given word. 

It is very important that you understand the basic building blocks (input/output: how to encode your instances, the labels; the model: what the neural network consists of, how to learn its weights, how to do a forward pass for prediction). 

##  1. Representing the data

We are assuming multi-class classification tasks for the assignments of this week. The labels are: $$ y \in \{da,nl,en\}$$

We will use the same data as in week2, from:
* English [Wookipedia](https://starwars.fandom.com/wiki/Main_Page)  
* Danish [Kraftens Arkiver](https://starwars.fandom.com/da/wiki) 
* Dutch [Yodapedia](https://starwars.fandom.com/da/wiki)


In [209]:
import torch
import numpy as np
def load_langid(path):
    text = []
    labels = []
    for line in open(path, encoding="utf-8"):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1])
    return text, labels

wooki_train_text, wooki_train_labels = load_langid('langid-data/wookipedia_langid.train.tok.txt')


#wow

* a): Convert the training data into n-hot format, where each feature represents whether a **single character** is present or not.  Similarly, convert the labels into numeric format. For simplicity, you can assume a closed vocabulary (only the letters in wookie_train_text, no unknown-character handling). Keep original casing, and assign the character indices based on their chronological order.

  * What is the vocabulary size?
  
**Hint:** It is easier for the rest of the assignment if you directly use a torch tensor to save the features ([tutorial](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py)), a 2d torch tensor filled with 0's can be initiated with: `torch.zeros(dim1, dim2, dtype=float)`. Note the use of `float` instead of `int` here, which is only because the `torch.mm` requires float tensors as input.

In [210]:
tokens=[]
for i in range(len(wooki_train_text)):
    tokens += list(wooki_train_text[i])

tokens=list(set(tokens))
print(len(tokens))
print(tokens)


131
['0', 'f', 'n', 'ï', ']', '|', 'ö', 'É', '°', 'ː', 'O', '–', 'θ', '“', 'u', 'k', '»', 'Å', ')', 'T', '4', '(', 'V', '[', 'à', '²', 'W', '1', 'e', '<', 'E', '.', ';', 'y', 'j', 'æ', '=', 'ə', 'g', '>', 'p', '8', '^', 'ō', 'm', '∑', 'a', '9', '«', 'B', 'è', 'ó', 'ɹ', 'q', '’', 'D', 'H', 'ü', 'd', 'J', ':', '6', '―', '…', '$', '#', 'Z', 'ś', 'b', '?', 'P', 'á', '™', 'Y', 'I', 'å', 'X', 'L', '´', 'A', 'o', "'", 'R', ',', 'Q', 'G', '&', 'K', '\u200b', '5', 'N', 'i', 'ɑ', 't', 'w', 'ë', '½', '!', '%', '`', 'Ø', 'Æ', 'v', '7', '‘', 'ʊ', '-', 'C', 'é', 'l', 'x', 'Θ', ' ', '”', 'M', 'ń', '—', 's', '/', 'U', 'r', 'ø', '+', 'h', 'c', '2', 'F', 'z', 'S', '3', 'ñ']


In [211]:
BOL_matrix=torch.zeros(len(wooki_train_text), len(tokens), dtype=torch.float)

for s, sentence in enumerate(wooki_train_text):
    for i in range(len(tokens)):
        if tokens[i] in sentence:
            BOL_matrix[s][i]=1
#BOL_list[1]


In [212]:
BOL_matrix[1]

tensor([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 0.,
        0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0.])

##  2: Forward pass (from scratch)

### Feedforward Neural Networks (FNNs) or MLPs

Feedforward Neural Networks (FNNs) are also called Multilayer Perceptrons (MLPs). These are the most basic types of neural networks. They are called this way as the information is flowing from the input nodes through the network up to the output nodes. 

It is essential to understand that a neural network is a non-linear classification model which is based upon function application. Each layer in a neural network is an application of a function.

Summary (by J.Frellsen):
<img src="pics/fnn_jf.png">

You are going to implement the forward step manually on a small dataset. You will create a network following the design in the following figure (note that the input should be the sames size as the number of characters found in the previous assignment, instead of 4):

<img src="pics/nn.svg">

a) How many neurons do hidden layer 1 and hidden layer 2 have? Note: the bias node is not shown in the figure, you do not have to count them for this assignment.

b) How many neurons does the output layer have? And the input layer? (Note: the figure shows only 4 input nodes, in this example your input size is defined in the previous assignment - what is the input layer size?)

c) Specify the size of layers of the feedforward neural network:

In [213]:
## helper functions to determine the input and output dimensions of each layer
input_dim = len(tokens)
hidden_dim1 = 15
hidden_dim2 = 20
output_dim = 3

d) Now initialize the layers themselves as torch tensors (do not use a torch.module here!). You can define the bias and the weights in separate tensors. The weights should be initialized randomly (`torch.randn((dim1, dim2), dtype=torch.float)`, see also [torch.randn](https://pytorch.org/docs/stable/generated/torch.randn.html)) and the biases can be set to 1 (`torch.ones(dim1, dtype=torch.float)`, see also [torch.ones](https://pytorch.org/docs/stable/generated/torch.ones.html)). Confirm whether their size match the answer to `b)` and `a)` by printing .shape of the tensors.


In [214]:
## define all parameters of this NN

# Initialize weights and biases for each layer
# Layer 1: input -> hidden1
weights1 = torch.randn((input_dim, hidden_dim1), dtype=torch.float)
bias1 = torch.ones(hidden_dim1, dtype=torch.float)

# Layer 2: hidden1 -> hidden2
weights2 = torch.randn((hidden_dim1, hidden_dim2), dtype=torch.float)
bias2 = torch.ones(hidden_dim2, dtype=torch.float)

# Layer 3: hidden2 -> output
weights3 = torch.randn((hidden_dim2, output_dim), dtype=torch.float)
bias3 = torch.ones(output_dim, dtype=torch.float)

# Print the shapes of the weights and biases to confirm dimensions
print("Weights1 shape:", weights1.shape)
print("Bias1 shape:", bias1.shape)

print("Weights2 shape:", weights2.shape)
print("Bias2 shape:", bias2.shape)

print("Weights3 shape:", weights3.shape)
print("Bias3 shape:", bias3.shape)

Weights1 shape: torch.Size([131, 15])
Bias1 shape: torch.Size([15])
Weights2 shape: torch.Size([15, 20])
Bias2 shape: torch.Size([20])
Weights3 shape: torch.Size([20, 3])
Bias3 shape: torch.Size([3])


Now that we have defined the shape of all parameters, we are ready to "connect the dots" and build the network. 

It is instructive to break the computation of each layer down into two steps: the scores $a1$ are obtained by the linear function followed by the activation applications $\sigma$ to obtain the representation $z1$, as in:

$$ a1 = xW_1 + b_1$$
$$ z1 = \sigma(a1)$$

d) Specify the entire network up to the output layer $z3$, and **up to and exclusive** the final application of the softmax, the last activation function, which is provided. For multiplication [torch.mm](https://pytorch.org/docs/stable/generated/torch.mm.html) can be used. Use a tanh activation function: [torch.tanh](https://pytorch.org/docs/stable/generated/torch.tanh.html).

The exact implementation of the softmax might differ from toolkit to toolkit (due to variations in implementation details in order to obtain numerical stability). Therefore, we will use the Pytorch implementation for the softmax calculation ([torch.nn.Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html)).

In [215]:

# Forward pass function to handle batch inputs
def forward_pass(x):
    # Layer 1: input -> hidden1
    a1 = torch.mm(x, weights1) + bias1  # Linear function 
    z1 = torch.tanh(a1)  # Apply tanh activation
    
    # Layer 2: hidden1 -> hidden2
    a2 = torch.mm(z1, weights2) + bias2
    z2 = torch.tanh(a2)  # Apply tanh activation
    
    # Layer 3: hidden2 -> output
    a3 = torch.mm(z2, weights3) + bias3
    z3 = torch.tanh(a3)  # Apply tanh activation (before softmax)

    return z3


# Perform the forward pass on the entire training set
output = forward_pass(BOL_matrix)

print("Output z3 (before softmax) for all samples in the training set:", output)


Output z3 (before softmax) for all samples in the training set: tensor([[-0.9990,  0.7394,  0.8762],
        [-0.9823, -0.9970,  1.0000],
        [-0.9954, -0.9959, -0.9569],
        ...,
        [-0.8334, -0.9985,  0.9997],
        [-1.0000,  0.8612,  1.0000],
        [-1.0000,  0.2778,  0.9845]])


We can check that all predictions sum up to approximately 1 (hint: use `torch.sum` with `axis=1`)



In [216]:
torch.sum(output,axis=1)
#not quite one for all

tensor([ 0.6166, -0.9793, -2.9482,  ..., -0.8322,  0.8613,  0.2623])


Congrats! you have made it through the manual construction of the forward pass. Note that these weights are still random, so performance is not expected to be good. Now lets compare your implementation to a set of pre-determined weights.

##  3. Where do the weights come from?  Loading existing weights

So far, the model that you used randomly initialized weights. In this step we will load pre-trained model weights and do the forward pass with those weights, in order to check your implementation against model predictions computed by the toolkit.

Now we are going to:
* load pretrained weights for all parameters
* apply the weights to the evaluation data
* check that your manual softmax scores match the ones obtained by the pre-trained model `model` that we will load
* convert the output to labels and calculate the accuracy score

First, lets load the pre-trained model:

In [217]:
import torch
import torch.nn as nn

# use the character indexing from assignment 1
idx2char = ['H', 'e', ' ', 'v', 'n', 'w', 't', 's', 'o', 'f', 'a', 'r', 'u', 'g', 'h', ',', 'i', 'c', 'y', 'd', 'b', 'm', 'p', 'l', 'k', '.', 'D', 'E', 'C', 'j', 'R', 'S', 'U', '1', "'", 'æ', 'å', 'q', '`', 'I', '(', ')', 'M', 'F', '-', 'x', 'K', '9', '5', 'B', 'W', 'z', 'G', 'P', 'L', '/', 'O', '6', 'T', '7', 'Z', '2', '0', 'J', 'V', 'A', 'ø', 'X', '–', 'N', 'ë', ':', '&', '3', 'Y', 'é', '4', '[', ']', '’', ';', '8', 'É', 'Æ', 'Q', '!', '—', 'ï', '°', 'ō', '\u200b', '‘', 'ń', '“', '”', '?', 'Å', '<', '>', '#', '%', '+', 'ʊ', 'ɹ', 'ə', 'ɑ', 'ö', 'à', 'á', 'è', '=', 'ü', 'Ø', '∑', '^', 'ś', 'ñ', '|', '½', '$', '«', '™', 'ó', '´', '…', '―', '»', 'ː', 'θ', '²', 'Θ']
char2idx = {'H': 0, 'e': 1, ' ': 2, 'v': 3, 'n': 4, 'w': 5, 't': 6, 's': 7, 'o': 8, 'f': 9, 'a': 10, 'r': 11, 'u': 12, 'g': 13, 'h': 14, ',': 15, 'i': 16, 'c': 17, 'y': 18, 'd': 19, 'b': 20, 'm': 21, 'p': 22, 'l': 23, 'k': 24, '.': 25, 'D': 26, 'E': 27, 'C': 28, 'j': 29, 'R': 30, 'S': 31, 'U': 32, '1': 33, "'": 34, 'æ': 35, 'å': 36, 'q': 37, '`': 38, 'I': 39, '(': 40, ')': 41, 'M': 42, 'F': 43, '-': 44, 'x': 45, 'K': 46, '9': 47, '5': 48, 'B': 49, 'W': 50, 'z': 51, 'G': 52, 'P': 53, 'L': 54, '/': 55, 'O': 56, '6': 57, 'T': 58, '7': 59, 'Z': 60, '2': 61, '0': 62, 'J': 63, 'V': 64, 'A': 65, 'ø': 66, 'X': 67, '–': 68, 'N': 69, 'ë': 70, ':': 71, '&': 72, '3': 73, 'Y': 74, 'é': 75, '4': 76, '[': 77, ']': 78, '’': 79, ';': 80, '8': 81, 'É': 82, 'Æ': 83, 'Q': 84, '!': 85, '—': 86, 'ï': 87, '°': 88, 'ō': 89, '\u200b': 90, '‘': 91, 'ń': 92, '“': 93, '”': 94, '?': 95, 'Å': 96, '<': 97, '>': 98, '#': 99, '%': 100, '+': 101, 'ʊ': 102, 'ɹ': 103, 'ə': 104, 'ɑ': 105, 'ö': 106, 'à': 107, 'á': 108, 'è': 109, '=': 110, 'ü': 111, 'Ø': 112, '∑': 113, '^': 114, 'ś': 115, 'ñ': 116, '|': 117, '½': 118, '$': 119, '«': 120, '™': 121, 'ó': 122, '´': 123, '…': 124, '―': 125, '»': 126, 'ː': 127, 'θ': 128, '²': 129, 'Θ': 130}

# the label indexes that were used during training
label2idx = {'da':0, 'nl':1, 'en':2}
idx2label = ['da', 'nl', 'en']

# This is the definition of an FNN model in PyTorch, and can mostly be ignored for now.
# We will focus on how to create Torch models in lecture 6
class LangId(nn.Module):
    def __init__(self, vocab_size):
        super(LangId, self).__init__()
        self.input = nn.Linear(vocab_size, 15)
        self.hidden1 = nn.Linear(15, 20)
        self.hidden2 = nn.Linear(20, 3)

    def forward(self, x):
        x = torch.tanh(self.input(x))
        x = torch.tanh(self.hidden1(x))
        x = self.hidden2(x)
        return x

lang_classifier = torch.load('model.pt')


  lang_classifier = torch.load('model.pt')


Inspect the weights you just loaded using the `state_dict()` function of the model: 

In [218]:
print(lang_classifier)
lang_classifier.state_dict()


LangId(
  (input): Linear(in_features=131, out_features=15, bias=True)
  (hidden1): Linear(in_features=15, out_features=20, bias=True)
  (hidden2): Linear(in_features=20, out_features=3, bias=True)
)


OrderedDict([('input.weight',
              tensor([[ 0.1274,  0.2723,  0.4691,  ...,  0.0754,  0.0201,  0.0813],
                      [-0.1876,  0.3465,  0.4979,  ..., -0.0436, -0.0362, -0.0866],
                      [ 0.1779,  0.3311,  0.3578,  ..., -0.0705,  0.0656,  0.0415],
                      ...,
                      [-0.0264,  0.2019,  0.1753,  ...,  0.0335,  0.0764,  0.0222],
                      [-0.0810, -0.3535, -0.1255,  ..., -0.0645,  0.0299,  0.0438],
                      [ 0.0740, -0.1535,  0.1290,  ..., -0.0464, -0.0612,  0.0650]])),
             ('input.bias',
              tensor([ 0.4091,  0.8057,  0.4696,  0.3282,  0.4459, -0.3094, -0.7575, -0.3531,
                      -0.3175,  0.2946,  0.7420,  0.1358,  0.1037, -0.2193, -0.3283])),
             ('hidden1.weight',
              tensor([[ 0.2575,  0.6953,  0.5631,  0.2704, -0.3716, -0.9438, -0.4709, -0.9932,
                       -0.7564, -0.0925, -2.2822,  0.2297,  0.2956,  0.0241, -1.9843],
            

* a) Convert the following dev data into the input format for the neural network above. 

**Hint** The indices of the characters are based on the order in the training data, and should match in the development data, we provide the correct idx2char and char2idx that were used to train the model in the code above.

In [219]:
wooki_dev_text, wooki_dev_labels = load_langid('langid-data/wookipedia_langid.dev.tok.txt')

BOL_dev_matrix=torch.zeros(len(wooki_dev_text), len(idx2char), dtype=torch.float)

for s, sentence in enumerate(wooki_dev_text):
    for i in range(len(idx2char)):
        if idx2char[i] in sentence:
            BOL_dev_matrix[s][i]=1
#BOL_list[1]




* b) run a forward pass on the dev-data with `lang_classifier`, using the forward() function


In [220]:
Output_1=lang_classifier.forward(BOL_dev_matrix)
Output_1

tensor([[-2.3604,  0.7146,  0.3900],
        [-2.2557,  1.4831, -0.5462],
        [-2.0805,  1.3734, -0.5807],
        ...,
        [ 2.0050, -1.2380, -1.6113],
        [ 2.2639, -1.3041, -1.8159],
        [-1.7680, -1.5234,  2.6722]], grad_fn=<AddmmBackward0>)

* c) Apply your manual implementation of the forward pass to the evaluation data by using the parameters (weights) you just loaded with `state_dict()`. This allows you to check if you get the same results back as the model implemented in Torch. If the outputs match, you implemented the forward pass correctly, congratulations!



In [221]:
# Extracting the layers from the state_dict
weights1 = lang_classifier.state_dict()['input.weight'].T
bias1 = lang_classifier.state_dict()['input.bias'].T
weights2 = lang_classifier.state_dict()['hidden1.weight'].T
bias2 = lang_classifier.state_dict()['hidden1.bias'].T
weights3 = lang_classifier.state_dict()['hidden2.weight'].T
bias3 = lang_classifier.state_dict()['hidden2.bias'].T

# Now you can use these globally across your script


# Perform the forward pass on the entire training set
Output_2 = forward_pass(BOL_dev_matrix)

print("Output z3 (before softmax) for all samples in the training set:", Output_2)



Output z3 (before softmax) for all samples in the training set: tensor([[-0.9823,  0.6135,  0.3713],
        [-0.9783,  0.9021, -0.4977],
        [-0.9693,  0.8795, -0.5232],
        ...,
        [ 0.9644, -0.8449, -0.9233],
        [ 0.9786, -0.8628, -0.9484],
        [-0.9434, -0.9093,  0.9905]])


**Hint**: internally the torch model saves the weight in a transposed vector for efficiency reasons. This means that W1 will have the dimension of (15,131). To use your previous implementation you have to call the the transpose function in Pytorch ([`.t()`](https://pytorch.org/docs/stable/generated/torch.t.html)), which will convert the shape to be (131,15)

* d) Now apply softmax on the resulting weights and convert the output to the label predictions.

In [222]:
import torch.nn.functional as F

softmax_result1 = F.softmax(Output_1, dim=1)
print("softmax 1 \n", softmax_result1)

softmax_result2 = F.softmax(Output_2, dim=1)
print("softmax 2 \n" ,softmax_result2)



softmax 1 
 tensor([[0.0261, 0.5653, 0.4086],
        [0.0206, 0.8657, 0.1138],
        [0.0270, 0.8523, 0.1208],
        ...,
        [0.9381, 0.0366, 0.0252],
        [0.9568, 0.0270, 0.0162],
        [0.0115, 0.0147, 0.9738]], grad_fn=<SoftmaxBackward0>)
softmax 2 
 tensor([[0.1020, 0.5031, 0.3949],
        [0.1090, 0.7147, 0.1763],
        [0.1122, 0.7126, 0.1753],
        ...,
        [0.7603, 0.1245, 0.1151],
        [0.7668, 0.1216, 0.1116],
        [0.1117, 0.1156, 0.7727]])


### not quite the same but pretty close, not sure why

# Lecture 6: Word2vec and PyTorch
In the following exercises, you are going to explore what is represented with word embeddings. You are going to make use of the python gensim package and two sets of pre-trained embeddings. The embeddings can be downloaded from:

* http://itu.dk/people/robv/data/embeds/twitter.bin.tar.gz
* http://itu.dk/people/robv/data/embeds/GoogleNews-50k.bin.tar.gz

The first embeddings are skip-gram embeddings trained on a collection of 2 billion words from English tweets collected during 2012 and 2018 with the default settings of word2vec. The second embeddings are trained on 100 billion words from Google News. They have both been truncated to the most frequent 500,000 words. Note that loading that each of these embeddings require approximately 2GB of ram.

The embeddings can be loaded in gensim as follows:

In [223]:
twitter_data=r"C:\Users\elias\Desktop\_ITU\4.semester\NLP\twitter.bin"
google_data=r"C:\Users\elias\Desktop\_ITU\'4.semester\NLP\GoogleNews-50k.bin"

In [224]:
import gensim.models

twitEmbs = gensim.models.KeyedVectors.load_word2vec_format(
                                twitter_data, binary=True)
print('loading finished')

loading finished


You can now use the index operator ``[]`` or the function ``get_vector()`` to acces the individual word embeddings.

In [225]:
#twitEmbs['cat']

## 4. Word similarities
Cosine distance can be used to measure the distance between two words. It is defined as:
\begin{equation}
cos_{\vec{a},\vec{b}} = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|} = \frac{\sum^n_1 a_i b_i}{\sqrt{\sum^n_1 a_i^2} \sqrt{\sum^n_1 b_i^2}}
\end{equation}

* a) Implement the cosine similarity using pure python (using only the ``math`` package, no other libraries). Note that `similarity == 1-distance`.

You can compare your scores to the gensim implementation to check wheter it is correct. The following code should give the same output

In [226]:
import math

def cosine(vec1, vec2):
    dot= sum(vec1*vec2)
    len1=math.sqrt(sum(vec1*vec1))
    len2=math.sqrt(sum(vec2*vec2))
    return 1-dot/(len1*len2)

print(twitEmbs.distance('cat', 'dog'))
print(cosine(twitEmbs['cat'], twitEmbs['dog']))


0.10446518659591675
0.10446513714193606


In wordnet, the distance between two senses can be based on the distance in the taxonomy. The most common metric for this is:

* Wu-Palmer Similarity: denotes how similar two word senses are, based on the depth of the two senses in the taxonomy and of their Least Common Subsumer (most specific ancestor node).

It can be obtained in python like this:

In [227]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

first_word = wordnet.synsets('cat')[0] #0 means: most common sense
second_word = wordnet.synsets('dog')[0]
print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))

print('Twitter similarity: ' + str(twitEmbs.similarity('cat', 'dog')))


WordNet similarity: 0.8571428571428571
Twitter similarity: 0.8955348


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\elias\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




* b) Think of 5 word pairs which have a high similarity according to you. Estimate the difference between these pairs in wordnet as well as in the Twitter embeddings and the Google News embeddings. Which method is closest to your own intuition? (You can use the gensim implementation of cosine similarity here.)


In [228]:
fword = ["knight","steak","bench","dick","doctor","bike", "cheese"]
sword = ["warrior","beef","squat","cock","nurse","bicycle", "milk"]


for i in range(7):
    first_word = wordnet.synsets(fword[i])[0] 
    second_word = wordnet.synsets(sword[i])[0]

    print(fword[i]," and ",sword[i]," : ")

    print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))
    print('Twitter similarity: ' + str(twitEmbs.similarity(fword[i], sword[i])))




knight  and  warrior  : 
WordNet similarity: 0.631578947368421
Twitter similarity: 0.59290004
steak  and  beef  : 
WordNet similarity: 0.15384615384615385
Twitter similarity: 0.68714267
bench  and  squat  : 
WordNet similarity: 0.09090909090909091
Twitter similarity: 0.5594138
dick  and  cock  : 
WordNet similarity: 0.21052631578947367
Twitter similarity: 0.8291812
doctor  and  nurse  : 
WordNet similarity: 0.8695652173913043
Twitter similarity: 0.8612271
bike  and  bicycle  : 
WordNet similarity: 0.7272727272727273
Twitter similarity: 0.8499112
cheese  and  milk  : 
WordNet similarity: 0.875
Twitter similarity: 0.7525767


## 5. Training a Language Identification model using PyTorch

#### **Objective**  
Implement a PyTorch-based model for **language identification**. The goal is to train a model that can classify input text into three different languages.

* a) **Prepare Training Data.** 

Prepare the training data (texts) in an n-hot format, and the labels in a seperate matrix. The training data will be of shape (15000, 131) for 15000 instances and 131 features (i.e. characters). The train labels will be of shape (15000), and contains labels representing the languages (e.g. 0,1,2).

In [229]:
wooki_train_text, wooki_train_labels = load_langid('langid-data/wookipedia_langid.train.tok.txt')

BOL_train_matrix=torch.zeros(len(wooki_train_text), len(idx2char), dtype=torch.float)

for s, sentence in enumerate(wooki_train_text):
    for i in range(len(idx2char)):
        if idx2char[i] in sentence:
            BOL_train_matrix[s][i]=1


print(BOL_train_matrix.shape)  # 
print(len(wooki_train_labels))  # 



torch.Size([15000, 131])
15000


In [230]:
train_data =BOL_train_matrix
train_labels = torch.tensor([label2idx[label] for label in wooki_train_labels])

In [231]:
train_labels.shape

torch.Size([15000])

* b) **Initialize Model, Loss Function, and Optimizer**

You can use the model as defined in assignment 3.

In [232]:
from torch.utils.data import DataLoader, TensorDataset
batch_size = 32
train_dataset = TensorDataset(BOL_train_matrix, train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

* c) **Prepare dataset and dataloader**

You can use torch.utils.data.DataLoader and torch.utils.data.TensorDataset

In [233]:
import torch.nn as nn
import torch.optim as optim

# Initialize langClassifier using LangId with vocab size as an argument
vocab_size = len(idx2char)
langClassifier = LangId(vocab_size)

# CrossEntropyLoss function
criterion = nn.CrossEntropyLoss()

# SGD optimizer with learning rate 0.00001 and momentum 0.9
optimizer = optim.SGD(langClassifier.parameters(), lr=0.00001, momentum=0.9)


* d) **Implement the training Loop**

In [235]:

# Number of epochs
num_epochs = 20

def train(model, train_loader, criterion, optimizer, num_epochs):
    for epoch in range(num_epochs):
        model.train()  # Set the model to training mode
        running_loss = 0.0

        for batch_data, batch_labels in train_loader:
            # Zero the parameter gradients
            optimizer.zero_grad()

            # Run the forward pass on langClassifier model with batch_data and collect the outputs
            outputs = model(batch_data)

            # Calculate the loss of the outputs with respect to batch_labels
            loss = criterion(outputs, batch_labels)

            # Perform backpropagation of the loss
            loss.backward()

            # Update model weights using the optimizer
            optimizer.step()

            # Update running loss
            running_loss += loss.item()
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader)}')
            
train(langClassifier, train_loader, criterion, optimizer, num_epochs)


    # Print the loss at the end of each epoch


Epoch [1/20], Loss: 1.073342069125633
Epoch [2/20], Loss: 1.0723590652571557
Epoch [3/20], Loss: 1.0713857114950478
Epoch [4/20], Loss: 1.0703879546509114
Epoch [5/20], Loss: 1.0693813336175133
Epoch [6/20], Loss: 1.0683769140162194
Epoch [7/20], Loss: 1.067360745818376
Epoch [8/20], Loss: 1.0663304509384546
Epoch [9/20], Loss: 1.0652965464825823
Epoch [10/20], Loss: 1.0642437657821915
Epoch [11/20], Loss: 1.0631836299448887
Epoch [12/20], Loss: 1.0621023999094201
Epoch [13/20], Loss: 1.0610235527888545
Epoch [14/20], Loss: 1.0599193822092086
Epoch [15/20], Loss: 1.0587940765088046
Epoch [16/20], Loss: 1.057648436601228
Epoch [17/20], Loss: 1.0564977339844206
Epoch [18/20], Loss: 1.0553157456648121
Epoch [19/20], Loss: 1.0541284602842351
Epoch [20/20], Loss: 1.0528840591658408


* e) **Save the Model**

Save the trained langClassifier model using torch.save()

In [236]:
# Save the trained model
torch.save(langClassifier.state_dict(), 'langClassifier_model.pth')
print("Model saved as langClassifier_model.pth")

Model saved as langClassifier_model.pth


* f) **Load and Prepare Development Set**

In [237]:
wooki_dev_text, wooki_dev_labels = load_langid('langid-data/wookipedia_langid.dev.tok.txt')

# Convert development data to n-hot format
BOL_dev_matrix = torch.zeros(len(wooki_dev_text), len(idx2char), dtype=torch.float)
for s, sentence in enumerate(wooki_dev_text):
    for i in range(len(idx2char)):
        if idx2char[i] in sentence:
            BOL_dev_matrix[s][i] = 1

# Convert labels to numeric format
dev_labels_numeric = torch.tensor([label2idx[label] for label in wooki_dev_labels])

dev_dataset = TensorDataset(BOL_dev_matrix, dev_labels_numeric)
dev_loader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False)


* g) **Evaluate the Model**
Evaluate the results on dev data:

In [238]:
# Load the saved model
langClassifier.load_state_dict(torch.load('langClassifier_model.pth'))
langClassifier.eval()

# Forward pass on dev_data with langClassifier model
logits = langClassifier(BOL_dev_matrix)

# Obtain the prediction with the highest logit
predicted = torch.argmax(logits, dim=1)

# Count the correct labels
correct = (predicted == dev_labels_numeric).sum().item()

# Compute and print accuracy
accuracy = correct / len(dev_labels_numeric)
print(f'Accuracy: {accuracy * 100:.2f}%')


Accuracy: 68.93%


  langClassifier.load_state_dict(torch.load('langClassifier_model.pth'))


* h) Explore the tuning of one part of the model to improve performance, you can choose for example to tune the learning rate, change the number of layers, dimensions of layers, or prune the input space.

In [239]:
import torch.nn as nn
import torch.optim as optim

def train_and_evaluate(model, train_loader, num_epochs=20, learning_rate=0.00001, momentum=0.9):
    # Define the loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

    # Training loop
    train(langClassifier, train_loader, criterion, optimizer, num_epochs)

    # Evaluation on development set
    model.eval()
    with torch.no_grad():
        logits = model(BOL_dev_matrix)
        predicted = torch.argmax(logits, dim=1)
        correct = (predicted == dev_labels_numeric).sum().item()
        accuracy = correct / len(dev_labels_numeric)
        print(f'Accuracy: {accuracy * 100:.2f}%')

# Example usage:
# Initialize the model with new parameters if needed
vocab_size = len(idx2char)
langClassifier = LangId(vocab_size)

# Retrain and evaluate the model with new parameters
train_and_evaluate(langClassifier, train_loader, num_epochs=20, learning_rate=0.001, momentum=0.9)


Epoch [1/20], Loss: 1.0304788420957798
Epoch [2/20], Loss: 0.7167769280959294
Epoch [3/20], Loss: 0.47726635053467903
Epoch [4/20], Loss: 0.4150668075407492
Epoch [5/20], Loss: 0.39715405755332794
Epoch [6/20], Loss: 0.38949856432134916
Epoch [7/20], Loss: 0.38486340586374057
Epoch [8/20], Loss: 0.3826969844032961
Epoch [9/20], Loss: 0.380083731306133
Epoch [10/20], Loss: 0.37808633986503076
Epoch [11/20], Loss: 0.37775072843027013
Epoch [12/20], Loss: 0.3761742014302882
Epoch [13/20], Loss: 0.3744962166494398
Epoch [14/20], Loss: 0.3738096256309481
Epoch [15/20], Loss: 0.37315579996243725
Epoch [16/20], Loss: 0.37216841526377176
Epoch [17/20], Loss: 0.37206720112801106
Epoch [18/20], Loss: 0.37124445643633414
Epoch [19/20], Loss: 0.3697653824904326
Epoch [20/20], Loss: 0.36974122748573196
Accuracy: 85.90%
