# ELMo: Embeddings from Language Models 
<!-- ![](https://get.whotrades.com/u4/photoDE6C/20647654315-0/blogpost.jpeg) -->

In this assignment you will implement a deep lstm-based model for contextualized word embeddings - ELMo. Your tasks are as following: 

- Preprocessing (20 points)
- Implementation of ELMo model (30 points)
  - 2-layer BiLSTM (15 points)
  - Highway layers (5 points) [link](https://paperswithcode.com/method/highway-layer) [paper](https://arxiv.org/pdf/1507.06228.pdf) [code](https://github.com/allenai/allennlp/blob/9f879b0964e035db711e018e8099863128b4a46f/allennlp/modules/highway.py#L11)
  - CharCNN embeddings (5 points) [paper](https://arxiv.org/pdf/1509.01626.pdf)
  - Handle out-of-vocabulary words (5 points)
- Report metrics and loss using tensorbord/comet or other tool.  (10 points)
- Evaluate on movie review dataset (20 pts)
- Compare the performance with BERT model (10 pts)
- Clean and documented code (10 points)


Remarks: 

*   Use Pytorch
*   Cheating will result in 0 points


ELMo paper: https://arxiv.org/pdf/1802.05365.pdf

Possible datasets:
- [WikiText-103](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)
- Any monolingual dataset from [WMT](https://statmt.org/wmt22/translation-task.html)

## Data loading and preprocessing
Preprocess the english monolingual data (20 points):
- clean
- split to train and validation
- tokenize
- create vocabulary, convert words to numbers. [vocab](https://pytorch.org/text/stable/vocab.html#id1)
- pad sequences

Use these tutorials [one](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html) and [two](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) as a reference

![](https://miro.medium.com/max/720/1*UPirqwpBWnNmcwoUjfZZIA.png)

In [10]:
import torch
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [21]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Read Sentences

In [74]:
data_dir = 'eng-simple_wikipedia_2021_10K'
data_filename = "eng-simple_wikipedia_2021_10K-sentences.txt"
data_full_filename = os.path.join(data_dir, data_filename)

# read data_ful_filename into a pandas dataframe without index
sents = pd.read_csv(data_full_filename, sep='\t', header=None, index_col=False)[1]
sents

0       ' 1979 standup tour. citation In 1998, he rele...
1       A 100-year-old woman named Rose DeWitt Bukater...
2       A1 is the name of a major road in some countries.
3       A 2002 report by American Sports Data found th...
4                   A 268-page booklet available on-line.
                              ...                        
9555    Zones are the places where buildings can develop.
9556    Zoological Journal of the Linnean Society, 71,...
9557    Zou Tribe is one of the Schedule Tribes of Man...
9558    Zubeyr was killed in a U.S. drone airstrike on...
9559    Քաշաթաղի մելիքություն) - Armenian melikdom(pri...
Name: 1, Length: 9560, dtype: object

### Filter ASCII-only Sentences

In [75]:
ascii_sent_indices = np.array(list(map(lambda x: x.isascii(), sents)))
ascii_sents = sents[ascii_sent_indices]
ascii_sents

0       ' 1979 standup tour. citation In 1998, he rele...
1       A 100-year-old woman named Rose DeWitt Bukater...
2       A1 is the name of a major road in some countries.
3       A 2002 report by American Sports Data found th...
4                   A 268-page booklet available on-line.
                              ...                        
9554                                  Z is not used much.
9555    Zones are the places where buildings can develop.
9556    Zoological Journal of the Linnean Society, 71,...
9557    Zou Tribe is one of the Schedule Tribes of Man...
9558    Zubeyr was killed in a U.S. drone airstrike on...
Name: 1, Length: 8969, dtype: object

### Train-Test Indices

In [13]:
# split sentences into train and test sets with numpy
np.random.seed(42)
train_indices = np.random.choice(ascii_sents.index, size=int(0.8*len(ascii_sents)), replace=False)
test_indices = ascii_sents.index.difference(train_indices)
# train_sents = ascii_sents.loc[train_indices]
# test_sents = ascii_sents.loc[test_indices]

### Tokenizer

In [76]:
from torchtext.data.utils import get_tokenizer

# create pytorch tokenizer
tokenizer = get_tokenizer('basic_english')

### Word Vocab

In [15]:
# create vocabulary of training words
from torchtext.vocab import build_vocab_from_iterator

vocab = build_vocab_from_iterator(
    [tokenizer(sent) for sent in train_sents],
    specials=['<unk>', '<pad>', '<bos>', '<eos>']
)

vocab.set_default_index(vocab['<unk>'])

### Char Vocab

In [16]:
# create vocabulary of ascii symbols
ascii_symbols = list(map(chr, range(127)))

symbols_vocab = build_vocab_from_iterator(
    [ascii_symbols],
    specials=['<unk>', '<pad>', '<bos>', '<eos>']
)

symbols_vocab.set_default_index(symbols_vocab['<unk>'])

### Tokenized Sents

In [None]:
# tokenize ascii_sents
tokenized_sents = list(map(tokenizer, ascii_sents))

### Max Words

In [None]:
# get max number of words in tokenized_sents
max_num_words = max(map(lambda x: len(x), tokenized_sents))
max_num_words

### Max Letters

In [None]:
# get max number of letters in the words in tokenized_sents
max_num_letters = max(map(lambda x: max(map(lambda y: len(y), x)), tokenized_sents))
max_num_letters

### Padded Word Ids

In [227]:
# padded_word_ids = torch.full(
#     size=(len(ascii_sents), max_num_words),
#     fill_value=vocab['<pad>']
# )
# padded_word_ids.shape

torch.Size([8969, 571])

In [None]:
# for sent_num, sent in enumerate(tokenized_sents):
#     for word_num, word in enumerate(sent):
#         padded_word_ids[sent_num, word_num] = vocab[word]

In [235]:
def sents_to_word_ids(sents):
    word_ids = []
    for sent in sents:
        sent_word_ids = [vocab['<bos>']] + [vocab[token] for token in tokenizer(sent)] + [vocab['<eos>']]
        word_ids.append(torch.Tensor(sent_word_ids))
    return word_ids

In [236]:
word_ids = sents_to_word_ids(ascii_sents)
#print(word_ids[:2])
padded_word_ids = torch.nn.utils.rnn.pad_sequence(word_ids, padding_value=vocab['<pad>'], batch_first=True)
padded_word_ids

tensor([[2.0000e+00, 1.8000e+01, 1.1870e+03,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [2.0000e+00, 1.0000e+01, 6.6030e+03,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [2.0000e+00, 7.1350e+03, 1.2000e+01,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        ...,
        [2.0000e+00, 0.0000e+00, 1.7630e+03,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [2.0000e+00, 6.5600e+03, 3.1630e+03,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [2.0000e+00, 1.6338e+04, 1.3000e+01,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00]])

In [22]:
# BATCH_SIZE = 128

# from torch.nn.utils.rnn import pad_sequence
# from torch.utils.data import DataLoader

# def pad_batch(data_batch):
#     return pad_sequence(data_batch, padding_value=vocab['<pad>'])

# train_iter = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)
# test_iter = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)

### Padded Char Ids

In [127]:
padded_char_ids = torch.full(
    size=(len(ascii_sents), max_num_words, max_num_letters),
    fill_value=symbols_vocab['<pad>']
)
padded_char_ids.shape

torch.Size([8969, 571, 35])

In [130]:
for sent_num, sent in enumerate(ascii_sents):
    for word_num, word in enumerate(tokenizer(sent)):
        for letter_num, letter in enumerate(word):
            padded_char_ids[sent_num, word_num, letter_num] = symbols_vocab[letter]

In [132]:
padded_char_ids

tensor([[[ 43,   1,   1,  ...,   1,   1,   1],
         [ 53,  61,  59,  ...,   1,   1,   1],
         [119, 120, 101,  ...,   1,   1,   1],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]],

        [[101,   1,   1,  ...,   1,   1,   1],
         [ 53,  52,  52,  ...,   1,   1,   1],
         [123, 115, 113,  ...,   1,   1,   1],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]],

        [[101,  53,   1,  ...,   1,   1,   1],
         [109, 119,   1,  ...,   1,   1,   1],
         [120, 108, 105,  ...,   1,   1,   1],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]],

        ...,

        [[126, 115, 115,  ...,   1,   1,   1],
         [110, 115, 121,  ...,   1,   1,   1]

In [None]:
# from torch.utils.data import TensorDataset, DataLoader
# from torchtext.vocab import vocab

## Model - learning embeddings
Read chapter 3 from the [paper](https://arxiv.org/pdf/1802.05365.pdf)

Implement this model with 
- 2 BiLSTM layers,
- CharCNN embeddings,
- Highway layers,
- out-of-vocabulary words handling

Plot the training and validation losses over the epochs (iterations)

Use the [implementation](https://github.com/allenai/allennlp/blob/main/allennlp/modules/elmo.py) as a reference

![](https://miro.medium.com/max/720/1*3_wsDpyNG-TylsRACF48yA.png)

![](https://miro.medium.com/max/720/1*8pG54o28pbD2L0dv5THL-A.png)

In [220]:
from torch import nn

class ELMo(nn.Module):
    
    def __init__(
        self,
        vocab_size,
        n_tokens,
        n_chars=50,
        embedding_dim=4,
        #projection_dim=512,
        lstm_units=4096,
        elmo_output_size=512
    ):
        super(ELMo, self).__init__()
        
        self.vocab_size = vocab_size
        self.n_tokens = n_tokens
        self.n_chars = n_chars
        self.embedding_dim = embedding_dim
        #self.projection_dim = projection_dim
        self.lstm_units = lstm_units
        self.elmo_output_size = elmo_output_size
        

        self.embedding_matrix = nn.Embedding(vocab_size, embedding_dim)

        filters = [[1,4], [2,8], [3,26], [4,32], [5,64]]
        self.conv_layers = nn.ModuleList([
            nn.Conv1d(
                in_channels=4,
                out_channels=num,
                kernel_size=width,
                bias=True 
            )
            for (width, num) in filters
        ])
        self.conv_activation = nn.ReLU()

        self.highway_layers = nn.ModuleList([
            nn.Linear(134, 134 * 2)
            for _ in range(2)
        ])
        self.highway_activation = nn.ReLU()
        self.highway_projection = nn.Linear(134, elmo_output_size, bias=True)
        
        self.lstm = nn.LSTM(
            input_size=elmo_output_size,
            hidden_size=lstm_units,
            bidirectional=True,
            batch_first=True,
            num_layers=2,
            proj_size=elmo_output_size
        )

        self.linear = nn.Linear(2 * elmo_output_size, vocab_size, bias=True)

        # self.lstm2 = nn.LSTM(
        #     input_size=elmo_output_size,
        #     hidden_size=lstm_units,
        #     bidirectional=True,
        #     batch_first=True,
        #     proj_size=elmo_output_size
        # )


    def forward(self, x):

        print(x.shape)

        # embed the input
        # in shape: (batch_size, n_tokens, n_chars)
        # out shape: (batch_size, n_tokens, n_chars, embedding_dim)
        embedded = self.embedding_matrix(x.view(-1, self.n_chars))

        print(embedded.shape)

        # CharCNN
        # in shape: (n_tokens, n_chars, embedding_dim)
        # out shape: (n_tokens, projection_dim)

        embedded = torch.transpose(embedded, 1, 2)

        # pass the embedded input through the convolutional layers
        conv_outputs = []
        for conv_layer in self.conv_layers:
            conv_output = conv_layer(embedded)
            conv_output, _ = torch.max(conv_output, dim=-1)
            conv_output = self.conv_activation(conv_output)
            conv_outputs.append(conv_output)

        # concatenate the conv outputs
        token_embedding = torch.cat(conv_outputs, dim=-1)

        print(token_embedding.shape)

        # pass the conv output through the highway layers
        highway_output = token_embedding
        for highway_layer in self.highway_layers:
            projected_input = highway_layer(highway_output)
            linear_part = highway_output

            nonlinear_part, gate = projected_input.chunk(2, dim=-1)
            nonlinear_part = self.highway_activation(nonlinear_part)
            gate = torch.sigmoid(gate)

            highway_output = gate * linear_part + (1 - gate) * nonlinear_part
        
        token_embedding = self.highway_projection(highway_output) 

        print(token_embedding.shape)

        # pass the token embedding through the BiLSTM
        lstm_output, _ = self.lstm(token_embedding)

        out = self.linear(lstm_output)

        return out

        lstm2_output = self.lstm2(lstm1_output)

        # return lstm2_output, lstm1_output, token_embedding


In [221]:
model = ELMo(vocab_size=len(vocab), n_tokens=max_num_words, n_chars=max_num_letters)
pred = model(padded_char_ids[0:3, :, :])

torch.Size([3, 571, 35])
torch.Size([1713, 35, 4])
torch.Size([1713, 134])
torch.Size([1713, 512])


In [223]:
pred

tensor([[-0.0021,  0.0262, -0.0112,  ...,  0.0107, -0.0050, -0.0005],
        [-0.0018,  0.0261, -0.0108,  ...,  0.0100, -0.0050, -0.0012],
        [-0.0016,  0.0261, -0.0107,  ...,  0.0097, -0.0049, -0.0016],
        ...,
        [-0.0011,  0.0263, -0.0102,  ...,  0.0096, -0.0051, -0.0018],
        [-0.0010,  0.0264, -0.0104,  ...,  0.0095, -0.0052, -0.0016],
        [-0.0006,  0.0268, -0.0106,  ...,  0.0093, -0.0055, -0.0011]],
       grad_fn=<AddmmBackward0>)

In [224]:
pred.shape

torch.Size([1713, 16340])

In [225]:
len(vocab)

16340

In [None]:
# train the model

from torch import optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    running_loss = 0.0
    for i in range(0, len(padded_char_ids), 32):
        # get the inputs
        inputs = padded_char_ids[i:i+32, :, :]
        labels = padded_ids[i:i+32, :]

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, len(vocab)), labels.view(-1))
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 32 == 31:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 32))
            running_loss = 0.0


## Evaluate your embeddings model on IMDB movie reviews dataset (sentiment analysis) 
[Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

Preprocess data

Disable training for ELMo, it will produce 5 embeddings for each word, add trainable parameters $\gamma^{task}$ and $s^{task}_j$

Don't forget metric plots

## Compare the results with BERT embeddings
you can choose other bert model

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)