# ELMo: Embeddings from Language Models 
<!-- ![](https://get.whotrades.com/u4/photoDE6C/20647654315-0/blogpost.jpeg) -->

In this assignment you will implement a deep lstm-based model for contextualized word embeddings - ELMo. Your tasks are as following: 

- Preprocessing (20 points)
- Implementation of ELMo model (30 points)
  - 2-layer BiLSTM (15 points)
  - Highway layers (5 points) [link](https://paperswithcode.com/method/highway-layer) [paper](https://arxiv.org/pdf/1507.06228.pdf) [code](https://github.com/allenai/allennlp/blob/9f879b0964e035db711e018e8099863128b4a46f/allennlp/modules/highway.py#L11)
  - CharCNN embeddings (5 points) [paper](https://arxiv.org/pdf/1509.01626.pdf)
  - Handle out-of-vocabulary words (5 points)
- Report metrics and loss using tensorbord/comet or other tool.  (10 points)
- Evaluate on movie review dataset (20 pts)
- Compare the performance with BERT model (10 pts)
- Clean and documented code (10 points)


Remarks: 

*   Use Pytorch
*   Cheating will result in 0 points


ELMo paper: https://arxiv.org/pdf/1802.05365.pdf

Possible datasets:
- [WikiText-103](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)
- Any monolingual dataset from [WMT](https://statmt.org/wmt22/translation-task.html)

## Data loading and preprocessing
Preprocess the english monolingual data (20 points):
- clean
- split to train and validation
- tokenize
- create vocabulary, convert words to numbers. [vocab](https://pytorch.org/text/stable/vocab.html#id1)
- pad sequences

Use these tutorials [one](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html) and [two](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) as a reference

![](https://miro.medium.com/max/720/1*UPirqwpBWnNmcwoUjfZZIA.png)

In [60]:
import pandas as pd
import numpy as np
import os 

data_dir = 'eng-simple_wikipedia_2021_10K'
data_filename = "eng-simple_wikipedia_2021_10K-sentences.txt"
data_full_filename = os.path.join(data_dir, data_filename)

# read data_ful_filename into a pandas dataframe without index
sents = pd.read_csv(data_full_filename, sep='\t', header=None, index_col=False)[1]
sents

0       ' 1979 standup tour. citation In 1998, he rele...
1       A 100-year-old woman named Rose DeWitt Bukater...
2       A1 is the name of a major road in some countries.
3       A 2002 report by American Sports Data found th...
4                   A 268-page booklet available on-line.
                              ...                        
9555    Zones are the places where buildings can develop.
9556    Zoological Journal of the Linnean Society, 71,...
9557    Zou Tribe is one of the Schedule Tribes of Man...
9558    Zubeyr was killed in a U.S. drone airstrike on...
9559    Քաշաթաղի մելիքություն) - Armenian melikdom(pri...
Name: 1, Length: 9560, dtype: object

In [61]:
ascii_sent_indices = np.array(list(map(lambda x: x.isascii(), sents)))
ascii_sents = sents[ascii_sent_indices]
ascii_sents

0       ' 1979 standup tour. citation In 1998, he rele...
1       A 100-year-old woman named Rose DeWitt Bukater...
2       A1 is the name of a major road in some countries.
3       A 2002 report by American Sports Data found th...
4                   A 268-page booklet available on-line.
                              ...                        
9554                                  Z is not used much.
9555    Zones are the places where buildings can develop.
9556    Zoological Journal of the Linnean Society, 71,...
9557    Zou Tribe is one of the Schedule Tribes of Man...
9558    Zubeyr was killed in a U.S. drone airstrike on...
Name: 1, Length: 8969, dtype: object

In [63]:
# split sentences into train and test sets with numpy
np.random.seed(42)
train_indices = np.random.choice(ascii_sents.index, size=int(0.8 * len(ascii_sents)), replace=False)
test_indices = ascii_sents.index.difference(train_indices)
train_sents = ascii_sents.loc[train_indices]
test_sents = ascii_sents.loc[test_indices]

In [None]:
from torch.utils.data import TensorDataset, DataLoader
from torchtext.vocab import vocab

## Model - learning embeddings
Read chapter 3 from the [paper](https://arxiv.org/pdf/1802.05365.pdf)

Implement this model with 
- 2 BiLSTM layers,
- CharCNN embeddings,
- Highway layers,
- out-of-vocabulary words handling

Plot the training and validation losses over the epochs (iterations)

Use the [implementation](https://github.com/allenai/allennlp/blob/main/allennlp/modules/elmo.py) as a reference

![](https://miro.medium.com/max/720/1*3_wsDpyNG-TylsRACF48yA.png)

![](https://miro.medium.com/max/720/1*8pG54o28pbD2L0dv5THL-A.png)

In [None]:
from torch import nn

## Evaluate your embeddings model on IMDB movie reviews dataset (sentiment analysis) 
[Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

Preprocess data

Disable training for ELMo, it will produce 5 embeddings for each word, add trainable parameters $\gamma^{task}$ and $s^{task}_j$

Don't forget metric plots

## Compare the results with BERT embeddings
you can choose other bert model

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)