# ELMo: Embeddings from Language Models 
<!-- ![](https://get.whotrades.com/u4/photoDE6C/20647654315-0/blogpost.jpeg) -->

In this assignment you will implement a deep lstm-based model for contextualized word embeddings - ELMo. Your tasks are as following: 

- Preprocessing (20 points)
- Implementation of ELMo model (30 points)
  - 2-layer BiLSTM (15 points)
  - Highway layers (5 points) [link](https://paperswithcode.com/method/highway-layer) [paper](https://arxiv.org/pdf/1507.06228.pdf) [code](https://github.com/allenai/allennlp/blob/9f879b0964e035db711e018e8099863128b4a46f/allennlp/modules/highway.py#L11)
  - CharCNN embeddings (5 points) [paper](https://arxiv.org/pdf/1509.01626.pdf)
  - Handle out-of-vocabulary words (5 points)
- Report metrics and loss using tensorbord/comet or other tool.  (10 points)
- Evaluate on movie review dataset (20 pts)
- Compare the performance with BERT model (10 pts)
- Clean and documented code (10 points)


Remarks: 

*   Use Pytorch
*   Cheating will result in 0 points


ELMo paper: https://arxiv.org/pdf/1802.05365.pdf

Possible datasets:
- [WikiText-103](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)
- Any monolingual dataset from [WMT](https://statmt.org/wmt22/translation-task.html)

## Data loading and preprocessing
Preprocess the english monolingual data (20 points):
- clean
- split to train and validation
- tokenize
- create vocabulary, convert words to numbers. [vocab](https://pytorch.org/text/stable/vocab.html#id1)
- pad sequences

Use these tutorials [one](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html) and [two](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) as a reference

![](https://miro.medium.com/max/720/1*UPirqwpBWnNmcwoUjfZZIA.png)

In [1]:
import torch
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#device = torch.device('cpu')

In [3]:
from torch.utils.data import DataLoader

### Read Sentences

In [4]:
data_dir = 'eng-simple_wikipedia_2021_10K'
data_filename = "eng-simple_wikipedia_2021_10K-sentences.txt"
data_full_filename = os.path.join(data_dir, data_filename)

# read data_ful_filename into a pandas dataframe without index
sents = pd.read_csv(data_full_filename, sep='\t', header=None, index_col=False)[1]
sents

0       ' 1979 standup tour. citation In 1998, he rele...
1       A 100-year-old woman named Rose DeWitt Bukater...
2       A1 is the name of a major road in some countries.
3       A 2002 report by American Sports Data found th...
4                   A 268-page booklet available on-line.
                              ...                        
9555    Zones are the places where buildings can develop.
9556    Zoological Journal of the Linnean Society, 71,...
9557    Zou Tribe is one of the Schedule Tribes of Man...
9558    Zubeyr was killed in a U.S. drone airstrike on...
9559    Քաշաթաղի մելիքություն) - Armenian melikdom(pri...
Name: 1, Length: 9560, dtype: object

### Filter ASCII-only Sentences

In [5]:
ascii_sent_indices = np.array(list(map(lambda x: x.isascii(), sents)))
ascii_sents = sents[ascii_sent_indices]
ascii_sents = ascii_sents.reset_index(drop=True)
ascii_sents

0       ' 1979 standup tour. citation In 1998, he rele...
1       A 100-year-old woman named Rose DeWitt Bukater...
2       A1 is the name of a major road in some countries.
3       A 2002 report by American Sports Data found th...
4                   A 268-page booklet available on-line.
                              ...                        
8964                                  Z is not used much.
8965    Zones are the places where buildings can develop.
8966    Zoological Journal of the Linnean Society, 71,...
8967    Zou Tribe is one of the Schedule Tribes of Man...
8968    Zubeyr was killed in a U.S. drone airstrike on...
Name: 1, Length: 8969, dtype: object

### Train-Test Indices

In [6]:
# split sentences into train and test sets with numpy
np.random.seed(42)
train_indices = np.random.choice(ascii_sents.index, size=int(0.8*len(ascii_sents)), replace=False)
test_indices = ascii_sents.index.difference(train_indices)
train_sents = ascii_sents.loc[train_indices]
test_sents = ascii_sents.loc[test_indices]

### Tokenizer

In [7]:
from torchtext.data.utils import get_tokenizer

# create pytorch tokenizer
tokenizer = get_tokenizer('basic_english')

### Word Vocab

In [8]:
# create vocabulary of training words
from torchtext.vocab import build_vocab_from_iterator

vocab = build_vocab_from_iterator(
    [tokenizer(sent) for sent in train_sents],
    specials=['<unk>', '<pad>', '<bos>', '<eos>']
)

vocab.set_default_index(vocab['<unk>'])

### Char Vocab

In [9]:
# create vocabulary of ascii symbols
ascii_symbols = list(map(chr, range(127)))

symbols_vocab = build_vocab_from_iterator(
    [ascii_symbols],
    specials=['<unk>', '<pad>', '<bow>', '<eow>']
)

symbols_vocab.set_default_index(symbols_vocab['<unk>'])

### Tokenized Sents

In [10]:
# tokenize ascii_sents
tokenized_sents = list(map(tokenizer, ascii_sents))

### Max Words

In [11]:
# get max number of words in tokenized_sents
max_num_words = max(map(lambda x: len(x), tokenized_sents))
max_num_words

571

### Max Letters

In [12]:
# get max number of letters in the words in tokenized_sents
max_num_letters = max(map(lambda x: max(map(lambda y: len(y), x)), tokenized_sents))
max_num_letters

35

### Padded Word Ids

In [13]:
def sents_to_word_ids(sents):
    word_ids = []
    for sent in sents:
        sent_word_ids = torch.tensor([vocab['<bos>']] + [vocab[token] for token in tokenizer(sent)] + [vocab['<eos>']])
        word_ids.append(sent_word_ids)
    return word_ids

In [14]:
word_ids = sents_to_word_ids(ascii_sents)
word_ids[:2]


[tensor([    2,    18,  1187, 14913,   979,     4,   104,     8,   689,     6,
            15,   103,  4593,  1159,     6,    10,  2233,   181,     4,     3]),
 tensor([    2,    10,  6603,   388,   178,  1826,  9339,  8270,  1067,    10,
           302,    63,    49, 15982,    19,     5,   307,  1166, 15466,     4,
             3])]

In [15]:
padded_word_ids = torch.nn.utils.rnn.pad_sequence(word_ids, padding_value=vocab['<pad>'], batch_first=True)
padded_word_ids

tensor([[    2,    18,  1187,  ...,     1,     1,     1],
        [    2,    10,  6603,  ...,     1,     1,     1],
        [    2,  7135,    12,  ...,     1,     1,     1],
        ...,
        [    2,     0,  1763,  ...,     1,     1,     1],
        [    2,  6560,  3163,  ...,     1,     1,     1],
        [    2, 16338,    13,  ...,     1,     1,     1]])

In [16]:
padded_word_ids.shape

torch.Size([8969, 573])

In [17]:
# padded_word_ids = padded_word_ids.to(device)
# padded_word_ids.get_device()

In [18]:
# BATCH_SIZE = 128

# from torch.nn.utils.rnn import pad_sequence
# from torch.utils.data import DataLoader

# def pad_batch(data_batch):
#     return pad_sequence(data_batch, padding_value=vocab['<pad>'])

# train_iter = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)
# test_iter = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)

### Padded Char Ids

In [19]:
padded_char_ids = torch.full(
    size=(len(ascii_sents), max_num_words + 2, max_num_letters),
    fill_value=symbols_vocab['<pad>']
)
padded_char_ids.shape

torch.Size([8969, 573, 35])

In [20]:
for sent_num, sent in enumerate(ascii_sents):
    tokenized = tokenizer(sent)
    for i in range(0, len(tokenized) + 2):
        if i == 0:
            padded_char_ids[sent_num, i, 0] = symbols_vocab['<bow>']
            padded_char_ids[sent_num, i, 1] = symbols_vocab['<bos>']
            padded_char_ids[sent_num, i, 2] = symbols_vocab['<eow>']
            continue
        elif i == len(tokenized) + 1:
            padded_char_ids[sent_num, i, 0] = symbols_vocab['<bow>']
            padded_char_ids[sent_num, i, 1] = symbols_vocab['<eos>']
            padded_char_ids[sent_num, i, 2] = symbols_vocab['<eow>']
            continue
        word = tokenized[i - 1]
        for letter_num, letter in enumerate(word):
            padded_char_ids[sent_num, i, letter_num] = symbols_vocab[letter]

In [21]:
symbols_vocab['.']

50

In [22]:
padded_char_ids

tensor([[[  2,   0,   3,  ...,   1,   1,   1],
         [ 43,   1,   1,  ...,   1,   1,   1],
         [ 53,  61,  59,  ...,   1,   1,   1],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]],

        [[  2,   0,   3,  ...,   1,   1,   1],
         [101,   1,   1,  ...,   1,   1,   1],
         [ 53,  52,  52,  ...,   1,   1,   1],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]],

        [[  2,   0,   3,  ...,   1,   1,   1],
         [101,  53,   1,  ...,   1,   1,   1],
         [109, 119,   1,  ...,   1,   1,   1],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]],

        ...,

        [[  2,   0,   3,  ...,   1,   1,   1],
         [126, 115, 115,  ...,   1,   1,   1]

In [23]:
padded_char_ids.shape

torch.Size([8969, 573, 35])

In [24]:
# padded_char_ids = padded_char_ids.to(device)
# padded_char_ids.get_device()

In [25]:
# from torch.utils.data import TensorDataset, DataLoader
# from torchtext.vocab import vocab

### Dataset

In [26]:
# create dataset from padded_word_ids and padded_char_ids

# import TensorDataset
from torch.utils.data import TensorDataset

train_ds = TensorDataset(padded_word_ids[train_indices], padded_char_ids[train_indices])
test_ds = TensorDataset(padded_word_ids[test_indices], padded_char_ids[test_indices])

### Dataloader

In [27]:
# create dataloaders
from torch.utils.data import DataLoader

BATCH_SIZE = 16

train_iter = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
test_iter = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=True)

## Model - learning embeddings
Read chapter 3 from the [paper](https://arxiv.org/pdf/1802.05365.pdf)

Implement this model with 
- 2 BiLSTM layers,
- CharCNN embeddings,
- Highway layers,
- out-of-vocabulary words handling

Plot the training and validation losses over the epochs (iterations)

Use the [implementation](https://github.com/allenai/allennlp/blob/main/allennlp/modules/elmo.py) as a reference

![](https://miro.medium.com/max/720/1*3_wsDpyNG-TylsRACF48yA.png)

![](https://miro.medium.com/max/720/1*8pG54o28pbD2L0dv5THL-A.png)

In [28]:
from torch import nn

class ELMo(nn.Module):
    
    def __init__(
        self,
        vocab_size,
        n_tokens,
        n_chars=50,
        embedding_dim=4,
        lstm_units=256,
        elmo_output_size=32
    ):
        super(ELMo, self).__init__()
        
        self.vocab_size = vocab_size
        self.n_tokens = n_tokens
        self.n_chars = n_chars
        self.embedding_dim = embedding_dim
        self.lstm_units = lstm_units
        self.elmo_output_size = elmo_output_size
        

        self.embedding_matrix = nn.Embedding(vocab_size, embedding_dim)

        filters = [[1,4], [2,8], [3,26], [4,32], [5,64]]
        self.conv_layers = nn.ModuleList([
            nn.Conv1d(
                in_channels=4,
                out_channels=num,
                kernel_size=width,
                bias=True 
            )
            for (width, num) in filters
        ])
        self.conv_activation = nn.ReLU()

        self.highway_layers = nn.ModuleList([
            nn.Linear(134, 134 * 2)
            for _ in range(2)
        ])
        self.highway_activation = nn.ReLU()
        self.highway_projection = nn.Linear(134, elmo_output_size, bias=True)
        
        self.lstm1 = nn.LSTM(
            input_size=elmo_output_size,
            hidden_size=lstm_units,
            bidirectional=True,
            batch_first=True,
            proj_size=elmo_output_size
        )

        self.lstm2 = nn.LSTM(
            input_size=2*elmo_output_size,
            hidden_size=lstm_units,
            bidirectional=True,
            batch_first=True,
            proj_size=elmo_output_size
        )

        self.linear = nn.Linear(2 * elmo_output_size, vocab_size, bias=True)

    def embed_input(self, x):
        return self.embedding_matrix(x.view(-1, self.n_chars))

    def charCNN(self, x):
        embedded = torch.transpose(x, 1, 2)

        # pass the embedded input through the convolutional layers
        conv_outputs = []
        for conv_layer in self.conv_layers:
            conv_output = conv_layer(embedded)
            conv_output, _ = torch.max(conv_output, dim=-1)
            conv_output = self.conv_activation(conv_output)
            conv_outputs.append(conv_output)

        # concatenate the conv outputs
        token_embedding = torch.cat(conv_outputs, dim=-1)

        # pass the conv output through the highway layers
        highway_output = token_embedding
        for highway_layer in self.highway_layers:
            projected_input = highway_layer(highway_output)
            linear_part = highway_output

            nonlinear_part, gate = projected_input.chunk(2, dim=-1)
            nonlinear_part = self.highway_activation(nonlinear_part)
            gate = torch.sigmoid(gate)

            highway_output = gate * linear_part + (1 - gate) * nonlinear_part
        
        token_embedding = self.highway_projection(highway_output) 

        return token_embedding

    def forward(self, x):

        #batch_size = x.size(0)

        # embed the input
        # in shape: (batch_size, n_tokens, n_chars)
        # out shape: (batch_size, n_tokens, n_chars, embedding_dim)
        #embedded = self.embedding_matrix(x.view(-1, self.n_chars))
        embedded = self.embed_input(x)

        # CharCNN
        # in shape: (n_tokens, n_chars, embedding_dim)
        # out shape: (n_tokens, projection_dim)
        token_embedding = self.charCNN(embedded)

        # pass the token embedding through the BiLSTM
        lstm_output1, (h1_n, c1_n) = self.lstm1(token_embedding)
        lstm_output2, (h2_n, c2_n) = self.lstm2(lstm_output1)

        out = self.linear(lstm_output2)
        
        return out.view(-1, self.n_tokens+2, self.vocab_size)


In [29]:
model = ELMo(vocab_size=len(vocab), n_tokens=max_num_words, n_chars=max_num_letters).to(device)

In [30]:
# train the model and evalutate it with tensorboard

from torch import optim

optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

def train(model, iterator, optimizer, criterion):
    
    model.train()
    
    epoch_loss = 0
    epoch_acc = 0

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        word_ids, char_ids = batch
        word_ids = word_ids.to(device)
        char_ids = char_ids.to(device)

        logits = model(char_ids)

        loss = criterion(logits.view(-1, len(vocab)), word_ids.view(-1))
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
        # calculate accuracy
        preds = torch.argmax(logits, dim=2)
        epoch_acc += torch.sum(preds == word_ids).item() / len(word_ids) 

        
        writer.add_scalar('Loss/train', epoch_loss / (i+1), i)
        writer.add_scalar('Accuracy/train', epoch_acc / (i+1), i)

        if i % 10 == 0:
            print(f'Training loss: {epoch_loss / (i+1)}, accuracy: {epoch_acc}')

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    epoch_acc = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            word_ids, char_ids = batch
            
            word_ids = word_ids.to(device)
            char_ids = char_ids.to(device)

            logits = model(char_ids)

            logits = logits.view(-1, len(vocab))
            word_ids = word_ids.view(-1)

            loss = criterion(logits, word_ids)
            
            epoch_loss += loss.item()
            
            # calculate accuracy
            preds = torch.argmax(logits, dim=2)
            acc = torch.sum(preds == word_ids).item() / len(word_ids)
            epoch_acc += acc

            writer.add_scalar('Loss/valid', epoch_loss / (i+1), i)
            writer.add_scalar('Accuracy/valid', epoch_acc / (i+1), i)

            if i % 10 == 0:
                print(f'Validation loss: {epoch_loss / (i+1)}, accuracy: {epoch_acc / (i+1)}')

    return epoch_loss / len(iterator)

N_EPOCHS = 2

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
        
        print(f'=== EPOCH {epoch} ===')

        train_loss = train(model, train_iter, optimizer, criterion)
        valid_loss = evaluate(model, test_iter, criterion)
        
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'model.pt')
        
        # writer.add_scalar('Loss/train', train_loss, epoch)
        # writer.add_scalar('Loss/valid', valid_loss, epoch)
    
        # print(f'Epoch: {epoch+1:02}')
        # print(f'\tTrain Loss: {train_loss:.3f}')
        # print(f'\t Val. Loss: {valid_loss:.3f}')

writer.close()


=== EPOCH 0 ===
Training loss: 9.611769676208496, accuracy: 0.0
Training loss: 9.180383942343973, accuracy: 4991.1875


KeyboardInterrupt: 

In [None]:
# train the model
from torch import optim

from tqdm import tqdm

crossentropy = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(2):
    
    running_loss = 0.0
    
    for i, batch in enumerate(train_iter):
        
        optimizer.zero_grad()
        
        word_ids, char_ids = batch
        word_ids = word_ids.to(device)
        char_ids = char_ids.to(device)
        
        logits = model(char_ids)
        
        loss = crossentropy(logits.view(-1, len(vocab)), word_ids.view(-1))
        loss.backward()
        
        optimizer.step()

        running_loss += loss.item()
        if i % BATCH_SIZE == BATCH_SIZE-1:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss}')
            running_loss = 0.0

In [38]:
torch.save(model.state_dict(), 'elmo.pt')

In [30]:
model = ELMo(vocab_size=len(vocab), n_tokens=max_num_words, n_chars=max_num_letters).to(device)
model.load_state_dict(torch.load('elmo.pt'))
model.eval()

ELMo(
  (embedding_matrix): Embedding(16340, 4)
  (conv_layers): ModuleList(
    (0): Conv1d(4, 4, kernel_size=(1,), stride=(1,))
    (1): Conv1d(4, 8, kernel_size=(2,), stride=(1,))
    (2): Conv1d(4, 26, kernel_size=(3,), stride=(1,))
    (3): Conv1d(4, 32, kernel_size=(4,), stride=(1,))
    (4): Conv1d(4, 64, kernel_size=(5,), stride=(1,))
  )
  (conv_activation): ReLU()
  (highway_layers): ModuleList(
    (0): Linear(in_features=134, out_features=268, bias=True)
    (1): Linear(in_features=134, out_features=268, bias=True)
  )
  (highway_activation): ReLU()
  (highway_projection): Linear(in_features=134, out_features=32, bias=True)
  (lstm): LSTM(32, 256, proj_size=32, num_layers=2, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=64, out_features=16340, bias=True)
)

## Evaluate your embeddings model on IMDB movie reviews dataset (sentiment analysis) 
[Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

Preprocess data

Disable training for ELMo, it will produce 5 embeddings for each word, add trainable parameters $\gamma^{task}$ and $s^{task}_j$

Don't forget metric plots

### Read Data

In [44]:
# read IMDB dataset
import pandas as pd

imdb = pd.read_csv('IMDB Dataset.csv')
imdb

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


### Filter non-ASCII Strings

In [58]:
reviews = imdb['review']
labels = imdb['sentiment']

ascii_review_indices = np.array(list(map(lambda x: x.isascii(), reviews)))
ascii_reviews = reviews[ascii_review_indices]
ascii_reviews = ascii_reviews.reset_index(drop=True)
ascii_reviews

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
45335    I thought this movie did a down right good job...
45336    Bad plot, bad dialogue, bad acting, idiotic di...
45337    I am a Catholic taught in parochial elementary...
45338    I'm going to have to disagree with the previou...
45339    No one expects the Star Trek movies to be high...
Name: review, Length: 45340, dtype: object

### Padded Char Ids

In [62]:
padded_review_char_ids = torch.full(
    size=(len(ascii_reviews), max_num_words + 2, max_num_letters),
    fill_value=symbols_vocab['<pad>']
)
padded_review_char_ids.shape

torch.Size([45340, 573, 35])

In [65]:
for sent_num, sent in enumerate(ascii_reviews):
    tokenized = tokenizer(sent)
    for i in range(0, min(len(tokenized), max_num_words) + 2):
        if i == 0:
            padded_review_char_ids[sent_num, i, 0] = symbols_vocab['<bow>']
            padded_review_char_ids[sent_num, i, 1] = symbols_vocab['<bos>']
            padded_review_char_ids[sent_num, i, 2] = symbols_vocab['<eow>']
            continue
        elif i == len(tokenized) + 1:
            padded_review_char_ids[sent_num, i, 0] = symbols_vocab['<bow>']
            padded_review_char_ids[sent_num, i, 1] = symbols_vocab['<eos>']
            padded_review_char_ids[sent_num, i, 2] = symbols_vocab['<eow>']
            continue
        word = tokenized[i - 1]
        for letter_num, letter in enumerate(word):
            if letter_num >= max_num_letters:
                break
            padded_review_char_ids[sent_num, i, letter_num] = symbols_vocab[letter]

### Labels

In [None]:
labels = labels[ascii_review_indices]
labels = labels.reset_index(drop=True)
labels

In [158]:
# encode labels with pandas
labels = labels.map({'positive': 1, 'negative': 0})
labels

0        1
1        1
2        1
3        0
4        1
        ..
45335    1
45336    0
45337    0
45338    0
45339    0
Name: sentiment, Length: 45340, dtype: int64

In [160]:
labels_tensor = torch.tensor(labels.values)
labels_tensor

tensor([1, 1, 1,  ..., 0, 0, 0])

In [161]:
labels_tensor.shape

torch.Size([45340])

In [162]:
# reshape labels_tensor to (batch_size, 1)
labels_tensor = labels_tensor.view(-1, 1)
labels_tensor.shape

torch.Size([45340, 1])

In [163]:
labels_tensor

tensor([[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]])

### Dataset

In [166]:
# create dataset from padded_review_char_ids and labels
from torch.utils.data import TensorDataset

imdb_ds = TensorDataset(padded_review_char_ids, labels_tensor)
imdb_ds

<torch.utils.data.dataset.TensorDataset at 0x2024579c2e0>

### Loader

In [275]:
# create loader for imdb_ds
from torch.utils.data import DataLoader

imdb_dl = DataLoader(imdb_ds, batch_size=16, shuffle=True)
imdb_dl

<torch.utils.data.dataloader.DataLoader at 0x201f4206520>

### ELMo IMDB

In [296]:
from unicodedata import bidirectional
from torch import nn

class ELMoIMDB(ELMo):
    
    def __init__(self, vocab_size=len(vocab), n_tokens=max_num_words, n_chars=max_num_letters):
        super(ELMoIMDB, self).__init__(vocab_size, n_tokens, n_chars)

        #self.elmo = ELMo(vocab_size=len(vocab), n_tokens=max_num_words, n_chars=max_num_letters).to(device)
        #self.elmo.load_state_dict(torch.load('elmo.pt'))

        # freeze model parameters
        for param in self.parameters():
            param.requires_grad = False

        self.gamma = nn.Parameter(torch.randn(1), requires_grad=True)
        self.s_0 = nn.Parameter(torch.randn(1), requires_grad=True)
        self.s_1 = nn.Parameter(torch.randn(1), requires_grad=True)
        self.s_2 = nn.Parameter(torch.randn(1), requires_grad=True)

        # create linear layer with 64 output features and relu activation
        self.linear = nn.Linear(64, 64)
        self.relu = nn.ReLU()

        # create linear layer for classification
        self.classifier = nn.Linear(64, 1)


    def forward(self, x):
        
        embedded = model.embed_input(x)
        token_embedding = model.charCNN(embedded)
        lstm_output1, (h1_n, c1_n) = model.lstm1(token_embedding)
        lstm_output2, (h2_n, c2_n) = model.lstm2(lstm_output1)

        h_0 = torch.cat((token_embedding, token_embedding), dim=1)

        r = self.gamma * (self.s_0 * h_0 + self.s_1 * lstm_output1 + self.s_2 * lstm_output2)

        r = r.view(-1, self.n_tokens + 2, 2 * self.elmo_output_size)

        out = torch.sum(r, dim=1)

        # pass out through linear layer with relu activation
        out = self.linear(out)
        out = self.relu(out)

        # pass out through classifier
        out = self.classifier(out)

        return out 


In [297]:
# create elmoimdb model
elmoimdb = ELMoIMDB().to(device)

In [298]:
# train elmoimdb model on imdb dataloader

# define loss function
loss_fn = nn.BCEWithLogitsLoss()

# define optimizer
optimizer = torch.optim.Adam(elmoimdb.parameters(), lr=0.001)

# train model
for epoch in range(1):

    # set model to train mode
    elmoimdb.train()

    # # initialize loss
    # epoch_loss = 0

    running_loss = 0.0

    for i, (x, y) in enumerate(imdb_dl):

        # move data to device
        x = x.to(device)
        y = y.to(device)

        # zero out gradients
        optimizer.zero_grad()

        # pass data through model
        y_hat = elmoimdb(x)

        # calculate loss
        loss = loss_fn(y_hat, y.float())

        # backpropagate loss
        loss.backward()

        # update model parameters
        optimizer.step()

        # # add loss to epoch loss
        # epoch_loss += loss.item()

        running_loss += loss.item()
        if i % 16 == 15:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss}')
            running_loss = 0.0

    # # calculate average epoch loss
    # epoch_loss /= len(imdb_dl)

    # # print epoch loss
    # print(f'Epoch {epoch} loss: {epoch_loss}')

[1,    16] loss: 12.102257370948792
[1,    32] loss: 11.29083788394928
[1,    48] loss: 11.188578069210052
[1,    64] loss: 11.170762479305267
[1,    80] loss: 11.105431199073792
[1,    96] loss: 11.249241352081299
[1,   112] loss: 11.259655594825745
[1,   128] loss: 11.243346691131592
[1,   144] loss: 11.39393413066864
[1,   160] loss: 11.189066648483276
[1,   176] loss: 11.161707699298859
[1,   192] loss: 11.071907043457031


KeyboardInterrupt: 

In [208]:
crossentropy = nn.CrossEntropyLoss()
optimizer = optim.Adam([elmoimdb.gamma, elmoimdb.s_0, elmoimdb.s_1, elmoimdb.s_2], lr=0.001)

AttributeError: 'ELMoIMDB' object has no attribute 'gamma'

In [204]:
# train the model
from torch import optim
from tqdm import tqdm

model = ELMo(vocab_size=len(vocab), n_tokens=max_num_words, n_chars=max_num_letters).to(device)

# freeze model parameters
for param in model.parameters():
    param.requires_grad = False

gamma = nn.Parameter(torch.randn(1).to(device))
s_0 = nn.Parameter(torch.randn(1).to(device))
s_1 = nn.Parameter(torch.randn(1).to(device))
s_2 = nn.Parameter(torch.randn(1).to(device))

# create linear layer for classification
linear = nn.Linear(2 * model.elmo_output_size, 1).to(device)

crossentropy = nn.CrossEntropyLoss()
optimizer = optim.Adam([gamma, s_0, s_1, s_2], lr=0.001)

for epoch in range(2):
    
    running_loss = 0.0
    
    for i, batch in enumerate(imdb_dl):
        
        optimizer.zero_grad()
        
        char_ids, labels = batch
        char_ids = char_ids.to(device)
        labels = labels.to(device)
        
        embedded = model.embed_input(x)
        token_embedding = model.charCNN(embedded)
        lstm_output1, (h1_n, c1_n) = model.lstm1(token_embedding)
        lstm_output2, (h2_n, c2_n) = model.lstm2(lstm_output1)
        
        h_0 = torch.cat((token_embedding, token_embedding), dim=1)

        r = gamma * (s_0 * h_0 + s_1 * lstm_output1 + s_2 * lstm_output2)

        logits = linear(r)

        #logits = model(char_ids)
        
        #print(r.shape, labels.shape)

        loss = crossentropy(logits, labels)
        loss.backward()
        
        optimizer.step()

        running_loss += loss.item()
        if i % BATCH_SIZE == BATCH_SIZE-1:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss}')
            running_loss = 0.0

ValueError: Expected input batch_size (1146) to match target batch_size (1).

## Compare the results with BERT embeddings
you can choose other bert model

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)