# [Character Aware Language Model](https://arxiv.org/pdf/1508.06615.pdf)

We have character Aware Language Model Implemented here from scratch.
Since the training is big we used validation data to train the model and test data as our validation data.

Note- Since validation data is quite smaller that's why we see perplexity not improving much

![](images/char_rnn.png)
![](images/char_rnn_1.png)

In [1]:
!mkdir data
%cd data
!wget https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt
!wget https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt
!wget https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt
%cd ..

/content/data
--2020-08-24 18:57:37--  https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5101618 (4.9M) [text/plain]
Saving to: ‘ptb.train.txt’


2020-08-24 18:57:37 (32.6 MB/s) - ‘ptb.train.txt’ saved [5101618/5101618]

--2020-08-24 18:57:37--  https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 399782 (390K) [text/plain]
Saving to: ‘ptb.valid.txt’


2020-08-24 18:57:38 (9.01 MB/s) - ‘ptb.valid.t

In [2]:
import json
import os
import sys
import math

from tqdm import tqdm

from torchtext import data
from torchtext import datasets

import torch
import torch.nn as nn
import torch.nn.functional as F

import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

BPTT = 35
DATA_DIR = 'data'

N_EPOCHS = 10
INIT_LR = 0.5
BATCH_SIZE = 32
SCHEDULER_PATIENCE = 0
SCHEDULER_FACTOR = 0.1
SCHEDULER_THRESHOLD = 1e-1
CLIP = 5.0

EMBEDDING_DIM = 20
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Preparing data for LM

In [3]:
def process_ptb(in_filename, out_filename):
    with open(f'{in_filename}', 'r') as r:
        data = r.read()
        data = data.replace('\n', '<eos>')
        data = data.split()

    examples = []

    for i, _ in enumerate(data[:-BPTT-1]):
        examples.append({'words':data[i:i+BPTT], 'targets': data[i+1:i+BPTT+1]})

    with open(f'{out_filename}', 'w') as w:
        for example in examples:
            json.dump(example, w)
            w.write('\n')

process_ptb(os.path.join(DATA_DIR, 'ptb.test.txt'), os.path.join(DATA_DIR, 'ptb.test.jsonl'))
process_ptb(os.path.join(DATA_DIR, 'ptb.valid.txt'), os.path.join(DATA_DIR, 'ptb.valid.jsonl'))

In [4]:
CHAR_NESTING = data.Field(batch_first=True, tokenize=list, init_token='<sos>', eos_token='<eos>')
CHARS = data.NestedField(CHAR_NESTING)
TARGETS = data.Field(batch_first=True)

fields = {'words': ('chars', CHARS), 'targets': ('targets', TARGETS)}
#get data from csv
train, valid,  = data.TabularDataset.splits(
                path = 'data',
                train = 'ptb.valid.jsonl',
                validation = 'ptb.test.jsonl',
                format = 'json',
                fields = fields
)

TARGETS.build_vocab(train)
CHARS.build_vocab(train)

print(f'{len(CHARS.vocab)} characters in character vocab')
print(f'char vocab = {CHARS.vocab.itos}')

print(f'{len(TARGETS.vocab)} words in target vocab')
print(f'most common words = {TARGETS.vocab.freqs.most_common(10)}')

train_iter, valid_iter, = data.Iterator.splits((train, valid,),
                                             batch_size=BATCH_SIZE,
                                             sort=False,
                                             repeat=False,
                                             device=device)

52 characters in character vocab
char vocab = ['<unk>', '<pad>', '<sos>', '<eos>', 'e', 't', 'o', 'a', 's', 'n', 'i', 'r', 'h', 'l', 'd', 'u', 'c', 'm', '<', '>', 'f', 'p', 'k', 'g', 'y', 'b', 'w', 'v', 'N', "'", '.', 'x', '$', 'j', '-', 'q', 'z', '&', '0', '1', '9', '3', '5', '#', '8', '2', '\\', '*', '4', '6', '7', '/']
6023 words in target vocab
most common words = [('the', 144224), ('<unk>', 121928), ('<eos>', 117894), ('N', 90966), ('of', 64120), ('to', 61193), ('a', 60789), ('and', 48656), ('in', 48641), ("'s", 30364)]


In [5]:
next(iter(train_iter))


[torchtext.data.batch.Batch of size 32]
	[.chars]:[torch.cuda.LongTensor of size 32x35x18 (GPU 0)]
	[.targets]:[torch.cuda.LongTensor of size 32x35 (GPU 0)]

In [6]:
class TimeDistributed(nn.Module):
    def __init__(self, modules):
        super().__init__()
        self.mod =  modules
    
    def forward(self, inputs):
        # input shape [batch_size, seq_len, chars_vocab, emb_dim]
        input_size = inputs.size()
        output_shape = [-1]+[i for i in input_size[2:]]
        # output_shape [batch_size*seq_len , chars_vocab, emb_dim]
        inputs = inputs.view(*output_shape)
        inputs = self.mod(inputs)
        # inputs [batch_size*seq_len, chars_vocab, hid_dim]
        output_shape = [input_size[0], input_size[1]] + [x for x in inputs.size()[1:]]
        # output_shape [batch_size, seq_len, chars_vocab, hid_dim]
        reshaped_input = inputs.view(*output_shape).contiguous()
        return reshaped_input

In [7]:
class ConvLayer(nn.Module):
    def __init__(self,emb_dim,conv_layer,mul_factor,dropout=0.2):
        super().__init__()
        self.convModule = nn.ModuleList(
            [TimeDistributed(nn.Conv1d(emb_dim, i*mul_factor , i)) for i in conv_layer]
        )
    def forward(self,input):
        # input [batch_size, seq, nchars, emb]
        convs = [ F.gelu(conv(input.transpose(2,3))) for conv in self.convModule]
        # convs = [batch_size, seq, conv_layer[i]*25 , nchars-conv_layer[i]+1]
        pool_out = [F.max_pool2d(conv, (1,conv.shape[3])).squeeze(-1) for conv in convs]
        # pool_out [batch_size,seq,conv_layer[i]*25]
        pool_cat = torch.cat(pool_out, dim = 2)
        return pool_cat

In [8]:
class HighWayNetwork(nn.Module):
    def __init__(self,in_dim):
        super().__init__()
        self.hx = nn.Linear(in_dim, in_dim)
        self.tx = nn.Linear(in_dim, in_dim)
    def forward(self,input):
        #input [batch_size, seq_len, in_dim]
        t = torch.sigmoid(self.tx(input))
        # t [batch_size, seq_len, out_dim]
        h = F.relu(self.hx(input))
        # h [batch_size, seq_len, out_dim]
        return h*t + input*(1-t)

In [9]:
class CharLM(nn.Module):
    def __init__(self, nchars, output_dim, emb_dim, dropout=0.2):
        super().__init__()
        conv_layer = [2,3,4,5,6]
        mul_factor = 20
        self.emb = TimeDistributed(nn.Embedding(nchars, emb_dim,))
        self.conv_layer = ConvLayer(emb_dim,conv_layer, mul_factor,dropout)
        self.in_dim = sum(conv_layer)*mul_factor
        self.highway_layer = TimeDistributed(HighWayNetwork(self.in_dim))
        self.lstm = nn.LSTM(self.in_dim, self.in_dim//2 ,2,False,True,dropout, )
        self.classify = nn.Linear(self.in_dim//2 ,output_dim,bias=False)

    def forward(self, input, hidden):
        embed = self.emb(input)
        conv_output = self.conv_layer(embed)
        highway_output = self.highway_layer(conv_output)
        output ,_ = self.lstm(highway_output , hidden)
        output = output.contiguous().view(-1, output.shape[-1])
        return self.classify(output) ,hidden

    def init_hidden(self, batch_size):
        hidden =(
            torch.zeros(2, batch_size, self.in_dim//2).to(device),
            torch.zeros(2, batch_size, self.in_dim//2).to(device),
        )
        return hidden

In [10]:
model = CharLM(len(CHARS.vocab) , len(TARGETS.vocab), EMBEDDING_DIM)
model = model.to(device)
model

CharLM(
  (emb): TimeDistributed(
    (mod): Embedding(52, 20)
  )
  (conv_layer): ConvLayer(
    (convModule): ModuleList(
      (0): TimeDistributed(
        (mod): Conv1d(20, 40, kernel_size=(2,), stride=(1,))
      )
      (1): TimeDistributed(
        (mod): Conv1d(20, 60, kernel_size=(3,), stride=(1,))
      )
      (2): TimeDistributed(
        (mod): Conv1d(20, 80, kernel_size=(4,), stride=(1,))
      )
      (3): TimeDistributed(
        (mod): Conv1d(20, 100, kernel_size=(5,), stride=(1,))
      )
      (4): TimeDistributed(
        (mod): Conv1d(20, 120, kernel_size=(6,), stride=(1,))
      )
    )
  )
  (highway_layer): TimeDistributed(
    (mod): HighWayNetwork(
      (hx): Linear(in_features=400, out_features=400, bias=True)
      (tx): Linear(in_features=400, out_features=400, bias=True)
    )
  )
  (lstm): LSTM(400, 200, num_layers=2, bias=False, batch_first=True, dropout=0.2)
  (classify): Linear(in_features=200, out_features=6023, bias=False)
)

In [11]:
print(f"Number of params {sum([p.numel() for p in model.parameters() if p.requires_grad])}")

Number of params 2362840


In [None]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach().clone()
    else:
        return tuple(repackage_hidden(v) for v in h)

In [12]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=INIT_LR,momentum=0.9,nesterov=True, weight_decay=1e-5)
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, threshold=SCHEDULER_THRESHOLD, threshold_mode='abs', factor=SCHEDULER_FACTOR, patience=SCHEDULER_PATIENCE, verbose=True)


best_valid_loss = float('inf')

for epoch in range(1, N_EPOCHS+1):

    epoch_loss = 0
    epoch_acc = 0

    model.train()
    h = model.init_hidden(BATCH_SIZE)
    for batch in tqdm(train_iter, desc='Train'):

        optimizer.zero_grad()
        if batch.chars.size(0) != h[0].shape[1]:
            h = model.init_hidden(batch.chars.size(0))
        h = repackage_hidden(h) # detach the tensor otherwise it will backpropagate till the entire dataset.
        predictions, h = model(batch.chars, h)

        loss = criterion(predictions, batch.targets.view(-1,))

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)

        optimizer.step()

        epoch_loss += loss.item()

    #calculate metrics averaged across whole epoch
    train_loss = epoch_loss / len(train_iter)

    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    val_h = model.init_hidden(BATCH_SIZE)
    for batch in tqdm(valid_iter, desc='Valid'):

        with torch.no_grad():
            if batch.chars.size(0) != val_h[0].shape[1]:
                val_h = model.init_hidden(batch.chars.size(0))
            predictions, val_h = model(batch.chars,val_h)

            loss = criterion(predictions, batch.targets.view(-1,))
            
            epoch_loss += loss.item()

    #calculate metrics averaged across whole epoch
    valid_loss = epoch_loss / len(valid_iter)

    #update learning rate
    scheduler.step(math.exp(valid_loss))

    #print metrics
    print(f'\nEpoch: {epoch}') 
    print(f'Train Loss: {train_loss:.3f}, Train PPL: {math.exp(train_loss):.2f}')
    print(f'Valid Loss: {valid_loss:.3f}, Valid PPL: {math.exp(valid_loss):.2f}')

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')

Train: 100%|██████████| 2304/2304 [00:50<00:00, 45.98it/s]
Valid: 100%|██████████| 2575/2575 [00:23<00:00, 108.66it/s]
Train:   0%|          | 1/2304 [00:00<04:07,  9.31it/s]


Epoch: 1
Train Loss: 5.399, Train PPL: 221.09
Valid Loss: 5.255, Valid PPL: 191.45


Train: 100%|██████████| 2304/2304 [00:51<00:00, 44.97it/s]
Valid: 100%|██████████| 2575/2575 [00:24<00:00, 105.46it/s]
Train:   0%|          | 1/2304 [00:00<04:14,  9.05it/s]

Epoch     2: reducing learning rate of group 0 to 5.0000e-02.

Epoch: 2
Train Loss: 4.146, Train PPL: 63.20
Valid Loss: 5.445, Valid PPL: 231.71


Train: 100%|██████████| 2304/2304 [00:51<00:00, 44.88it/s]
Valid: 100%|██████████| 2575/2575 [00:23<00:00, 110.54it/s]
Train:   0%|          | 1/2304 [00:00<04:18,  8.91it/s]

Epoch     3: reducing learning rate of group 0 to 5.0000e-03.

Epoch: 3
Train Loss: 3.446, Train PPL: 31.38
Valid Loss: 5.471, Valid PPL: 237.78


Train: 100%|██████████| 2304/2304 [00:51<00:00, 44.54it/s]
Valid: 100%|██████████| 2575/2575 [00:23<00:00, 109.48it/s]
Train:   0%|          | 1/2304 [00:00<04:13,  9.08it/s]

Epoch     4: reducing learning rate of group 0 to 5.0000e-04.

Epoch: 4
Train Loss: 3.326, Train PPL: 27.82
Valid Loss: 5.484, Valid PPL: 240.86


Train: 100%|██████████| 2304/2304 [00:52<00:00, 44.17it/s]
Valid: 100%|██████████| 2575/2575 [00:23<00:00, 111.08it/s]
Train:   0%|          | 1/2304 [00:00<04:13,  9.07it/s]

Epoch     5: reducing learning rate of group 0 to 5.0000e-05.

Epoch: 5
Train Loss: 3.312, Train PPL: 27.43
Valid Loss: 5.484, Valid PPL: 240.78


Train: 100%|██████████| 2304/2304 [00:51<00:00, 44.91it/s]
Valid: 100%|██████████| 2575/2575 [00:22<00:00, 112.18it/s]
Train:   0%|          | 1/2304 [00:00<04:06,  9.33it/s]

Epoch     6: reducing learning rate of group 0 to 5.0000e-06.

Epoch: 6
Train Loss: 3.310, Train PPL: 27.40
Valid Loss: 5.484, Valid PPL: 240.73


Train: 100%|██████████| 2304/2304 [00:51<00:00, 45.06it/s]
Valid: 100%|██████████| 2575/2575 [00:24<00:00, 106.74it/s]
Train:   0%|          | 1/2304 [00:00<04:05,  9.39it/s]

Epoch     7: reducing learning rate of group 0 to 5.0000e-07.

Epoch: 7
Train Loss: 3.310, Train PPL: 27.38
Valid Loss: 5.484, Valid PPL: 240.73


Train: 100%|██████████| 2304/2304 [00:51<00:00, 45.15it/s]
Valid: 100%|██████████| 2575/2575 [00:23<00:00, 109.33it/s]
Train:   0%|          | 1/2304 [00:00<04:20,  8.83it/s]

Epoch     8: reducing learning rate of group 0 to 5.0000e-08.

Epoch: 8
Train Loss: 3.310, Train PPL: 27.39
Valid Loss: 5.484, Valid PPL: 240.73


Train: 100%|██████████| 2304/2304 [00:51<00:00, 44.57it/s]
Valid: 100%|██████████| 2575/2575 [00:23<00:00, 111.81it/s]
Train:   0%|          | 1/2304 [00:00<04:17,  8.93it/s]

Epoch     9: reducing learning rate of group 0 to 5.0000e-09.

Epoch: 9
Train Loss: 3.310, Train PPL: 27.38
Valid Loss: 5.484, Valid PPL: 240.73


Train: 100%|██████████| 2304/2304 [00:51<00:00, 45.10it/s]
Valid: 100%|██████████| 2575/2575 [00:22<00:00, 112.27it/s]


Epoch: 10
Train Loss: 3.310, Train PPL: 27.38
Valid Loss: 5.484, Valid PPL: 240.73





# [ELMo (Embeddings from Language Model)](https://arxiv.org/pdf/1802.05365.pdf)

The point where word2vec, Glove and fasttext failed is, The meaning, semantics of a words changes with respect to the words it is surrounded by in a sentence and these models wheren't able to represent that.

For example:- 
"
1. Apple is in profit.
2. Apple is tasty.

Apple in first sentence refers to the company whereas apple in second sentence refers to fruit.

In word2vec model this one word with two meaning will be represented by same vector.

To solve this issue ELMo uses stacked bi-LSTM to generate embeddings for each words which is context dependent.

so In the above example Apple in first and second case will have two different vector representation.

## Understanding ELMO

Read Section 3.1 to understand ELMO implementation for details. In short, It does forward and Backward Language modelling

The Multi-layer Bi-LSTM take independent words as input and the hidden state of the LSTM is the output embedding representing the word.

But Before passing the words as input to BI-LSTM the words are passed throught a char-CNN model to generate a fixed representation of a the word.

Since we are using bidirectional LSTM so the forward and backward output of LSTM is returned as the embeddings for that particular word.

output = [forward_representaion,backward_representation]

we can average the two representation to get one representation of the word.


# ELMO Application

Since ELMo embeddings are trained in a unsupervised way. These embeddings can be used for downstream task in supervised way.

To add ELMo in a supervised task. First Freeze the bi-LSTM layer and generate ELMo embeddings concatenate it with x and pass it to the the task RNN.

output = [x,ELMo]

where x is the normal context-independent token representation.