<a href="https://colab.research.google.com/github/dexter11235813/END_1.0/blob/main/assignment_13/END_assignment13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Downloading requisite files / packages**

In [1]:
!wget "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"


--2021-02-25 18:55:45--  http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9916637 (9.5M) [application/zip]
Saving to: ‘cornell_movie_dialogs_corpus.zip.1’


2021-02-25 18:55:46 (15.2 MB/s) - ‘cornell_movie_dialogs_corpus.zip.1’ saved [9916637/9916637]



In [2]:
!unzip "cornell_movie_dialogs_corpus.zip"

Archive:  cornell_movie_dialogs_corpus.zip
replace cornell movie-dialogs corpus/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [3]:
!python -m spacy download en


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [335]:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator, Iterator
from torchtext import data

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np

import random
import math
import time

In [639]:

SEED = 42
device = 'cuda'
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True



In [640]:
import pandas as pd
df=pd.read_csv("/content/cornell movie-dialogs corpus/movie_lines.txt", sep='\+\+\+\$\+\+\+', header=None)
df.columns = ['line_id','char_id', 'movie_id', 'char_name', 'dialog' ]
df.head()

  


Unnamed: 0,line_id,char_id,movie_id,char_name,dialog
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


In [641]:
df['line_id'] = df.line_id.apply(lambda x: x[1:]).astype(int)
df = df.sort_values('line_id')
df.head()

Unnamed: 0,line_id,char_id,movie_id,char_name,dialog
86,49,u0,m0,BIANCA,Did you change your hair?
85,50,u3,m0,CHASTITY,No.
84,51,u0,m0,BIANCA,You might wanna think about it
648,59,u9,m0,PATRICK,I missed you.
647,60,u8,m0,MISS PERKY,It says here you exposed yourself to a group ...


In [642]:
df['reply'] = df['dialog'].shift(-1)
df.head()

Unnamed: 0,line_id,char_id,movie_id,char_name,dialog,reply
86,49,u0,m0,BIANCA,Did you change your hair?,No.
85,50,u3,m0,CHASTITY,No.,You might wanna think about it
84,51,u0,m0,BIANCA,You might wanna think about it,I missed you.
648,59,u9,m0,PATRICK,I missed you.,It says here you exposed yourself to a group ...
647,60,u8,m0,MISS PERKY,It says here you exposed yourself to a group ...,It was a bratwurst. I was eating lunch.


In [646]:
max_length = 22
df = df[(df['dialog'].str.len() <= max_length) & (df['reply'].str.len() <= max_length)].drop(columns=['char_id', 'movie_id', 'char_name'])


In [647]:
df['dialog'] = df['dialog'].astype(str)
df['reply'] = df['reply'].astype(str)
df['line_id'] = df['line_id'].astype(int)

In [648]:
df.head()

Unnamed: 0,line_id,dialog,reply
266,63,You the new guy?,So they tell me...
258,71,Thirty-two.,Get out!
385,127,He always look so,Block E?
382,130,Just a little.,What's this?
381,131,What's this?,An attempted slit.


In [649]:
dialog = data.Field(tokenize='spacy', init_token='<sos>', eos_token='<eos>', lower=True)
reply = data.Field(tokenize='spacy', init_token='<sos>', eos_token='<eos>', lower=True)
line_id = data.Field(sequential=False, use_vocab=False)
fields = [('dialog', dialog), ('reply', reply), ('line_id', line_id)]
example = [data.Example.fromlist([df.dialog.iloc[i], df.reply.iloc[i], df.line_id.iloc[i]], fields) for i in range(df.shape[0])]

In [650]:
train_dset = data.Dataset(example[:int(len(df) * 0.6)], fields)
test_dset = data.Dataset(example[int(len(df) * 0.6): ], fields)

In [651]:
dialog.build_vocab(train_dset, min_freq = 3)
reply.build_vocab(test_dset, min_freq = 3)

In [652]:
BATCH_SIZE = 256

train_iterator = BucketIterator(train_dset, batch_size=BATCH_SIZE, sort=True,
                           sort_key = lambda x: -x.line_id,
                           sort_within_batch=True, device = device)
test_iterator = BucketIterator(test_dset, batch_size=BATCH_SIZE, sort=True,
                           sort_key = lambda x: -x.line_id,
                           sort_within_batch=True, device = device)

In [653]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim,
                 dropout, 
                 device,
                 max_length = 120):
        super().__init__()

        self.device = device
        
        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([EncoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim,
                                                  dropout, 
                                                  device) 
                                     for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len]
        #src_mask = [batch size, 1, 1, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [batch size, src len]

        src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
        
        #src = [batch size, src len, hid dim]
        
        for layer in self.layers:
            src = layer(src, src_mask)
            
        #src = [batch size, src len, hid dim]
            
        return src


class EncoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim,  
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len, hid dim]
        #src_mask = [batch size, 1, 1, src len] 
                
        #self attention
        _src, _ = self.self_attention(src, src, src, src_mask)
        
        #dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        #positionwise feedforward
        _src = self.positionwise_feedforward(src)
        
        #dropout, residual and layer norm
        src = self.ff_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        return src

In [654]:
class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        
        batch_size = query.shape[0]
        
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
                
        Q = self.fc_q(query)
        K = self.fc_k(key)
        V = self.fc_v(value)
        
        #Q = [batch size, query len, hid dim]
        #K = [batch size, key len, hid dim]
        #V = [batch size, value len, hid dim]
                
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        
        #Q = [batch size, n heads, query len, head dim]
        #K = [batch size, n heads, key len, head dim]
        #V = [batch size, n heads, value len, head dim]
                
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        
        #energy = [batch size, n heads, query len, key len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim = -1)
                
        #attention = [batch size, n heads, query len, key len]
                
        x = torch.matmul(self.dropout(attention), V)
        
        #x = [batch size, n heads, query len, head dim]
        
        x = x.permute(0, 2, 1, 3).contiguous()
        
        #x = [batch size, query len, n heads, head dim]
        
        x = x.view(batch_size, -1, self.hid_dim)
        
        #x = [batch size, query len, hid dim]
        
        x = self.fc_o(x)
        
        #x = [batch size, query len, hid dim]
        
        return x, attention

class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [batch size, seq len, hid dim]
        
        x = self.dropout(torch.relu(self.fc_1(x)))
        
        #x = [batch size, seq len, pf dim]
        
        x = self.fc_2(x)
        
        #x = [batch size, seq len, hid dim]
        
        return x

In [655]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device,
                 max_length = 120):
        super().__init__()
        
        self.device = device
        
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([DecoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim, 
                                                  dropout, 
                                                  device)
                                     for _ in range(n_layers)])
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
                
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
                            
        #pos = [batch size, trg len]

        trg = self.dropout((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))
                
        #trg = [batch size, trg len, hid dim]
        
        for layer in self.layers:
            trg, attention = layer(trg, enc_src, trg_mask, src_mask)
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        output = self.fc_out(trg)
        
        #output = [batch size, trg len, output dim]
            
        return output, attention

class DecoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.enc_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.encoder_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len, hid dim]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
        
        #self attention
        _trg, _ = self.self_attention(trg, trg, trg, trg_mask)
        
        #dropout, residual connection and layer norm
        trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
            
        #trg = [batch size, trg len, hid dim]
            
        #encoder attention
        _trg, attention = self.encoder_attention(trg, enc_src, enc_src, src_mask)
        # query, key, value
        
        #dropout, residual connection and layer norm
        trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
                    
        #trg = [batch size, trg len, hid dim]
        
        #positionwise feedforward
        _trg = self.positionwise_feedforward(trg)
        
        #dropout, residual and layer norm
        trg = self.ff_layer_norm(trg + self.dropout(_trg))
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        return trg, attention

In [656]:
class Seq2Seq(nn.Module):
    def __init__(self, 
                 encoder, 
                 decoder, 
                 src_pad_idx, 
                 trg_pad_idx, 
                 device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device
        
    def make_src_mask(self, src):
        
        #src = [batch size, src len]
        
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)

        #src_mask = [batch size, 1, 1, src len]

        return src_mask
    
    def make_trg_mask(self, trg):
        
        #trg = [batch size, trg len]
        
        trg_pad_mask = (trg != self.trg_pad_idx).unsqueeze(1).unsqueeze(2)
        
        #trg_pad_mask = [batch size, 1, 1, trg len]
        
        trg_len = trg.shape[1]
        
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device = self.device)).bool()
        
        #trg_sub_mask = [trg len, trg len]
            
        trg_mask = trg_pad_mask & trg_sub_mask
        
        #trg_mask = [batch size, 1, trg len, trg len]
        
        return trg_mask

    def forward(self, src, trg):
        
        #src = [batch size, src len]
        #trg = [batch size, trg len]
                
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        
        #src_mask = [batch size, 1, 1, src len]
        #trg_mask = [batch size, 1, trg len, trg len]
        
        enc_src = self.encoder(src, src_mask)
        
        #enc_src = [batch size, src len, hid dim]
                
        output, attention = self.decoder(trg, enc_src, trg_mask, src_mask)
        
        #output = [batch size, trg len, output dim]
        #attention = [batch size, n heads, trg len, src len]
        #return output, attention, mask_
        return output, attention


In [657]:
INPUT_DIM = len(dialog.vocab)
OUTPUT_DIM = len(reply.vocab)
HID_DIM = 256
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 8
DEC_HEADS = 8
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.10
DEC_DROPOUT = 0.10

enc = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT, 
              device)

dec = Decoder(OUTPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT,
              device)

In [658]:
SRC_PAD_IDX = dialog.vocab.stoi[dialog.pad_token]
TRG_PAD_IDX = reply.vocab.stoi[reply.pad_token]

model = Seq2Seq(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device)

In [659]:
def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)
model.apply(initialize_weights);

In [660]:
LEARNING_RATE = 0.0001

optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE, weight_decay=1e-6)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,factor=0.5, patience=5)


In [661]:
# def maskNLLLoss(inp, target, mask):
#     nTotal = mask.sum()
#     #print(f'target shape : {target.shape}, mask shape : {mask.shape}')
#     crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
#     loss = crossEntropy.masked_select(mask.reshape(-1)).mean()
#     #loss = loss_[~torch.isnan(loss_)].mean()
#     loss = loss.to(device)
#     return loss, nTotal.item()

def maskNLLLoss(inp, target, mask):
    # print(inp.shape, target.shape, mask.sum())
    nTotal = mask.sum()
    crossEntropy = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)
    loss = crossEntropy(inp, target)
    loss = loss.to(device)
    return loss, nTotal.item()

In [662]:
criterion = maskNLLLoss


In [663]:
from tqdm import tqdm
from functools import partial 

tqdm = partial(tqdm, position=0, leave=True)
def make_trg_mask(trg):
        
        #trg = [batch size, trg len]
        
        trg_pad_mask = (trg != TRG_PAD_IDX).unsqueeze(1).unsqueeze(2)
        
        #trg_pad_mask = [batch size, 1, 1, trg len]
        
        trg_len = trg.shape[1]
        
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device = device)).bool()
        
        #trg_sub_mask = [trg len, trg len]
            
        trg_mask = trg_pad_mask & trg_sub_mask
        
        #trg_mask = [batch size, 1, trg len, trg len]
        
        return trg_mask

def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    n_totals = 0
    print_losses = []
    for i, batch in tqdm(enumerate(iterator), total=len(iterator)):
        # print(batch)
        
        loss = 0
        src = batch.dialog.permute(1, 0)
        trg = batch.reply.permute(1, 0)
        trg_mask = make_trg_mask(trg)
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])
                
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
            
        output_dim = output.shape[-1]
            
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
                
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
        mask_loss, nTotal = criterion(output, trg, trg_mask)
        
        mask_loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
       
          
        print_losses.append(mask_loss.item() * nTotal)
        n_totals += nTotal


        
    return sum(print_losses) / n_totals

def evaluate(model, iterator, criterion):
    
    model.eval()
    
    n_totals = 0
    print_losses = []
    
    with torch.no_grad():
    
        for i, batch in tqdm(enumerate(iterator), total=len(iterator)):

            src = batch.dialog.permute(1, 0)
            trg = batch.reply.permute(1, 0)
            trg_mask = make_trg_mask(trg)

            output, _ = model(src, trg[:,:-1])
            
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]
            
            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)
            
            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            mask_loss, nTotal = criterion(output, trg, trg_mask)

            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal


        
    return sum(print_losses) / n_totals

In [664]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [665]:
N_EPOCHS = 25
CLIP = 1
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    print(f'LR before train() :- {optimizer.param_groups[0]["lr"]}\n')
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    print(f'LR after train():- {optimizer.param_groups[0]["lr"]}\n')
    
    valid_loss = evaluate(model, test_iterator, criterion)
    if scheduler:
      scheduler.step(valid_loss)
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # if valid_loss < best_valid_loss:
    #     best_valid_loss = valid_loss
    #     torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

  1%|▏         | 1/68 [00:00<00:08,  8.35it/s]

LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 14.96it/s]
 15%|█▌        | 7/46 [00:00<00:00, 61.60it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 63.34it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.14it/s]

Epoch: 01 | Time: 0m 5s
	Train Loss: 4.044 | Train PPL:  57.033
	 Val. Loss: 3.109 |  Val. PPL:  22.389
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.53it/s]
 15%|█▌        | 7/46 [00:00<00:00, 62.79it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 63.58it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.11it/s]

Epoch: 02 | Time: 0m 5s
	Train Loss: 2.857 | Train PPL:  17.408
	 Val. Loss: 2.783 |  Val. PPL:  16.175
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.56it/s]
 15%|█▌        | 7/46 [00:00<00:00, 62.87it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 63.77it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.07it/s]

Epoch: 03 | Time: 0m 5s
	Train Loss: 2.633 | Train PPL:  13.909
	 Val. Loss: 2.666 |  Val. PPL:  14.381
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.58it/s]
 15%|█▌        | 7/46 [00:00<00:00, 62.22it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.71it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.57it/s]

Epoch: 04 | Time: 0m 5s
	Train Loss: 2.525 | Train PPL:  12.494
	 Val. Loss: 2.603 |  Val. PPL:  13.501
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.50it/s]
 15%|█▌        | 7/46 [00:00<00:00, 62.57it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.91it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.34it/s]

Epoch: 05 | Time: 0m 5s
	Train Loss: 2.454 | Train PPL:  11.630
	 Val. Loss: 2.554 |  Val. PPL:  12.859
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.57it/s]
 13%|█▎        | 6/46 [00:00<00:00, 58.45it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.14it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.15it/s]

Epoch: 06 | Time: 0m 5s
	Train Loss: 2.398 | Train PPL:  10.999
	 Val. Loss: 2.516 |  Val. PPL:  12.383
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.56it/s]
 15%|█▌        | 7/46 [00:00<00:00, 62.41it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.29it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.17it/s]

Epoch: 07 | Time: 0m 5s
	Train Loss: 2.350 | Train PPL:  10.484
	 Val. Loss: 2.488 |  Val. PPL:  12.036
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.62it/s]
 15%|█▌        | 7/46 [00:00<00:00, 61.79it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.20it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.48it/s]

Epoch: 08 | Time: 0m 5s
	Train Loss: 2.309 | Train PPL:  10.061
	 Val. Loss: 2.465 |  Val. PPL:  11.764
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.56it/s]
 13%|█▎        | 6/46 [00:00<00:00, 58.81it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.48it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.81it/s]

Epoch: 09 | Time: 0m 5s
	Train Loss: 2.269 | Train PPL:   9.674
	 Val. Loss: 2.448 |  Val. PPL:  11.560
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.54it/s]
 15%|█▌        | 7/46 [00:00<00:00, 62.25it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.01it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.17it/s]

Epoch: 10 | Time: 0m 5s
	Train Loss: 2.236 | Train PPL:   9.359
	 Val. Loss: 2.432 |  Val. PPL:  11.378
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.54it/s]
 15%|█▌        | 7/46 [00:00<00:00, 61.30it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.08it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.36it/s]

Epoch: 11 | Time: 0m 5s
	Train Loss: 2.206 | Train PPL:   9.079
	 Val. Loss: 2.422 |  Val. PPL:  11.271
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.51it/s]
 15%|█▌        | 7/46 [00:00<00:00, 61.93it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 62.67it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.01it/s]

Epoch: 12 | Time: 0m 5s
	Train Loss: 2.177 | Train PPL:   8.820
	 Val. Loss: 2.416 |  Val. PPL:  11.198
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.42it/s]
 13%|█▎        | 6/46 [00:00<00:00, 59.74it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.73it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.64it/s]

Epoch: 13 | Time: 0m 5s
	Train Loss: 2.145 | Train PPL:   8.546
	 Val. Loss: 2.409 |  Val. PPL:  11.124
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.35it/s]
 13%|█▎        | 6/46 [00:00<00:00, 59.89it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.64it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.87it/s]

Epoch: 14 | Time: 0m 5s
	Train Loss: 2.121 | Train PPL:   8.341
	 Val. Loss: 2.405 |  Val. PPL:  11.080
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.46it/s]
 13%|█▎        | 6/46 [00:00<00:00, 58.92it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.80it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.05it/s]

Epoch: 15 | Time: 0m 5s
	Train Loss: 2.098 | Train PPL:   8.153
	 Val. Loss: 2.400 |  Val. PPL:  11.019
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.32it/s]
 13%|█▎        | 6/46 [00:00<00:00, 58.65it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.80it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.07it/s]

Epoch: 16 | Time: 0m 5s
	Train Loss: 2.074 | Train PPL:   7.953
	 Val. Loss: 2.401 |  Val. PPL:  11.033
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.34it/s]
 15%|█▌        | 7/46 [00:00<00:00, 60.95it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.69it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.55it/s]

Epoch: 17 | Time: 0m 5s
	Train Loss: 2.049 | Train PPL:   7.760
	 Val. Loss: 2.397 |  Val. PPL:  10.992
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.19it/s]
 13%|█▎        | 6/46 [00:00<00:00, 58.99it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.40it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.21it/s]

Epoch: 18 | Time: 0m 5s
	Train Loss: 2.027 | Train PPL:   7.594
	 Val. Loss: 2.396 |  Val. PPL:  10.985
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.43it/s]
 15%|█▌        | 7/46 [00:00<00:00, 61.03it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.52it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.74it/s]

Epoch: 19 | Time: 0m 5s
	Train Loss: 2.005 | Train PPL:   7.428
	 Val. Loss: 2.400 |  Val. PPL:  11.020
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.29it/s]
 13%|█▎        | 6/46 [00:00<00:00, 59.43it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.68it/s]
  3%|▎         | 2/68 [00:00<00:04, 14.14it/s]

Epoch: 20 | Time: 0m 5s
	Train Loss: 1.986 | Train PPL:   7.286
	 Val. Loss: 2.404 |  Val. PPL:  11.072
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.30it/s]
 13%|█▎        | 6/46 [00:00<00:00, 59.54it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.22it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.79it/s]

Epoch: 21 | Time: 0m 5s
	Train Loss: 1.965 | Train PPL:   7.132
	 Val. Loss: 2.405 |  Val. PPL:  11.077
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.24it/s]
 13%|█▎        | 6/46 [00:00<00:00, 55.94it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 60.43it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.79it/s]

Epoch: 22 | Time: 0m 5s
	Train Loss: 1.946 | Train PPL:   6.998
	 Val. Loss: 2.407 |  Val. PPL:  11.104
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.23it/s]
 13%|█▎        | 6/46 [00:00<00:00, 59.19it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 61.13it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.99it/s]

Epoch: 23 | Time: 0m 5s
	Train Loss: 1.926 | Train PPL:   6.862
	 Val. Loss: 2.414 |  Val. PPL:  11.176
LR before train() :- 0.0001



100%|██████████| 68/68 [00:04<00:00, 15.14it/s]
 13%|█▎        | 6/46 [00:00<00:00, 59.62it/s]

LR after train():- 0.0001



100%|██████████| 46/46 [00:00<00:00, 60.45it/s]
  3%|▎         | 2/68 [00:00<00:04, 13.86it/s]

Epoch: 24 | Time: 0m 5s
	Train Loss: 1.907 | Train PPL:   6.731
	 Val. Loss: 2.420 |  Val. PPL:  11.250
LR before train() :- 5e-05



100%|██████████| 68/68 [00:04<00:00, 15.14it/s]
 15%|█▌        | 7/46 [00:00<00:00, 61.05it/s]

LR after train():- 5e-05



100%|██████████| 46/46 [00:00<00:00, 61.51it/s]

Epoch: 25 | Time: 0m 5s
	Train Loss: 1.884 | Train PPL:   6.577
	 Val. Loss: 2.422 |  Val. PPL:  11.263



