<a href="https://colab.research.google.com/github/akashe/NLP/blob/main/Attention_is_all_you_need.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[The paper that changed NLP.](https://arxiv.org/pdf/1706.03762.pdf)

Before Transformer, langauge models employed LSTMs and GRUs. Though both of them broke many SOA results, Transformers took language models to another level. Transformers have few advantages over traditional RNN:

1. Longer dependencies: LSTMs can't work with very long sequences.[Explanation](https://akashe.io/blog/2020/12/03/rnn-lstm-gru-and-attention/#how-LSTM-solves-vanishing-exploding-gradients). Transformers don't unroll in time. Transformers depend only on the max_len of the sequence to find dependency. Max_len can be 100 or 10000.
2. Faster computation: Transformer are highly parallelizable. LSTMs have to unroll in sequence dim. This limits how fast you can train a LSTM.

What transformers lack:

1. Notion of position: Transformers lack the notion of sequential data, so they are given an additional input of positional embeddings to sequential nature of the data.
2. They don't have an internal state.[Paper](https://arxiv.org/pdf/2002.09402.pdf)


The task:

We will make and train a transformer architecture to translate from german to english.

Note: Code heavily inspired from other places.

Import libs

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np

import random
import math
import time

Set Seed

In [2]:
SEED = 1007

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Download spacy English and German models

In [3]:
!python -m spacy download en
!python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/de_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/de
You can now load the model via spacy.load('de')


Load the models

In [4]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

Define tokenizers

In [5]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

Define source and target fields

In [6]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

Download the data

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

Build vocab

In [8]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

Set device

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

Build iterators

In [10]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     device = device)

## Main Architecture:

We need:

1. Encoder
2. Decoder
3. Seq2seq
4. EncoderLayer
5. DecoderLayer
6. MultiheadAttentionComponent(with mask)
7. FeedForwardComponent
8. PositionalEncodingsComponent 

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/9479fcb532214ad26fd4bda9fcf081a05e1aaf4e/assets/transformer1.png)

#### Positional Encoding

Positional information is given using sine and cosine functions of different frequencies.

$PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}})$
$PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\text{model}}})$

where i lies between 0 to (hid_dim or d_model)//2 and pos refers to the position of the token.


In [11]:
class PositionalEncodingComponent(nn.Module):
  '''
  Class to encode positional information to tokens.
  

  '''
  def __init__(self,hid_dim,device,dropout=0.2,max_len=5000):
    super().__init__()

    assert hid_dim%2==0 # If not, it will result error in allocation to positional_encodings[:,1::2] later

    self.dropout = nn.Dropout(dropout)

    self.positional_encodings = torch.zeros(max_len,hid_dim)

    pos = torch.arange(0,max_len).unsqueeze(1) # pos : [max_len,1]
    div_term  = torch.exp(-torch.arange(0,hid_dim,2)*math.log(10000.0)/hid_dim) # Calculating value of 1/(10000^(2i/hid_dim)) in log space and then exponentiating it
    # div_term: [hid_dim//2]

    self.positional_encodings[:,0::2] = torch.sin(pos*div_term) # pos*div_term [max_len,hid_dim//2]
    self.positional_encodings[:,1::2] = torch.cos(pos*div_term) 

    self.positional_encodings = self.positional_encodings.unsqueeze(0) # To account for batch_size in inputs

    self.device = device

  def forward(self,x):
    x = x + self.positional_encodings[:,:x.size(1)].detach().to(self.device)
    return self.dropout(x)


#### Pointwise Feed Forward:
$\mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2$

In [12]:
class FeedForwardComponent(nn.Module):
  '''
  Class for pointwise feed forward connections
  '''
  def __init__(self,hid_dim,pf_dim,dropout):
    super().__init__()

    self.dropout = nn.Dropout(dropout)

    self.fc1 = nn.Linear(hid_dim,pf_dim)
    self.fc2 = nn.Linear(pf_dim,hid_dim)

  def forward(self,x):

    # x : [batch_size,seq_len,hid_dim]
    x = self.dropout(torch.relu(self.fc1(x)))

    # x : [batch_size,seq_len,pf_dim]
    x = self.fc2(x)

    # x : [batch_size,seq_len,hid_dim]
    return x

#### Attention
In transformers, we use self-attention i.e. we using the self value, we learn what parts of it are more important. [Remember for attention](https://akashe.io/blog/2020/12/03/rnn-lstm-gru-and-attention/#Attention), to get attention of $x$ over $y$ we find relative importance $\alpha_{x,y}$ using a score function and later use $\alpha_{x,y}$ to get relative parts from $y$.


In self attention, x and y are the same. Now some nomenclature,

1. Query: We find attention over query. So its similar to $y$ above.
2. Key: What we use to find attention over query. Similar to $x$ above.
3. Value: What we use to create a final vector using attention values. Similar to $y$ in the expression $\sum \alpha_{x,y}y$.


In transformers, ~~Query, Key and Value are the same vectors~~. In implementation, the Q,K,V representations are learned using linear transformations but the input are same to the transformations.

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$



#### Multiheaded Attention
 Instead of calculating attention over the entire src/trg vector, we divide the src/trg vector into multiple smaller heads. We transform them to seperate vectors using learnable matrices($W_i^Q,W_i^K,W_i^V$). Perform self-attention over these transformations and concat them later. Perform another transformation($W^O$) to get the final form.

 $\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ...,
\mathrm{head_h})W^O    \\
    \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)$


![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/9479fcb532214ad26fd4bda9fcf081a05e1aaf4e/assets/transformer-attention.png)

In [13]:
class MultiHeadedAttentionComponent(nn.Module):
  '''
  Multiheaded attention Component. This implementation also supports mask. 
  The reason for mask that in Decoder, we don't want attention mechanism to get
  important information from future tokens.
  '''
  def __init__(self,hid_dim, n_heads, dropout, device):
    super().__init__()

    assert hid_dim % n_heads == 0 # Since we split hid_dims into n_heads

    self.hid_dim = hid_dim
    self.n_heads = n_heads # no of heads in 'multiheaded' attention
    self.head_dim = hid_dim//n_heads # dims of each head

    # Transformation from source vector to query vector
    self.fc_q = nn.Linear(hid_dim,hid_dim)

    # Transformation from source vector to key vector
    self.fc_k = nn.Linear(hid_dim,hid_dim)

    # Transformation from source vector to value vector
    self.fc_v = nn.Linear(hid_dim,hid_dim)

    self.fc_o = nn.Linear(hid_dim,hid_dim)

    self.dropout = nn.Dropout(dropout)

    # Used in self attention for smoother gradients
    self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)

  def forward(self,query,key,value,mask=None):

    #query : [batch_size, query_len, hid_dim]
    #key : [batch_size, key_len, hid_dim]
    #value : [batch_size, value_len, hid_dim]

    batch_size = query.shape[0]

    # Transforming quey,key,values
    Q = self.fc_q(query)
    K = self.fc_k(key)
    V = self.fc_v(value)

    #Q : [batch_size, query_len, hid_dim]
    #K : [batch_size, key_len, hid_dim]
    #V : [batch_size, value_len,hid_dim]

    # Changing shapes to acocmadate n_heads information
    Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
    K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
    V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)

    #Q : [batch_size, n_heads, query_len, head_dim]
    #K : [batch_size, n_heads, key_len, head_dim]
    #V : [batch_size, n_heads, value_len, head_dim]

    # Calculating alpha
    score = torch.matmul(Q,K.permute(0,1,3,2))/self.scale
    # score : [batch_size, n_heads, query_len, key_len]

    if mask is not None:
      score = score.masked_fill(mask==0,-1e10)

    alpha = torch.softmax(score,dim=-1)
    # alpha : [batch_size, n_heads, query_len, key_len]

    # Get the final self-attention  vector
    x = torch.matmul(self.dropout(alpha),V)
    # x : [batch_size, n_heads, query_len, head_dim]

    # Reshaping self attention vector to concatenate
    x = x.permute(0,2,1,3).contiguous()
    # x : [batch_size, query_len, n_heads, head_dim]

    x = x.view(batch_size,-1,self.hid_dim)
    # x: [batch_size, query_len, hid_dim]

    # Transforming concatenated outputs 
    x = self.fc_o(x)
    #x : [batch_size, query_len, hid_dim] 

    return x, alpha

In [14]:
class EncoderLayer(nn.Module):
  '''
  Operations of a single layer in an Encoder. An Encoder employs multiple such layers. Each layer contains:
  1) multihead attention, folllowed by
  2) LayerNorm of addition of multihead attention output and input to the layer, followed by
  3) FeedForward connections, followed by
  4) LayerNorm of addition of FeedForward outputs and output of previous layerNorm.
  '''
  def __init__(self, hid_dim,n_heads,pf_dim,dropout,device):
    super().__init__()
    
    self.self_attn_layer_norm = nn. LayerNorm(hid_dim) #Layer norm after self-attention
    self.ff_layer_norm = nn.LayerNorm(hid_dim) # Layer norm after FeedForward component

    self.self_attention = MultiHeadedAttentionComponent(hid_dim,n_heads,dropout,device)
    self.feed_forward = FeedForwardComponent(hid_dim,pf_dim,dropout)

    self.dropout = nn.Dropout(dropout)
    
  def forward(self,src,src_mask):
    
    # src : [batch_size, src_len, hid_dim]
    # src_mask : [batch_size, 1, 1, src_len]

    # get self-attention
    _src, _ = self.self_attention(src,src,src,src_mask)

    # LayerNorm after dropout
    src = self.self_attn_layer_norm(src + self.dropout(_src))
    # src : [batch_size, src_len, hid_dim]

    # FeedForward
    _src = self.feed_forward(src)

    # layerNorm after dropout
    src = self.ff_layer_norm(src + self.dropout(_src))
    # src: [batch_size, src_len, hid_dim]

    return src
    

In [15]:
class DecoderLayer(nn.Module):
  '''
  Operations of a single layer in an Decoder. An Decoder employs multiple such layers. Each layer contains:
  1) masked decoder self attention, followed by
  2) LayerNorm of addition of previous attention output and input to the layer,, followed by
  3) encoder self attention, followed by
  4) LayerNorm of addition of result of encoder self attention and its input, followed by
  5) FeedForward connections, followed by
  6) LayerNorm of addition of Feedforward results and its input.
  '''
  def __init__(self,hid_dim,n_heads,pf_dim,dropout,device):
    super().__init__()

    self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
    self.enc_attn_layer_norm = nn.LayerNorm(hid_dim)
    self.ff_layer_norm = nn.LayerNorm(hid_dim)

    # decoder self attention
    self.self_attention = MultiHeadedAttentionComponent(hid_dim,n_heads,dropout,device)

    # encoder attention
    self.encoder_attention = MultiHeadedAttentionComponent(hid_dim,n_heads,dropout,device)

    # FeedForward
    self.feed_forward = FeedForwardComponent(hid_dim,pf_dim,dropout)

    self.dropout = nn.Dropout(dropout)

  def forward(self,trg, enc_src,trg_mask,src_mask):

    #trg : [batch_size, trg_len, hid_dim]
    #enc_src : [batch_size, src_len, hid_dim]
    #trg_mask : [batch_size, 1, trg_len, trg_len]
    #src_mask : [batch_size, 1, 1, src_len]

    '''
    Decoder self-attention
    trg_mask is to force decoder to look only into past tokens and not get information from future tokens.
    Since we apply mask before doing softmax, the final self attention vector gets no information from future tokens.
    '''
    _trg, _ = self.self_attention(trg,trg,trg,trg_mask)

    # LayerNorm and dropout with resdiual connection
    trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
    # trg : [batch_size, trg_len, hid_dim]

    '''
    Encoder attention:
    Query: trg
    key: enc_src
    Value : enc_src
    Why? 
    the idea here is to extract information from encoder outputs. So we use decoder self-attention as a query to find important values from enc_src
    and that is why we use src_mask, to avoid getting information from enc_src positions where it is equal to pad-id
    After we get necessary infromation from encoder outputs we add them back to decoder self-attention.
    '''
    _trg, encoder_attn_alpha = self.encoder_attention(trg,enc_src,enc_src,src_mask)

    # LayerNorm , residual connection and dropout
    trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
    # trg : [ batch_size, trg_len, hid_dim]

    # Feed Forward
    _trg = self.feed_forward(trg)

    # LayerNorm, residual connection and dropout
    trg = self.ff_layer_norm(trg + self.dropout(_trg))

    return trg, encoder_attn_alpha
    

In [16]:
class Encoder(nn.Module):
  '''
  An encoder, creates token embeddings and position embeddings and passes them through multiple encoder layers
  '''
  def __init__(self,input_dim,hid_dim,n_layers,n_heads,pf_dim,dropout,device,max_length = 5000):
    super().__init__()
    self.device = device

    self.tok_embedding = nn.Embedding(input_dim,hid_dim)
    self.pos_embedding = PositionalEncodingComponent(hid_dim,device,dropout,max_length)

    # encoder layers
    self.layers = nn.ModuleList([EncoderLayer(hid_dim,n_heads,pf_dim,dropout,device) for _ in range(n_layers)])

    self.dropout = nn.Dropout(dropout)

    self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)

  def forward(self,src,src_mask):

    # src : [batch_size, src_len]
    # src_mask : [batch_size,1,1,src_len]

    batch_size = src.shape[0]
    src_len = src.shape[1]

    tok_embeddings = self.tok_embedding(src)*self.scale

    # token plus position embeddings
    src  = self.pos_embedding(tok_embeddings)

    for layer in self.layers:
      src = layer(src,src_mask)
    # src : [batch_size, src_len, hid_dim]

    return src

In [17]:
class Decoder(nn.Module):
  '''
  An decoder, creates token embeddings and position embeddings and passes them through multiple decoder layers
  '''
  def __init__(self,output_dim,hid_dim,n_layers,n_heads,pf_dim,dropout,device,max_length= 5000):
    super().__init__()

    self.device = device

    self.tok_embedding = nn.Embedding(output_dim,hid_dim)
    self.pos_embedding = PositionalEncodingComponent(hid_dim,device,dropout,max_length)

    # decoder layers
    self.layers = nn.ModuleList([DecoderLayer(hid_dim,n_heads,pf_dim,dropout,device) for _ in range(n_layers)])

    # convert decoder outputs to real outputs
    self.fc_out = nn.Linear(hid_dim,output_dim)

    self.dropout = nn.Dropout(dropout)

    self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)

  def forward(self, trg, enc_src,trg_mask,src_mask):
    
    #trg : [batch_size, trg_len]
    #enc_src : [batch_size, src_len, hid_dim]
    #trg_mask : [batch_size, 1, trg_len, trg_len]
    #src_mask : [batch_size, 1, 1, src_len]

    batch_size = trg.shape[0]
    trg_len = trg.shape[1]

    tok_embeddings = self.tok_embedding(trg)*self.scale

    # token plus pos embeddings
    trg = self.pos_embedding(tok_embeddings)
    # trg : [batch_size, trg_len, hid_dim]

    # Pass trg thorugh decoder layers
    for layer in self.layers:
      trg, encoder_attention = layer(trg,enc_src,trg_mask,src_mask)
    
    # trg : [batch_size,trg_len,hid_dim]
    # encoder_attention :  [batch_size, n_head,trg_len, src_len]

    # Convert to outputs
    output = self.fc_out(trg)
    # output : [batch_size, trg_len, output_dim]
    
    return output, encoder_attention

In [18]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, src_pad_idx, trg_pad_idx, device):
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder
    self.src_pad_idx = src_pad_idx
    self.trg_pad_idx = trg_pad_idx
    self.device = device

  def make_src_mask(self,src):
    # src : [batch_size, src_len]

    # Masking pad values
    src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
    # src_mask : [batch_size,1,1,src_len]

    return src_mask

  def make_trg_mask(self,trg):
    # trg : [batch_size, trg_len]

    # Masking pad values
    trg_pad_mask = (trg != self.trg_pad_idx).unsqueeze(1).unsqueeze(2)
    # trg_pad_mask : [batch_size,1,1, trg_len]

    # Masking future values
    trg_len = trg.shape[1]
    trg_sub_mask = torch.tril(torch.ones((trg_len,trg_len),device= self.device)).bool()
    # trg_sub_mask : [trg_len, trg_len]

    # combine both masks
    trg_mask = trg_pad_mask & trg_sub_mask
    # trg_mask = [batch_size,1,trg_len,trg_len]

    return trg_mask

  def forward(self,src,trg):

    # src : [batch_size, src_len]
    # trg : [batch_size, trg_len]

    src_mask = self.make_src_mask(src)
    trg_mask = self.make_trg_mask(trg)

    # src_mask : [ batch_size, 1,1,src_len]
    # trg_mask : [batch_size, 1, trg_len, trg_len]

    enc_src = self.encoder(src,src_mask)
    #enc_src : [batch_size, src_len, hid_dim]

    output, encoder_decoder_attention = self.decoder(trg,enc_src,trg_mask,src_mask)
    # output : [batch_size, trg_len, output_dim]
    # encoder_decoder_attention : [batch_size, n_heads, trg_len, src_len]

    return output, encoder_decoder_attention

Intializing network

In [19]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
HID_DIM = 256
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 8
DEC_HEADS = 8
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

enc = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT, 
              device)

dec = Decoder(OUTPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT, 
              device)

SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

model = Seq2Seq(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device)

Initialize weights

In [20]:
def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)

model.apply(initialize_weights);

Total model params

In [21]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 8,987,653 trainable parameters


Learning rate, criterion and optimizer

In [22]:
LEARNING_RATE = 0.0005

optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Train Loop

In [23]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])
                
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
            
        output_dim = output.shape[-1]
            
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
                
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
            
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Evaluate Loop

In [24]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output, _ = model(src, trg[:,:-1])
            
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]
            
            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)
            
            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Time per epoch

In [25]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Runner Loop

In [26]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 18s
	Train Loss: 4.484 | Train PPL:  88.607
	 Val. Loss: 3.339 |  Val. PPL:  28.185
Epoch: 02 | Time: 0m 18s
	Train Loss: 3.150 | Train PPL:  23.338
	 Val. Loss: 2.557 |  Val. PPL:  12.901
Epoch: 03 | Time: 0m 19s
	Train Loss: 2.460 | Train PPL:  11.709
	 Val. Loss: 2.093 |  Val. PPL:   8.109
Epoch: 04 | Time: 0m 19s
	Train Loss: 2.036 | Train PPL:   7.663
	 Val. Loss: 1.833 |  Val. PPL:   6.253
Epoch: 05 | Time: 0m 19s
	Train Loss: 1.754 | Train PPL:   5.778
	 Val. Loss: 1.694 |  Val. PPL:   5.443
Epoch: 06 | Time: 0m 20s
	Train Loss: 1.554 | Train PPL:   4.730
	 Val. Loss: 1.645 |  Val. PPL:   5.179
Epoch: 07 | Time: 0m 20s
	Train Loss: 1.398 | Train PPL:   4.046
	 Val. Loss: 1.591 |  Val. PPL:   4.908
Epoch: 08 | Time: 0m 20s
	Train Loss: 1.271 | Train PPL:   3.565
	 Val. Loss: 1.557 |  Val. PPL:   4.746
Epoch: 09 | Time: 0m 19s
	Train Loss: 1.168 | Train PPL:   3.214
	 Val. Loss: 1.563 |  Val. PPL:   4.773
Epoch: 10 | Time: 0m 19s
	Train Loss: 1.076 | Train PPL