# Encoder-Only LLM Using PyTorch

### ➨ Python Class (opening notes)

Classes python possess several advantages.

1. Encapsulation: related functions and data into a single unit.

2. Abstraction: abstract away implementation details and expose only the necessary interfaces to the outside world. 

3. Code Reusability: once you've defined a class, you can create multiple instances of that class (objects) with different data. This allows you to reuse the same code in different parts of your program or even in different programs altogether.

4. Inheritance: classes support inheritance, which allows you to create new classes that inherit attributes and methods from existing classes. This promotes code reuse and helps you build on existing functionality without having to reinvent the wheel.

The name of the class is, for example 
```python
class class_name(param_1,...,param_n):
```
where the parameters are the parent classes for inheritance.

#### Important PyTorch note:
In PyTorch, when defining a class (e.g., a new network model), it inherits from the class `nn.Module`, so we should write
```python
class net(nn.Module):
```
The class `nn.Module` always calls for a function named `forward`, so we need to define this function, which represets the forward-pass and subsequently automatically defines the backword-pass for backpropagation. This is why it is enough to call the class, e.g. `model = net(parameters)` (where we specify the required `parameters` when writing the `__init__` method of our new class `net`) for the forward pass, unlike other classes which require calling the functions of the class separately.

### Constructor (`__init__`), `self` and `super()`

##### Constructor
A *constructor method*, also known as a constructor, is a method in OOP languages that is automatically called when an instance of a class is created. It initializes the newly created object. In Python, the constructor is defined with `__init__`.
The constructor is often used to initialize instance variables or perform setup when creating an object. It allows to customize how objects are initialized when they are created from a class.

##### Self
In Python, `self` is a conventionally used name for the first parameter of instance methods in a class. When you create an instance of a class and call a method on that instance, Python automatically passes the instance as the first argument to the method. It's a way for the method to refer to the *specific* instance it's operating on.

##### Super
In Python, `super()` is a built-in function that is typically used to call methods defined in the superclass (parent class) within a subclass (child class).
Using `super()` allows for more maintainable and extensible code by ensuring that changes in inheritance hierarchy propagate properly.

In [68]:
import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

# ➨ The Transformer Blocks

In [69]:
class TransformerModel(nn.Module):
# defines a new class named TransformerModel, which inherits from nn.Module. 
# this indicates that TransformerModel is a PyTorch NN module and can make use of various features provided by PyTorch.
# see cells below for iformation about __init__, self, and super()


    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int, nlayers: int, dropout: float = 0.5):
        # for dropout we provide a float parameter with default 0.5 if not provided
        
        super().__init__()  # call the constrcutor of the superclass nn.Module for inheritance
        self.model_type = 'Transformer'
        
        # initializes pos_encoder with an instance of the PositionalEncoding class 
        # (the class is expected to be defined elsewhere in the code)
        self.pos_encoder = PositionalEncoding(d_model, dropout)  
        
        # initializes an instance of the TransformerEncoderLayer class (one encoder layer) with the provided parameters
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout) 
        
        # initializes an instance of the TransformerEncoder class (one encoder) with the provided one encoder layer
        # (see below for more information) 
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)  
        
        # an embedding layer that maps each token (indexed by integers) to a vector of dimension d_model
        self.embedding = nn.Embedding(ntoken, d_model) 
        
        self.d_model = d_model
        
        # maps the output of the Transformer model to a vector of size ntoken, 
        # representing the logits (scores) for each token in the vocabulary
        self.linear = nn.Linear(d_model, ntoken)

        # calls init_weights (see function below) to initialize the weights of the embedding layer and the linear layer
        self.init_weights()  

        
    # a method for weight initialization that takes no arguments and returns None
    def init_weights(self) -> None:  
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)  # initializes the weights of the embedding layer
        self.linear.bias.data.zero_()  # initializes the bias of the linear layer to zeros
        self.linear.weight.data.uniform_(-initrange, initrange)  # initializes the weights of the linear layer

    
    # defines the forward pass of the model
    def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:      
        """Arguments:
        src: Tensor of shape [seq_len, batch_size] (this is batch of data)
        src_mask: Tensor of shape [seq_len, seq_len] (a square matrix representing tokens to mask; 
                                                      if the upper right triangle is masked, then this is future masking 
                                                      which is also the default masking of this code)
        Returns: output Tensor of shape [seq_len, batch_size, ntoken] (logits for each column in the batch)"""
        
        # embeds the input tokens (src) and scales them by the square root of the d_model;
        # this scaling factor is used to prevent the gradients from becoming too small or too large during training
        src = self.embedding(src) * math.sqrt(self.d_model)
        
        # applies positional encoding to the embedded input tokens
        src = self.pos_encoder(src) 
        
        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0) (causal masking is when we mask the future, 
                                                           that is when we train the model such that it will choose the next 
                                                           token conditioned on previous choices; 
                                                           so this default masks the upper left triangle)"""
            src_mask = nn.Transformer.generate_square_subsequent_mask(len(src)).to(device)
        
        # passes the embedded and encoded input tokens through the transformer encoder (transformer_encoder) 
        # to obtain the output representations
        output = self.transformer_encoder(src, src_mask) 
        
        # applies a linear transformation to the output representations to obtain the logits for each token in the vocabulary
        output = self.linear(output)
        return output

### The TransformerEncoderLayer Class

    CLASS torch.nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=function relu, layer_norm_eps=1e-05, batch_first=False, norm_first=False, bias=True, device=None, dtype=None)


TransformerEncoderLayer is made up of self-attention and FF network. This standard encoder layer is based on the paper “Attention Is All You Need”, where:
1. d_model (int) – the number of expected features in the input (required).

2. nhead (int) – the number of heads in the multiheadattention models (required).

3. dim_feedforward (int) – the dimension of the feedforward network model (default=2048).

4. dropout (float) – the dropout value (default=0.1).

5. activation (Union[str, Callable[[Tensor], Tensor]]) – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: relu

6. layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).

7. batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).

8. norm_first (bool) – if True, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it’s done after. Default: False (after).

9. bias (bool) – If set to False, Linear and LayerNorm layers will not learn an additive bias. Default: True.


This class has the following function: 

    forward(src, mask=None, src_key_padding_mask=None, is_causal=None)

where:
1. src (Tensor) – the sequence to the encoder (required).
2. mask (Optional[Tensor]) – the mask for the src sequence (optional).
3. src_key_padding_mask (Optional[Tensor]) – the mask for the src keys per batch (optional).
4. is_causal (Optional[bool]) – If specified, applies a causal mask as mask. Default: None; try to detect a causal mask. Warning: is_causal provides a hint that mask is the causal mask. Providing incorrect hints can result in incorrect execution, including forward and backward compatibility.

Example:
> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)<br>
src = torch.rand(10, 32, 512)<br>
ut = encoder_layer(src)

Alternatively, when batch_first is True:
> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)<br>
src = torch.rand(32, 10, 512)<br>
ut = encoder_layer(src)

### The TransformerEncoder Class works

    CLASS torch.nn.TransformerEncoder(encoder_layer, num_layers, norm=None, enable_nested_tensor=True, mask_check=True)

Where:
1. encoder_layer – an instance of the TransformerEncoderLayer() class (required).
2. num_layers – the number of sub-encoder-layers in the encoder (required).
3. norm – the layer normalization component (optional).
4. enable_nested_tensor – if True, input will automatically convert to nested tensor (and convert back on output). This will improve the overall performance of TransformerEncoder when padding rate is high. Default: True (enabled).

This class has the same function `forward` as in the TransformerEncoderLayer class.

Example:
> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)<br>
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)}<br>
src = torch.rand(10, 32, 512)<br>
out = transformer_encoder(src)

### The Embedding Layer
For example, if the vocabulary size is 10, then `emb = nn.Embdeding(10, 3)` creates 10 vectors (tensors) of size 3, each represents a different word. These parameters are learnable, and are randomly initialized from the unit Gaussian. In this LLM code, the initialization is done uniformly using a different dedicated function `init_weight`. Other layer features exist.

The layer `emb` takes input indices, each index represent a word. For example:

    >>> embedding = nn.Embedding(10, 3)
    >>> seq = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])  # a batch of 2 sequences with 4 tokens each
    >>> embedding(seq)
    tensor([[[-0.0251, -1.6902,  0.7172],
             [-0.6431,  0.0748,  0.6969],
             [ 1.4970,  1.3448, -0.9685],
             [-0.3677, -2.7265, -0.1685]],
    
            [[ 1.4970,  1.3448, -0.9685],
             [ 0.4362, -0.4004,  0.9400],
             [-0.6431,  0.0748,  0.6969],
             [ 0.9124, -2.3616,  1.1151]]])
Notice that, for example, the row (embedding) corresponfing to the token indexed 4 appears in the correct place in each sequence (tensor). Of course, each token index can correspond to a one-hot vector.

# ➨ Positional Encoding

The `PositionalEncoding` module injects some information about the relative or absolute position of the tokens in the sequence. 
The positional encodings have the same dimension as the embeddings so that the two can be summed. 
Here, we use sine and cosine functions of different frequencies.

In [70]:
class PositionalEncoding(nn.Module):
# defines a new class named PositionalEncoding, which inherits from nn.Module. 
# this indicates that TransformerModel is a PyTorch NN module and can make use of various features provided by PyTorch.


    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # creates a tensor containing integers from 0 to max_len-1, and unsqueezes it to add a new dimension at index 1
        position = torch.arange(max_len).unsqueeze(1)
        
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        
        # registers the pe tensor as a buffer of the module.
        # buffers are tensors that are not updated by gradients during training,
        # and they are typically used for parameters that are not trainable, such as fixed positional encodings
        self.register_buffer('pe', pe)

        
    def forward(self, x: Tensor) -> Tensor:
        """Arguments: x Tensor of shape [seq_len, batch_size, embedding_dim]"""
        # adds the positional encodings (pe) to the input tensor x
        # he positional encodings are added to the first seq_len elements of the input tensor x, 
        # where seq_len is the length of the input sequence
        x = x + self.pe[:x.size(0)]  
        return self.dropout(x)

# ➨ Vocabulary and Datasets
The `vocab` object is built based on the training dataset, and is used to numericalize tokens into tensors. The data set `PennTreebank` represents rare tokens as the `<unk>` token.

Given a 1D vector of sequential data, the `batchify()` function arranges the data into `batch_size` columns. If the data does not divide evenly into `batch_size` columns, then the data is trimmed to fit. Batching enables more parallelizable processing, however batching means that the model treats each column independently and dependence of columns cannot be learned.

We point out that by 'sequential data' or 'sequential tokens' we mean data of coherent sentences.

In [102]:
# the torchtext library provides utilities for working with text data
from torchtext.datasets import PennTreebank
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


############ Generate the vocabulary ############

# loads the training split of the PennTreebank dataset using the PennTreebank class from the torchtext.datasets module.
# using the training split, we create the vocabulary (as well as the training dataset)
train_iter = PennTreebank(split='train')
"""train_iter is a PennTreebank iterable object containing 42068 training sentences.
   To read all sentences, use the code: [print(i) for i in train_iter]"""

# generates a tokenizing function that converts string sentences into string tokens using the 'basic_english' tokenizer
tokenizer = get_tokenizer('basic_english')  

# map(tokenizer, train_iter) applies the function tokenizer elemnt-wise to each sentence in train_iter, 
# producing an iterable object of tokens, where each sentence is converted into a list of string tokens
token_iter = map(tokenizer, train_iter)
"""To read all lists of tokens, use the code: [print(i) for i in map(tokenizer, train_iter)]"""

# build_vocab_from_iterator builds a vocabulary from an iterable object of tokens by adding indices to each token, 
# and adds a certain token (if not already exists) to be designated as a special token
vocab = build_vocab_from_iterator(token_iter, specials=['<unk>'])
"""To read tokens in indices 1,2, use the code: print(vocab.lookup_tokens([1,2]))"""

# set_default_index(vocab['<unk>']) sets the default index for out-of-vocabulary tokens to the index of <unk>
vocab.set_default_index(vocab['<unk>'])

# NOTICE that the vocabulary object `vocab` is defined using the training set #


############ Generate the datasets ############

# data_process takes an iterable object of raw text, and converts it into a flat tensor of corresponding token indices:
# each item (sentence) in raw_text_iter is tokenized using the tokenizer function,
# mapped to token indices using the vocabulary, and then converted to a tensor of type torch.long
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]  # list of tensors
    """tokenizer(item) is a list (sentence) of string tokens, and vocab(tokenizer(item)) are their indices in the vocabulary"""
    
    # filter(lambda t: t.numel() > 0, data) filters all empty tensors which may occur if some tokens are not in the vocabulary.
    # tuple the list of (filtered) tensor `data` into (ordered) tuple of tensors  
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))  # concatenates the tuple of tensors into a single tensor


# load the training, validation, and test splits of the PennTreebank dataset, and process them using data_process
train_iter, val_iter, test_iter = PennTreebank()  # without the positional embeddings
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)


# checks if a CUDA-capable GPU is available and selects the device accordingly (cuda if available, otherwise cpu)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# takes a flat tensor of token indices `data` and a batch size `bsz`, and converts it into a batched tensor for training: 
# it reshapes the data into `bsz` separate sequences (columns), discarding any extra elements that wouldn't fit,
# and the resulting tensor has shape [N//bsz, bsz], where N is the length of the original flat data tensor.
def batchify(data: Tensor, bsz: int) -> Tensor:
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

# NOTICE that after applying batchify(), each column is a batch of sequential tokens, and the columns allow parallelization #
# NOTICE that since the columns are processes in parallel, no relation between column can be learned #


# batchify the training, validation, and test data using batchify
batch_size, eval_batch_size = (20, 10)  # number of columns of sequential tokens for learning and evaluation
train_data = batchify(train_data, batch_size)
"""To read the sequential tokens that form the the first column of the training data, use the code:
   print(vocab.lookup_tokens([*train_data[:,0]]))"""
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

### The `get_tokenizer` Function

`torchtext.data.utils.get_tokenizer(tokenizer, language='en')` generates a tokenizer function for a string sentence, where:

1. tokenizer: the name of tokenizer function. If None, it returns `split()` function, which splits the string sentence by space. If basic_english, it returns `_basic_english_normalize()` function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
2. language – Default `en`.

For example,

    >>> tokenizer = get_tokenizer("basic_english")
    >>> tokens = tokenizer("You can now install TorchText using pip!")
    >>> tokens
    >>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']\
    
### The `build_vocab_from_iterator` Function

    torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None) → Vocab
    
Builds a Vocab object from an iterator (refer to the literature of the Vocab class for information), where:

1. iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.
2. min_freq – The minimum frequency needed to include a token in the vocabulary.
3. specials – Special symbols to add. The order of supplied tokens will be preserved.
4. special_first – Indicates whether to insert symbols at the beginning or at the end.
5. max_tokens – If provided, creates the vocab from the `max_tokens - len(specials)` most frequent tokens.

# ➨ Batching the Data
`get_batch()` generates a pair of input-target sequences for the transformer model. It subdivides the source data into chunks of length `bptt` (size of context window). In other words, `get_batch()` generates the `(input, label)` pairs where:
1. `input` is a sequence of sequential tokens of length `bptt`.
1. `label` is the target sequence of sequential tokens of length `bptt`, which is the same as `input` but with an indent of one index (as in encoder-only auto-regressuve architectures).

We also refer to a single pair of `(input, label)` as a mini-batch. The `get_batch()` function works across all batches in parallel (that is, it works column-wise). This function takes a data tensor of size `[N//batch_size, batch_size]`, where N is the length of the original flat data tensor, and an integer `i` specifying the current position of the context window. The function returns a tuple of tensors `(input, label)`, where `input` is of size `[seq_len, batch_size]`, where `seq_len` is at most the size of the context window `bptt` (it can be shorter if the current sentence is shorter), and `label` is of size `[seq_len * batch_size]` (suitable for CE loss).

In [107]:
bptt = 35  # the maximal seq_len, that is the size of the context window (sequences can be shorter, depending on the source)
def get_batch(source: Tensor, i: int) -> Tuple[Tensor, Tensor]:
    seq_len = min(bptt, len(source) - 1 - i)  # ensures that seq_len does not exceed bptt or go beyond the end of the source
    data = source[i:i+seq_len]  # extracts the input sequence from the source (column-wise)
    target = source[i+1:i+1+seq_len].reshape(-1)  # extracts the target from the source
    return data, target

# ➨ Defining the Complete Model

In [108]:
# Initiate an instance of the network

ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)  # an instance of the network

Once we have an instance of the network (with feed-forward and back-propagation pass functions established), we build the entire model, which includes the loss, training process and evaluation process.

We use CrossEntropyLoss with the SGD optimizer. The learning rate is initially set to 5.0 and follows a StepLR schedule (a step-size that decays by a factor of $\gamma$ every a speceified number of epochs.

In [120]:
import time

criterion = nn.CrossEntropyLoss()
lr = 5.0  # initial learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)  # lr decay by gamma=0.95 every step_size=1 epochs


def train(model: nn.Module) -> None:
    # turn on training mode (some layers like dropout and batch norma behave differently during training and evaluation)
    model.train()
    
    num_batches = len(train_data) // bptt  # the number of (input,label) pairs in the training data
    # NOTICE that by `num_batches` we mean the number of mini-batches, which are pairs of (input,label) sequences, that are
    # fed to the model. Recall that each mini-bitch is a pair of tensors of size [seq_len, batch_size]. # 
    
    log_interval = 200  # interval of mini-batches for logging training progress
    total_loss = 0.  # average time taken for each `log_intervals' of mini-batches
    start_time = time.time()  # average time taken for each `log_intervals' of mini-batches 
    
      
    # the enumerate function generates a pair of indices (batch, i), where `batch` starts at 0 and ends at len(train_data)
    # with intervals of size `bptt`; hence `batch` corresponds to the row index of the training data tensor at which
    # the current mini-batch starts.
    # the index `i` is the index of the current mini-batch, hence it starts at 0 and ends at `num_batches`
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):  
        data, targets = get_batch(train_data, i)  # (data, target) are (input,label) pair, both of size [seq_len, batch_size]
        
        output = model(data)  # the output is of size [seq_len, batch_size, vocab_size]
        # NOTICE that for each element of the `batch_size` columns (each column is a sequence of tokens), 
        # we have a vecotr of logits of the size of the vocabulary #
        
        output_flat = output.view(-1, ntokens)  # reshapes the output to [batch_size * seq_len, ntokens] for CE loss
        loss = criterion(output_flat, targets)

        optimizer.zero_grad()
        loss.backward()
        
        # during training, we use nn.utils.clip_grad_norm_ to prevent gradients from exploding; this function
        # clips the gradient such that their norm is 0.5 by normalization; the norm is computed over all gradients together, 
        # as if they were concatenated into a single vector.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        
        optimizer.step()

        
        total_loss += loss.item()  # update the total loss of by summing over all last `log_interval` mini-batches
        if batch % log_interval == 0 and batch > 0:  # checks if it is time to log training progress
            lr = scheduler.get_last_lr()[0]
            
            # average time (in ms), loss and perplexity of the last `log_interval` mini-batches of the epoch
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval  
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            
            # reinitialize loss and time for the next `log_intervals` of mini-batch
            total_loss = 0  
            start_time = time.time()

            
def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval() 
    total_loss = 0.
    
    # no_grad() disables gradient calculation; useful for inference when we do not call backward().
    # reduces memory consumption for computations that would otherwise have `requires_grad=True`.
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):  # same as in training
            data, targets = get_batch(eval_data, i)
            seq_len = data.size(0)  # seq_len of the current evaluation mini-batch (up to `bptt`)
            output = model(data)
            output_flat = output.view(-1, ntokens)
            
            # unlike training, during evaluation we calculate the loss over the entire epoch (all mini-batches);
            # therefore, we multiply each loss by the size of the current mini-batch (which is `seq_len`),
            # and last we divide the sum by the length of the evaluation/validation data
            total_loss += seq_len * criterion(output_flat, targets).item()  
    return total_loss / (len(eval_data) - 1)

# ➨ Training and Model Evaluation
During training, we check what are the best learned parameters so far by testing them on the validation set, and keep them as the output model.

Notice that the mini-batches for SGD are not taken randomly, by rather in a sequential order of `(input,label)` pairs.

In [77]:
best_val_loss = float('inf')  # for tracking the best validation loss encountered during training
epochs = 3


# create a temporary directory using the TemporaryDirectory; used to store the parameters of the best model during training
with TemporaryDirectory() as tempdir:  
    best_model_params_path = os.path.join(tempdir, "best_model_params.pt")  # creates a path to store the parameters

    
    for epoch in range(1, epochs + 1):
        epoch_start_time = time.time()  # track the overall time of each epoch (s)
        train(model)  # train the model for a single epoch
        
        # calculate the loss and perplexity over the validation set using the trained model (a single epoch)
        val_loss = evaluate(model, val_data)  
        val_ppl = math.exp(val_loss)
        elapsed = time.time() - epoch_start_time  # overall time of current epoch
        
        print('-' * 89)
        print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
            f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
        print('-' * 89)
        
        
        # checks if the current validation loss is better than the previous best validation loss `best_val_loss`.
        # if so, update `best_val_loss` with the current validation loss and save the parameters of the current best model
        # to the specified file path.
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), best_model_params_path)

            
        scheduler.step()  # adjusts the learning rate scheduler after each epoch
        
        
    # after training is completed (all epoch are done), load the parameters of the best model for evaluation over test set
    model.load_state_dict(torch.load(best_model_params_path)) 

| epoch   1 |   200/ 1320 batches | lr 5.00 | ms/batch 110.09 | loss  6.88 | ppl   975.11
| epoch   1 |   400/ 1320 batches | lr 5.00 | ms/batch 103.65 | loss  6.06 | ppl   426.62
| epoch   1 |   600/ 1320 batches | lr 5.00 | ms/batch 109.26 | loss  5.83 | ppl   340.41
| epoch   1 |   800/ 1320 batches | lr 5.00 | ms/batch 116.12 | loss  5.66 | ppl   286.47
| epoch   1 |  1000/ 1320 batches | lr 5.00 | ms/batch 129.05 | loss  5.58 | ppl   265.86
| epoch   1 |  1200/ 1320 batches | lr 5.00 | ms/batch 131.20 | loss  5.48 | ppl   238.93
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 162.26s | valid loss  5.46 | valid ppl   234.68
-----------------------------------------------------------------------------------------
| epoch   2 |   200/ 1320 batches | lr 4.75 | ms/batch 132.02 | loss  5.41 | ppl   223.23
| epoch   2 |   400/ 1320 batches | lr 4.75 | ms/batch 135.60 | loss  5.35 | ppl   211.22
| epoch   2 |   600/ 1320

In [121]:
# evaluate the best model on the test dataset

test_loss = evaluate(model, test_data)  # overall loss over the test set
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

| End of training | test loss  6.98 | test ppl  1071.73
