<a href="https://colab.research.google.com/github/Zahra-FallahMMA/DeepLearning-Sharif/blob/main/HW3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Using Pytroch (Score: 250)

Transformers have revolutionized the field of Natural Language Processing (NLP) by introducing a novel mechanism for capturing dependencies within sequences through attention mechanisms. Let’s break it down, implement it from scratch using PyTorch.


The implementation is based on the paper: [*Attention Is All You Need!*](https://arxiv.org/abs/1706.03762)

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*BHzGVskWGS_3jEcYYi6miQ.png" width="500"/>

Your task in this homework is to complete the **TO DO** sections.



--- TO DO ---

FULLNAME: Zahra Fallah MirMousavi

STUDENT NUMBER: 401207192

In [None]:
import torch
import torch.nn as nn
import math

## Input Embedding (20)

It allows to convert the original sentence into a vector of X dimensions (d_model in our case).

In [None]:
class InputEmbeddings(nn.Module):

    def __init__(self, d_model: int, vocab_size: int) -> None:
        super().__init__()
        ####################################
        ##              To Do             ##
        ####################################
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        ####################################
        ####################################

    def forward(self, x):
        # (batch, seq_len) --> (batch, seq_len, d_model)
        # Multiply by sqrt(d_model) to scale the embeddings according to the paper
        ####################################
        ##              To Do             ##
        ####################################
        scaled_embedding = self.embedding(x) * math.sqrt(self.d_model)
        return scaled_embedding

        ####################################
        ####################################

## PositionalEncoding Class (20)

Positional encoding is a crucial component in transformer models, which helps the model understand the position of each word in a sentence.

**Mathematical Formulation:**

For a given position *pos* and embedding dimensoin i:

$PE_{(pos,2i)}=sin(\frac{pos}{(10000^{(2i/d_{model})}}) $

$PE_{(pos,2i+1)}=cos(\frac{pos}{(10000^{(2i/d_{model})}}) $

where:

- $PE_{(pos,2i)}$ is the value of the positional encoding at position *pos* for the even dimenstion 2i.
- $PE_{(pos,2i+1)}$ is the value of the positional encoding at position *pos* for the odd dimension 2i + 1.
- $d_{model}$ is the dimension of the embedding (e.g. 512)

> **Note**: Be aware that positional embeddings should remain fixed at all times and should not be learned.

In [None]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        ####################################
        ##              To Do             ##
        ####################################
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)


        position_encoding = torch.zeros(seq_len, d_model) # shape : (seq_len, d_model)


        pos = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # shape : (seq_len, 1)

        factor = -math.log(10000.0) / d_model
        # Using a list comprehension to build the array
        div_term_list = [math.exp(i * factor) for i in range(0, d_model, 2)]
        # Converting the list to a tensor
        div_term = torch.tensor(div_term_list, dtype=torch.float)

        position_encoding[:, 0::2] = torch.sin(pos * div_term) # sin(pos * (10000 ** (2i / d_model)) for even indices

        position_encoding[:, 1::2] = torch.cos(pos * div_term) # cos(pos * (10000 ** (2i / d_model)) for odd indices

        position_encoding = position_encoding.unsqueeze(0) # (1, seq_len, d_model)

        self.register_buffer('position_encoding', position_encoding)


        ####################################
        ####################################

    def forward(self, x):
        ####################################
        ##              To Do             ##
        ####################################
        x = x + (self.position_encoding[:, :x.shape[1], :]).requires_grad_(False) # (batch, seq_len, d_model)
        return self.dropout(x)

        ####################################
        ####################################

## FeedForwardBlock Class

FeedForward is basically a fully connected layer, that transformer uses in both encoder and decoder. It consists of two linear transformations with a ReLU activation in between. This helps in adding non-linearity to the model, allowing it to learn more complex patterns.

In [None]:
class FeedForwardBlock(nn.Module):

    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff) # w1 and b1
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model) # w2 and b2

    def forward(self, x):
        # (batch, seq_len, d_model) --> (batch, seq_len, d_ff) --> (batch, seq_len, d_model)
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

## MultiHeadAttentionBlock Class (50)

Multi-head attention is a core component of the transformer architecture, enabling the model to focus on different parts of the input sequence simultaneously. Let’s break down how multi-head attention works and why it is essential.

In [None]:
class MultiHeadAttentionBlock(nn.Module):

    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model # Embedding vector size
        self.h = h # Number of heads
        # Make sure d_model is divisible by h
        assert d_model % h == 0, "d_model is not divisible by h"

        self.d_k = d_model // h # Dimension of vector seen by each head
        self.w_q = nn.Linear(d_model, d_model, bias=False) # Wq
        self.w_k = nn.Linear(d_model, d_model, bias=False) # Wk
        self.w_v = nn.Linear(d_model, d_model, bias=False) # Wv
        self.w_o = nn.Linear(d_model, d_model, bias=False) # Wo
        self.dropout = nn.Dropout(dropout)

    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        """
            Scaled Dot-Product Attention
            query: (batch, h, seq_len, d_k)
            key: (batch, h, seq_len, d_k)
            value: (batch, h, seq_len, d_k)
            mask: (batch, 1, 1, seq_len)

            Returns:
            output: (batch, h, seq_len, d_k)
        """
        ####################################
        ##              To Do             ##
        ####################################

        ## Compute attention scores (Just apply the formula from the paper)
        ## (batch, h, seq_len, d_k) --> (batch, h, seq_len, seq_len)

        # Get the dimension of the query (d_k)
        d_k = query.shape[-1]

        # Compute the dot product between query and key
        attention_scores = torch.matmul(query, key.transpose(-2, -1))

        # Scale the attention scores by the square root of d_k
        attention_scores = attention_scores / math.sqrt(d_k)
        if mask is not None:
            ## Write a very low value (indicating -inf) to the positions where mask == 0
            attention_scores.masked_fill_(mask == 0, -1e9)

        ## Apply softmax
        attention_scores = attention_scores.softmax(dim=-1)

        if dropout is not None:
            ## Apply dropout on attention scores
            attention_scores = dropout(attention_scores)

        output = torch.matmul(attention_scores, value)

        ## (batch, h, seq_len, seq_len) --> (batch, h, seq_len, d_k)
        return output, attention_scores
        ####################################
        ####################################

    def forward(self, x_q, x_k, x_v, mask):
        ####################################
        ##              To Do             ##
        ####################################
        ## Calculate query, key and value
        query = self.w_q(x_q) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        key = self.w_k(x_k) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        value = self.w_v(x_v) # (batch, seq_len, d_model) --> (batch, seq_len, d_model)

        ## Separate all heads
        ## (batch, seq_len, d_model) --> (batch, seq_len, h, d_k) --> (batch, h, seq_len, d_k)
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1, 2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1, 2)


        ## Get attention outputs

        x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)

        ## Combine all the heads together
        ## (batch, h, seq_len, d_k) --> (batch, seq_len, h, d_k) --> (batch, seq_len, d_model)
        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)

        ## Multiply by Wo
        ## (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        return self.w_o(x)

        ####################################
        ####################################

## ResidualConnection Class

Residual connections, or skip connections, are used to help with the training of deep neural networks by allowing gradients to flow more easily through the network.

In [None]:
class ResidualConnection(nn.Module):
    def __init__(self, d_model: int, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, Y):
        return x + self.dropout(Y(self.norm(x)))

## EncoderBlock Class (30)

Now we will create the encoder block which will contain one multi-head attention, two Add and Norm (ResidualConnection) & one feed forward layer.

In [None]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model: int, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        ####################################
        ##              To Do             ##
        ####################################

        self.self_attention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(d_model, dropout) for _ in range(2)])

        ####################################
        ####################################

    def forward(self, x, src_mask):
        ####################################
        ##              To Do             ##
        ####################################

        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
        x = self.residual_connections[1](x, self.feed_forward_block)
        return x
        ####################################
        ####################################

In [None]:
class Encoder(nn.Module):
    def __init__(self, d_model: int, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

## DecoderBlock Class (30)

The `DecoderBlock` class represents a single block of the Transformer decoder. Each decoder block contains a self-attention mechanism, a cross-attention mechanism (attending to the encoder's output), and a feed-forward network, all surrounded by residual connections and layer normalization.

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, d_model: int, self_attention_block: MultiHeadAttentionBlock, cross_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout:float)->None:
        super().__init__()
        ####################################
        ##              To Do             ##
        ####################################

        self.self_attention_block = self_attention_block
        self.cross_attention_block = cross_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(d_model, dropout) for _ in range(3)])

        ####################################
        ####################################

    def forward(self, x, encoder_output,
                src_mask, # aplly src mask on cross_attention_block
                tgt_mask  # aplly tgt mask on self_attention_block
                ):
        ####################################
        ##              To Do             ##
        ####################################

        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, tgt_mask))
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
        x = self.residual_connections[2](x, self.feed_forward_block)
        return x
        ####################################
        ####################################

In [None]:
class Decoder(nn.Module):

    def __init__(self, features: int, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = nn.LayerNorm(features)

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

## ProjectionLayer Class

The `ProjectionLayer` class is used to convert the high-dimensional vectors (output of the decoder) into logits over the vocabulary. This projection is typically the last layer in the decoder of a transformer model.

In [None]:
class ProjectionLayer(nn.Module):

    def __init__(self, d_model, vocab_size) -> None:
        super().__init__()
        self.proj = nn.Linear(d_model, vocab_size)

    def forward(self, x) -> None:
        # (batch, seq_len, d_model) --> (batch, seq_len, vocab_size)
        return self.proj(x)

## Transformer Class (50)

The `Transformer` class encapsulates the entire transformer model, integrating both the encoder and decoder components along with embedding layers and positional encodings.

In [None]:
class Transformer(nn.Module):

    def __init__(self, encoder: Encoder, decoder: Decoder, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings,
                 src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer: ProjectionLayer) -> None:
        super().__init__()
        ####################################
        ##              To Do             ##
        ####################################
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.projection_layer = projection_layer


        ####################################
        ####################################

    def encode(self, src, src_mask):
        ####################################
        ##              To Do             ##
        ####################################

        src = self.src_pos(self.src_embed(src))
        return self.encoder(src, src_mask)


        ####################################
        ####################################

    def decode(self, encoder_output: torch.Tensor, src_mask: torch.Tensor, tgt: torch.Tensor, tgt_mask: torch.Tensor):
        ####################################
        ##              To Do             ##
        ####################################
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)


        ####################################
        ####################################

    def project(self, x):
        # (batch, seq_len, vocab_size)
        return self.projection_layer(x)

## Build Transformer Function (50)

`build_transformer` constructs a full Transformer model by putting together its various components, such as embedding layers, positional encoding, encoder and decoder blocks, and a final projection layer.

In [None]:
def build_transformer(src_vocab_size: int, tgt_vocab_size: int, src_seq_len: int, tgt_seq_len: int, d_model: int=512, N: int=6, h: int=8, dropout: float=0.1, d_ff: int=2048) -> Transformer:
    ####################################
    ##              To Do             ##
    ####################################
    ## Create the embedding layers
    src_embedding = InputEmbeddings(d_model, src_vocab_size)
    tgt_embedding = InputEmbeddings(d_model, tgt_vocab_size)


    ## Create the positional encoding layers
    src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)

    ## Create the encoder blocks
    def create_encoder_block(_):
      encoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
      feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
      return EncoderBlock(d_model, encoder_self_attention_block, feed_forward_block, dropout)

    encoder_blocks = list(map(create_encoder_block, range(N)))

    ## Create the decoder blocks
    def create_decoder_block(_):
        decoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_cross_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        return DecoderBlock(d_model, decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)

    decoder_blocks = list(map(create_decoder_block, range(N)))



    ## Create the encoder and decoder

    encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))
    decoder = Decoder(d_model, nn.ModuleList(decoder_blocks))

    ## Create the projection layer

    projection_layer = ProjectionLayer(d_model, tgt_vocab_size)

    ## Create the transformer
    transformer = Transformer(encoder, decoder, src_embedding, tgt_embedding, src_pos, tgt_pos, projection_layer)


    ## Initialize the parameters
    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return transformer
    ####################################
    ####################################

## Testing the model

Here is a simple test to verify whether you have implemented the transformer correctly. Run the code below and ensure that both the training and validation losses decrease steadily.



In [None]:
!pip install datasets sentencepiece transformers

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import torch
from datasets import load_dataset
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
from transformers import BertTokenizer
from tqdm.notebook import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load WMT14 English-German Translation Dataset (test split is enough for our purpose)
dataset = load_dataset('wmt14', 'de-en', split='test')

# Initialize Tokenizer (use a pretrained tokenizer for simplicity)
src_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # English tokenizer
tgt_tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')  # German tokenizer

# Preprocess data (Tokenization and Padding)
def tokenize_data(batch):
    src = src_tokenizer(batch['translation']['en'], padding="max_length", truncation=True, max_length=32)
    tgt = tgt_tokenizer(batch['translation']['de'], padding="max_length", truncation=True, max_length=32)
    return {'src_input_ids': src['input_ids'], 'tgt_input_ids': tgt['input_ids']}

dataset = dataset.map(tokenize_data)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/280M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/265M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/273M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/474k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/509k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4508785 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3003 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/255k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/485k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Map:   0%|          | 0/3003 [00:00<?, ? examples/s]

In [None]:
# Set vocab sizes
src_vocab_size = src_tokenizer.vocab_size
tgt_vocab_size = tgt_tokenizer.vocab_size

# Define model parameters
src_seq_len = 32  # Max length of source sequences
tgt_seq_len = 32  # Max length of target sequences
d_model = 512
N = 6  # Number of layers
h = 8  # Number of heads
dropout = 0.1
d_ff = 2048

# Build Transformer Model
transformer = build_transformer(src_vocab_size, tgt_vocab_size, src_seq_len, tgt_seq_len, d_model, N, h, dropout, d_ff).to(device)

# Loss function and optimizer
criterion = CrossEntropyLoss(ignore_index=0)  # Ignore padding index
optimizer = Adam(transformer.parameters(), lr=2e-5)

def create_src_mask(src_input, pad_idx=0):
    """Create a mask for the source to hide padding tokens."""
    src_mask = (src_input != pad_idx).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, seq_len)
    return src_mask

def create_tgt_mask(tgt_input, pad_idx=0):
    """Create a target mask to hide future tokens (causal mask) and padding tokens."""
    batch_size, tgt_len = tgt_input.shape
    # Causal mask to prevent looking ahead
    causal_mask = torch.tril(torch.ones(tgt_len, tgt_len)).bool().to(tgt_input.device).unsqueeze(0)  # (1, tgt_len, tgt_len)
    # Padding mask
    pad_mask = (tgt_input != pad_idx).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, tgt_len)

    # Combine the causal mask and padding mask
    tgt_mask = causal_mask & pad_mask.squeeze(1)  # (batch_size, tgt_len, tgt_len)
    return tgt_mask.unsqueeze(1)  # (batch_size, 1, tgt_len, tgt_len)

# Training Loop
for epoch in range(10):
    transformer.train()
    train_loss = 0
    val_loss = 0

    transformer.train()
    # Training
    for i in tqdm(range(0, 2000, 32)):
        src_input = torch.tensor(dataset[i:i+32]['src_input_ids']).to(device)
        tgt_input = torch.tensor(dataset[i:i+32]['tgt_input_ids']).to(device)

        # Create masks
        src_mask = create_src_mask(src_input).to(device)
        tgt_mask = create_tgt_mask(tgt_input[:, :-1]).to(device)  # Apply mask only on the decoder input sequence

        # Forward pass
        optimizer.zero_grad()
        encoder_output = transformer.encode(src_input, src_mask)
        decoder_output = transformer.decode(encoder_output, src_mask, tgt_input[:, :-1], tgt_mask)
        output = transformer.project(decoder_output)

        # Calculate loss
        loss = criterion(output.view(-1, tgt_vocab_size), tgt_input[:, 1:].reshape(-1))
        train_loss += loss.item()

        # Backpropagation
        loss.backward()
        optimizer.step()

    transformer.eval()
    # Evaluation
    for i in tqdm(range(2000, len(dataset), 32)):
        with torch.no_grad():
            src_input = torch.tensor(dataset[i:i+32]['src_input_ids']).to(device)
            tgt_input = torch.tensor(dataset[i:i+32]['tgt_input_ids']).to(device)

            # Create masks
            src_mask = create_src_mask(src_input).to(device)
            tgt_mask = create_tgt_mask(tgt_input[:, :-1]).to(device)  # Apply mask only on the decoder input sequence

            # Forward pass
            optimizer.zero_grad()
            encoder_output = transformer.encode(src_input, src_mask)
            decoder_output = transformer.decode(encoder_output, src_mask, tgt_input[:, :-1], tgt_mask)
            output = transformer.project(decoder_output)

            # Calculate loss
            loss = criterion(output.view(-1, tgt_vocab_size), tgt_input[:, 1:].reshape(-1))
            val_loss += loss.item()

    print(f'Epoch {epoch+1}, Train loss: {train_loss/2000}, Val loss: {val_loss/(len(dataset) - 2000)}')


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 1, Train loss: 0.3097009410858154, Val loss: 0.303187516726858


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 2, Train loss: 0.28880433177948, Val loss: 0.28661931951642633


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 3, Train loss: 0.2713646116256714, Val loss: 0.27321653565761456


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 4, Train loss: 0.2573965926170349, Val loss: 0.26257027277085976


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 5, Train loss: 0.24658015656471252, Val loss: 0.2551207803895443


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 6, Train loss: 0.23900739359855652, Val loss: 0.25025670953904644


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 7, Train loss: 0.233940425157547, Val loss: 0.2473042522327731


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 8, Train loss: 0.23054414439201354, Val loss: 0.24549232190057027


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 9, Train loss: 0.22822919511795045, Val loss: 0.24435564955828315


  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 10, Train loss: 0.22647426247596741, Val loss: 0.24404503983015552
