# Building a Transformer Model from Scratch with PyTorch and HuggingFace

In this notebook, we'll walk through the process of building a Transformer model from scratch using PyTorch and HuggingFace. Transformers have revolutionized the field of natural language processing by enabling models to understand context and relationships in sequential data more effectively.

**What you'll do:**

- Implement each component of the Transformer architecture step by step.
- Fill in the `TODO` sections to complete the implementation.
- Use test cells provided after each section to verify your implementation.

Let's get started!

---

In [1]:
# Import necessary libraries
import torch
import torch.nn as nn
import math

## Transformer Architecture Overview

A Transformer model primarily consists of an Encoder and a Decoder. Each of these is made up of multiple layers that include mechanisms like self-attention and feed-forward neural networks.

In this notebook, we'll focus on implementing the **Encoder** part of the Transformer, which is commonly used for tasks like text classification or encoding sequences for downstream tasks.

The encoder layer is like:

<img src="../docs/encoder_layer.png" alt="Alt Text" width="500"/>

---

## Implementing the Transformer Encoder Layer

**In this section:**

- You'll implement the `TransformerEncoderLayer` class.
- You'll define the feedforward neural network components.
- You'll complete the `forward` method to process the input through the self-attention and feedforward layers.

---

In [2]:
# Define the Transformer Encoder Layer
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, dropout):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)

        ###### TODO: Define the feedforward neural network layers ######
        # Hint: You need two linear layers with a ReLU activation in between and dropout.
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        ######

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        # Self-attention layer
        src2 = self.self_attn(
            src, src, src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask
        )[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        ####### TODO: Implement the feedforward neural network part######
        # Hint: Apply the two linear layers with a ReLU activation and dropout in between.
        src2 = self.linear2(torch.relu(self.dropout(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        ######

        return src

### Testing the Transformer Encoder Layer

Let's create an instance of `TransformerEncoderLayer` and pass some dummy data through it to ensure it's working correctly.

**What you'll do:**

- Create dummy input data.
- Initialize the `TransformerEncoderLayer`.
- Pass the data through the layer and observe the output shape.

---

In [3]:
# Test the TransformerEncoderLayer
# Parameters
d_model = 512
nhead = 8
dim_feedforward = 2048
dropout = 0.1

# Create an instance of the TransformerEncoderLayer
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout)

# Dummy input data (sequence length, batch size, embedding size)
seq_length = 10
batch_size = 2
dummy_input = torch.rand(seq_length, batch_size, d_model)

# Forward pass
output = encoder_layer(dummy_input)

print(f"Output shape: {output.shape}")

Output shape: torch.Size([10, 2, 512])


**Expected Output:**

The output shape should be `(seq_length, batch_size, d_model)`, which confirms that the encoder layer processes the input correctly.

---

## Building the Full Transformer Encoder

**In this section:**

- You'll implement the `TransformerEncoder` class.
- You'll stack multiple `TransformerEncoderLayer` instances.
- You'll complete the `forward` method to process the input through all the layers.

---

In [4]:
# Define the Transformer Encoder
class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers):
        super(TransformerEncoder, self).__init__()
        self.layers = nn.ModuleList([encoder_layer for _ in range(num_layers)])
        self.num_layers = num_layers

    def forward(self, src, mask=None, src_key_padding_mask=None):
        output = src

        for mod in self.layers:
            output = mod(
                output, src_mask=mask, src_key_padding_mask=src_key_padding_mask
            )

        return output

### Testing the Transformer Encoder

Let's create an instance of `TransformerEncoder` and pass some dummy data through it.

**What you'll do:**

- Use the previously defined `TransformerEncoderLayer` as the building block.
- Initialize the `TransformerEncoder` with multiple layers.
- Pass the dummy data through the encoder.

---

In [5]:
# Test the TransformerEncoder
num_layers = 6

# Initialize the TransformerEncoder
transformer_encoder = TransformerEncoder(encoder_layer, num_layers)

# Dummy input data remains the same
# Forward pass
output = transformer_encoder(dummy_input)

print(f"Output shape after TransformerEncoder: {output.shape}")

Output shape after TransformerEncoder: torch.Size([10, 2, 512])


**Expected Output:**

The output shape should still be `(seq_length, batch_size, d_model)`, confirming that stacking multiple layers doesn't change the shape but refines the representation.

---

## Implementing Positional Encoding

**In this section:**

- You'll implement the `PositionalEncoding` class.
- You'll initialize the positional encoding matrix using sine and cosine functions.
- You'll complete the `forward` method to add positional encodings to the input embeddings.

---


<img src="../docs/positional_encoding.png" alt="Alt Text" width="700"/>

In [6]:
# Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        ###### TODO: Initialize the positional encoding matrix #####
        # Hint: Use sine and cosine functions to create a matrix of shape (max_len, d_model)
        # max_len is the maximum length of the input sequence (ie)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(start=1, end=max_len + 1, step=1).unsqueeze(1)
        div_term = torch.pow(
            input=1e5 * torch.ones(int(d_model / 2)),
            exponent=torch.arange(start=1, end=d_model + 1, step=2) / d_model,
        )
        pe[:, 0::2] = torch.sin(position / div_term)
        pe[:, 1::2] = torch.cos(position / div_term)
        ######

        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[: x.size(0), :]
        return self.dropout(x)

### Testing Positional Encoding

Let's verify that our positional encoding works by passing some dummy data through it.

**What you'll do:**

- Create dummy input embeddings.
- Initialize the `PositionalEncoding` class.
- Pass the embeddings through the positional encoder.

---

In [7]:
# Test the PositionalEncoding
pos_encoder = PositionalEncoding(d_model, dropout)

# Dummy input embeddings (sequence length, batch size, embedding size)
dummy_embeddings = torch.zeros(seq_length, batch_size, d_model)

# Forward pass
pos_encoded_embeddings = pos_encoder(dummy_embeddings)

print(f"Positional Encoded Embeddings shape: {pos_encoded_embeddings.shape}")

Positional Encoded Embeddings shape: torch.Size([10, 2, 512])


**Expected Output:**

The output shape should be `(seq_length, batch_size, d_model)`, and the embeddings should now contain positional information.

---

## Assembling the Complete Transformer Model

**In this section:**

- You'll implement the `TransformerModel` class.
- You'll combine the embedding layer, positional encoding, transformer encoder, and the final decoder layer.
- You'll initialize the weights of the model.

---

In [8]:
# Complete Transformer Model
class TransformerModel(nn.Module):
    def __init__(
        self, ntoken, d_model, nhead, dim_feedforward, num_layers, dropout=0.5
    ):
        super(TransformerModel, self).__init__()
        self.model_type = "Transformer"
        self.src_mask = None

        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(
            d_model, nhead, dim_feedforward, dropout
        )
        self.transformer_encoder = TransformerEncoder(encoder_layers, num_layers)
        self.encoder = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        ##### TODO: Initialize the weights of the model #####
        # Hint: Initialize the embedding and decoder weights uniformly, and set decoder biases to zero.
        #####
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.bias)

    def forward(self, src, src_mask=None):
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

### Testing the Complete Transformer Model

Let's test the entire model by passing some dummy data through it.

**What you'll do:**

- Define the model parameters.
- Create an instance of `TransformerModel`.
- Generate dummy input data.
- Perform a forward pass and observe the output shape.

---

In [9]:
# Example usage
ntokens = 1000  # Size of vocabulary
d_model = 512  # Embedding size
nhead = 8  # Number of attention heads
dim_feedforward = 2048  # Feedforward network hidden layer size
num_layers = 6  # Number of encoder layers
dropout = 0.2  # Dropout rate

model = TransformerModel(ntokens, d_model, nhead, dim_feedforward, num_layers, dropout)

# Dummy input data (sequence length, batch size)
batch_size = 32
seq_length = 35
input_data = torch.randint(0, ntokens, (seq_length, batch_size))

# Forward pass
output = model(input_data)
print(f"Output shape: {output.shape}")  # Should be [seq_length, batch_size, ntokens]

Output shape: torch.Size([35, 32, 1000])


**Expected Output:**

The output shape should be `(seq_length, batch_size, ntokens)`, indicating that the model outputs a score for each token in the vocabulary at each position in the sequence.

---

## Conclusion

We've successfully built a Transformer Encoder model from scratch using PyTorch. This implementation covers the key components of the Transformer architecture, including multi-head attention, positional encoding, and feedforward neural networks.

Understanding how each part works and how they fit together is crucial for leveraging Transformers in your own projects.

**Next Steps:**

- Extend the model to include the Decoder part of the Transformer.

---

# Extending the Transformer Model with a Decoder

In the previous sections, we implemented the **Encoder** part of the Transformer model. Now, we'll extend our implementation to include the **Decoder** part, completing the Transformer architecture.

**What you'll do:**

- Implement the `TransformerDecoderLayer` and `TransformerDecoder` classes.
- Integrate the decoder into the `TransformerModel`.
- Test each component with provided test cells.

---

## Understanding the Transformer Decoder

The **Decoder** is responsible for generating the output sequence, one element at a time, while attending to the encoder's output. It consists of:

- **Self-Attention Layer**: Allows the decoder to attend to previous positions in the output sequence.
- **Cross-Attention Layer**: Allows the decoder to attend to the encoder's output.
- **Feedforward Neural Network**: Processes the attention outputs.
- **Layer Normalization and Residual Connections**: Applied after each sub-layer.

---

## Implementing the Transformer Decoder Layer

**In this section:**

- You'll implement the `TransformerDecoderLayer` class.
- You'll define the self-attention, cross-attention, and feedforward components.
- You'll complete the `forward` method to process the input.

---

In [10]:
# Define the Transformer Decoder Layer
class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, dropout):
        super(TransformerDecoderLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)

        # TODO: Define the feedforward neural network layers
        # Hint: Similar to the encoder layer, use two linear layers with a ReLU activation and dropout in between.
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(
        self,
        tgt,
        memory,
        tgt_mask=None,
        memory_mask=None,
        tgt_key_padding_mask=None,
        memory_key_padding_mask=None,
    ):
        # Self-attention layer
        tgt2 = self.self_attn(
            tgt, tgt, tgt, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask
        )[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)

        # TODO: Implement the cross-attention layer
        # Hint: Use multihead attention where the query is the decoder input and the key and value are the encoder output.
        tgt2 = self.multihead_attn(
            query=tgt,
            key=memory,
            value=memory,
            key_padding_mask=memory_key_padding_mask,
            attn_mask=memory_mask,
        )[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt = self.norm2(tgt)

        # Feedforward layer
        tgt2 = self.linear2(self.dropout(torch.relu(self.linear1(tgt))))
        tgt = tgt + self.dropout3(tgt2)
        tgt = self.norm3(tgt)

        return tgt

### Testing the Transformer Decoder Layer

Let's create an instance of `TransformerDecoderLayer` and pass some dummy data through it.

**What you'll do:**

- Create dummy input data for the decoder and encoder outputs.
- Initialize the `TransformerDecoderLayer`.
- Pass the data through the layer and observe the output shape.

---

In [11]:
# Test the TransformerDecoderLayer
# Parameters
d_model = 512
nhead = 8
dim_feedforward = 2048
dropout = 0.1

# Create an instance of the TransformerDecoderLayer
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout)

# Dummy input data for the decoder (target sequence length, batch size, embedding size)
tgt_seq_length = 12
batch_size = 2
dummy_tgt = torch.rand(tgt_seq_length, batch_size, d_model)

# Dummy memory from the encoder (source sequence length, batch size, embedding size)
memory = torch.rand(seq_length, batch_size, d_model)

# Forward pass
output = decoder_layer(dummy_tgt, memory)

print(f"Output shape: {output.shape}")

Output shape: torch.Size([12, 2, 512])


**Expected Output:**

The output shape should be `(tgt_seq_length, batch_size, d_model)`, confirming that the decoder layer processes the input correctly.

---

## Building the Full Transformer Decoder

**In this section:**

- You'll implement the `TransformerDecoder` class.
- You'll stack multiple `TransformerDecoderLayer` instances.
- You'll complete the `forward` method to process the input through all the layers.

---

In [12]:
# Define the Transformer Decoder
class TransformerDecoder(nn.Module):
    def __init__(self, decoder_layer, num_layers):
        super(TransformerDecoder, self).__init__()
        self.layers = nn.ModuleList([decoder_layer for _ in range(num_layers)])
        self.num_layers = num_layers

    def forward(
        self,
        tgt,
        memory,
        tgt_mask=None,
        memory_mask=None,
        tgt_key_padding_mask=None,
        memory_key_padding_mask=None,
    ):
        output = tgt

        for mod in self.layers:
            output = mod(
                output,
                memory,
                tgt_mask=tgt_mask,
                memory_mask=memory_mask,
                tgt_key_padding_mask=tgt_key_padding_mask,
                memory_key_padding_mask=memory_key_padding_mask,
            )

        return output

### Testing the Transformer Decoder

Let's create an instance of `TransformerDecoder` and pass some dummy data through it.

**What you'll do:**

- Use the previously defined `TransformerDecoderLayer` as the building block.
- Initialize the `TransformerDecoder` with multiple layers.
- Pass the dummy data through the decoder.

---

In [13]:
# Test the TransformerDecoder
num_layers = 6

# Initialize the TransformerDecoder
transformer_decoder = TransformerDecoder(decoder_layer, num_layers)

# Dummy input data remains the same
# Forward pass
output = transformer_decoder(dummy_tgt, memory)

print(f"Output shape after TransformerDecoder: {output.shape}")

Output shape after TransformerDecoder: torch.Size([12, 2, 512])


**Expected Output:**

The output shape should still be `(tgt_seq_length, batch_size, d_model)`, confirming that stacking multiple layers doesn't change the shape but refines the representation.

---

## Updating the Transformer Model with Decoder

**In this section:**

- You'll update the `TransformerModel` class to include the decoder.
- You'll adjust the `forward` method to handle both source and target inputs.
- You'll ensure that masks are properly handled.

---

In [14]:
# Complete Transformer Model with Encoder and Decoder
class TransformerModel(nn.Module):
    def __init__(
        self,
        src_ntoken,
        tgt_ntoken,
        d_model,
        nhead,
        dim_feedforward,
        num_layers,
        dropout=0.5,
    ):
        super(TransformerModel, self).__init__()
        self.model_type = "Transformer"
        self.src_mask = None
        self.tgt_mask = None

        self.pos_encoder = PositionalEncoding(d_model, dropout)
        self.pos_decoder = PositionalEncoding(d_model, dropout)

        encoder_layers = TransformerEncoderLayer(
            d_model, nhead, dim_feedforward, dropout
        )
        self.transformer_encoder = TransformerEncoder(encoder_layers, num_layers)

        decoder_layers = TransformerDecoderLayer(
            d_model, nhead, dim_feedforward, dropout
        )
        self.transformer_decoder = TransformerDecoder(decoder_layers, num_layers)

        self.src_encoder = nn.Embedding(src_ntoken, d_model)
        self.tgt_encoder = nn.Embedding(tgt_ntoken, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, tgt_ntoken)

        self.init_weights()

    def init_weights(self):
        # Initialize the weights of the model
        initrange = 0.1
        self.src_encoder.weight.data.uniform_(-initrange, initrange)
        self.tgt_encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def generate_square_subsequent_mask(self, sz):
        """Generate a square mask for the sequence. Mask out future positions."""
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = (
            mask.float()
            .masked_fill(mask == 0, float("-inf"))
            .masked_fill(mask == 1, float(0.0))
        )
        return mask

    def forward(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None):
        src_emb = self.src_encoder(src) * math.sqrt(self.d_model)
        src_emb = self.pos_encoder(src_emb)
        memory = self.transformer_encoder(src_emb, src_mask)

        tgt_emb = self.tgt_encoder(tgt) * math.sqrt(self.d_model)
        tgt_emb = self.pos_decoder(tgt_emb)
        output = self.transformer_decoder(
            tgt_emb, memory, tgt_mask=tgt_mask, memory_mask=memory_mask
        )
        output = self.decoder(output)
        return output

### Handling Masks

In sequence-to-sequence models, masks are crucial for:

- **Source Mask (`src_mask`)**: To handle padding in the source sequences.
- **Target Mask (`tgt_mask`)**: To prevent the model from peeking ahead in the target sequence during training (causal masking).
- **Memory Mask (`memory_mask`)**: To handle padding in the encoder outputs when used by the decoder.

---

### Testing the Complete Transformer Model with Decoder

Let's test the updated model by passing some dummy data through it.

**What you'll do:**

- Define separate vocab sizes for source and target languages.
- Create an instance of the updated `TransformerModel`.
- Generate dummy input data for both source and target.
- Create appropriate masks.
- Perform a forward pass and observe the output shape.

---

In [15]:
# Example usage
src_ntokens = 1000  # Size of source vocabulary
tgt_ntokens = 1000  # Size of target vocabulary
d_model = 512  # Embedding size
nhead = 8  # Number of attention heads
dim_feedforward = 2048  # Feedforward network hidden layer size
num_layers = 6  # Number of encoder and decoder layers
dropout = 0.2  # Dropout rate

model = TransformerModel(
    src_ntokens, tgt_ntokens, d_model, nhead, dim_feedforward, num_layers, dropout
)

# Dummy input data
batch_size = 32
src_seq_length = 35
tgt_seq_length = 30
src_input = torch.randint(0, src_ntokens, (src_seq_length, batch_size))
tgt_input = torch.randint(0, tgt_ntokens, (tgt_seq_length, batch_size))

# Generate masks
# TODO: Generate the target mask to prevent the decoder from attending to future positions
# Hint: Use the generate_square_subsequent_mask method provided in the model
tgt_mask = model.generate_square_subsequent_mask(tgt_seq_length)

# Forward pass
output = model(src_input, tgt_input, tgt_mask=tgt_mask)
print(
    f"Output shape: {output.shape}"
)  # Should be [tgt_seq_length, batch_size, tgt_ntokens]

Output shape: torch.Size([30, 32, 1000])


**Expected Output:**

The output shape should be `(tgt_seq_length, batch_size, tgt_ntokens)`, indicating that the model outputs a score for each token in the target vocabulary at each position in the target sequence.

---

## Conclusion

We've successfully extended our Transformer model to include the **Decoder** part, completing the architecture. This comprehensive implementation covers:

- Multi-head self-attention and cross-attention mechanisms.
- Positional encoding for both encoder and decoder inputs.
- Handling of masks to manage sequence padding and causal relationships.

**Next Steps:**

- **Training the Model**: Use a dataset to train the model for tasks like machine translation or text summarization.
- **Implementing Beam Search**: Enhance the decoding process by implementing beam search to generate more coherent sequences.
- **Exploring Pre-trained Models**: Utilize HuggingFace's Transformers library to leverage pre-trained models for your tasks.

---

# Help

- `torch.zeros`: Creates a tensor filled with zeros.
  - **Example Usage:**
    ```python
    pe = torch.zeros(max_len, d_model)
    ```
- `torch.arange`: Returns a 1-D tensor of size `(end - start) / step` with values from `start` to `end` (exclusive) with step size `step`.
  - **Example Usage:**
    ```python
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    ```
- `torch.exp`: Returns a new tensor with the exponential of the elements of the input tensor.
  - **Example Usage:**
    ```python
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    ```
- `torch.sin` and `torch.cos`: Computes the sine and cosine of each element in the input tensor.
  - **Example Usage:**
    ```python
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    ```
- `torch.unsqueeze`: Adds a dimension to the tensor at the specified position.
  - **Example Usage:**
    ```python
    pe = pe.unsqueeze(0)
    ```
- `torch.transpose`: Swaps two dimensions of the tensor.
  - **Example Usage:**
    ```python
    pe = pe.transpose(0, 1)
    ```

- `nn.MultiheadAttention`: Allows the model to jointly attend to information from different representation subspaces.
  - **Example Usage:**
    ```python
    mha = nn.MultiheadAttention(embed_dim=512, num_heads=8)
    attn_output, attn_output_weights = mha(query, key, value)
    ```

- `torch.rand`: Generates random numbers between 0 and 1.
  
- `nn.Embedding`: Turns indices into dense vectors of fixed size.
  - **Example Usage:**
    ```python
    embedding = nn.Embedding(num_embeddings=1000, embedding_dim=512)
    embedded = embedding(input_indices)
    ```
- `torch.triu`: Returns the upper triangular part of a matrix.
  - **Example Usage:**
    ```python
    mask = torch.triu(torch.ones(seq_length, seq_length))
    ```
- `torch.ones`: Creates a tensor filled with ones.

- `torch.randint`: Returns a tensor filled with random integers.
  - **Example Usage:**
    ```python
    src_input = torch.randint(0, src_ntokens, (src_seq_length, batch_size))
    ```

- `nn.Linear`: Applies a linear transformation to the incoming data.
  - **Example Usage:**
    ```python
    linear_layer = nn.Linear(in_features=512, out_features=2048)
    output = linear_layer(input_tensor)
    ```
- `torch.relu`: Applies the ReLU activation function element-wise.
  - **Example Usage:**
    ```python
    activated_output = torch.relu(input_tensor)
    ```
- `nn.Dropout`: Randomly zeroes some of the elements of the input tensor with probability `p`.
  - **Example Usage:**
    ```python
    dropout_layer = nn.Dropout(p=0.1)
    output = dropout_layer(input_tensor)
    ```
- `nn.LayerNorm`: Applies Layer Normalization over a mini-batch of inputs.
  - **Example Usage:**
    ```python
    layer_norm = nn.LayerNorm(normalized_shape=512)
    normalized_output = layer_norm(input_tensor)
    ```
    