In [1]:
from transformers import BertTokenizer, BertModel, AutoConfig

In [2]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
config = AutoConfig.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [3]:
sample_text = "Hugging Face makes it easy to use Transformers."
tokens = tokenizer(sample_text)
print("Token IDs:", tokens['input_ids'])

Token IDs: [101, 17662, 2227, 3084, 2009, 3733, 2000, 2224, 19081, 1012, 102]


In [4]:
for name, module in model.named_modules():
    print(name)


embeddings
embeddings.word_embeddings
embeddings.position_embeddings
embeddings.token_type_embeddings
embeddings.LayerNorm
embeddings.dropout
encoder
encoder.layer
encoder.layer.0
encoder.layer.0.attention
encoder.layer.0.attention.self
encoder.layer.0.attention.self.query
encoder.layer.0.attention.self.key
encoder.layer.0.attention.self.value
encoder.layer.0.attention.self.dropout
encoder.layer.0.attention.output
encoder.layer.0.attention.output.dense
encoder.layer.0.attention.output.LayerNorm
encoder.layer.0.attention.output.dropout
encoder.layer.0.intermediate
encoder.layer.0.intermediate.dense
encoder.layer.0.intermediate.intermediate_act_fn
encoder.layer.0.output
encoder.layer.0.output.dense
encoder.layer.0.output.LayerNorm
encoder.layer.0.output.dropout
encoder.layer.1
encoder.layer.1.attention
encoder.layer.1.attention.self
encoder.layer.1.attention.self.query
encoder.layer.1.attention.self.key
encoder.layer.1.attention.self.value
encoder.layer.1.attention.self.dropout
encoder.

In [5]:
print("Hidden Size:", config.hidden_size)
print("Number of Layers:", config.num_hidden_layers)
print("Vocab Size:", tokenizer.vocab_size)

Hidden Size: 768
Number of Layers: 12
Vocab Size: 30522


# Task
Visualize the encoder-decoder stack of a transformer model, explain the working of each layer using code, load a pre-trained transformer model, extract layer information, illustrate the embedding layer, explain and illustrate the attention mechanism, explain and illustrate the feed-forward network, iterate through layers, explain the role of skip connections and layer normalization, and summarize the forward pass.

Load a suitable pre-trained transformer model (e.g., BERT or GPT-2) using the `transformers` library.


The instructions ask to print the configuration object, the number of hidden layers, the hidden size, the vocabulary size, and iterate through the model's named modules to print their names. The previous code blocks have already loaded the tokenizer, model, and configuration, and have already printed the hidden size, number of layers, and vocabulary size, and iterated through the named modules. The configuration object itself was not explicitly printed. I will print the configuration object and then finish the task.



In [6]:
print("Configuration Object:")
print(config)

Configuration Object:
BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.55.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## Illustrate embedding layer

### Subtask:
Show how the input text is converted into numerical embeddings using the model's embedding layer.


**Reasoning**:
Use the previously created tokens and the model's embedding layer to get the numerical embeddings and then print their shape and a small portion of the data.



In [7]:
import torch

# Convert token IDs to a PyTorch tensor
input_ids = torch.tensor(tokens['input_ids']).unsqueeze(0)

# Get embeddings from the model's embedding layer
embeddings = model.embeddings(input_ids)

# Print the shape of the embeddings
print("Shape of Embeddings:", embeddings.shape)

# Print embeddings for the first token
print("\nEmbeddings for the first token (CLS token):")
print(embeddings[0, 0, :10]) # Print the first 10 dimensions of the embedding

Shape of Embeddings: torch.Size([1, 11, 768])

Embeddings for the first token (CLS token):
tensor([ 0.1686, -0.2858, -0.3261, -0.1122,  0.0343, -0.2689, -0.0302, -0.0390,
         0.0157, -0.2828], grad_fn=<SliceBackward0>)


# Self-Attention Mechanism

- **Query (Q):** Represents the current token's representation. Used to query other tokens.
- **Key (K):** Represents the other tokens' representations. Used to determine relevance to the Query.
- **Value (V):** Represents the other tokens' representations. Weighted by attention scores to produce the output.

## Mathematical Steps

1. **Calculate Attention Scores:**  
   Scores = Q × K^T  
   (Dot product of Query and Key)

2. **Obtain Attention Weights:**  
   Weights = Softmax(Scores / √dₖ)  
   (Apply softmax to scaled scores, where dₖ is the dimension of the key vectors)

3. **Compute Weighted Sum of Values:**  
   Output = Weights × V  
   (Weighted sum of Values)

In [8]:
import torch.nn as nn


# 5. Select one of the encoder layers (e.g., the first layer)
first_encoder_layer = model.encoder.layer[0]

# 6. Extract the self attention module
self_attention_module = first_encoder_layer.attention.self

# 7. Pass the embeddings through the extracted self-attention module
# The self-attention module expects an attention mask, but for this simple example with a single sequence,
# we can use a mask of ones. In a real scenario with padding, you would create a proper mask.
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)
extended_attention_mask = model.get_extended_attention_mask(attention_mask, input_ids.shape, input_ids.device)

# The self-attention module also expects head_mask, which we can set to None for this demonstration.
attention_output = self_attention_module(embeddings, extended_attention_mask, head_mask=None)

# The output of the self-attention module is a tuple, where the first element is the attention output
attention_output_tensor = attention_output[0]

# 8. Print the shape of the output
print("\nShape of Self-Attention Output:", attention_output_tensor.shape)


Shape of Self-Attention Output: torch.Size([1, 11, 768])




The output shape `[Batch Size, Sequence Length, Hidden Size]` indicates that for each token in the input sequence (length 11) and for each item in the batch (size 1), the self-attention mechanism has produced a new representation vector of size 768 (the hidden size of the model). This new representation incorporates information from all other tokens in the sequence based on their calculated relevance.

## Explain and illustrate feed-forward network


# Feed-Forward Network (FFN)

The FFN is a crucial component in each transformer layer, applied independently to each position.  
It typically consists of two linear transformations separated by a non-linear activation function.  
Its role is to further process the information after the attention mechanism, allowing the model to capture more complex patterns.

## Structure of the FFN

1. **First Linear Layer (Intermediate Dense):**  
   Projects the input to a higher dimension.

2. **Non-linear Activation (e.g., GELU):**  
   Introduces non-linearity.

3. **Second Linear Layer (Output Dense):**  
   Projects the result back to the original hidden dimension.

In [9]:
# 3. Select the intermediate and output dense layers and the intermediate activation function from the first encoder layer
intermediate_dense = first_encoder_layer.intermediate.dense
intermediate_activation = first_encoder_layer.intermediate.intermediate_act_fn
output_dense = first_encoder_layer.output.dense

# 4. Pass the output from the attention mechanism through the intermediate dense layer
intermediate_output = intermediate_dense(attention_output_tensor)

# 5. Apply the intermediate activation function
activated_intermediate_output = intermediate_activation(intermediate_output)

# 6. Pass the result through the output dense layer
feed_forward_output = output_dense(activated_intermediate_output)

# 7. Print the shape of the final output
print("\nShape of Feed-Forward Network Output:", feed_forward_output.shape)

# 8. Briefly explain what the output shape represents


Shape of Feed-Forward Network Output: torch.Size([1, 11, 768])


The output shape `[Batch Size, Sequence Length, Hidden Size]` is the same as the input shape to the FFN.  
This indicates that the FFN processes each token's representation independently and outputs a new representation of the same dimension.  
This output will then be subject to skip connections and layer normalization before being passed to the next layer or the final output.

## Iterate through layers


In [10]:
# Initialize the input tensor for the first encoder layer with the embeddings
layer_input = embeddings
print(f"Input shape to the first encoder layer: {layer_input.shape}")

# Prepare the attention mask
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)
extended_attention_mask = model.get_extended_attention_mask(attention_mask, input_ids.shape, input_ids.device)

# Iterate through each encoder layer
for i, layer in enumerate(model.encoder.layer):
    print(f"\nProcessing Encoder Layer {i}")
    # Pass the input through the current layer
    # The output of a BertLayer is a tuple, the first element is the hidden states
    layer_output = layer(layer_input, extended_attention_mask, head_mask=None)[0]

    # Update the input for the next layer
    layer_input = layer_output

    # Print the shape of the output after the layer
    print(f"Output shape after Encoder Layer {i}: {layer_input.shape}")


Input shape to the first encoder layer: torch.Size([1, 11, 768])

Processing Encoder Layer 0
Output shape after Encoder Layer 0: torch.Size([1, 11, 768])

Processing Encoder Layer 1
Output shape after Encoder Layer 1: torch.Size([1, 11, 768])

Processing Encoder Layer 2
Output shape after Encoder Layer 2: torch.Size([1, 11, 768])

Processing Encoder Layer 3
Output shape after Encoder Layer 3: torch.Size([1, 11, 768])

Processing Encoder Layer 4
Output shape after Encoder Layer 4: torch.Size([1, 11, 768])

Processing Encoder Layer 5
Output shape after Encoder Layer 5: torch.Size([1, 11, 768])

Processing Encoder Layer 6
Output shape after Encoder Layer 6: torch.Size([1, 11, 768])

Processing Encoder Layer 7
Output shape after Encoder Layer 7: torch.Size([1, 11, 768])

Processing Encoder Layer 8
Output shape after Encoder Layer 8: torch.Size([1, 11, 768])

Processing Encoder Layer 9
Output shape after Encoder Layer 9: torch.Size([1, 11, 768])

Processing Encoder Layer 10
Output shape aft

## Explain the role of skip connections and layer normalization


# Skip Connections and Layer Normalization in Transformers

## 1. Skip Connections (Residual Connections)

**Purpose:**  
Help combat the vanishing gradient problem during training of deep networks.

**How they work:**  
They add the input of a sub-layer to its output. If the input to a sub-layer is $ X $ and the output is $ F(X) $, the result after the skip connection is:
$$
X + F(X)
$$

**Benefit:**  
Allows gradients to flow directly through the network via shortcut paths, making it easier to train deep architectures like Transformers.

---

## 2. Layer Normalization

**Purpose:**  
Stabilize the training process by normalizing the activations across the hidden dimensions for each sample independently.

**How it works:**  
- Computes the mean and variance across the feature (hidden) dimensions for each token and each batch item.
- Normalizes the activations using:
  $$
  \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}
  $$
- Applies learned scale ($\gamma$) and shift ($\beta$) parameters:
  $$
  y = \gamma \hat{x} + \beta
  $$

**Benefit:**  
Reduces internal covariate shift, making training faster and more stable, especially in deep models.

---

## 3. Placement within a Transformer Layer

In a standard Transformer layer, skip connections and layer normalization are applied **twice**:

1. **After the Multi-Head Self-Attention sub-layer:**
   $$
   \text{Output}_1 = \text{LayerNorm}(X + \text{Attention}(X))
   $$

2. **After the Feed-Forward Network (FFN) sub-layer:**
   $$
   \text{Output}_2 = \text{LayerNorm}(\text{Output}_1 + \text{FFN}(\text{Output}_1))
   $$

This pattern is often referred to as the *post-norm* architecture.

---

## 4. Referring to the BERT Model Structure

Looking at the printed module names (from the previous cell output):

- Each `encoder.layer.X` block contains:
  - `attention.output.LayerNorm` → applied after attention with a skip connection.
  - `output.LayerNorm` → applied after the FFN with another skip connection.

- The skip connections are **implicitly handled** within the `BertLayer` class. For example:
  ```python
  # After attention
  hidden_states = self.attention(hidden_states)
  hidden_states = self.LayerNorm(hidden_states + attention_output)

  # After FFN
  hidden_states = self.output.dense(hidden_states)
  hidden_states = self.LayerNorm(hidden_states + intermediate_output)

# Forward Pass Summary

## 1. Input Processing and Embedding

- Raw text is first tokenized into a sequence of tokens (words or subwords).
- These tokens are converted into numerical IDs.
- The model's embedding layer converts these IDs into dense vector representations (embeddings).
- For BERT-like models, embeddings are a sum of:
  - **Token embeddings**: Represent the identity of each token.
  - **Position embeddings**: Encode the order of tokens in the sequence.
  - **Token type (segment) embeddings**: Distinguish between different segments (e.g., question and answer in QA tasks).
- Example: Input token IDs are converted to embeddings of shape  
  `[Batch Size, Sequence Length, Hidden Size]` (e.g., `torch.Size([1, 11, 768])`).

---

## 2. Encoder Layer Processing

- The initial embeddings are passed sequentially through multiple identical encoder layers.
- Each encoder layer contains two main sub-layers:
  1. **Multi-Head Self-Attention**:
     - Allows each token to attend to all other tokens in the sequence.
     - Produces context-aware representations by weighting relevant tokens.
  2. **Feed-Forward Network (FFN)**:
     - A position-wise fully connected network that further transforms each token’s representation.
- Around each sub-layer:
  - **Skip connections (residual connections)** add the input to the output of the sub-layer.
  - **Layer normalization** stabilizes activations.
- The output of one encoder layer serves as input to the next.
- Example: Data flows through all `{config.num_hidden_layers}` encoder layers (e.g., 12 layers in BERT-base), maintaining the shape  
  `[Batch Size, Sequence Length, Hidden Size]`.

---

## 3. Final Encoder Output

- The output of the last encoder layer is a sequence of contextualized representations, one vector per input token.
- These vectors encode rich semantic and syntactic information based on the full context.
- Example: Final encoder output shape is  
  `[Batch Size, Sequence Length, Hidden Size]` (e.g., `torch.Size([1, 11, 768])`).

---

## 4. Pooling Layer (Optional – for BERT-like Models)

*Only applicable for models with a pooler (e.g., BERT).*

- For sequence-level tasks (e.g., classification), the representation of the **[CLS]** token (first token) is often used.
- The **pooling layer** applies a transformation to this vector:
  - Typically: linear layer + activation (e.g., Tanh).
- Outputs a fixed-size vector summarizing the entire input sequence.
- Example: Transforms the [CLS] token representation into a vector of shape  
  `[Batch Size, Hidden Size]` (e.g., `torch.Size([1, 768])`).

> *Note: If the model lacks a pooler or it fails to execute, this step may be skipped.*

---

## 5. Conclusion

- The forward pass transforms raw text into powerful, context-aware numerical representations.
- This process involves:
  - Embedding lookup
  - Multi-layer contextual encoding via self-attention and FFNs
  - Optional pooling for sequence-level tasks
- The resulting outputs (per-token hidden states or pooled representation) serve as input for downstream NLP tasks such as:
  - Text classification
  - Named entity recognition (NER)
  - Question answering
  - Sentiment analysis

---

**End of Forward Pass Summary**

## Summary:

### Data Analysis Key Findings

*   A pre-trained BERT model (`bert-base-uncased`) was successfully loaded, including its configuration, revealing 12 hidden layers and a hidden size of 768.
*   Input text was tokenized and converted into numerical embeddings with a shape of `[1, 11, 768]`, demonstrating the output of the model's embedding layer.
*   The self-attention mechanism's operation was illustrated, showing how input embeddings are processed to produce contextually aware representations, maintaining the shape `[1, 11, 768]`.
*   The feed-forward network's operation was demonstrated, confirming that it processes the attention output and maintains the shape `[1, 11, 768]`.
*   The sequential flow of data through the 12 encoder layers was shown, with the tensor shape consistently remaining `[1, 11, 768]` after each layer.
*   The roles of skip connections (aiding gradient flow) and layer normalization (stabilizing training) were explained, noting their placement after the attention and feed-forward sub-layers.
*   A comprehensive summary of the forward pass was provided, detailing the steps from tokenization and embedding through the encoder layers and optional pooling.

### Insights or Next Steps

*   The consistent shape `[1, 11, 768]` throughout the encoder layers highlights how the transformer processes sequences by refining the representation of each token independently but informed by the context of others.
*   Understanding the individual components (embeddings, attention, FFN, skip connections, normalization) and their sequential application is fundamental to grasping how transformers build complex language representations.
