# Required Assignment 20.1-Refining a simple transformer in PyTorch

In this assignment, you will explore the key steps involved in building a simple transformer model for text summarization. The Transformer architecture is one of the most powerful models in natural language processing (NLP), widely used in translation, summarization, and question answering.

This exercise aims to train a state-of-the-art model and understand and implement the core building blocks of PyTorch. You will learn how raw text is converted into tokens, how positional information is added, how attention mechanisms work, and finally, how the model generates a summary through greedy decoding.

In this assignment, you will complete the following steps:

- Tokenization – Convert raw text into numerical indices using a predefined vocabulary.

- Positional Encoding – Add position information to word embeddings so the model can understand word order.

- Multi-Head Attention – Explore how the model learns relationships between words through the attention mechanism.

- Greedy Decoding – Use the transformer model to generate a simple summary from the input text.

In [20]:
### import necessary libraries.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import copy

The TokenEmbedding class defines a simple embedding layer for converting token indices into dense vector representations.

In [21]:
# --- Transformer components as previously defined ---
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
    def forward(self, x):
        return self.embedding(x)


The `PositionalEncoding` class implements the positional encoding mechanism used in Transformer models to inject information about the relative or absolute position of tokens in a sequence. Since Transformer architectures do not have recurrence or convolution, positional encodings provide a way to give the model a sense of token order.

In [22]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, embed_size)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-math.log(10000.0) / embed_size))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

### Question 1: Positional Encoding Check
Create a tensor of random values with shape `(1, 5, embed_size)` and pass it through the `PositionalEncoding` class.  
Print the shape of the output and confirm it matches the input shape.

In [23]:
### GRADED CELL
def test_positional_encoding(embed_size):
    # YOUR CODE HERE
    #raise NotImplementedError()

    # create input tensor: (batch_size=1, seq_len=5, embed_size)
    x = torch.randn(1, 5, embed_size)

    # instantiate positional encoding
    pe = PositionalEncoding(embed_size, max_len=5)

    # pass through positional encoding
    out = pe(x)

    # return the shape so the visible test can print it
    return out.shape
    
# Visible test
print(test_positional_encoding(16))  # Expected: torch.Size([1, 5, 16])


torch.Size([1, 5, 16])


In [24]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        assert self.head_dim * heads == embed_size, "Embed size must be divisible by heads"
        self.query_linear = nn.Linear(embed_size, embed_size)
        self.key_linear = nn.Linear(embed_size, embed_size)
        self.value_linear = nn.Linear(embed_size, embed_size)
        self.fc_out = nn.Linear(embed_size, embed_size)
    def forward(self, query, key, value, mask=None):
        N = query.shape[0]
        query_len, key_len, value_len = query.shape[1], key.shape[1], value.shape[1]
        queries = self.query_linear(query).view(N, query_len, self.heads, self.head_dim)
        keys = self.key_linear(key).view(N, key_len, self.heads, self.head_dim)
        values = self.value_linear(value).view(N, value_len, self.heads, self.head_dim)
        queries = queries.permute(0, 2, 1, 3)
        keys = keys.permute(0, 2, 1, 3)
        values = values.permute(0, 2, 1, 3)
        energy = torch.matmul(queries, keys.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-inf"))
        attention = torch.softmax(energy, dim=-1)
        out = torch.matmul(attention, values)
        out = out.permute(0, 2, 1, 3).contiguous()
        out = out.view(N, query_len, self.embed_size)
        out = self.fc_out(out)
        return out

### Question 2: Exploration of Attention Weights
Create a random input tensor of shape `(1, 5, embed_size)` and pass it through the `MultiHeadAttention` class.  
Print the shape of the attention output and confirm it matches the input shape.

In [25]:
### GRADED CELL
def test_attention(embed_size, heads):
    # YOUR CODE HERE
    #raise NotImplementedError()

    # random input: (batch_size=1, seq_len=5, embed_size)
    x = torch.randn(1, 5, embed_size)

    # instantiate multi-head attention
    attn = MultiHeadAttention(embed_size, heads)

    # self-attention: query, key, and value are the same input
    out = attn(x, x, x)

    # return the shape to confirm it matches the input shape
    return out.shape

# Visible test
print(test_attention(16, 4))  # Expected: torch.Size([1, 5, 16])


torch.Size([1, 5, 16])


The `FeedForward` class implements a simple two-layer fully connected feedforward network, commonly used within transformer architectures.

In [9]:
class FeedForward(nn.Module):
    def __init__(self, embed_size, forward_expansion):
        super().__init__()
        self.fc1 = nn.Linear(embed_size, forward_expansion * embed_size)
        self.fc2 = nn.Linear(forward_expansion * embed_size, embed_size)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

The `EncoderLayer` class represents a single layer of a transformer encoder, combining multi-head self-attention and a feedforward network with residual connections and normalization.

In [10]:
class EncoderLayer(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout):
        super().__init__()
        self.mha = MultiHeadAttention(embed_size, heads)
        self.ff = FeedForward(embed_size, forward_expansion)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x, mask):
        attn_out = self.mha(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))
        return x


The `DecoderLayer` class represents a single layer of a transformer decoder, integrating self-attention, encoder-decoder cross-attention, and a feedforward network with residual connections and normalization.

In [11]:
class DecoderLayer(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(embed_size, heads)
        self.cross_attn = MultiHeadAttention(embed_size, heads)
        self.ff = FeedForward(embed_size, forward_expansion)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)
        self.norm3 = nn.LayerNorm(embed_size)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x, enc_out, src_mask, tgt_mask):
        self_attn_out = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(self_attn_out))
        cross_attn_out = self.cross_attn(x, enc_out, enc_out, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_out))
        ff_out = self.ff(x)
        x = self.norm3(x + self.dropout(ff_out))
        return x


The `TransformerSummarizer` class implements a complete sequence-to-sequence transformer model for text summarization. It combines token embeddings, positional encodings, stacked encoder and decoder layers, and a final linear layer to generate output tokens.

In [12]:
class TransformerSummarizer(nn.Module):
    def __init__(self, vocab_size, embed_size=256, num_layers=2, heads=4, forward_expansion=4, dropout=0.1, max_len=100):
        super().__init__()
        self.token_emb = TokenEmbedding(vocab_size, embed_size)
        self.pos_enc = PositionalEncoding(embed_size, max_len)
        self.encoder_layers = nn.ModuleList([EncoderLayer(embed_size, heads, forward_expansion, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(embed_size, heads, forward_expansion, dropout) for _ in range(num_layers)])
        self.fc_out = nn.Linear(embed_size, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def make_src_mask(self, src):
        return (src != 0).unsqueeze(1).unsqueeze(2)

    def make_tgt_mask(self, tgt):
        N, tgt_len = tgt.shape
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)
        subsequent_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt.device)).bool()
        tgt_mask = tgt_mask & subsequent_mask
        return tgt_mask

    def forward(self, src, tgt):
        src_mask = self.make_src_mask(src)
        tgt_mask = self.make_tgt_mask(tgt)

        enc_out = self.token_emb(src)
        enc_out = self.pos_enc(enc_out)
        enc_out = self.dropout(enc_out)

        for layer in self.encoder_layers:
            enc_out = layer(enc_out, src_mask)

        dec_out = self.token_emb(tgt)
        dec_out = self.pos_enc(dec_out)
        dec_out = self.dropout(dec_out)

        for layer in self.decoder_layers:
            dec_out = layer(dec_out, enc_out, src_mask, tgt_mask)

        final_out = self.fc_out(dec_out)
        return final_out


This setup is commonly used in sequence-to-sequence models to convert text into numerical representations that can be fed into neural networks.

In [13]:
# -- Simple Tokenizer and Vocabulary --
# Tokens: [PAD]=0, [SOS]=1, [EOS]=2, and some words for demo
word2idx = {
    '[PAD]': 0, '[SOS]': 1, '[EOS]': 2,
    'the': 3, 'cat': 4, 'sat': 5, 'on': 6, 'mat': 7,
    'a': 8, 'dog': 9, 'is': 10, 'here': 11,
}
idx2word = {v:k for k,v in word2idx.items()}

### Question 3: Tokenization Practice
Write a function `my_tokenize` that converts a given sentence into a list of indices using the provided `word2idx` dictionary.  
Test it on the sentence: `"The cat sat on the mat"`.

In [15]:
###GRADED CELL
def my_tokenize(text):
    # YOUR CODE HERE
    #raise NotImplementedError()

    tokens = []
    for word in text.split():
        tokens.append(word2idx.get(word.lower(), word2idx['[PAD]']))  # default to PAD if not found
    return tokens

This workflow covers text preprocessing and generating summaries using a Transformer model:

- Tokenization: Converts sentences into sequences of integer token IDs using a predefined vocabulary. Unknown words are mapped to a default token.

- Detokenization: Converts sequences of token IDs back into readable text, stopping at the end-of-sequence token.

- Greedy Decoding: Generates summaries by iteratively selecting the most probable next token until the end-of-sequence token is reached or a maximum length is met.

This approach allows models to transform raw text into numerical representations, process it with neural networks, and produce human-readable summaries.

In [16]:
def tokenize(text):
    return [word2idx.get(word, 0) for word in text.lower().split()]

def detokenize(indices):
    words = []
    for idx in indices:
        if idx == word2idx['[EOS]']:
            break
        words.append(idx2word.get(idx, '[UNK]'))
    return ' '.join(words)

# Greedy decoding for summary generation
def generate_summary(model, src_sentence, max_len=10, device='cpu'):
    model.eval()
    src_tokens = torch.tensor([tokenize(src_sentence)], dtype=torch.long, device=device)
    # Assume [PAD]=0, no padding needed here for single sentence
    tgt_tokens = torch.tensor([[word2idx['[SOS]']]], dtype=torch.long, device=device)

    for _ in range(max_len):
        with torch.no_grad():
            output = model(src_tokens, tgt_tokens)
        next_token_logits = output[:, -1, :]
        next_token = next_token_logits.argmax(dim=-1).unsqueeze(1)
        tgt_tokens = torch.cat((tgt_tokens, next_token), dim=1)
        #Stop if EOS predicted, otherwise decoding ends at max_len.
        if next_token.item() == word2idx['[EOS]']:
            break
    summary = tgt_tokens[0,1:].cpu().tolist()  # remove [SOS]
    return detokenize(summary)

# Initialize model with small vocab for demo
vocab_size = len(word2idx)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = TransformerSummarizer(vocab_size, embed_size=64, num_layers=2, heads=4)
model = model.to(device)

# For demo, load dummy trained weights or keep random (won't generate meaningful summary)
# Normally you would train the model here on summarization dataset.



### Question 4: Greedy Decoding Demonstration
Use the provided `generate_summary` function to generate a summary for the input sentence:  

`"My name is Anushka. I am working on a project in AI."`  

Return the generated summary as a string.


In [17]:
### GRADED CELL
def run_summary():
    # YOUR CODE HERE
    #raise NotImplementedError()

    input_sentence = "My name is Anushka. I am working on a project in AI."
    summary = generate_summary(model, input_sentence, max_len=5, device=device)
    return summary

# Visible test
print(run_summary())  # Output will be random (model is untrained), but should be a string


on


In [18]:
# Example input sentence
input_sentence = "My name is Anushka. I am working on a project in AI. I love working on this project and am happy to see the wonders happening because of AI. AI helps solve real-world problems and makes processes faster and smarter. Through this project, I am learning how machines can analyze data and make predictions. It excites me to see how AI can be applied in healthcare, education, and business. I enjoy experimenting with different models and improving their accuracy. This journey motivates me to explore deeper into the field of AI."
summary = generate_summary(model, input_sentence, max_len=5, device=device)
input_word_count = len(input_sentence.split())
summary_word_count = len(summary.split())


print("Input:", input_sentence)
print("Number of words in input:", input_word_count)
print("Generated summary:", summary)
print("Number of words in summary:", summary_word_count)

Input: My name is Anushka. I am working on a project in AI. I love working on this project and am happy to see the wonders happening because of AI. AI helps solve real-world problems and makes processes faster and smarter. Through this project, I am learning how machines can analyze data and make predictions. It excites me to see how AI can be applied in healthcare, education, and business. I enjoy experimenting with different models and improving their accuracy. This journey motivates me to explore deeper into the field of AI.
Number of words in input: 91
Generated summary: on
Number of words in summary: 1


This workflow demonstrates how to perform abstractive text summarization using the pretrained T5 (Text-to-Text Transfer Transformer) model from Hugging Face. The T5 model treats every NLP task as a text-to-text problem, making it highly flexible for tasks like summarization.

In [19]:
# Install required libraries if not already installed
# !pip install transformers torch

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pretrained T5 tokenizer and model
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def summarize(text):
    # Prepend the task prefix for T5
    input_text = "summarize: " + text
    # Tokenize input text and generate input IDs
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    # Generate summary ids using beam search for better quality
    summary_ids = model.generate(
        inputs,
        max_length=150,
        min_length=40,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )
    # Decode the generated summary ids to string
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example input text
input_text = (
    "My name is Anushka. I am working on a project in AI. I love working on this project and am happy to see the wonders happening because of AI. AI helps solve real-world problems and makes processes faster and smarter. Through this project, I am learning how machines can analyze data and make predictions. It excites me to see how AI can be applied in healthcare, education, and business. I enjoy experimenting with different models and improving their accuracy. This journey motivates me to explore deeper into the field of AI. I am also learning how important data quality is for building reliable AI systems. Collaborating with my peers on this project gives me new ideas and perspectives. I hope to apply the knowledge I gain here to create solutions that benefit society. AI has a bright future, and being part of this field makes me feel inspired. My long-term goal is to keep researching and contributing to meaningful innovations in AI."
)

summary = summarize(input_text)
# Count words in input and summary
input_word_count = len(input_text.split())
summary_word_count = len(summary.split())

print("Input text:", input_text)
print("Number of words in input:", input_word_count)
print("\nGenerated summary:", summary)
print("Number of words in summary:", summary_word_count)


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

ImportError: 
 requires the protobuf library but it was not found in your environment. Check out the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.
