# 🧠 **Week 2 Lab: Text Summarization using T5**

**Course:** AI & Machine Learning Fundamentals
**Topic:** Natural Language Processing (NLP), Sequence Models & Transformers

## 🎯 **Learning Objectives**

Upon successful completion of this laboratory session, participants will be equipped to:

1.  Comprehend foundational text preprocessing methodologies within NLP.
2.  Articulate the principles of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) for effective sequence modeling.
3.  Detail the operational mechanics of the Transformer architecture and the Self-Attention mechanism.
4.  Develop a functional Text Summarization application leveraging a pre-trained T5 Transformer model.

***

## 🧩 **Section 1: NLP Fundamentals & Text Preprocessing**

Natural Language Processing (NLP) is the computational discipline dedicated to enabling computer systems to analyze, interpret, and generate human language.

### **Text Preprocessing Necessity**

Before training any NLP model, raw textual data necessitates cleansing and structuring. This essential process, referred to as **text preprocessing**, ensures data quality and reduces noise.

### ⚙️ **Common Preprocessing Steps**

| Step | Description |
| :--- | :--- |
| **Tokenization** | Breaking text into smaller parts like words or subwords. |
| **Stopword Removal** | Removing frequently used words (like *the*, *is*, and *and*) that add little value. |
| **Stemming/Lemmatization** | Reducing words to their base or root form (e.g., *running* → *run*). |


<div class="alert alert-block alert-info">

<b>💡 Good Practice:</b> Effective preprocessing is a foundational step that significantly enhances model learning speed and overall performance by reducing noise in the input data.
</div>

***

### **Step 1.1: Import Libraries and Download necessary Data**

In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

### **Code Explanation**

The code block above serves a critical setup function by importing necessary components and downloading required data:

| Import / Function | Purpose in NLP |
| :--- | :--- |
| `import nltk` | Imports the foundational Natural Language Toolkit library. |
| `from nltk.corpus import stopwords` | Imports the list of common stop words (e.g., 'a', 'an', 'the') that often need to be removed to focus on meaningful content. |
| `from nltk.tokenize import word_tokenize` | Imports the function used to break raw text into individual words or tokens. |
| `from nltk.stem import PorterStemmer` | Imports the specific algorithm for **Stemming**, which reduces words to their root or base form. |
| `nltk.download('punkt')` | Downloads the `punkt` resource, which contains pre-trained models for **tokenization** across various languages. |
| `nltk.download('stopwords')` | Downloads the corpus containing the standard English **stopword list**. |

---
**Export to Sheets:** *This is not a functional code command, but a descriptive header indicating that the setup is complete and data processing is ready to begin.*

***

### **Step 1.2: Preprocess a Sample Text**

In [14]:
# Sample text
text = "Natural Language Processing allows computers to understand human language and derive meaning from it."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stopword Removal
stops = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stops]
print("\nAfter Removing Stopwords:", filtered_tokens)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("\nAfter Stemming:", stemmed_tokens)

Tokens: ['Natural', 'Language', 'Processing', 'allows', 'computers', 'to', 'understand', 'human', 'language', 'and', 'derive', 'meaning', 'from', 'it', '.']

After Removing Stopwords: ['Natural', 'Language', 'Processing', 'allows', 'computers', 'understand', 'human', 'language', 'derive', 'meaning', '.']

After Stemming: ['natur', 'languag', 'process', 'allow', 'comput', 'understand', 'human', 'languag', 'deriv', 'mean', '.']


### **Reflection Questions**

Take a moment to consider the theoretical importance of the preprocessing steps you just performed. Document your answers to the following questions:

1.  **Stopword Rationale ($\mathbf{\text{🧠}}$):**
    Explain the primary reasons for removing stopwords before training an NLP model. Focus on the benefits related to **computational efficiency** and the **quality/signal** of the input data.

2.  **Normalization Impact ($\mathbf{\text{🔍}}$):**
    Discuss the potential consequences for an NLP model (like a text summarizer) if we do *not* normalize words using techniques like Stemming or Lemmatization. How does failing to normalize impact the model's understanding of word relationships?

    ---

## 🌀 **Section 2: Sequence Models – RNNs & LSTMs**

Before the advent of the Transformer architecture, the standard method for modeling text sequences relied on Recurrent Neural Networks (RNNs) and their advanced variant, Long Short-Term Memory (LSTM) networks.

---

### 🧠 Operational Principles

* **Recurrent Neural Networks (RNNs):** RNNs process sequential data element by element, using a hidden state to carry information forward. The output from the processing of the current word becomes part of the input for the next.
    * **Limitation:** RNNs are prone to the **vanishing gradient problem**, which causes them to struggle with retaining relevant information from earlier parts of long sequences (the "long-term dependency" issue).
* **Long Short-Term Memory (LSTMs):** LSTMs were specifically designed as a solution to the vanishing gradient problem. They utilize specialized **memory gates** (Input, Forget, and Output gates) that regulate the flow of information, allowing the network to selectively store or discard data across long time steps.

---

### 📊 Conceptual Flow Example: Sentiment Analysis

- Consider the sentence: *"The food was not bad."*

- To correctly determine the sentiment, the model must maintain the context of the negating word *"not"* when it processes the adjective *"bad."*

* A **Basic RNN** often fails this test, as the context established by *"not"* may be effectively "forgotten" by the time the model reaches *"bad."*
* An **LSTM** network excels in this scenario because its memory cells ensure the context (the presence of *"not"*) is preserved over the long-term dependency, allowing for an accurate interpretation.

---

### **Step 2.1: Demo – How an RNN Processes Data**

In [15]:
import torch
import torch.nn as nn

# Define a simple RNN layer
rnn = nn.RNN(input_size=5, hidden_size=3, batch_first=True)

# Example input: 1 sequence, 4 timesteps, each with 5 features
inputs = torch.randn(1, 4, 5)
print("Input shape:", inputs.shape)

# Initial hidden state
h0 = torch.zeros(1, 1, 3)

# Forward pass
output, hn = rnn(inputs, h0)

print("\nOutput shape:", output.shape)
print("Hidden state shape:", hn.shape)

Input shape: torch.Size([1, 4, 5])

Output shape: torch.Size([1, 4, 3])
Hidden state shape: torch.Size([1, 1, 3])


### **Code Explanation**

The demonstration illustrates how an RNN processes a sequence and produces an output and a final hidden state.

| Variable | Description | Shape / Value |
| :--- | :--- | :--- |
| `input_size=5` | Defines the dimension of the input vector at each timestep (e.g., the size of a word embedding). | $5$ |
| `hidden_size=3` | Defines the size of the RNN's internal "memory" or context vector. | $3$ |
| `inputs` | The sequence fed into the network. | $\mathbf{(1, 4, 5)}$: (Batch Size, Sequence Length, Features) |
| `h0` | The initial hidden state, which is typically set to zero. | $\mathbf{(1, 1, 3)}$: (Layers, Batch Size, Hidden Size) |
| `output` | The hidden state output from *every* timestep in the sequence. | $\mathbf{(1, 4, 3)}$: (Batch Size, Sequence Length, Hidden Size) |
| `hn` | The final hidden state produced after processing the *last* timestep. This is the final memory of the network for the entire sequence. | $\mathbf{(1, 1, 3)}$: (Layers, Batch Size, Hidden Size) |

---
**Export to Sheets:** *This table summarizes the core tensor transformations within the RNN layer.*

<div class="alert alert-block alert-info">


<b>Key Takeaway:</b>
Notice that the **output** dimension's Sequence Length is $4$, because it provides a context vector (of size 3) for every input word. The **hidden state ($\text{hn}$)** only has a Sequence Length of $1$, as it represents the single final state summarizing the whole input.
</div>

---

### 💬 Reflection

Answer the following questions to consolidate your understanding of sequence models:

1.  **RNN Long Sequence Challenge ($\mathbf{\text{💭}}$):**
    Explain the core reason why traditional Recurrent Neural Networks (RNNs) struggle to maintain relevant context when processing extremely long sentences. What is the specific technical phenomenon that causes this "forgetting"?

2.  **LSTM Mechanism ($\mathbf{\text{🧩}}$):**
    Briefly describe the mechanism by which Long Short-Term Memory (LSTM) networks overcome the limitations of basic RNNs, allowing them to effectively learn and retain long-range dependencies in sequence data.

    ---

## ⚡ **Section 3: Transformer Architecture & Self-Attention**

The introduction of the Transformer architecture in 2017 (in the paper *Attention Is All You Need*) fundamentally revolutionized Natural Language Processing by entirely replacing recurrent layers.

Instead of processing sequences word by word (recurrence), Transformers analyze all input elements *simultaneously* using the **Self-Attention** mechanism.

### **Core Concept: Self-Attention ($\mathbf{\text{🔑}}$)**

The Self-Attention mechanism is the key innovation, allowing the model to weigh the importance of all other words in a sentence relative to the word currently being processed. This enables robust context understanding.

* **Example:** In the sentence *"The animal didn’t cross because it was too tired,"* the model uses self-attention to determine that the pronoun *"it"* should be strongly linked to *"animal."*

### **Positional Encoding ($\mathbf{\text{✳️}}$)**

Since Transformers process all words in parallel, they lose the inherent sequential order that RNNs maintain. To fix this, **Positional Encoding** is added to the input embeddings.

* This encoding provides the model with information about the relative or absolute position of each token in the sequence.
* **Crucial Role:** Without positional encoding, the model would fail to distinguish between sentences like *"dog bites man"* and *"man bites dog."*

### **Why Transformers Are Powerful ($\mathbf{\text{💡}}$)**

* **Parallel Processing:** Allows for significantly faster training and inference compared to sequential RNNs.
* **Long-Range Dependencies:** The self-attention mechanism is highly effective at capturing relationships between words regardless of the distance between them.
* **Foundation for Modern NLP:** Transformers are the core technology behind state-of-the-art models, including **GPT, BERT, and T5**.

***

# **Development of a Text Summarizer with T5 implementation**

## 🤖 Section 4: Text Summarization using T5

**T5 (Text-to-Text Transfer Transformer)** is a powerful model that simplifies all NLP tasks by reframing them as a unified text-to-text problem. T5 can translate, classify, summarize, or answer questions—all by simply changing a task prefix.

### **How Summarization Works ($\mathbf{\text{🧠}}$)**

T5 performs text summarization by being explicitly instructed what to do via the input prefix:

| Step | Description |
| :--- | :--- |
| **Input** | The user provides the model with the task instruction and the text: `"summarize: <long text>"`. |
| **Model** | The Transformer's Encoder processes the input text, and the Decoder then generates the output sequence step-by-step. |
| **Output** | The model generates a concise summary that retains the key ideas and core meaning of the original document. |

### **Step 4.1: Install and Import Dependencies**

In [16]:
!pip install transformers sentencepiece --quiet

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

### **Code Explanation**

| Component | Purpose |
| :--- | :--- |
| `!pip install transformers...` | Installs the Hugging Face Transformers library, which provides the T5 model implementation. |
| `sentencepiece` | A dependency often used for tokenizer models, like the one used by T5. |
| `T5Tokenizer` | The tool used to convert our input text (the long document and the "summarize:" prefix) into numerical IDs (tokens) that the T5 model can understand. |
| `T5ForConditionalGeneration` | The PyTorch class that contains the pre-trained T5 model architecture, ready for sequence-to-sequence tasks like summarization. |
| `import torch` | Imports the PyTorch library, the underlying framework used by the T5 model. |

### **Step 4.2: Load the Pre-Trained T5 Model**

In [18]:
# Load the small pre-trained T5 model
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

print(" Model and tokenizer loaded successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

 Model and tokenizer loaded successfully!


### **Step 4.3: Define a Summarization Function**

In [19]:
def summarize(text, max_length=150, min_length=30):
    """
    Generates a summary using the pre-trained T5 model.
    """
    input_text = "summarize: " + text

    # Encode input text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

    # Generate summary
    summary_ids = model.generate(
        input_ids,
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )

    # Decode and return summary
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

#### **Function Explanation**

| Code Line / Parameter | Purpose in Summarization |
| :--- | :--- |
| `input_text = "summarize: " + text` | **T5 Task Prefix:** Explicitly tells the T5 model that the required task is summarization. |
| `tokenizer.encode(...)` | Converts the text into numerical input IDs (`input_ids`) that the model requires. |
| `return_tensors="pt"` | Specifies the output should be a PyTorch tensor. |
| `max_length=150` | Sets the maximum number of tokens (words/subwords) the output summary can have. |
| `min_length=30` | Sets the minimum length for the generated summary, preventing overly short or trivial output. |
| `num_beams=4` | **Search Strategy:** Uses **Beam Search** with 4 beams. This is a common method for improving the quality of generated text by exploring multiple possible output sequences simultaneously. |
| `length_penalty=2.0` | A factor that penalizes shorter sequences, encouraging the model to produce summaries closer to the `max_length`. |
| `tokenizer.decode(...)` | Converts the model's numerical output (`summary_ids`) back into a natural language string. |

### **Step 4.4: Test the Summarizer**

In [20]:
sample_text = """
Natural Language Processing (NLP) is a field of Artificial Intelligence that focuses on the
interaction between computers and humans through language. The goal is to make computers
understand, interpret, and generate human languages in a way that is both valuable and meaningful.
Applications include chatbots, translation, and summarization.
"""

summary = summarize(sample_text)
print("📝 Original Text:\n", sample_text)
print("\n🔍 Generated Summary:\n", summary)

📝 Original Text:
 
Natural Language Processing (NLP) is a field of Artificial Intelligence that focuses on the 
interaction between computers and humans through language. The goal is to make computers 
understand, interpret, and generate human languages in a way that is both valuable and meaningful.
Applications include chatbots, translation, and summarization.


🔍 Generated Summary:
 natural language processing (NLP) is a field of Artificial Intelligence that focuses on the interaction between computers and humans through language. applications include chatbots, translation, and summarization.


### **Step 4.5: Try Your Own Text!**

In [23]:
user_text = input("Enter any paragraph you'd like to summarize:\n")
summary = summarize(user_text)
print("\n🧾 Generated Summary:\n", summary)

Enter any paragraph you'd like to summarize:
The moment the front door clicked shut, Spence, the notoriously naughty Japanese Spitz, initiated his daily, high-stakes reconnaissance mission across the living room carpet. With his snow-white fur and a perpetually delighted curl to his lip, he presented a picture of pure, angelic innocence, which was, of course, a deliberate deception. His current objective was a freshly delivered box of tissues, which he was certain contained some vital, shredded information that needed immediate extraction and distribution across the floor as fluffy, celebratory confetti. His ears were constantly swiveling, not for commands, but for the slightest chance to dart under the dining table—his designated fortress of mischief—from which he could launch surprise attacks on stray socks or, if he felt particularly bold, steal the remote and bury it beneath a mountain of plush pillows.

🧾 Generated Summary:
 the notoriously naughty Japanese Spitz initiated his dai

## 📘 **Section 5: Recap & Key Takeaways**

This lab introduced you to the progression of sequence models and culminated in building a practical Text Summarization tool using a modern Transformer architecture.

### **Core Concepts Summary**

| Concept | Description |
| :--- | :--- |
| **NLP Preprocessing** | Essential steps like **Tokenization**, **Stopword Removal**, and **Stemming/Lemmatization** used to clean and structure text for models. |
| **Sequence Models** | **RNNs** and **LSTMs** capture sequential context, but LSTMs use gates to overcome the vanishing gradient problem in long sequences. |
| **Transformers** | Replaced recurrence with **Self-Attention** to process all words simultaneously, capturing deep, long-range contextual relationships. |
| **T5 Model** | A unified framework that treats every NLP problem (translation, classification, summarization) as a **text-to-text** task via prefixes. |
| **Summarization** | The application of a sequence model (like T5) to produce concise, coherent summaries from long, unstructured passages. |

***

### **What You Learned ($\mathbf{\text{🧭}}$)**

By completing this lab, you have successfully:

* $\mathbf{\text{✅}}$ Cleaned and preprocessed text data using the foundational NLTK library.
* $\mathbf{\text{✅}}$ Understood the architectural limitations of RNNs and how LSTMs address them.
* $\mathbf{\text{✅}}$ Grasped the core mechanics of the Transformer and its vital Self-Attention mechanism.
* $\mathbf{\text{✅}}$ Built a functional text summarizer by loading and utilizing the pre-trained T5 model from the Hugging Face Transformers library.