## **Comparing BERT and T5: Understanding Tokenization and Summarization**

### **Table of Contents**
1. [Introduction](#1-introduction)
2. [Setup and Installation](#2-setup-and-installation)
3. [BERT: Tokenization and Masked Language Modeling](#3-bert-tokenization-and-masked-language-modeling)
4. [T5: Summarization](#4-t5-summarization)
5. [Comparative Analysis](#5-comparative-analysis)
6. [Conclusion](#6-conclusion)


### **1. Introduction**

In the realm of Natural Language Processing (NLP), **BERT** and **T5** are two groundbreaking transformer-based models developed by Google. Despite both leveraging the transformer architecture, they are tailored for distinct tasks and exhibit unique strengths:

- **BERT (Bidirectional Encoder Representations from Transformers):**
  - **Purpose:** Designed for understanding and interpreting text.
  - **Functionality:** Excels in tasks like question answering, sentiment analysis, and Masked Language Modeling (MLM), where it predicts missing tokens in a sentence.
  - **Strength:** Deep contextual understanding of language.

- **T5 (Text-To-Text Transfer Transformer):**
  - **Purpose:** Crafted as a unified framework to handle a variety of NLP tasks by converting them into a text-to-text format.
  - **Functionality:** Capable of performing tasks such as translation, summarization, paraphrasing, and more.
  - **Strength:** Versatile text generation capabilities, particularly in summarization.

This notebook aims to demonstrate the distinct strengths of **BERT** and **T5** by showcasing **BERT's** tokenizer and MLM capabilities, and **T5's** prowess in summarizing text.


### **2. Setup and Installation**

Before delving into the models, ensure that the necessary libraries are installed. We'll be utilizing Hugging Face's `transformers` library, which provides seamless access to pre-trained models.


In [1]:
# Import necessary libraries
from transformers import BertTokenizer, BertForMaskedLM
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

### **3. BERT: Tokenization and Masked Language Modeling**

**BERT** is renowned for its ability to understand the context of words within a sentence. It achieves this through **Masked Language Modeling (MLM)**, where it predicts missing tokens in a given text. Additionally, BERT's tokenizer plays a crucial role in preparing text data for the model.

#### **3.1. BERT Tokenizer Demonstration**

Let's start by exploring **BERT's** tokenizer. The tokenizer breaks down sentences into tokens (subwords) that BERT can process.


In [3]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example sentence
bert_example_sentence = "Nearly all men can stand adversity, but if you want to test a man's character, give him power."

# Tokenize the sentence
bert_tokens = bert_tokenizer.tokenize(bert_example_sentence)
print(f"BERT Tokenized Sentence:\n{bert_tokens}")

BERT Tokenized Sentence:
['nearly', 'all', 'men', 'can', 'stand', 'ad', '##vers', '##ity', ',', 'but', 'if', 'you', 'want', 'to', 'test', 'a', 'man', "'", 's', 'character', ',', 'give', 'him', 'power', '.']


#### **3.2. BERT Masked Language Modeling**

Now, let's demonstrate **BERT's** ability to predict masked tokens in a sentence. We'll replace certain words with `[MASK]` tokens and let BERT predict the most probable replacements.

**Original Sentence:**
"Nearly all men can stand adversity, but if you want to test a man's character, give him power."

**Masked Sentence:**
"Nearly all men can [MASK], but if you want to test a man's [MASK], give him power."

##### **3.2.1. BERT Model Implementation**


In [4]:
bert_model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Masked sentence
bert_masked_sentence = "Nearly all men can [MASK], but if you want to test a man's [MASK], give him power."

# Tokenize the input sentence
bert_input_ids = bert_tokenizer.encode(bert_masked_sentence, return_tensors='pt')

# Identify the [MASK] token indices
mask_token_indices = torch.where(bert_input_ids == bert_tokenizer.mask_token_id)[1]

# Initialize a list to store predicted words
bert_predicted_words = []

# Iterate over each [MASK] token to predict the missing word
for mask_index in mask_token_indices:
    # Clone the input_ids to avoid modifying the original tensor
    bert_input_ids_clone = bert_input_ids.clone()
    
    # Forward pass through the model to get logits
    with torch.no_grad():
        outputs = bert_model(bert_input_ids_clone)
    logits = outputs.logits
    
    # Extract logits for the current mask position
    mask_logits = logits[0, mask_index, :]
    
    # Pick the highest scoring token
    predicted_token_id = torch.argmax(mask_logits).item()
    predicted_word = bert_tokenizer.decode([predicted_token_id]).strip()
    bert_predicted_words.append(predicted_word)

# Replace [MASK] tokens with the predicted words
filled_bert_sentence = bert_masked_sentence
for word in bert_predicted_words:
    filled_bert_sentence = filled_bert_sentence.replace("[MASK]", word, 1)

print(f"BERT Filled Sentence: {filled_bert_sentence}")

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

BERT Filled Sentence: Nearly all men can fight, but if you want to test a man's strength, give him power.


**Output:**
```
BERT Filled Sentence: Nearly all men can fight, but if you want to test a man's strength, give him power.
```

##### **3.2.2. Explanation**

1. **Tokenization:** The masked sentence is tokenized, and the positions of `[MASK]` tokens are identified.
2. **Prediction:** For each `[MASK]`, BERT predicts the most probable word based on the surrounding context.
3. **Replacement:** The predicted words replace the `[MASK]` tokens to form the complete, filled sentence.

### **4. T5: Summarization**

**T5** adopts a different approach by treating every NLP task as a text generation problem. This design allows it to handle tasks like translation, summarization, and paraphrasing with remarkable flexibility. Here, we'll focus on **T5's** summarization capability.

#### **4.1. T5 Summarization Implementation**

Let's use **T5** to summarize a paragraph. We'll provide a longer text and ask **T5** to generate a concise summary.

##### **4.1.1. Summarization Example**

In [7]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Example paragraph to summarize
t5_input_text = """
Paraphrase: Nearly all men can stand adversity, but if you want to test a man's integrity, give him power.
"""

# Prepare the input with the 'summarize:' prefix
t5_input_text_prefixed = "summarize: " + t5_input_text.strip()

# Tokenize the input
t5_input_ids = t5_tokenizer.encode(t5_input_text_prefixed, return_tensors='pt')

# Generate the summary using beam search for better results
t5_output_ids = t5_model.generate(
    t5_input_ids, 
    max_length=50,
    num_beams=4,
    early_stopping=True
)

# Decode the generated output, skipping special tokens
t5_summary = t5_tokenizer.decode(t5_output_ids[0], skip_special_tokens=True)

print(f"T5 Generated Summary:\n{t5_summary}")

T5 Generated Summary:
if you want to test a man's integrity, give him power.


**Output:**
```
T5 Generated Summary:
if you want to test a man's integrity, give him power.
```

##### **4.1.2. Explanation**

1. **Input Preparation:** The paragraph to be summarized is prefixed with `"summarize: "` to inform **T5** of the desired task.
2. **Tokenization:** The prefixed text is tokenized into input IDs that **T5** can process.
3. **Generation:** **T5** generates a summary using beam search (`num_beams=4`) to enhance the quality of the output. The `early_stopping=True` parameter ensures that the generation process stops once a sufficient summary is produced.
4. **Decoding:** The generated summary is decoded from token IDs back into readable text.


### **5. Comparative Analysis**

Having implemented both **BERT** and **T5** for their respective strengths, let's compare their functionalities and use cases.

#### **5.1. Side-by-Side Comparison**

| **Aspect**                       | **BERT**                                                                 | **T5**                                                          |
|----------------------------------|--------------------------------------------------------------------------|-----------------------------------------------------------------|
| **Primary Function**             | Masked token prediction                                                  | Text-to-text generation (e.g., summarization, translation)      |
| **Handling Multiple Masks**      | Requires separate predictions for each `[MASK]`                         | Handles multiple tasks within a single framework                |
| **Use Cases**                    | Understanding tasks (e.g., sentiment analysis, question answering)       | Generation tasks (e.g., summarization, translation, paraphrasing) |
| **Flexibility**                  | Focused on comprehension and token prediction                           | Highly versatile across various NLP tasks                       |
| **Output Nature**                | Predicts discrete tokens                                                | Generates coherent and concise text spans                        |


#### **5.2. Performance Insights**

- **Efficiency:**
  - **BERT:** Excels at tasks requiring deep understanding and token-level predictions but needs iterative processing for multiple masked tokens.
  - **T5:** More efficient for generation tasks, especially those involving multiple outputs like summarization, as it can handle them in a single pass.

- **Flexibility:**
  - **BERT:** Best suited for tasks that involve analyzing and understanding existing text.
  - **T5:** Offers a unified approach to various tasks, making it suitable for generating new text based on given inputs.

- **Contextual Understanding:**
  - **BERT:** Provides rich contextual embeddings, making it ideal for tasks that rely on understanding the intricacies of language.
  - **T5:** Utilizes its generative capabilities to produce contextually appropriate and coherent summaries or translations.


In this notebook, we've delved into the distinct functionalities of **BERT** and **T5**:

- **BERT** is a formidable model for understanding and predicting missing tokens within a sentence. Its tokenizer efficiently breaks down text into manageable tokens, and its **Masked Language Modeling** capability allows it to predict and fill in gaps based on contextual clues. This makes **BERT** exceptionally suited for tasks that require deep comprehension and analysis of text.

- **T5**, on the other hand, offers a versatile framework for a wide array of NLP tasks by treating them as text-to-text problems. Its strength lies in its ability to generate coherent and concise summaries, making it invaluable for tasks like summarization, translation, and paraphrasing. **T5's** generative nature allows it to produce new text based on given inputs, showcasing its flexibility and power in handling diverse language tasks.

**Key Takeaways:**

- **BERT's** tokenizer and **Masked Language Modeling** are ideal for tasks that require understanding and predicting specific parts of the text.
- **T5's** design as a text-to-text model makes it highly effective for generating new content, such as summaries, translations, and paraphrases.
- Choosing between **BERT** and **T5** depends on the specific requirements of your NLP project—whether it leans more towards understanding existing text or generating new text based on inputs.

By leveraging the strengths of both models, you can build robust NLP applications that harness the power of deep language understanding and sophisticated text generation.
