<a href="https://colab.research.google.com/github/arquansa/PSTB-exercises/blob/main/Week07/Day3/EX3/W7D3EX.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercises XP
Last Updated: July 10th, 2025

👩‍🏫 👩🏿‍🏫 What You’ll learn
How to tokenize text using BERT and understand the role of special tokens.
How to use a pre-trained BERT model for sentiment analysis.
How to build a custom sentiment analyzer with BERT.
How BERT can be used for Named Entity Recognition (NER) tasks.
The differences between BERT and GPT models.
How BERT is used in Retrieval-Augmented Generation (RAG) systems.


🛠️ What you will create
A tokenized output of a sample sentence using BERT.
A sentiment analysis pipeline using a pre-trained BERT model.
A custom sentiment analyzer with direct control over components.
A named entity recognizer using BERT.
A comparison table between BERT and GPT models.
An explanation of how BERT is used in RAG systems.

🌟 Exercise 1: Tokenization with BERT
Objective: Learn how BERT tokenizes text and adds special tokens, preparing it for model input.

Why this matters:
Before any language model can process text, it needs to convert it into tokens and numerical IDs. BERT uses special tokens like [CLS] and [SEP] to mark the beginning and end of sentences. This exercise helps you understand how BERT prepares raw text for analysis.

Instructions:

Install the transformers and torch libraries.
Load the BERT tokenizer (bert-base-uncased).
Choose a sample sentence.
Tokenize the sentence and view how BERT breaks it down.
Prepare the sentence with special tokens, padding, and truncation for model input.
Review the token IDs and tokens, identifying the special tokens BERT adds.
Outcome: You will have a fully tokenized sentence, see the special tokens BERT adds, and understand how text becomes input for BERT.



🌟 Exercise 2: Sentiment Analysis with BERT Pipeline
Objective: Use a pre-trained BERT model to perform sentiment analysis.

Why this matters:

Pre-trained models like BERT can quickly classify text, such as determining if a sentence is positive or negative. Pipelines simplify this process, allowing you to focus on the task without managing low-level details.

Instructions:

Import the pipeline class from transformers.
Create a sentiment analysis pipeline using the distilbert-base-uncased-finetuned-sst-2-english model.
Provide a sample sentence.
Use the pipeline to predict the sentiment.
Review the predicted label and confidence score.
Outcome: You will have a working sentiment analysis pipeline that can classify text as positive or negative.



🌟 Exercise 3: Building a Custom Sentiment Analyzer
Objective: Build a sentiment analyzer with direct control over the tokenizer, model, and processing pipeline.

Why this matters:

Using pipelines is convenient, but building a custom analyzer helps you understand how models process inputs and generate outputs. You gain full control over preprocessing, model handling, and post-processing.

Instructions:

1. Import AutoTokenizer and AutoModelForSequenceClassification.
2. Create a class BERTSentimentAnalyzer with methods for:

Initializing the tokenizer and model.
Preprocessing input text (cleaning, tokenizing, preparing tensors).
Predicting sentiment and returning results.
3. Test your analyzer with various sample texts.

Outcome: You will have a custom sentiment analyzer and understand each component’s role in the pipeline.



🌟 Exercise 4: Understanding BERT for Named Entity Recognition (NER)
Objective: Explore how BERT identifies entities in text using the NER task.

Why this matters:

NER helps extract important information like names, locations, and organizations from text. BERT can be fine-tuned for NER using models trained with the B-I-O tagging scheme (Begin, Inside, Outside).

Instructions:

1. Import AutoTokenizer and AutoModelForTokenClassification.
2. Create a class BERTNamedEntityRecognizer with methods for:

Initializing the tokenizer and model.
Recognizing entities in a given text and mapping token predictions to labels.
3. Test your recognizer with sample text containing entities.

Outcome: You will build an NER system that identifies entities like names, places, and more using BERT.



🌟 Exercise 5: Comparing BERT and GPT
Objective: Understand the architectural and functional differences between BERT and GPT models.

Why this matters:

BERT and GPT are foundational models in NLP but serve different purposes. Knowing their strengths, weaknesses, and use cases helps you choose the right model for your task.

Instructions:

1. Research the architectures and applications of BERT and GPT.
2. Create a comparison table based on:

Architecture (encoder, decoder, or both).
Primary purpose (understanding vs. generation).
Common use cases.
Strengths and weaknesses.
3. Reflect on the differences and similarities.

Outcome: You will have a clear comparison of BERT and GPT, helping you understand when to use each model.



🌟 Exercise 6: Exploring BERT Applications in Retrieval-Augmented Generation (RAG)
Objective: Learn how BERT is used in RAG systems to enhance information retrieval.

Why this matters:

RAG systems combine retrieval and generation, allowing language models to access external knowledge. BERT plays a key role in retrieving relevant information, improving the quality of generated responses.

Instructions:

Research the concept of Retrieval-Augmented Generation (RAG).
Explain BERT’s role in the retrieval component.
Describe how BERT generates embeddings for documents and queries.
Discuss how a vector database is used to match queries with relevant documents.
Provide an example of how BERT and a generative model like GPT work together in a RAG system.
Outcome: You will understand BERT’s role in RAG systems and how it enhances retrieval for generation tasks.

# Exercise 1: Tokenization with BERT
 Objective: Learn how BERT tokenizes text and adds special tokens, preparing it for model input.

Why this matters: Before any language model can process text, it needs to convert it into tokens and numerical IDs. BERT uses special tokens like [CLS] and [SEP] to mark the beginning and end of sentences. This exercise helps you understand how BERT prepares raw text for analysis.

Instructions:

- Install the transformers and torch libraries.
- Load the BERT tokenizer (bert-base-uncased).
- Choose a sample sentence.
- Tokenize the sentence and view how BERT breaks it down. - Prepare the sentence with special tokens, padding, and truncation for model input.
- Review the token IDs and tokens, identifying the special tokens BERT adds.
- Outcome: You will have a fully tokenized sentence, see the special tokens BERT adds, and understand how text becomes input for BERT.

In [None]:
!pip install transformers torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Choose a sample sentence
sentence = "BERT is a powerful language model."
print(f"Sample sentence: {sentence}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Sample sentence: BERT is a powerful language model.


In [None]:
# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)
print(f"Tokens: {tokens}")

# Prepare the sentence with special tokens, padding, and truncation for model input
# This returns a dictionary with input_ids, token_type_ids, and attention_mask
encoded_inputs = tokenizer(
    sentence,
    padding='max_length',  # Pad to the maximum sequence length
    truncation=True,       # Truncate if the sentence is longer than the max length
    max_length=128,        # Specify the maximum sequence length
    return_tensors='pt'    # Return PyTorch tensors
)

print("\nEncoded Inputs (PyTorch tensors):")
print(encoded_inputs)

# Review the token IDs and tokens
input_ids = encoded_inputs['input_ids'][0]
tokens_from_ids = tokenizer.convert_ids_to_tokens(input_ids)

print("\nToken IDs and corresponding tokens:")
for token_id, token in zip(input_ids, tokens_from_ids):
    print(f"ID: {token_id.item()}, Token: {token}")

# Identify special tokens
special_tokens = tokenizer.special_tokens_map.values()
print(f"\nSpecial tokens used by BERT: {list(special_tokens)}")

Tokens: ['bert', 'is', 'a', 'powerful', 'language', 'model', '.']

Encoded Inputs (PyTorch tensors):
{'input_ids': tensor([[  101, 14324,  2003,  1037,  3928,  2653,  2944,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0, 

#Exercise 2: Sentiment Analysis with BERT Pipeline

Objective: Use a pre-trained BERT model to perform sentiment analysis.

Why this matters:

Pre-trained models like BERT can quickly classify text, such as determining if a sentence is positive or negative. Pipelines simplify this process, allowing you to focus on the task without managing low-level details.

Instructions:

Import the pipeline class from transformers. Create a sentiment analysis pipeline using the distilbert-base-uncased-finetuned-sst-2-english model. Provide a sample sentence. Use the pipeline to predict the sentiment. Review the predicted label and confidence score. Outcome: You will have a working sentiment analysis pipeline that can classify text as positive or negative.

In [None]:
from transformers import pipeline

# Create a sentiment analysis pipeline
# We use the 'distilbert-base-uncased-finetuned-sst-2-english' model, which is fine-tuned for sentiment analysis
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Provide a sample sentence
sample_sentence = "I love using transformers library for NLP tasks!"

# Use the pipeline to predict the sentiment
result = sentiment_analyzer(sample_sentence)

# Review the predicted label and confidence score
print(f"Sample Sentence: {sample_sentence}")
print(f"Sentiment Prediction: {result}")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


Sample Sentence: I love using transformers library for NLP tasks!
Sentiment Prediction: [{'label': 'POSITIVE', 'score': 0.9962906837463379}]


# Exercise 3: Building a Custom Sentiment Analyzer

Objective: Build a sentiment analyzer with direct control over the tokenizer, model, and processing pipeline.

Why this matters:

Using pipelines is convenient, but building a custom analyzer helps you understand how models process inputs and generate outputs. You gain full control over preprocessing, model handling, and post-processing.

Instructions:

Import AutoTokenizer and AutoModelForSequenceClassification.
Create a class BERTSentimentAnalyzer with methods for:
Initializing the tokenizer and model. Preprocessing input text (cleaning, tokenizing, preparing tensors). Predicting sentiment and returning results.

Test your analyzer with various sample texts.
Outcome: You will have a custom sentiment analyzer and understand each component’s role in the pipeline.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

class BERTSentimentAnalyzer:
    def __init__(self, model_name="bert-base-uncased"):
        # Initialize the tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)

    def preprocess(self, text):
        # Preprocess input text (cleaning, tokenizing, preparing tensors)
        # You can add more text cleaning steps here if needed
        encoded_inputs = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=128,
            return_tensors='pt'
        )
        return encoded_inputs

    def predict(self, text):
        # Predict sentiment and return results
        encoded_inputs = self.preprocess(text)
        with torch.no_grad():
            outputs = self.model(**encoded_inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        # You'll need to map the predicted index to a label (e.g., 0 for negative, 1 for positive)
        # The mapping depends on how the model was fine-tuned.
        # For 'bert-base-uncased', the output is typically not directly sentiment.
        # For a fine-tuned model like 'distilbert-base-uncased-finetuned-sst-2-english',
        # the labels are usually 0 for negative and 1 for positive.
        # Let's assume a simple mapping for demonstration.
        # In a real scenario with a fine-tuned model, you would get the label mapping from the model config.

        # Placeholder for label mapping - will depend on the fine-tuned model
        label_mapping = {0: "NEGATIVE", 1: "POSITIVE"} # This is an example, replace with actual labels if using a fine-tuned model

        predicted_label = label_mapping.get(predictions.item(), "UNKNOWN")

        return {"label": predicted_label, "score": torch.softmax(logits, dim=-1).max().item()}

# Test the analyzer with a sample text
# Note: Using 'bert-base-uncased' directly for sentiment won't give meaningful results
# as it's not fine-tuned for this task. For actual sentiment analysis,
# you would use a fine-tuned model name here (like in Exercise 2).
# This example demonstrates the structure of the custom analyzer.

# analyzer = BERTSentimentAnalyzer(model_name="distilbert-base-uncased-finetuned-sst-2-english") # Use a fine-tuned model for real sentiment analysis
# sample_text = "This is a great movie!"
# result = analyzer.predict(sample_text)
# print(f"\nSample Text: {sample_text}")
# print(f"Sentiment Prediction (Custom Analyzer): {result}")

print("BERTSentimentAnalyzer class defined. Instantiate with a fine-tuned model for sentiment analysis.")

BERTSentimentAnalyzer class defined. Instantiate with a fine-tuned model for sentiment analysis.


 # Exercise 4: Understanding BERT for Named Entity Recognition (NER)

 Objective: Explore how BERT identifies entities in text using the NER task.

Why this matters:

NER helps extract important information like names, locations, and organizations from text.

BERT can be fine-tuned for NER using models trained with the B-I-O tagging scheme (Begin, Inside, Outside).

Instructions:

Import AutoTokenizer and AutoModelForTokenClassification.
Create a class BERTNamedEntityRecognizer with methods for:
- Initializing the tokenizer and model.
- Recognizing entities in a given text and
- mapping token predictions to labels.

Test your recognizer with sample text containing entities.

Outcome: You will build an NER system that identifies entities like names, places, and more using BERT.

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

class BERTNamedEntityRecognizer:
    def __init__(self, model_name="dbmdz/bert-large-cased-finetuned-conll03-english"):
        # Initialize the tokenizer and model for NER
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)

    def recognize_entities(self, text):
        # Tokenize the input text and get offset mappings
        encoded_inputs = self.tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
        offsets = encoded_inputs.pop("offset_mapping")[0]

        # Get model predictions
        with torch.no_grad():
            outputs = self.model(**encoded_inputs)
        predictions = torch.argmax(outputs.logits, dim=2)[0].tolist()

        # Map token predictions to labels
        id_to_label = self.model.config.id2label
        predicted_labels = [id_to_label[p] for p in predictions]

        # Align tokens with original words and their predicted labels using offset mapping
        entities = []
        current_entity = None
        current_label = None

        for i, (offset, label) in enumerate(zip(offsets, predicted_labels)):
            # Skip special tokens [CLS] and [SEP] and padding [PAD]
            if offset[0] is None or offset[1] is None or label in ['O', 'PAD']:
                if current_entity:
                    entities.append({"entity": text[current_entity[0]:current_entity[1]], "label": current_label})
                    current_entity = None
                    current_label = None
                continue

            # Get the span of the current token in the original text
            start, end = offset.tolist()

            if label.startswith("B-"):
                if current_entity:
                    entities.append({"entity": text[current_entity[0]:current_entity[1]], "label": current_label})
                current_entity = [start, end]
                current_label = label[2:]
            elif label.startswith("I-") and current_entity and label[2:] == current_label:
                current_entity[1] = end
            else:
                if current_entity:
                    entities.append({"entity": text[current_entity[0]:current_entity[1]], "label": current_label})
                current_entity = None
                current_label = None

        # Add the last entity if exists
        if current_entity:
             entities.append({"entity": text[current_entity[0]:current_entity[1]], "label": current_label})

        return entities

# Note: Instantiating the model here for demonstration.
# You would typically instantiate the class and then call the method.
# recognizer = BERTNamedEntityRecognizer()
# sample_text = "Paris is the capital of France."
# recognized_entities = recognizer.recognize_entities(sample_text)
# print(f"\nSample Text: {sample_text}")
# print(f"Recognized Entities: {recognized_entities}")

print("BERTNamedEntityRecognizer class defined with improved entity recognition logic. Instantiate to perform NER.")

BERTNamedEntityRecognizer class defined with improved entity recognition logic. Instantiate to perform NER.


In [None]:
# Instantiate the NER recognizer
recognizer = BERTNamedEntityRecognizer()

# Test with a sample text containing entities
sample_text = "Paris is the capital of France. John Doe works at Google in California."
recognized_entities = recognizer.recognize_entities(sample_text)
print(f"\nSample Text: {sample_text}")
print(f"Recognized Entities: {recognized_entities}")

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Sample Text: Paris is the capital of France. John Doe works at Google in California.
Recognized Entities: []


In [None]:
# Instantiate the NER recognizer
recognizer = BERTNamedEntityRecognizer()

# Test with a sample text containing entities
sample_text = "Paris is the capital of France. John Doe works at Google in California."
recognized_entities = recognizer.recognize_entities(sample_text)
print(f"\nSample Text: {sample_text}")
print(f"Recognized Entities: {recognized_entities}")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Sample Text: Paris is the capital of France. John Doe works at Google in California.
Recognized Entities: []


In [None]:
# Instantiate the NER recognizer
recognizer = BERTNamedEntityRecognizer()

# Test with a sample text containing entities
sample_text = "Paris is the capital of France. John Doe works at Google in California."
recognized_entities = recognizer.recognize_entities(sample_text)
print(f"\nSample Text: {sample_text}")
print(f"Recognized Entities: {recognized_entities}")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Sample Text: Paris is the capital of France. John Doe works at Google in California.
Recognized Entities: []


In [None]:
from transformers import pipeline

# Create an NER pipeline
# Using the same fine-tuned model as in the custom class
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

# Test with the sample text
sample_text = "Paris is the capital of France. John Doe works at Google in California."
recognized_entities_pipeline = ner_pipeline(sample_text)

print(f"\nSample Text: {sample_text}")
print(f"Recognized Entities (Pipeline): {recognized_entities_pipeline}")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu



Sample Text: Paris is the capital of France. John Doe works at Google in California.
Recognized Entities (Pipeline): [{'entity_group': 'LOC', 'score': np.float32(0.99970144), 'word': 'Paris', 'start': 0, 'end': 5}, {'entity_group': 'LOC', 'score': np.float32(0.99977046), 'word': 'France', 'start': 24, 'end': 30}, {'entity_group': 'PER', 'score': np.float32(0.99459696), 'word': 'John Doe', 'start': 32, 'end': 40}, {'entity_group': 'ORG', 'score': np.float32(0.9987465), 'word': 'Google', 'start': 50, 'end': 56}, {'entity_group': 'LOC', 'score': np.float32(0.9994609), 'word': 'California', 'start': 60, 'end': 70}]


 # Exercise 5: Comparing BERT and GPT

 Objective: Understand the architectural and functional differences between BERT and GPT models.

Why this matters:

BERT and GPT are foundational models in NLP but serve different purposes. Knowing their strengths, weaknesses, and use cases helps you choose the right model for your task.

Instructions:

Research the architectures and applications of BERT and GPT.
Create a comparison table based on:
Architecture (encoder, decoder, or both). Primary purpose (understanding vs. generation). Common use cases. Strengths and weaknesses.

Reflect on the differences and similarities.
Outcome: You will have a clear comparison of BERT and GPT, helping you understand when to use each model.

#### Exercise 5: Comparing BERT and GPT in a table

Considering the study made hereabove on the architectures and applications of BERT and GPT, here is a comparison table:

| Feature                   | **BERT**                                                                                                 | **GPT**                                                                                                                |
| ------------------------- | -------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| **Architecture**          | Encoder-only bidirectional attention                              | Decoder-only autoregressive, causal attention         |
| **Training Objective**    | Masked Language Modeling + Next Sentence Prediction                                      | Causal (next-token) language modeling                  |
| **Contextual Processing** | Both left & right context simultaneously                                                                 | Only left-to-right (past context)                                                                                      |
| **Primary Use Cases**     | Understanding tasks: classification, NER, QA, sentiment | Generative tasks: text generation, code completion, chatbots           |
| **Strengths**             | Deep text comprehension, strong for discriminative tasks                                                 | Fluent generation, creative and conversational outputs                                                                 |
| **Weaknesses**            | Cannot naturally generate text                                                                           | Less effective for deep contextual understanding                                                                       |
| **Computational Needs**   | Moderate—efficient fine-tuning                                                                           | High—especially for larger versions like GPT-3/4                                                                       |
| **Notable Variants**      | RoBERTa, DistilBERT, ALBERT, etc.                                                                        | Scaled through GPT‑2, GPT‑3, GPT‑4, GPT‑4o                                                                             |



#Exercise 6: Exploring BERT Applications in Retrieval-Augmented Generation (RAG)

Objective: Learn how BERT is used in RAG systems to enhance information retrieval.

Why this matters:

RAG systems combine retrieval and generation, allowing language models to access external knowledge.

BERT plays a key role in retrieving relevant information, improving the quality of generated responses.

Instructions:

Research the concept of Retrieval-Augmented Generation (RAG).

1.  **Explain BERT's role in the retrieval component of a RAG system.** How does it help in finding relevant information?
2.  **Describe how BERT generates embeddings for documents and queries in a RAG setup.** Why are these embeddings important?
3.  **Discuss how a vector database is used in conjunction with BERT embeddings to match queries with relevant documents.**
4.  **Provide an example of how BERT (for retrieval) and a generative model like GPT (for generation) work together in a RAG system** to answer a user query based on external knowledge.


How BERT Fits into RAG:

Embedding Creation
- BERT encodes queries (user questions) and documents into dense vector representations, known as embeddings.

Similarity Matching
- These embeddings facilitate semantic similarity search—often using vector databases—to find the most relevant documents.

Grounding the Response
- Instead of generating based solely on pattern memorization, the system can reference and quote retrieved documents for better accuracy and reduced hallucinations.

Generative Step
A generative model (e.g., GPT) then constructs a response, enriched with retrieved content, improving factual alignment.

Example Workflow:
- User Query: “How should I prepare for legal depositions?”

Retrieval Phase:
- Query is transformed into a BERT-generated embedding.

BERT-encoded document embeddings are compared to find top matches.

Generation Phase:
- GPT uses both the retrieved content and context to generate a well-informed answer.

Citations and grounding in real texts reduce hallucinations.

Summary — Why It Matters:

- BERT provides accurate semantic matches by deeply understanding both query and document meaning.

GPT-like models excel in generating coherent, well-formatted responses based on retrieved information.

RAG architecture marries reliable retrieval with expressive generation—essential for factual accuracy in applications spanning from chatbots to research assistants.