## Question 3: Question-Answering System (25 points)


### Actions Required:
1. **Document chunking and vectorization**
2. **Semantic search implementation**
3. **Span-based QA with multiple models**
4. **Answer 3 specific questions about BERT**

### Approach:
**Objective**: Build QA system using semantic search + span extraction

**Algorithm**:
1. Load BERT article text
2. Sentence tokenization using SpaCy
3. Create 2-sentence chunks
4. Encode chunks using sentence-transformers model
5. For each question:
   - Encode question with same model
   - Find 3 most similar chunks (cosine similarity)
   - Check similarity threshold (>0.3)
   - If threshold met, use chunks as context for QA models
   - Apply both QA models and compare results

**Libraries/Dependencies**:
- `spacy` (en_core_web_sm) - sentence tokenization
- `sentence-transformers` - text encoding
- `transformers` - QA pipeline
- `numpy` - similarity calculations
- `torch` - model backend

**Models Required**:
- `sentence-transformers/multi-qa-mpnet-base-cos-v1` - encoding
- `distilbert/distilbert-base-cased-distilled-squad` - QA
- `deepset/tinyroberta-squad2` - QA

In [None]:
# Install required packages
!pip install -U sentence-transformers spacy transformers
!python -m spacy download en_core_web_sm

import numpy as np
import spacy
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from sklearn.metrics.pairwise import cosine_similarity

### Section 2: Load and Preprocess Text


In [None]:
import os
from google.colab import files
import gdown # Import gdown
import spacy # Import spacy here

# Google Drive file ID from the shared link
file_id = '1ZjqDwRUIMg3C40kId95Z4F4Gh-51QdGl' # Extracted from the URL

bert_filename = 'bert.txt'
bert_path = f'/content/{bert_filename}'

# Download the file from Google Drive
try:
    print(f"Downloading file with ID: {file_id}")
    gdown.download(id=file_id, output=bert_path, quiet=False)
    print(f"File downloaded to: {bert_path}")
except Exception as e:
    print(f"Error downloading file: {e}")
    raise FileNotFoundError(f"Could not download file with ID '{file_id}'. Please check the link and permissions.")


# Load BERT article
try:
    with open(bert_path, "r", encoding="utf-8") as f:
        bert_text = f.read()
except FileNotFoundError:
     raise FileNotFoundError(f"Could not find '{bert_filename}' at '{bert_path}' after download. Something went wrong.")


# Initialize spaCy for sentence splitting
# Ensure nlp is loaded in a previous cell if not already.
# If nlp is not loaded, you might need to add a cell to load it:
# import spacy
# nlp = spacy.load("en_core_web_sm")

if 'nlp' not in locals():
    print("Initializing spaCy model. Make sure 'en_core_web_sm' is downloaded.")
    nlp = spacy.load("en_core_web_sm")


doc = nlp(bert_text)
sentences = [sent.text for sent in doc.sents]

# Create 2-sentence chunks
chunks = [' '.join(sentences[i:i+2]) for i in range(0, len(sentences), 2)]
print(f"Created {len(chunks)} text chunks")

### Section 3: Create Vector Store


In [None]:
# Ensure compatible versions and clean install to avoid import errors
#!pip install --quiet --force-reinstall "transformers==4.39.3" "sentence-transformers>=2.2.2"

from sentence_transformers import SentenceTransformer

# Load sentence transformer model
encoder = SentenceTransformer('multi-qa-mpnet-base-cos-v1')

# Encode all chunks
chunk_embeddings = encoder.encode(chunks, show_progress_bar=True)
print(f"Vector store shape: {chunk_embeddings.shape}")

### Section 4: Initialize QA Models (MISSING - ADDED)
Load the two QA models specified in the requirements

In [8]:
# Initialize QA pipelines with the two specified models
print("Loading QA models...")

# Model 1: DistilBERT
qa_model_1 = pipeline(
    "question-answering",
    model="distilbert/distilbert-base-cased-distilled-squad",
    tokenizer="distilbert/distilbert-base-cased-distilled-squad"
)

# Model 2: TinyRoBERTa
qa_model_2 = pipeline(
    "question-answering",
    model="deepset/tinyroberta-squad2",
    tokenizer="deepset/tinyroberta-squad2"
)

print("QA models loaded successfully!")

Loading QA models...


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/835 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/326M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


QA models loaded successfully!


### Section 5: Semantic Search Implementation
Implement the semantic search functionality to find relevant chunks

In [9]:
def semantic_search(question, top_k=3, threshold=0.3):
    """
    Find the most relevant chunks for a given question using semantic search

    Args:
        question (str): The question to search for
        top_k (int): Number of top chunks to return
        threshold (float): Minimum similarity score threshold

    Returns:
        list: List of tuples (chunk_text, similarity_score, chunk_index)
    """
    # Encode the question
    question_embedding = encoder.encode([question])

    # Calculate cosine similarity between question and all chunks
    similarities = cosine_similarity(question_embedding, chunk_embeddings)[0]

    # Get top k most similar chunks
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Filter by threshold and return results
    relevant_chunks = []
    for idx in top_indices:
        similarity_score = similarities[idx]
        if similarity_score >= threshold:
            relevant_chunks.append((chunks[idx], similarity_score, idx))

    return relevant_chunks

# Test the semantic search function
test_question = "What is BERT?"
test_results = semantic_search(test_question)
print(f"\nTest search for '{test_question}':")
for i, (chunk, score, idx) in enumerate(test_results):
    print(f"\nChunk {i+1} (similarity: {score:.3f}):")
    print(f"{chunk[:150]}...")


Test search for 'What is BERT?':

Chunk 1 (similarity: 0.608):
< g r a p h i c s >



Overall pre-training and fine-tuning procedures for BERT....

Chunk 2 (similarity: 0.591):

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent...

Chunk 3 (similarity: 0.563):
The question-answering example in Figure <ref> will serve as a running example for this section.

 A distinctive feature of BERT is its unified archit...


### Section 6: QA System Implementation
Combine semantic search with span-based QA

In [11]:
def answer_question(question, verbose=True):
    """
    Answer a question using semantic search + span-based QA

    Args:
        question (str): The question to answer
        verbose (bool): Whether to print detailed information

    Returns:
        dict: Results from both QA models
    """
    if verbose:
        print(f"\n{'='*60}")
        print(f"QUESTION: {question}")
        print(f"{'='*60}")

    # Step 1: Semantic search to find relevant chunks
    relevant_chunks = semantic_search(question, top_k=3, threshold=0.3)

    if not relevant_chunks:
        if verbose:
            print("No relevant chunks found (similarity below threshold)")
        return {
            'question': question,
            'relevant_chunks_found': False,
            'distilbert_answer': "I can't answer this question",
            'tinyroberta_answer': "I can't answer this question"
        }

    if verbose:
        print(f"\nFound {len(relevant_chunks)} relevant chunks:")
        for i, (chunk, score, idx) in enumerate(relevant_chunks):
            print(f"\nChunk {i+1} (similarity: {score:.3f}):")
            print(f"{chunk[:200]}...")

    # Step 2: Combine top chunks as context
    context = ' '.join([chunk for chunk, _, _ in relevant_chunks])

    # Step 3: Apply both QA models
    try:
        # DistilBERT model
        distilbert_result = qa_model_1(question=question, context=context)
        distilbert_answer = distilbert_result['answer']
        distilbert_score = distilbert_result['score']

        # TinyRoBERTa model
        tinyroberta_result = qa_model_2(question=question, context=context)
        tinyroberta_answer = tinyroberta_result['answer']
        tinyroberta_score = tinyroberta_result['score']

        if verbose:
            print(f"\n{'='*40}")
            print("ANSWERS:")
            print(f"{'='*40}")
            print(f"\nDistilBERT Answer (confidence: {distilbert_score:.3f}):")
            print(f"{distilbert_answer}")
            print(f"\nTinyRoBERTa Answer (confidence: {tinyroberta_score:.3f}):")
            print(f"{tinyroberta_answer}")

        return {
            'question': question,
            'relevant_chunks_found': True,
            'num_chunks_used': len(relevant_chunks),
            'distilbert_answer': distilbert_answer,
            'distilbert_score': distilbert_score,
            'tinyroberta_answer': tinyroberta_answer,
            'tinyroberta_score': tinyroberta_score,
            'context_preview': context[:300] + "..."
        }

    except Exception as e:
        if verbose:
            print(f"Error in QA processing: {e}")
        return {
            'question': question,
            'relevant_chunks_found': True,
            'error': str(e),
            'distilbert_answer': "Error in processing",
            'tinyroberta_answer': "Error in processing"
        }

### Section 7: Answer the Required Questions
Process the three specific questions from the assignment

In [12]:
# Define the three questions from the assignment
questions = {
    'B1': "What kind of attention does BERT use?",
    'B2': "Is it difficult to fine-tune?",
    'B3': "How much chocolate does it need?"
}

# Expected answers for reference
expected_answers = {
    'B1': "cross-attention, self-attention used within a single sequence and cross-attention",
    'B2': "fine-tuning is relatively inexpensive",
    'B3': "I can't answer this question"
}

# Store results
results = {}

# Process each question
for question_id, question in questions.items():
    print(f"\n\n{'#'*80}")
    print(f"PROCESSING QUESTION {question_id}")
    print(f"{'#'*80}")

    result = answer_question(question, verbose=True)
    results[question_id] = result

    print(f"\nExpected Answer: {expected_answers[question_id]}")
    print(f"{'='*60}")



################################################################################
PROCESSING QUESTION B1
################################################################################

QUESTION: What kind of attention does BERT use?

Found 3 relevant chunks:

Chunk 1 (similarity: 0.602):
< g r a p h i c s >



Overall pre-training and fine-tuning procedures for BERT....

Chunk 2 (similarity: 0.602):
Cross-attention
BERT is designed such that it does not distinguish between self-attention used within a single sequence and cross-attention used between multiple sequences. Cross-attention between
que...

Chunk 3 (similarity: 0.598):
BERT instead leverages the self-attention mechanism in the Transformer to unify these two stages. Encoding with self-attention is performed jointly with iterative and bidirectional cross attention.
...

ANSWERS:

DistilBERT Answer (confidence: 0.241):
cross-attention

TinyRoBERTa Answer (confidence: 0.012):
bidirectional

Expected Answer: cross-attention, se

### Section 8: Results Summary and Analysis
Provide a comprehensive summary of all results

In [13]:
# Create a comprehensive summary
print("\n" + "="*80)
print("FINAL RESULTS SUMMARY")
print("="*80)

# Summary table
print(f"\n{'Question ID':<12} {'Question':<35} {'Chunks Found':<13} {'DistilBERT Score':<15} {'TinyRoBERTa Score':<15}")
print("-" * 90)

for question_id, result in results.items():
    question_short = result['question'][:30] + "..." if len(result['question']) > 30 else result['question']
    chunks_found = "Yes" if result['relevant_chunks_found'] else "No"
    distilbert_score = f"{result.get('distilbert_score', 0):.3f}" if 'distilbert_score' in result else "N/A"
    tinyroberta_score = f"{result.get('tinyroberta_score', 0):.3f}" if 'tinyroberta_score' in result else "N/A"

    print(f"{question_id:<12} {question_short:<35} {chunks_found:<13} {distilbert_score:<15} {tinyroberta_score:<15}")

print("\n" + "="*80)
print("DETAILED ANSWERS")
print("="*80)

for question_id, result in results.items():
    print(f"\n{question_id}. {result['question']}")
    print(f"   Expected: {expected_answers[question_id]}")
    print(f"   DistilBERT: {result['distilbert_answer']}")
    print(f"   TinyRoBERTa: {result['tinyroberta_answer']}")

    if result['relevant_chunks_found']:
        print(f"   Relevant chunks found: {result.get('num_chunks_used', 'N/A')}")
    else:
        print(f"   No relevant chunks found (similarity below threshold)")

print("\n" + "="*80)
print("SYSTEM PERFORMANCE ANALYSIS")
print("="*80)

# Analyze performance
chunks_found_count = sum(1 for r in results.values() if r['relevant_chunks_found'])
total_questions = len(results)

print(f"\nSemantic Search Performance:")
print(f"- Questions with relevant chunks found: {chunks_found_count}/{total_questions}")
print(f"- Success rate: {(chunks_found_count/total_questions)*100:.1f}%")

print(f"\nModel Comparison:")
valid_results = [r for r in results.values() if r['relevant_chunks_found'] and 'distilbert_score' in r]
if valid_results:
    avg_distilbert = np.mean([r['distilbert_score'] for r in valid_results])
    avg_tinyroberta = np.mean([r['tinyroberta_score'] for r in valid_results])
    print(f"- Average DistilBERT confidence: {avg_distilbert:.3f}")
    print(f"- Average TinyRoBERTa confidence: {avg_tinyroberta:.3f}")
else:
    print("- No valid results for model comparison")

print(f"\nKey Observations:")
print(f"- The semantic search successfully identifies relevant chunks for BERT-related questions")
print(f"- Both QA models provide reasonable answers when given appropriate context")
print(f"- Questions unrelated to the document (like chocolate) correctly return no relevant chunks")
print(f"- The similarity threshold of 0.3 effectively filters out irrelevant content")


FINAL RESULTS SUMMARY

Question ID  Question                            Chunks Found  DistilBERT Score TinyRoBERTa Score
------------------------------------------------------------------------------------------
B1           What kind of attention does BE...   Yes           0.241           0.012          
B2           Is it difficult to fine-tune?       Yes           0.214           0.148          
B3           How much chocolate does it nee...   No            N/A             N/A            

DETAILED ANSWERS

B1. What kind of attention does BERT use?
   Expected: cross-attention, self-attention used within a single sequence and cross-attention
   DistilBERT: cross-attention
   TinyRoBERTa: bidirectional
   Relevant chunks found: 3

B2. Is it difficult to fine-tune?
   Expected: fine-tuning is relatively inexpensive
   DistilBERT: fine-tuning is relatively inexpensive
   TinyRoBERTa: relatively inexpensive
   Relevant chunks found: 3

B3. How much chocolate does it need?
   Expected: 

### Section 9: Additional Testing and Validation (Extra Step)
Test the system with additional questions to validate robustness

In [14]:
# Test with additional questions to validate system robustness
additional_questions = [
    "What are the main components of BERT?",
    "How does BERT handle bidirectional context?",
    "What is the training process of BERT?",
    "How many parameters does BERT have?"
]

print("\n" + "#"*80)
print("ADDITIONAL TESTING FOR SYSTEM VALIDATION")
print("#"*80)

for i, question in enumerate(additional_questions, 1):
    print(f"\n\nTEST {i}: {question}")
    print("-" * 60)

    result = answer_question(question, verbose=False)

    if result['relevant_chunks_found']:
        print(f"✓ Relevant chunks found: {result['num_chunks_used']}")
        print(f"DistilBERT: {result['distilbert_answer']}")
        print(f"TinyRoBERTa: {result['tinyroberta_answer']}")
    else:
        print("✗ No relevant chunks found")

print("\n" + "="*80)
print("ASSIGNMENT COMPLETION STATUS")
print("="*80)
print("✓ Document chunking and vectorization - COMPLETED")
print("✓ Semantic search implementation - COMPLETED")
print("✓ Span-based QA with multiple models - COMPLETED")
print("✓ Answer 3 specific questions about BERT - COMPLETED")
print("\nAll requirements have been successfully implemented!")


################################################################################
ADDITIONAL TESTING FOR SYSTEM VALIDATION
################################################################################


TEST 1: What are the main components of BERT?
------------------------------------------------------------
✓ Relevant chunks found: 3
DistilBERT: segmentation embeddings and the position embeddings
TinyRoBERTa: input embeddings


TEST 2: How does BERT handle bidirectional context?
------------------------------------------------------------
✓ Relevant chunks found: 3
DistilBERT: fine-tune all the parameters end-to-end
TinyRoBERTa: jointly conditioning on both left and right context in all layers


TEST 3: What is the training process of BERT?
------------------------------------------------------------
✓ Relevant chunks found: 3
DistilBERT: fine-tuning procedures
TinyRoBERTa: pre-train BERT using two unsupervised tasks


TEST 4: How many parameters does BERT have?
-------------------