# Unit 1 Assignment: The Model Benchmark Challenge

**Objective**: This assignment evaluates the architectural differences between **BERT**, **RoBERTa**, and **BART** by forcing these models to perform tasks they might not be designed for. We will observe why architecture matters and how different model designs affect performance.

## Models Under Test:
1. **BERT** (`bert-base-uncased`): An **Encoder-only** model designed for understanding, not generation
2. **RoBERTa** (`roberta-base`): An optimized **Encoder-only** model 
3. **BART** (`facebook/bart-base`): An **Encoder-Decoder** model designed for seq2seq tasks


## Setup Section

First, let's import the necessary libraries and check versions.

In [15]:
# Import required libraries
from transformers import pipeline, __version__ as transformers_version
import torch
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

print("Setup Complete!")
print(f"Transformers version: {transformers_version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Setup Complete!
Transformers version: 5.0.0
PyTorch version: 2.10.0+cpu
CUDA available: False


### Experiment 1: Text Generation
**Task**: Try to generate text using the prompt: `"The future of Artificial Intelligence is"`
*   **Code Hint**: `pipeline('text-generation', model='...')`
*   **Hypothesis**: Which models will fail? Why? (Hint: Can an Encoder *generate* new tokens easily?)

In [19]:
# Test text generation with all three models
models_to_test = [
    "bert-base-uncased",
    "roberta-base", 
    "facebook/bart-base"
]

prompt = "The future of Artificial Intelligence is"

print("=== EXPERIMENT 1: TEXT GENERATION ===\n")

for model_name in models_to_test:
    print(f"Testing {model_name}:")
    print("-" * 40)
    
    try:
        # Create text generation pipeline
        generator = pipeline('text-generation', model=model_name, max_length=50, num_return_sequences=1)
        
        # Generate text
        result = generator(prompt)
        
        generated_text = result[0]['generated_text']
        continuation = generated_text[len(prompt):].strip()
        
        print(f"DONE!")
        print(f"Generated text: {generated_text}")
        
    except Exception as e:
        print(f"FAILURE")
        print(f"Error: {str(e)}")
    
    print("\n" + "="*60 + "\n")

=== EXPERIMENT 1: TEXT GENERATION ===

Testing bert-base-uncased:
----------------------------------------


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 721.49it/s, Materializing param=cls.predictions.transform.dense.weight]                 
BertLMHeadModel LOAD REPORT from: bert-base-uncased
Key                         | Status     |  | 
----------------------------+------------+--+-
bert.pooler.dense.weight    | UNEXPECTED |  | 
bert.pooler.dense.bias      | UNEXPECTED |  | 
cls.seq_relationship.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


DONE!
Generated text: The future of Artificial Intelligence is. the within my many so the and and in the ring ( " ( ) dot it or so everyone onwards 2 to over jay however he and and general in the around each " ( " ( " ".


Testing roberta-base:
----------------------------------------


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 881.61it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForCausalLM LOAD REPORT from: roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


DONE!
Generated text: The future of Artificial Intelligence is


Testing facebook/bart-base:
----------------------------------------


Loading weights: 100%|██████████| 159/159 [00:00<00:00, 791.04it/s, Materializing param=model.decoder.layers.5.self_attn_layer_norm.weight]   
This checkpoint seem corrupted. The tied weights mapping for this model specifies to tie model.decoder.embed_tokens.weight to lm_head.weight, but both are absent from the checkpoint, and we could not find another related tied weight for those keys
BartForCausalLM LOAD REPORT from: facebook/bart-base
Key                                                           | Status     | 
--------------------------------------------------------------+------------+-
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn.out_proj.weight   | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn_layer_norm.bias   | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.fc1.bias                    | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.final_layer_norm.weight     | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.fc2.bias                    | UNEXPECTED | 
encoder.la

DONE!
Generated text: The future of Artificial Intelligence is analyses analyses R R Wid samurai samurai samurai analyses answering diver R nonviolent R answering R rentedPT devotion devotion workers Indies workers answeringanny workers draft██ devotion devotion turmoil answering R██together██together Colourtogethertogethertogether




- BERT and BART generated some output while roberta didn't, but none of them were able to generate something meaning full.
- Outputs are:
    * The future of Artificial Intelligence is. ( ) ( ). the often it it i..
    * The future of Artificial Intelligence is 
    * The future of Artificial Intelligence is analyses analyses R R Wid samurai samurai.. 


### Experiment 2: Masked Language Modeling (Missing Word)
**Task**: Predict the missing word in: `"The goal of Generative AI is to [MASK] new content."`
*   **Code Hint**: `pipeline('fill-mask', model='...')`
*   **Hypothesis**: This is how BERT/RoBERTa were trained. They should perform well.

In [17]:
# Test fill-mask with all three models
print("=== EXPERIMENT 2: FILL-MASK (MASKED LANGUAGE MODELING) ===\n")

# Define the sentence with proper mask tokens for each model
sentences = {
    "bert-base-uncased": "The goal of Generative AI is to [MASK] new content.",
    "roberta-base": "The goal of Generative AI is to <mask> new content.",  # RoBERTa uses <mask>
    "facebook/bart-base": "The goal of Generative AI is to <mask> new content."  # BART uses <mask>
}

for model_name in models_to_test:
    print(f"Testing {model_name}:")
    print("-" * 40)
    
    try:
        # Create fill-mask pipeline
        fill_mask = pipeline('fill-mask', model=model_name, top_k=5)
        
        # Get the appropriate sentence for this model
        sentence = sentences[model_name]
        print(f"Input: {sentence}")
        
        # Predict the masked word
        results = fill_mask(sentence)
        
        print(f"DONE - Top 5 predictions:")
        for i, result in enumerate(results, 1):
            print(f"  {i}. {result['token_str']} (confidence: {result['score']:.4f})")
            
    except Exception as e:
        print(f"FAILURE")
        print(f"Error: {str(e)}")
    
    print("\n" + "="*60 + "\n")

=== EXPERIMENT 2: FILL-MASK (MASKED LANGUAGE MODELING) ===

Testing bert-base-uncased:
----------------------------------------


Loading weights: 100%|██████████| 202/202 [00:00<00:00, 731.46it/s, Materializing param=cls.predictions.transform.dense.weight]                 
BertForMaskedLM LOAD REPORT from: bert-base-uncased
Key                         | Status     |  | 
----------------------------+------------+--+-
bert.pooler.dense.weight    | UNEXPECTED |  | 
bert.pooler.dense.bias      | UNEXPECTED |  | 
cls.seq_relationship.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Input: The goal of Generative AI is to [MASK] new content.
DONE - Top 5 predictions:
  1. create (confidence: 0.5397)
  2. generate (confidence: 0.1558)
  3. produce (confidence: 0.0541)
  4. develop (confidence: 0.0445)
  5. add (confidence: 0.0176)


Testing roberta-base:
----------------------------------------


Loading weights: 100%|██████████| 202/202 [00:00<00:00, 776.37it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForMaskedLM LOAD REPORT from: roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Input: The goal of Generative AI is to <mask> new content.
DONE - Top 5 predictions:
  1.  generate (confidence: 0.3711)
  2.  create (confidence: 0.3677)
  3.  discover (confidence: 0.0835)
  4.  find (confidence: 0.0213)
  5.  provide (confidence: 0.0165)


Testing facebook/bart-base:
----------------------------------------


Loading weights: 100%|██████████| 259/259 [00:00<00:00, 881.24it/s, Materializing param=model.shared.weight]                                  


Input: The goal of Generative AI is to <mask> new content.
DONE - Top 5 predictions:
  1.  create (confidence: 0.0746)
  2.  help (confidence: 0.0657)
  3.  provide (confidence: 0.0609)
  4.  enable (confidence: 0.0359)
  5.  improve (confidence: 0.0332)




* Here, BERT and roberta excels and outperforms BART. 
    - BERT: Best performance (53.9% confidence for "create")
    - RoBERTa: Equally strong (37.1% "generate", 36.8% "create")
    - BART: Lower confidence scores but still reasonable predictions

### Experiment 3: Question Answering
**Task**: Answer the question `"What are the risks?"` based on the context: `"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."`
*   **Code Hint**: `pipeline('question-answering', model='...')`
*   **Note**: Using a "base" model (not fine-tuned for SQuAD) might yield random or poor results. Observe this behavior.

In [18]:
# Test question answering with all three models
print("=== EXPERIMENT 3: QUESTION ANSWERING ===\n")

context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

print(f"Context: {context}")
print(f"Question: {question}\n")

for model_name in models_to_test:
    print(f"Testing {model_name}:")
    print("-" * 40)
    
    try:
        # Create question-answering pipeline with base model
        qa_pipeline = pipeline('question-answering', model=model_name)
        
        # Get answer
        result = qa_pipeline(question=question, context=context)
        
        print(f"DONE!")
        print(f"Answer: '{result['answer']}'")
        print(f"Confidence: {result['score']:.4f}")
        print(f"Start position: {result['start']}")
        print(f"End position: {result['end']}")
        
    except Exception as e:
        print(f"FAILURE")
        print(f"Error: {str(e)}")
    
    print("\n" + "="*60 + "\n")

=== EXPERIMENT 3: QUESTION ANSWERING ===

Context: Generative AI poses significant risks such as hallucinations, bias, and deepfakes.
Question: What are the risks?

Testing bert-base-uncased:
----------------------------------------


Loading weights: 100%|██████████| 197/197 [00:00<00:00, 793.66it/s, Materializing param=bert.encoder.layer.11.output.dense.weight]              
BertForQuestionAnswering LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
bert.pooler.dense.weight                   | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
bert.pooler.dense.bias                     | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
qa_outputs.weight                          | MISSING    | 
qa_outputs.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can b

DONE!
Answer: 'risks such as hallucinations'
Confidence: 0.0087
Start position: 32
End position: 60


Testing roberta-base:
----------------------------------------


Loading weights: 100%|██████████| 197/197 [00:00<00:00, 875.25it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForQuestionAnswering LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
qa_outputs.weight               | MISSING    | 
qa_outputs.bias                 | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


DONE!
Answer: 'Generative AI'
Confidence: 0.0078
Start position: 0
End position: 13


Testing facebook/bart-base:
----------------------------------------


Loading weights: 100%|██████████| 259/259 [00:00<00:00, 865.30it/s, Materializing param=model.shared.weight]                                  
BartForQuestionAnswering LOAD REPORT from: facebook/bart-base
Key               | Status  | 
------------------+---------+-
qa_outputs.weight | MISSING | 
qa_outputs.bias   | MISSING | 

Notes:
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


DONE!
Answer: 'as hallucinations, bias, and deepfakes.'
Confidence: 0.0327
Start position: 43
End position: 82




- QnA part, all the models shows poor performance, still slightly different from each other:
    * BERT: Very low confidence (0.87%) but partially correct
    * RoBERTa: Wrong answer with very low confidence (0.78%)
    * BART: Best performance with 3.27% confidence and most complete answer

## Observation Table

Based on the experimental results, here is the updated observation table:

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
|------|-------|----------------------------------|-------------|---------------------|
| **Text Generation** | BERT | **Failure** | Generated complete nonsense: "is. ( ) ( ). the often it it it with him how the the the the the the on as and - ( ) a a a from ". the or so " ( ) ( " ) (" | BERT is encoder-only; forced into causal generation produces gibberish because it's trained on bidirectional MLM, not autoregressive generation |
| | RoBERTa | **Failure** | Generated no new content - just repeated the input prompt exactly | RoBERTa is encoder-only; cannot perform autoregressive generation effectively, so it defaults to just returning the input |
| | BART | **Failure** | Generated nonsensical words: "darkened pet Hero handsome proportion implications connectionsu ntled frog frog..." | Even though BART is encoder-decoder, the base model isn't properly fine-tuned for causal generation and produces random tokens |
| **Fill-Mask** | BERT | **Success** | Excellent predictions: "create" (53.9%), "generate" (15.6%), "produce" (5.4%), "develop" (4.5%) | BERT is specifically designed and trained on Masked Language Modeling (MLM) - this is its core strength |
| | RoBERTa | **Success** | Very good predictions: "generate" (37.1%), "create" (36.8%), "discover" (8.4%) | RoBERTa uses optimized MLM training like BERT - performs excellently on its designed task |
| | BART | **Partial Success** | Reasonable but lower confidence predictions: "create" (7.5%), "help" (6.6%), "provide" (6.1%) | BART includes MLM in training but confidence scores are much lower than pure encoder models |
| **Question Answering** | BERT | **Poor Performance** | Found partial answer "risks such as hallucinations" but very low confidence (0.87%) | Base BERT can extract text spans but without QA fine-tuning, randomly initialized QA head gives poor results |
| | RoBERTa | **Poor Performance** | Completely wrong answer "Generative AI" with very low confidence (0.78%) | Same issue as BERT - base model with randomly initialized QA components performs poorly |
| | BART | **Slightly Better** | More complete answer "as hallucinations, bias, and deepfakes" with higher confidence (3.27%) | Encoder-decoder architecture handles QA slightly better than encoder-only, but still needs fine-tuning |