# 1. Fill-Mask: Predict Missing Words

**Estimated Time**: ~2 hours

**Prerequisites**: None (this is the first notebook)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Explain** what masked language modeling (MLM) is and why it matters for NLP
2. **Use** the Hugging Face `fill-mask` pipeline to predict missing words
3. **Interpret** confidence scores and understand probability distributions
4. **Compare** predictions from different BERT-family models
5. **Identify** appropriate use cases for fill-mask models

## Setup

Run this cell first to import required libraries. The first time you run fill-mask, it will download a model (~400MB) which is then cached for future use.

In [None]:
# Core imports
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
import torch

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete!")

---

# Part 1: Conceptual Foundation

## What is Masked Language Modeling?

**In plain English**: Masked Language Modeling (MLM) is like a "fill-in-the-blank" test for AI. You give the model a sentence with a word hidden (masked), and it predicts what word should go there.

**Technical definition**: MLM is a self-supervised pre-training objective where the model learns to predict randomly masked tokens based on their surrounding context.

### The "Cloze Test" Analogy

You've probably done fill-in-the-blank exercises in school:

> "The cat sat on the ____."

Your brain uses context ("cat", "sat", "on", "the") to predict likely words: "mat", "floor", "couch", etc.

MLM models do the same thing, but they've read billions of sentences and learned patterns about how words relate to each other.

### How BERT Was Trained

BERT (Bidirectional Encoder Representations from Transformers) was trained on a massive corpus using this approach:

1. **Take a sentence**: "The cat sat on the mat."
2. **Randomly mask ~15% of tokens**: "The cat [MASK] on the mat."
3. **Train the model to predict the masked word**: Model learns that "sat" fits here
4. **Repeat billions of times** across Wikipedia and books

### Why "Bidirectional" Matters

```
Left context:  "The cat"     ‚Üí  [MASK]  ‚Üê  "on the mat"  :Right context
```

Unlike older models that only read left-to-right (like GPT), BERT looks at words on **both sides** of the mask. This bidirectional context helps it make much better predictions.

**Example**: In "The bank was flooded after the [MASK] broke", knowing "flooded" comes later helps predict "dam" (not "robber").

### Real-World Applications

Fill-mask models are the foundation for many NLP applications:

- **Autocomplete**: Suggesting words as you type
- **Grammar checking**: Identifying words that don't fit the context
- **Data augmentation**: Generating variations of training data
- **Transfer learning**: Pre-trained knowledge is fine-tuned for specific tasks

### Key Terminology

| Term | Definition |
|------|------------|
| **Token** | A piece of text (usually a word or subword) that the model processes |
| **[MASK]** | The special token that tells BERT "predict what goes here" |
| **Logits** | Raw model output scores (before converting to probabilities) |
| **Softmax** | Function that converts logits to probabilities (0-1, summing to 1) |
| **Top-k** | The k most likely predictions |
| **Confidence score** | Probability assigned to a prediction (higher = more confident) |

### Check Your Understanding

Before moving on, try to answer these questions (answers at the end of the notebook):

1. Why is BERT called "bidirectional"?
   - A) It can translate in both directions
   - B) It looks at context on both sides of a masked word
   - C) It was trained on two datasets

2. What percentage of tokens were masked during BERT's training?
   - A) 50%
   - B) 15%
   - C) 5%

3. What does a higher confidence score mean?
   - A) The model is more certain about its prediction
   - B) The word is longer
   - C) The prediction is always correct

4. Which is NOT a use case for fill-mask models?
   - A) Autocomplete
   - B) Grammar checking
   - C) Real-time translation

---

# Part 2: Basic Implementation

## Your First Fill-Mask Pipeline

Hugging Face provides a simple `pipeline` abstraction that handles all the complexity for you. Let's start with the most basic usage:

In [None]:
# Create a fill-mask pipeline (uses bert-base-uncased by default)
fill_mask = pipeline("fill-mask")

# Run prediction on a simple sentence
# Note: The [MASK] token tells the model which word to predict
result = fill_mask("The capital of France is [MASK].")

# Print the results
print("Top predictions for: 'The capital of France is [MASK].'\n")
for prediction in result:
    print(f"  {prediction['token_str']:12} ‚Üí {prediction['score']:.2%} confidence")

### Understanding the Output

The pipeline returns a list of dictionaries, each containing:

- `token_str`: The predicted word
- `score`: Confidence score (probability between 0 and 1)
- `token`: The token ID (internal representation)
- `sequence`: The full sentence with the prediction filled in

Let's examine one prediction in detail:

In [None]:
# Look at the first (highest confidence) prediction in detail
top_prediction = result[0]

print("Detailed view of top prediction:")
print(f"  Predicted word: '{top_prediction['token_str']}'")
print(f"  Confidence:     {top_prediction['score']:.4f} ({top_prediction['score']:.2%})")
print(f"  Token ID:       {top_prediction['token']}")
print(f"  Full sentence:  '{top_prediction['sequence']}'")

### How Context Changes Predictions

The same [MASK] position can have completely different predictions based on surrounding words. Let's see this in action:

In [None]:
# Compare predictions for similar sentences with different contexts
sentences = [
    "The [MASK] is a popular pet.",
    "The [MASK] is a wild animal.",
    "The [MASK] is a farm animal.",
    "The [MASK] is an endangered species."
]

print("Same position, different contexts:\n")
for sentence in sentences:
    predictions = fill_mask(sentence)
    # Get top 3 predictions
    top_3 = [f"{p['token_str']}({p['score']:.0%})" for p in predictions[:3]]
    print(f"'{sentence}'")
    print(f"  ‚Üí {', '.join(top_3)}\n")

Notice how the predictions change based on context:
- "popular pet" ‚Üí likely dogs, cats
- "wild animal" ‚Üí likely lions, wolves
- "farm animal" ‚Üí likely cows, pigs
- "endangered species" ‚Üí likely tigers, rhinos

This demonstrates that the model truly understands context, not just word frequency.

### Controlling the Number of Predictions

By default, the pipeline returns 5 predictions. You can change this with the `top_k` parameter:

In [None]:
# Get more predictions
sentence = "I love to eat [MASK] for breakfast."

# Top 10 predictions
results = fill_mask(sentence, top_k=10)

print(f"Top 10 predictions for: '{sentence}'\n")
for i, pred in enumerate(results, 1):
    print(f"  {i:2}. {pred['token_str']:15} {pred['score']:.2%}")

---

## Exercise 1: Context Experiments (Guided)

**Difficulty**: Basic | **Time**: 10-15 minutes

**Your task**: Explore how different contexts affect predictions for the same masked position.

### Step 1: Run the code below and observe the predictions

In [None]:
# The word "bank" has multiple meanings. Let's see if the model can distinguish them.

# Context 1: Financial institution
financial = fill_mask("I went to the bank to deposit my [MASK].")
print("Financial context: 'I went to the bank to deposit my [MASK].'")
print(f"  Top prediction: {financial[0]['token_str']} ({financial[0]['score']:.2%})\n")

# Context 2: River bank
river = fill_mask("The children played by the river [MASK].")
print("River context: 'The children played by the river [MASK].'")
print(f"  Top prediction: {river[0]['token_str']} ({river[0]['score']:.2%})")

### Step 2: Try your own experiments

Create 3 sentences with [MASK] in the same position, but with different contexts that should lead to different predictions.

**Example theme**: Professions
- "The [MASK] performed surgery on the patient." (expect: doctor, surgeon)
- "The [MASK] argued the case in court." (expect: lawyer, attorney)
- "The [MASK] designed the new building." (expect: architect)

In [None]:
# YOUR CODE HERE
# Create 3 sentences with different contexts

my_sentences = [
    # Replace these with your own sentences
    "The [MASK] performed surgery on the patient.",
    "The [MASK] argued the case in court.",
    "The [MASK] designed the new building."
]

# Run predictions
for sentence in my_sentences:
    result = fill_mask(sentence)
    print(f"'{sentence}'")
    print(f"  ‚Üí {result[0]['token_str']} ({result[0]['score']:.2%})\n")

### Step 3: Reflection Questions

After running your experiments, consider:
1. Were the predictions what you expected?
2. Did any predictions surprise you?
3. Can you find a sentence where the model makes a clearly wrong prediction?

---

# Part 3: Intermediate Exploration

## Comparing Different Models

BERT isn't the only masked language model. Let's compare predictions from different models to see how they differ.

### Important: Different models use different mask tokens!

| Model | Mask Token | Case-sensitive? |
|-------|------------|----------------|
| BERT (base-uncased) | [MASK] | No (lowercased) |
| BERT (base-cased) | [MASK] | Yes |
| RoBERTa | `<mask>` | Yes |
| DistilBERT | [MASK] | No (uncased version) |

In [None]:
# Create pipelines for different models
# Note: First run will download each model (~250-500MB each)

print("Loading models (this may take a moment on first run)...\n")

# BERT base (uncased - converts everything to lowercase)
bert_uncased = pipeline("fill-mask", model="bert-base-uncased")

# DistilBERT (smaller, faster, 97% of BERT's accuracy)
distilbert = pipeline("fill-mask", model="distilbert-base-uncased")

print("Models loaded!")

In [None]:
# Compare predictions across models
test_sentence = "The scientist made an important [MASK] in the laboratory."

print(f"Sentence: '{test_sentence}'\n")

models = [
    ("BERT (base-uncased)", bert_uncased),
    ("DistilBERT", distilbert),
]

for model_name, model_pipeline in models:
    predictions = model_pipeline(test_sentence)
    top_3 = [(p['token_str'], p['score']) for p in predictions[:3]]
    print(f"{model_name}:")
    for word, score in top_3:
        print(f"  {word:15} {score:.2%}")
    print()

### Understanding Model Differences

Different models may give different predictions because:

1. **Training data**: Models are trained on different text corpora
2. **Model size**: Larger models generally capture more nuance
3. **Architecture**: Different designs lead to different representations
4. **Vocabulary**: Each model has its own tokenizer and vocabulary

### Confidence Score Analysis

Let's look at how confidence scores are distributed:

In [None]:
# Get more predictions to see confidence distribution
sentence = "Python is a popular [MASK] language."
predictions = fill_mask(sentence, top_k=15)

print(f"Confidence distribution for: '{sentence}'\n")
print("Word            Confidence    Visual")
print("-" * 50)

for pred in predictions:
    # Create a visual bar
    bar_length = int(pred['score'] * 50)
    bar = '‚ñà' * bar_length
    print(f"{pred['token_str']:15} {pred['score']:6.2%}    {bar}")

Notice how confidence scores typically follow a **long-tail distribution**:
- Top 1-2 predictions have high confidence
- Confidence drops sharply for subsequent predictions
- Many plausible words have very low confidence

## Handling Ambiguous Contexts

Some sentences are genuinely ambiguous - multiple words would work equally well:

In [None]:
# Examples of ambiguous vs. clear contexts

# Ambiguous: Many words could fit
ambiguous = "I like to [MASK]."
amb_results = fill_mask(ambiguous, top_k=5)

# Clear: Strong contextual constraints
clear = "Water freezes at zero degrees [MASK]."
clear_results = fill_mask(clear, top_k=5)

print("AMBIGUOUS CONTEXT:")
print(f"'{ambiguous}'\n")
for p in amb_results:
    print(f"  {p['token_str']:10} {p['score']:.2%}")

print(f"\nConfidence gap (1st vs 2nd): {amb_results[0]['score'] - amb_results[1]['score']:.2%}")

print("\n" + "="*50 + "\n")

print("CLEAR CONTEXT:")
print(f"'{clear}'\n")
for p in clear_results:
    print(f"  {p['token_str']:10} {p['score']:.2%}")

print(f"\nConfidence gap (1st vs 2nd): {clear_results[0]['score'] - clear_results[1]['score']:.2%}")

The **confidence gap** between 1st and 2nd predictions indicates how certain the model is:
- **Large gap**: Model is confident in top prediction
- **Small gap**: Multiple words are equally plausible

---

## Exercise 2: Model Comparison (Semi-guided)

**Difficulty**: Intermediate | **Time**: 15-20 minutes

**Your task**: Find sentences where BERT and DistilBERT give notably different predictions.

**Hints**:
1. Try sentences with technical or specialized vocabulary
2. Try sentences with cultural references
3. Compare confidence scores, not just the top prediction

**Expected output**: At least 2 sentences where the models disagree on the top prediction

In [None]:
# YOUR CODE HERE
# Create a function to compare models on a sentence

def compare_models(sentence):
    """Compare BERT and DistilBERT predictions for a sentence."""
    print(f"Sentence: '{sentence}'\n")
    
    bert_pred = bert_uncased(sentence, top_k=3)
    distil_pred = distilbert(sentence, top_k=3)
    
    print("BERT predictions:")
    for p in bert_pred:
        print(f"  {p['token_str']:15} {p['score']:.2%}")
    
    print("\nDistilBERT predictions:")
    for p in distil_pred:
        print(f"  {p['token_str']:15} {p['score']:.2%}")
    
    # Check if top predictions differ
    if bert_pred[0]['token_str'] != distil_pred[0]['token_str']:
        print("\n‚ö° Models DISAGREE on top prediction!")
    else:
        print("\n‚úì Models agree on top prediction")
    print("="*50 + "\n")

# Test with your own sentences
test_sentences = [
    # Add your own sentences here
    "The quarterback threw the [MASK] for a touchdown.",
    "The programmer fixed the [MASK] in the code.",
]

for sentence in test_sentences:
    compare_models(sentence)

---

# Part 4: Advanced Topics

## Under the Hood: What the Pipeline Actually Does

The `pipeline` function is a convenience wrapper. Let's see what happens inside:

In [None]:
# Load tokenizer and model separately
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Vocabulary size: {tokenizer.vocab_size:,} tokens")

In [None]:
# Step-by-step: What happens when you call fill_mask()

sentence = "The capital of France is [MASK]."
print(f"Input: '{sentence}'\n")

# STEP 1: Tokenization
# Convert text to token IDs
inputs = tokenizer(sentence, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

print("STEP 1 - Tokenization:")
print(f"  Tokens: {tokens}")
print(f"  Token IDs: {inputs['input_ids'][0].tolist()}")

# Find the mask position
mask_token_id = tokenizer.mask_token_id
mask_position = (inputs['input_ids'][0] == mask_token_id).nonzero(as_tuple=True)[0].item()
print(f"  [MASK] is at position: {mask_position}")

In [None]:
# STEP 2: Model Inference
# Pass tokens through the model
print("STEP 2 - Model Inference:")

with torch.no_grad():  # Disable gradient computation (faster, less memory)
    outputs = model(**inputs)

# outputs.logits shape: [batch_size, sequence_length, vocab_size]
print(f"  Output shape: {outputs.logits.shape}")
print(f"  This means: {outputs.logits.shape[1]} positions √ó {outputs.logits.shape[2]:,} possible tokens")

# Get logits for the masked position
mask_logits = outputs.logits[0, mask_position, :]
print(f"  Logits for [MASK] position: {mask_logits.shape} (one score per vocabulary token)")

In [None]:
# STEP 3: Convert to Probabilities
# Apply softmax to get probabilities
print("STEP 3 - Softmax (logits ‚Üí probabilities):")

probabilities = torch.softmax(mask_logits, dim=0)

print(f"  Sum of all probabilities: {probabilities.sum():.4f} (should be ~1.0)")
print(f"  Min probability: {probabilities.min():.2e}")
print(f"  Max probability: {probabilities.max():.4f}")

In [None]:
# STEP 4: Get Top Predictions
print("STEP 4 - Top-k Selection:")

# Get top 5 predictions
top_k = 5
top_probs, top_indices = torch.topk(probabilities, top_k)

print(f"\nTop {top_k} predictions:")
for prob, idx in zip(top_probs, top_indices):
    token = tokenizer.decode([idx])
    print(f"  {token:15} {prob:.4f} ({prob:.2%})")

### Performance Considerations

When using fill-mask models in production:

| Consideration | Recommendation |
|---------------|----------------|
| **Batch processing** | Process multiple sentences at once for efficiency |
| **Model size** | Use DistilBERT for speed (66% smaller, 60% faster) |
| **GPU** | Use CUDA if available: `pipeline("fill-mask", device=0)` |
| **Caching** | Models are cached locally after first download |
| **Memory** | Each model uses ~400MB-1GB RAM |

In [None]:
# Example: Batch processing multiple sentences
sentences = [
    "The [MASK] jumped over the lazy dog.",
    "I enjoy reading [MASK] in my free time.",
    "The weather today is [MASK]."
]

# Process all sentences at once (more efficient than one at a time)
results = fill_mask(sentences)

for sentence, prediction in zip(sentences, results):
    top = prediction[0]  # Results are nested: list of lists
    print(f"'{sentence}'")
    print(f"  ‚Üí {top['token_str']} ({top['score']:.2%})\n")

### Limitations of Fill-Mask Models

Fill-mask models have important limitations to be aware of:

1. **Single mask prediction**: Most models predict one mask at a time. Multiple [MASK] tokens are predicted independently, not jointly.

2. **Vocabulary constraints**: Can only predict words in the vocabulary. Rare words or new terms may not be predicted.

3. **Biases**: Models inherit biases from training data (gender, racial, cultural biases).

4. **No reasoning**: Models match patterns, not logic. They might predict grammatically correct but factually wrong answers.

Let's see some of these limitations:

In [None]:
# Limitation 1: Factual errors (pattern matching, not reasoning)
print("LIMITATION: Pattern matching vs. reasoning\n")

# The model might predict based on common patterns, not facts
factual_test = "The largest planet in our solar system is [MASK]."
result = fill_mask(factual_test)

print(f"'{factual_test}'")
print(f"Top 3 predictions:")
for p in result[:3]:
    correct = "‚úì" if p['token_str'].lower() == "jupiter" else "‚úó"
    print(f"  {correct} {p['token_str']} ({p['score']:.2%})")

In [None]:
# Limitation 2: Multiple masks are independent
print("LIMITATION: Multiple masks are predicted independently\n")

# Each [MASK] is filled without considering the other
multi_mask = "The [MASK] and the [MASK] are friends."
result = fill_mask(multi_mask)

print(f"'{multi_mask}'")
print("\nNote: The model fills each [MASK] independently.")
print("You might get 'dog' and 'dog' instead of two different animals.")

---

## Exercise 3: Word Rank Checker (Independent)

**Difficulty**: Advanced | **Time**: 15-20 minutes

**Your task**: Create a function that checks how "natural" a specific word is in a given sentence context.

**Requirements**:
1. Take a sentence and a word to check
2. Mask the word's position
3. Get predictions and find the rank of the target word
4. Return the rank and probability

**Example**:
- Input: "The cat sat on the mat", word="mat"
- Expected: The word "mat" ranks in the top 10 (or not, which would be interesting!)

In [None]:
# YOUR CODE HERE

def check_word_naturalness(sentence, target_word, top_k=100):
    """
    Check how natural a word is in a sentence context.
    
    Args:
        sentence: A sentence containing the target word
        target_word: The word to check
        top_k: How many predictions to consider
    
    Returns:
        dict with rank, probability, and whether it's in top predictions
    """
    # TODO: Implement this function
    # 1. Replace target_word with [MASK] in the sentence
    # 2. Get predictions from fill_mask
    # 3. Find the rank of target_word in predictions
    # 4. Return results
    
    pass  # Remove this and add your code


# Test your function
test_cases = [
    ("The cat sat on the mat.", "mat"),
    ("The cat sat on the mat.", "elephant"),  # Should rank poorly
    ("Water freezes at zero degrees Celsius.", "Celsius"),
]

# Uncomment to test after implementing:
# for sentence, word in test_cases:
#     result = check_word_naturalness(sentence, word)
#     print(f"'{word}' in '{sentence}'")
#     print(f"  Rank: {result}")

---

# Part 5: Mini-Project

## Project: Word Fitness Scorer

**Scenario**: You're building a writing assistant that helps users choose better words. When a user highlights a word in their text, the tool should tell them if the word fits well and suggest alternatives.

**Your goal**: Build a `WordFitnessScorer` class that:
1. Takes a sentence and a highlighted word
2. Scores how well the word fits (based on its prediction rank)
3. Suggests alternatives if the word doesn't rank highly
4. Provides a human-readable assessment

In [None]:
# MINI-PROJECT: Word Fitness Scorer
# =================================

class WordFitnessScorer:
    """
    A tool that evaluates how well a word fits in a sentence
    and suggests alternatives.
    """
    
    def __init__(self, model_name="bert-base-uncased"):
        """Initialize the scorer with a fill-mask pipeline."""
        # TODO 1: Create a fill-mask pipeline
        self.fill_mask = pipeline("fill-mask", model=model_name)
        self.mask_token = "[MASK]"
    
    def _create_masked_sentence(self, sentence, target_word):
        """
        Replace the target word with [MASK].
        Handles case-insensitive matching.
        """
        # TODO 2: Replace target_word with [MASK]
        # Hint: Handle case sensitivity carefully
        words = sentence.split()
        masked_words = []
        for word in words:
            # Remove punctuation for comparison
            clean_word = word.strip('.,!?;:')
            if clean_word.lower() == target_word.lower():
                # Preserve punctuation after the mask
                punct = word[len(clean_word):]
                masked_words.append(self.mask_token + punct)
            else:
                masked_words.append(word)
        return ' '.join(masked_words)
    
    def score_word(self, sentence, target_word, num_alternatives=5):
        """
        Score how well a word fits in the sentence.
        
        Args:
            sentence: The complete sentence
            target_word: The word to evaluate
            num_alternatives: How many alternatives to suggest
            
        Returns:
            dict with fitness score, rank, and alternatives
        """
        # TODO 3: Create the masked sentence
        masked = self._create_masked_sentence(sentence, target_word)
        
        # TODO 4: Get predictions (get more than we need to find the rank)
        predictions = self.fill_mask(masked, top_k=50)
        
        # TODO 5: Find the rank of the target word
        target_lower = target_word.lower()
        rank = None
        probability = 0.0
        
        for i, pred in enumerate(predictions):
            if pred['token_str'].lower().strip() == target_lower:
                rank = i + 1  # 1-indexed rank
                probability = pred['score']
                break
        
        # TODO 6: Calculate fitness score (simple approach: inverse of rank)
        if rank:
            fitness = 1.0 / rank
        else:
            fitness = 0.0  # Word not in top 50
            rank = ">50"
        
        # TODO 7: Get top alternatives
        alternatives = [
            {"word": p['token_str'], "confidence": p['score']}
            for p in predictions[:num_alternatives]
        ]
        
        return {
            "target_word": target_word,
            "rank": rank,
            "probability": probability,
            "fitness_score": fitness,
            "alternatives": alternatives,
            "masked_sentence": masked
        }
    
    def get_assessment(self, sentence, target_word):
        """
        Get a human-readable assessment of word fitness.
        
        Returns:
            str: A formatted assessment message
        """
        result = self.score_word(sentence, target_word)
        
        # TODO 8: Create human-readable assessment
        assessment = []
        assessment.append(f"Word Fitness Assessment")
        assessment.append(f"=" * 50)
        assessment.append(f"Sentence: '{sentence}'")
        assessment.append(f"Target word: '{target_word}'")
        assessment.append(f"")
        
        # Interpret the rank
        rank = result['rank']
        if isinstance(rank, int):
            if rank == 1:
                verdict = "Excellent! This is the top predicted word."
                emoji = "üü¢"
            elif rank <= 5:
                verdict = "Good fit. The word is highly probable in this context."
                emoji = "üü¢"
            elif rank <= 15:
                verdict = "Acceptable. The word works but alternatives might be stronger."
                emoji = "üü°"
            else:
                verdict = "Unusual choice. Consider the alternatives below."
                emoji = "üü†"
        else:
            verdict = "This word is unexpected in this context. Strongly consider alternatives."
            emoji = "üî¥"
        
        assessment.append(f"Rank: #{rank}")
        assessment.append(f"Probability: {result['probability']:.2%}")
        assessment.append(f"")
        assessment.append(f"{emoji} {verdict}")
        assessment.append(f"")
        assessment.append(f"Top alternatives:")
        
        for i, alt in enumerate(result['alternatives'], 1):
            assessment.append(f"  {i}. {alt['word']} ({alt['confidence']:.2%})")
        
        return '\n'.join(assessment)


# Test the Word Fitness Scorer
print("Creating Word Fitness Scorer...\n")
scorer = WordFitnessScorer()

In [None]:
# Test with various sentences
test_cases = [
    ("The cat sat on the mat.", "mat"),
    ("The cat sat on the mat.", "elephant"),
    ("She decided to pursue a career in medicine.", "medicine"),
    ("She decided to pursue a career in medicine.", "dancing"),
]

for sentence, word in test_cases:
    print(scorer.get_assessment(sentence, word))
    print("\n" + "="*60 + "\n")

### Extension Ideas

If you want to extend this project further:

1. **Batch processing**: Evaluate multiple words in a document
2. **Contextual suggestions**: Suggest words that match the document's tone
3. **Comparison mode**: Compare multiple candidate words
4. **Synonym detection**: Check if the word has synonyms that rank higher
5. **Writing style analysis**: Build a profile of a user's word choices

---

# Part 6: Wrap-Up

## Key Takeaways

1. **Masked Language Modeling** is a pre-training objective where models learn to predict hidden words based on context

2. **Bidirectional context** (looking at words on both sides) is what makes BERT different from earlier models

3. **Confidence scores** tell you how certain the model is - but high confidence doesn't mean the prediction is factually correct

4. **Different models give different results** - model choice matters based on your use case

5. **The pipeline hides complexity** - under the hood, it tokenizes text, runs inference, and processes outputs

## Common Mistakes to Avoid

| Mistake | Why It's a Problem |
|---------|-------------------|
| Using wrong mask token | RoBERTa uses `<mask>`, BERT uses `[MASK]` |
| Trusting high confidence blindly | Models can be confidently wrong |
| Ignoring case sensitivity | Uncased models convert everything to lowercase |
| Processing one sentence at a time | Batching is much more efficient |

## What's Next?

In **Notebook 2: Named Entity Recognition**, you'll learn:
- How to extract people, organizations, and locations from text
- The same BERT architecture used here, but for a different task
- How models can label entire sequences, not just single positions

The concepts you learned about tokenization, model inference, and confidence scores will directly apply!

---

## Solutions

### Check Your Understanding (Quiz Answers)

1. **B) It looks at context on both sides of a masked word**
2. **B) 15%**
3. **A) The model is more certain about its prediction**
4. **C) Real-time translation** (Translation uses encoder-decoder models, not fill-mask)

### Exercise 3: Word Rank Checker (Sample Solution)

In [None]:
# Sample solution for Exercise 3

def check_word_naturalness_solution(sentence, target_word, top_k=100):
    """
    Check how natural a word is in a sentence context.
    """
    # Create masked sentence
    words = sentence.split()
    masked_words = []
    target_lower = target_word.lower()
    
    for word in words:
        clean = word.strip('.,!?;:')
        if clean.lower() == target_lower:
            punct = word[len(clean):]
            masked_words.append('[MASK]' + punct)
        else:
            masked_words.append(word)
    
    masked_sentence = ' '.join(masked_words)
    
    # Get predictions
    predictions = fill_mask(masked_sentence, top_k=top_k)
    
    # Find rank of target word
    rank = None
    probability = 0.0
    
    for i, pred in enumerate(predictions):
        if pred['token_str'].lower().strip() == target_lower:
            rank = i + 1
            probability = pred['score']
            break
    
    return {
        'word': target_word,
        'rank': rank if rank else f'>{top_k}',
        'probability': probability,
        'masked_sentence': masked_sentence,
        'in_top_10': rank is not None and rank <= 10
    }

# Test the solution
print("Testing solution:\n")
test_cases = [
    ("The cat sat on the mat.", "mat"),
    ("The cat sat on the elephant.", "elephant"),
]

for sentence, word in test_cases:
    result = check_word_naturalness_solution(sentence, word)
    print(f"'{word}' in '{sentence}'")
    print(f"  Rank: {result['rank']}, Probability: {result['probability']:.2%}")
    print(f"  In top 10: {result['in_top_10']}\n")

---

## Additional Resources

- [BERT Paper](https://arxiv.org/abs/1810.04805): Original BERT research paper
- [Hugging Face fill-mask docs](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.FillMaskPipeline)
- [BERT Model Card](https://huggingface.co/bert-base-uncased): Model details and limitations
- [DistilBERT](https://huggingface.co/distilbert-base-uncased): Smaller, faster alternative