# Assignment: Understanding Transformer LLM Internals and Token-Level Analysis

**Background:**  This assignment explores the **internal mechanisms** of transformer-based large language models (LLMs) through hands-on analysis of tokenization, probability distributions, attention mechanisms, and generation strategies. You will investigate how LLMs process and generate text at the token level, examining the mathematical foundations behind their predictions.

## Instructions and Point Breakdown

### 1. **Model Architecture Analysis and Setup (2 points)**

- Load a transformer model using Hugging Face `transformers` library (e.g., Phi-3, GPT-2, or Llama model).
- **Questions:**
  - Examine the model architecture by printing `model`. What are the key components and their dimensions?
  - What is the vocabulary size of your chosen model? How does this compare to the hidden dimension?
  - Why do transformer models use layer normalization instead of batch normalization?
- **Implementation:** Load the model, tokenizer, and display the architecture summary.

### 2. **Tokenization and Vocabulary Investigation (3 points)**

- **Deep Dive into Tokenization:**
  - Choose 5-7 diverse text samples (include technical terms, different languages, numbers, punctuation combinations).
  - For each sample, show the tokenization breakdown: `input → token IDs → decoded tokens`.
  - **Questions:**
    - How does the tokenizer handle out-of-vocabulary words, numbers, and special characters?
    - Why might "The" and "the" tokenize differently? Test this hypothesis.
    - What happens when you tokenize the same word in different contexts?
  - **Comparative Analysis:** Compare how your model's tokenizer handles the same text vs. a different tokenizer (e.g., compare word-level vs. subword tokenization).

### 3. **Probability Distribution and Next-Token Prediction (2 points)**

- **Mathematical Understanding:**
  - Take a simple prompt like "The capital of France is" and examine the model's logits and probability distributions.
  - Extract and analyze the top 10 most probable next tokens with their probabilities.
  - **Questions:**
    - Why do models use softmax for converting logits to probabilities? What would happen with other normalization methods?
    - How does temperature affect the probability distribution? Test with temperatures 0.1, 1.0, and 2.0.
    - What's the relationship between the model's confidence (probability mass on top token) and the actual correctness of predictions?
  - **Analysis:** Create visualizations showing how probability distributions change with different prompts and temperature settings.

### 4. **Generation Strategies and Computational Efficiency (2 points)**

- **Performance Analysis:**
  - Implement and time text generation with and without key-value caching.
  - **Questions:**
    - Why does caching provide such dramatic speedup? What is being cached and why?
    - How does the computational complexity of attention scale with sequence length?
    - What are the memory trade-offs between caching and recomputation?
  - **Sampling Methods Investigation:**
    - Compare greedy decoding, top-k sampling, and nucleus (top-p) sampling on the same prompt.
    - Analyze how different sampling methods affect output diversity and quality.

### 5. **Model Behavior and Limitations Analysis (1 point)**

- **Critical Thinking Questions:**
  - Test the model with deliberately ambiguous or trick questions. How does it handle uncertainty?
  - Investigate tokenization artifacts: How might unusual tokenization affect model performance?
  - **Philosophical Questions:**
    - Does the model "understand" language or is it performing sophisticated pattern matching? Support your argument with evidence from your experiments.
    - How do the model's internal representations (hidden states) relate to human understanding of language?
    - What are the implications of the model's next-token prediction objective for its capabilities and limitations?

## Submission Requirements

- **Jupyter Notebook** containing:
  - All code implementations and outputs from Sections 1-4
  - Clear visualizations of probability distributions and tokenization analysis
  - Thoughtful written responses to all critical thinking questions
  - Evidence-based arguments supported by experimental results
- **Written Analysis:** For each critical question, provide 2-3 paragraphs of analysis supported by your experimental observations.
- Use Python libraries: `transformers`, `torch`, `matplotlib`, `numpy`, and `pandas` as needed.

## Advanced Extensions (Optional)

- **Attention Visualization:** Extract and visualize attention weights for specific examples.
- **Layer-wise Analysis:** Examine how representations change through different transformer layers.
- **Cross-model Comparison:** Compare the internal behaviors of models with different architectures or sizes.

**Grading Rubric:**

| Section                              | Points |
|:-------------------------------------|:------:|
| Architecture analysis & setup        | 2      |
| Tokenization investigation           | 3      |
| Probability & prediction analysis    | 2      |
| Generation & efficiency analysis     | 2      |
| Critical thinking & reflection       | 1      |
| **Total**                           | **10** |

**Evaluation Criteria:**
- **Technical Implementation (40%):** Correct and insightful use of the code examples and extensions
- **Critical Analysis (40%):** Depth of understanding demonstrated through written responses
- **Experimental Evidence (20%):** Quality of supporting evidence and experimental design