# Notebook 01: Natural Language Processing - Text Generation

**Learning Objectives:**
- Understand text generation using transformer models
- Load and use pre-trained language models from HuggingFace
- Generate coherent text continuations
- Experiment with generation parameters

## Prerequisites

### Hardware Requirements

| Model Option | Model Name | Size | Min RAM | Recommended Setup | Notes |
|--------------|------------|------|---------|-------------------|-------|
| **CPU (Small)** | distilgpt2 | 82MB | 2GB | 4GB RAM, CPU | Fast, educational |
| **GPU (Medium)** | gpt2-medium | 1.5GB | 4GB | 8GB VRAM (RTX 4080) | Better quality |

### Software Requirements
- Python 3.8+
- Libraries: `transformers`, `torch`
- See `requirements.txt` for full list

## Overview

**Text Generation** is the task of producing coherent text based on a given prompt. It's one of the most popular applications of transformer models.

**Use Cases:**
- Creative writing assistance
- Code completion
- Chatbots and conversational AI
- Content generation

**How it works:**
1. You provide a text prompt (e.g., "Once upon a time")
2. The model predicts the next token (word/subword)
3. The process repeats to generate a sequence
4. Various decoding strategies control generation quality

## Expected Behaviors

When you run this notebook, here's what you should see:

### First Time Running
- **Model Download**: First run will download the model (~82MB for distilgpt2, ~1.5GB for gpt2-medium)
  - Downloads go to `~/.cache/huggingface/hub/`
  - Progress bar shows download status
  - Subsequent runs use cached model (much faster!)

### Setup Cell Output
```
PyTorch version: 2.x.x
CUDA available: True/False
GPU: NVIDIA GeForce RTX 4080 (if you have GPU)
```

### Model Loading
```
Loading distilgpt2...
Model loaded successfully!
```
- **CPU**: Takes 5-10 seconds
- **GPU**: Takes 2-5 seconds

### Text Generation Examples
- Generated text should be **grammatically coherent** but may not always be factually accurate
- **Temperature 0.3**: More repetitive, focused text
- **Temperature 0.7**: Balanced creativity
- **Temperature 1.2**: More random, creative text

### Common Outputs
- Text continues naturally from your prompt
- May include unexpected topics or tangents (this is normal!)
- Quality improves with larger models (gpt2-medium > distilgpt2)

### Performance
- **CPU (distilgpt2)**: ~2-5 seconds per generation
- **GPU (distilgpt2)**: ~0.5-1 second per generation
- **GPU (gpt2-medium)**: ~1-2 seconds per generation

### Troubleshooting
- **"CUDA out of memory"**: Use CPU model option or restart kernel
- **Slow generation**: First run downloads model; subsequent runs are faster
- **Repetitive text**: Try increasing temperature or adjusting top_k/top_p

## Setup and Installation

In [None]:
# Import required libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(1103)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Model Selection

Choose one of the following models based on your hardware:

In [None]:
# CHOOSE YOUR MODEL:

# Option 1: CPU-friendly (recommended for beginners)
MODEL_NAME = "distilgpt2"  # 82MB, fast on CPU

# Option 2: GPU-optimized (uncomment if you have RTX 4080 or similar)
# MODEL_NAME = "gpt2-medium"  # 1.5GB, better quality, needs GPU

print(f"Selected model: {MODEL_NAME}")

## Method 1: Using Pipeline (Simplest)

The `pipeline` API is the easiest way to use HuggingFace models.

In [None]:
# Create a text generation pipeline
print(f"Loading {MODEL_NAME}...")
generator = pipeline(
    "text-generation",
    model=MODEL_NAME,
    device=0 if torch.cuda.is_available() else -1  # 0 for GPU, -1 for CPU
)

### Basic Text Generation

In [None]:
# Generate text from a prompt
prompt = "Once upon a time in a distant galaxy"

result = generator(
    prompt,
    max_length=50,        # Maximum length of generated text
    num_return_sequences=1,  # Number of different outputs
    temperature=0.7,      # Creativity (0.1=conservative, 2.0=creative)
    do_sample=True        # Enable random sampling
)

print("Generated text:")
print(result[0]['generated_text'])

### Generating Multiple Variations

In [None]:
# Generate 3 different continuations
prompt = "The future of artificial intelligence is"

results = generator(
    prompt,
    max_length=40,
    num_return_sequences=3,
    temperature=0.8,
    do_sample=True
)

print("\n=== Generated Variations ===")
for i, result in enumerate(results, 1):
    print(f"\n{i}. {result['generated_text']}")

## Method 2: Using Model and Tokenizer Directly (Advanced)

For more control, load the model and tokenizer separately.

In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded on: {device}")

In [None]:
# Generate text with more control
prompt = "In the year 2050, technology will"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate
outputs = model.generate(
    inputs.input_ids,
    max_length=60,
    temperature=0.7,
    top_k=50,           # Consider top 50 tokens
    top_p=0.95,         # Nucleus sampling threshold
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# Decode output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nGenerated: {generated_text}")

## Understanding Generation Parameters

Let's experiment with different parameters to see their effects:

In [None]:
def compare_temperatures(prompt, temperatures=[0.3, 0.7, 1.2]):
    """
    Compare text generation with different temperature values.
    Lower temperature = more conservative, higher = more creative
    """
    print(f"Prompt: '{prompt}'\n")
    
    for temp in temperatures:
        result = generator(
            prompt,
            max_length=40,
            temperature=temp,
            do_sample=True,
            num_return_sequences=1
        )
        print(f"Temperature {temp}:")
        print(f"{result[0]['generated_text']}\n")

# Test it
compare_temperatures("The secret to happiness is")

## Practical Applications

### Example 1: Story Beginning Generator

In [None]:
story_prompts = [
    "Once upon a time in a small village,",
    "The detective looked at the evidence and realized",
    "On the first day of summer vacation,"
]

for prompt in story_prompts:
    result = generator(prompt, max_length=50, temperature=0.8, do_sample=True)
    print(f"{result[0]['generated_text']}\n")

### Example 2: Code Comment Generator

In [None]:
code_prompt = "This function calculates the"

result = generator(
    code_prompt,
    max_length=30,
    temperature=0.5,  # Lower temperature for more focused output
    do_sample=True
)

print(result[0]['generated_text'])

## Performance Comparison

Let's measure generation speed:

In [None]:
import time

prompt = "Artificial intelligence is"

# Measure time
start_time = time.time()
result = generator(prompt, max_length=50, do_sample=True)
end_time = time.time()

print(f"Generated text: {result[0]['generated_text']}")
print(f"\nTime taken: {end_time - start_time:.2f} seconds")
print(f"Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## Exercises

Try these challenges to deepen your understanding:

1. **Experiment with parameters**: Generate text with different `temperature`, `top_k`, and `top_p` values. What happens?

2. **Longer generation**: Try generating longer sequences (e.g., `max_length=100`). Does quality degrade?

3. **Domain-specific prompts**: Test the model with prompts from different domains (technical, creative, conversational). How does it perform?

4. **Compare models**: If you have GPU access, compare `distilgpt2` with `gpt2-medium`. What differences do you notice?

5. **Batch generation**: Generate text for multiple prompts at once using a list of prompts.

In [None]:
# Your code here for exercises


## State-of-the-Art Open Models (Not Covered)

While this notebook focuses on GPT-2 for educational purposes, here are **state-of-the-art open-source text generation models** you should know about:

### Large Language Models (7B+ parameters)

**ü¶ô Llama 2 & Llama 3** (Meta)
- Sizes: 7B, 13B, 70B parameters
- Best for: General-purpose text generation, chat, instruction following
- [Model Card](https://huggingface.co/meta-llama) | [Paper](https://arxiv.org/abs/2307.09288)
- Note: Requires 16GB+ GPU for 7B model

**üåä Mistral & Mixtral** (Mistral AI)
- Mistral 7B: Efficient, outperforms Llama 2 13B
- Mixtral 8x7B: Mixture of Experts, exceptional performance
- [Model Card](https://huggingface.co/mistralai) | [Paper](https://arxiv.org/abs/2401.04088)

**üéØ Qwen 2** (Alibaba)
- Sizes: 0.5B to 72B parameters
- Strong multilingual capabilities
- [Model Card](https://huggingface.co/Qwen) | [Blog](https://qwenlm.github.io/)

**üíé Gemma** (Google)
- Sizes: 2B, 7B parameters
- Excellent efficiency and safety features
- [Model Card](https://huggingface.co/google/gemma-7b) | [Blog](https://blog.google/technology/developers/gemma-open-models/)

**üî¨ Phi-3** (Microsoft)
- Sizes: 3.8B (mini), 7B (small), 14B (medium)
- Exceptional performance for size, optimized for edge devices
- [Model Card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

### Specialized Models

**üíª CodeLlama** (Meta)
- Specialized for code generation
- [Model Card](https://huggingface.co/codellama/CodeLlama-7b-hf)

**üó£Ô∏è Zephyr** (HuggingFace)
- Fine-tuned for helpful, harmless conversations
- [Model Card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)

### Why Not Covered Here?

These models require:
- **Large GPU memory**: 16GB+ VRAM for 7B models
- **Longer download times**: 10-40GB model files
- **More compute**: Slower on consumer hardware

**Learning Path**:
1. ‚úÖ Start with GPT-2 (this notebook) to learn fundamentals
2. Move to fine-tuning (Notebook 13) with LoRA for efficient training
3. Graduate to 7B+ models when you have GPU resources

### Where to Find More

- [HuggingFace Text Generation Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [Papers with Code - Text Generation](https://paperswithcode.com/task/text-generation)
- [Ollama](https://ollama.ai/) - Run LLMs locally (see Notebook 10)

## Key Takeaways

‚úÖ **Pipeline API** is the easiest way to use HuggingFace models

‚úÖ **Temperature** controls creativity: lower = safer, higher = more random

‚úÖ **Max length** determines how much text to generate

‚úÖ Models are downloaded once and cached locally

‚úÖ GPU acceleration significantly speeds up generation

## Next Steps

- Try **Notebook 02**: Text Classification for sentiment analysis
- Explore larger models like `gpt2-large` if you have more resources
- Check out [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation) for more text generation models

## Resources

- [Transformers Documentation - Text Generation](https://huggingface.co/docs/transformers/main_classes/text_generation)
- [How to Generate Text](https://huggingface.co/blog/how-to-generate)
- [GPT-2 Model Card](https://huggingface.co/gpt2)