# Notebook 8: The Broader Ecosystem - Hugging Face & The Fine-Tuning Workflow

We've built models from scratch. Now, let's learn the standard, practical workflow that professionals use. This involves leveraging the vast Hugging Face ecosystem to download pre-trained models and adapt them to new tasks.

Throughout this notebook, we'll discover how the concepts we've learned from scratchâ€”tokenization, model architectures, inferenceâ€”map directly to the industry-standard tools that power most real-world LLM applications today.


## The Hugging Face Hub: A "GitHub" for AI

The Hugging Face Hub is the central place where the machine learning community shares models, datasets, and tokenizers. Think of it as GitHub, but specifically designed for AI models and datasets.

Just like GitHub repositories have unique identifiers (e.g., `username/repo-name`), every model, dataset, and tokenizer on the Hub has a unique ID:
- `openai-community/gpt2` - GPT-2 model
- `meta-llama/Meta-Llama-3-8B` - Meta's Llama 3 8B model
- `distilbert-base-uncased-finetuned-sst-2-english` - A sentiment analysis model
- `bert-base-uncased` - BERT base model

This ID system allows you to programmatically download and use any model with a single line of code. The Hub hosts hundreds of thousands of pre-trained models, making it the largest collection of open-source AI models in the world.


In [None]:
# First, install the huggingface_hub library if needed
# !pip install huggingface_hub

from huggingface_hub import list_models, list_datasets
import pandas as pd

# List popular text generation models
print("="*60)
print("Popular Text Generation Models:")
print("="*60)
text_gen_models = list(list_models(
    task="text-generation",
    sort="downloads",
    direction=-1,
    limit=10
))
for i, model in enumerate(text_gen_models, 1):
    print(f"{i}. {model.id}")

print("\n" + "="*60)
print("Popular Datasets:")
print("="*60)
datasets = list(list_datasets(
    sort="downloads",
    direction=-1,
    limit=10
))
for i, dataset in enumerate(datasets, 1):
    print(f"{i}. {dataset.id}")

print("\n" + "="*60)
print("Explore more at: https://huggingface.co/models")
print("="*60)


## The Easy Button: The pipeline API

The `pipeline` API is the highest-level, easiest way to use a model for a specific task. It abstracts away all the complexity we've been working with manually:

1. **Loading the model** - Downloads and loads the pre-trained weights
2. **Tokenization** - Converts text into tokens the model understands
3. **Inference** - Runs the model forward pass
4. **Decoding** - Converts the model's output back into human-readable text

With `pipeline`, you can go from zero to running inference in literally one line of code. It's perfect for quick experiments and prototyping.


In [None]:
# Install transformers if needed
# !pip install transformers torch

from transformers import pipeline
import torch

# Set device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}\n")

# Example 1: Text Generation with GPT-2
print("="*60)
print("1. Text Generation Pipeline (GPT-2)")
print("="*60)
generator = pipeline('text-generation', model='gpt2', device=device)
result = generator("The future of artificial intelligence", max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])
print()

# Example 2: Sentiment Analysis
print("="*60)
print("2. Sentiment Analysis Pipeline")
print("="*60)
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english', device=device)
result = classifier("I love this product! It's amazing.")
print(result)
print()

# Example 3: Fill-Mask (BERT-style)
print("="*60)
print("3. Fill-Mask Pipeline (BERT)")
print("="*60)
unmasker = pipeline('fill-mask', model='bert-base-uncased', device=device)
result = unmasker("The capital of France is [MASK].")
for i, prediction in enumerate(result[:3], 1):
    print(f"{i}. {prediction['sequence']} (confidence: {prediction['score']:.4f})")
print()

print("="*60)
print("That's it! One line of code per task.")
print("="*60)


## The Standard Workflow: AutoModel and AutoTokenizer

For more control, we drop down a level from `pipeline` to `AutoModel` and `AutoTokenizer`. These classes automatically select the correct model architecture and tokenizer based on the model ID you provide.

**AutoTokenizer.from_pretrained(model_id):** Loads the correct tokenizer for any model. It knows which tokenizer to use based on the model's configuration.

**AutoModelForCausalLM.from_pretrained(model_id):** Loads the model architecture and its pre-trained weights. The `ForCausalLM` suffix indicates this is a causal language model (like GPT) that generates text left-to-right.

Now we'll manually replicate the steps that `pipeline` does automatically. This gives us full control over each step and helps us understand exactly what's happening under the hood.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Set device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}\n")

# Step 1: Load the tokenizer and model for GPT-2
print("="*60)
print("Step 1: Loading Model and Tokenizer")
print("="*60)
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
print(f"âœ“ Loaded {model_id}")
print(f"âœ“ Vocabulary size: {tokenizer.vocab_size}")
print()

# Step 2: Define a prompt
print("="*60)
print("Step 2: Define Prompt")
print("="*60)
prompt = "The future of artificial intelligence"
print(f"Prompt: '{prompt}'")
print()

# Step 3: Tokenize
print("="*60)
print("Step 3: Tokenization")
print("="*60)
inputs = tokenizer(prompt, return_tensors="pt")
print("Tokenized inputs:")
print(f"  inputs: {inputs}")
print(f"  input_ids shape: {inputs['input_ids'].shape}")
print(f"  attention_mask shape: {inputs['attention_mask'].shape}")
print()
print("Explanation:")
print("  - input_ids: The token IDs that represent our text")
print("  - attention_mask: A mask indicating which tokens are real (1) vs padding (0)")
print(f"  - Token IDs: {inputs['input_ids'][0].tolist()}")
print(f"  - Decoded back: {tokenizer.decode(inputs['input_ids'][0])}")
print()

# Step 4: Move inputs to device and run inference
print("="*60)
print("Step 4: Inference")
print("="*60)
inputs = {k: v.to(device) for k, v in inputs.items()}
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
print("âœ“ Model forward pass complete")
print()

# Step 5: Inspect Logits
print("="*60)
print("Step 5: Inspect Logits")
print("="*60)
logits = outputs.logits
print(f"Logits shape: {logits.shape}")
print(f"  - Batch size: {logits.shape[0]}")
print(f"  - Sequence length: {logits.shape[1]}")
print(f"  - Vocabulary size: {logits.shape[2]}")
print()
print("Explanation:")
print("  - For each position in the sequence, the model outputs scores for every token in the vocabulary")
print("  - Shape (batch_size, sequence_length, vocab_size) means:")
print("    * We have 1 example in the batch")
print("    * We have predictions for each token in our input sequence")
print("    * For each position, we have scores for all possible next tokens")
print()

# Step 6: Get the next token prediction (last position)
print("="*60)
print("Step 6: Get Next Token Prediction")
print("="*60)
next_token_logits = logits[0, -1, :]  # Get logits for the last position
next_token_probs = torch.softmax(next_token_logits, dim=-1)
top_k = 5
top_k_probs, top_k_indices = torch.topk(next_token_probs, top_k)
print(f"Top {top_k} most likely next tokens:")
for i, (prob, idx) in enumerate(zip(top_k_probs, top_k_indices), 1):
    token = tokenizer.decode([idx.item()])
    print(f"  {i}. '{token}' (probability: {prob.item():.4f})")
print()

# Step 7: Decode the full output
print("="*60)
print("Step 7: Decode Full Output")
print("="*60)
# For demonstration, let's generate a few tokens
generated_ids = model.generate(
    inputs['input_ids'],
    max_new_tokens=30,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")
print()
print("="*60)
print("This is exactly what pipeline doesâ€”we just did it step by step!")
print("="*60)


## Fine-Tuning: Adapting a Genius for a New Job

Why spend millions training a model from scratch on all of Wikipedia when you can take a pre-trained "genius" model like Llama 3 and simply show it a few thousand examples of your specific task (e.g., answering customer support questions)? 

This process of adapting a pre-trained model to a new task is called **fine-tuning**. Instead of training from scratch (which requires massive computational resources and datasets), fine-tuning leverages the general knowledge already encoded in the pre-trained model and specializes it for your specific use case.

Think of it like this: A pre-trained model is a brilliant generalist who has read everything on the internet. Fine-tuning is like giving this generalist specialized training for a specific jobâ€”teaching them your company's terminology, your product details, or your writing style. It's far more efficient than training a new expert from scratch!


## The Professional Way: The Trainer API

The `Trainer` class from Hugging Face Transformers is the high-level tool for fine-tuning. It handles all the training loop boilerplate we wrote manually in earlier notebooks:

- **Epoch/step loops** - Iterating through your dataset
- **Moving data to device** - Handling GPU/CPU transfers automatically
- **Calling loss.backward()** - Computing gradients
- **optimizer.step()** - Updating model weights
- **Evaluation** - Running validation loops
- **Logging** - Tracking metrics and losses
- **Saving checkpoints** - Automatic model checkpointing

Instead of writing dozens of lines of training loop code, you define your model, dataset, and training arguments, then simply call `trainer.train()`. The Trainer handles everything else, making fine-tuning accessible and reliable.


In [None]:
"""
Fine-Tuning Template
====================

This is a complete template showing the structure of a fine-tuning script.
Note: This is a template - actual datasets and model IDs may vary.

For a runnable example, you would need:
1. A dataset in the correct format
2. Proper data preprocessing
3. Sufficient GPU memory for the model you choose
"""

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset

# Step 1: Load model and tokenizer
model_id = "gpt2"  # or "meta-llama/Llama-3-8b" for a larger model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Step 2: Load your dataset
# This could be from Hugging Face Hub or a local file
dataset = load_dataset("your_dataset_name", split="train")

# Step 3: Preprocess the dataset
def tokenize_function(examples):
    # Tokenize the text
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)

# Step 4: Define training arguments
training_args = TrainingArguments(
    output_dir="./results",              # Where to save checkpoints
    num_train_epochs=3,                 # Number of training epochs
    per_device_train_batch_size=4,      # Batch size per GPU
    per_device_eval_batch_size=4,       # Evaluation batch size
    warmup_steps=100,                   # Warmup steps for learning rate
    learning_rate=5e-5,                 # Learning rate
    logging_dir="./logs",               # Directory for logs
    logging_steps=10,                   # Log every N steps
    save_steps=500,                     # Save checkpoint every N steps
    evaluation_strategy="steps",        # Evaluate during training
    eval_steps=500,                     # Evaluate every N steps
    save_total_limit=2,                 # Keep only last 2 checkpoints
    load_best_model_at_end=True,        # Load best model at end
    report_to="tensorboard",            # Logging tool
)

# Step 5: Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Set to False for causal LM (GPT-style)
)

# Step 6: Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

# Step 7: Train!
trainer.train()

# Step 8: Save the fine-tuned model
trainer.save_model("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")

print("\n" + "="*60)
print("Fine-tuning complete! That's all it takes.")
print("="*60)


## Making Fine-Tuning Feasible: QLoRA and Unsloth

Even fine-tuning can be too much for consumer hardware. Modern techniques have revolutionized what's possible on a single GPU.

### QLoRA: Fine-Tuning Large Models on Consumer Hardware

QLoRA (Quantized Low-Rank Adaptation) is a revolutionary technique that makes fine-tuning massive models feasible on consumer hardware. Here's how it works:

1. **Quantization**: First, it loads the massive base model in a quantized, memory-saving 4-bit format (instead of the usual 32-bit). This dramatically reduces memory usageâ€”a 7B parameter model that would normally need 28GB of RAM now needs only ~4GB.

2. **Freezing**: The quantized base model is "frozen"â€”its weights are not updated during training.

3. **LoRA Adapters**: Tiny, trainable "adapter" layers (called LoRA - Low-Rank Adaptation) are inserted into the model. These adapters contain only a small fraction of the model's parameters.

4. **Training Only Adapters**: During fine-tuning, you only train these tiny adapters, not the entire model. This is dramatically faster and uses far less memory than training the whole model.

The `peft` (Parameter-Efficient Fine-Tuning) library from Hugging Face implements QLoRA and other efficient fine-tuning techniques. It's as simple as wrapping your model with a `get_peft_model()` call.

### Unsloth: Making Fine-Tuning Faster

Unsloth is a performance library that accelerates fine-tuning. You add two lines of code to your fine-tuning script, and it intelligently replaces standard PyTorch modules with its own hand-written, hyper-optimized kernels.

**Benefits:**
- **2x faster** training speed
- **Significantly less memory** usage
- **Especially effective** for QLoRA fine-tuning
- **Drop-in replacement** - no changes to your training code structure

Unsloth optimizes the critical operations that happen billions of times during training: matrix multiplications, attention computations, and gradient updates. It's like switching from a regular car to a Formula 1 race carâ€”same destination, much faster journey.

Together, QLoRA and Unsloth make it possible to fine-tune models like Llama 3 8B on a single consumer GPU in just a few hours, opening up advanced LLM capabilities to a much broader audience.


## Conclusion: Your Journey So Far

You started with a single tensor. You built MLPs, CNNs, and ResNets. You architected a GPT from its fundamental components using Flash Attention. You mastered the logic of inference and sampling. And now, you've seen how the professional ecosystem abstracts these concepts into powerful, reusable tools.

**What you've accomplished:**

1. **From Scratch to Production**: You understand the fundamental building blocksâ€”tensors, neural networks, attention mechanisms, and inference loops. You've built these from scratch.

2. **Industry Standards**: You know how to use the Hugging Face ecosystemâ€”pipelines, AutoModel, AutoTokenizer, and the Trainer API.

3. **Real-World Techniques**: You understand modern fine-tuning methods like QLoRA and performance optimizations like Unsloth.

4. **Full-Stack Intuition**: You have the code-first intuition to tackle any challenge in the world of LLMsâ€”from understanding research papers to implementing production systems.

**The path forward:**

You now have the foundation to:
- Fine-tune models for your specific use cases
- Experiment with different architectures and techniques
- Understand and contribute to the latest research
- Build production applications with LLMs

The world of LLMs is vast and rapidly evolving. But with your deep understanding of the fundamentals and familiarity with the tools, you're equipped to navigate it confidently. Keep building, keep experimenting, and keep learning!

**Happy coding! ðŸš€**
