# Week 9: Datasets on Hugging Face
## The Fuel for AI Models

**Today's Goals:**
1. Understand why datasets matter for AI
2. Navigate the Hugging Face Datasets Hub
3. Load and explore datasets in Python
4. Prepare for fine-tuning (next steps!)

---

## Part 1: Why Datasets Matter

**AI models learn from data.**

Think of it like this:
- A **model** is like a student
- A **dataset** is like a textbook
- **Training** is like studying

The quality and type of data determines what the model learns!

```
Dataset (examples) → Training → Model (learned patterns)
```

### Real Examples:

| Dataset | What model learns |
|---------|------------------|
| Movie reviews (positive/negative) | Sentiment analysis |
| Images of cats and dogs | Image classification |
| Question-answer pairs | Question answering |
| English-French sentence pairs | Translation |

## Setup

In [None]:
# Install the datasets library
!pip install datasets -q
!pip install pandas -q

print("Libraries installed!")

In [None]:
from datasets import load_dataset
import pandas as pd

print("Ready to explore datasets!")

---
## Part 2: The Hugging Face Datasets Hub

**Browse datasets at:** [huggingface.co/datasets](https://huggingface.co/datasets)

There are 100,000+ datasets covering:
- Text (sentiment, Q&A, translation)
- Images (classification, object detection)
- Audio (speech recognition, music)
- And more!

### How to Find Datasets:
1. Use the **Task** filter (what do you want to do?)
2. Sort by **Downloads** (popular = reliable)
3. Check the **Size** (smaller = faster to load)

---
## Part 3: Loading Your First Dataset

Let's load the famous **IMDB dataset** - 50,000 movie reviews labeled as positive or negative.

In [None]:
# Load the IMDB dataset
imdb = load_dataset("imdb")

print("Dataset loaded!")
print(imdb)

In [None]:
# See the structure
print("Dataset structure:")
print(f"  Training examples: {len(imdb['train'])}")
print(f"  Test examples: {len(imdb['test'])}")
print(f"\nColumns: {imdb['train'].column_names}")

In [None]:
# Look at one example
example = imdb['train'][0]

print("First training example:")
print(f"\nLabel: {example['label']} ({'Positive' if example['label'] == 1 else 'Negative'})")
print(f"\nReview (first 500 chars):")
print(example['text'][:500] + "...")

### Understanding the Labels:
- `0` = Negative review
- `1` = Positive review

In [None]:
# Look at a few more examples
print("Sample reviews:\n")
for i in range(3):
    ex = imdb['train'][i]
    sentiment = "POSITIVE" if ex['label'] == 1 else "NEGATIVE"
    print(f"--- Example {i+1} ({sentiment}) ---")
    print(ex['text'][:200] + "...")
    print()

---
## Part 4: Exploring Other Datasets

### Emotion Dataset

Tweets labeled with 6 emotions!

In [None]:
# Load emotion dataset
emotion = load_dataset("emotion")

print("Emotion dataset loaded!")
print(f"Training examples: {len(emotion['train'])}")
print(f"Columns: {emotion['train'].column_names}")

In [None]:
# The emotion labels
emotion_labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

# Look at examples of each emotion
print("Examples of each emotion:\n")
for i, label in enumerate(emotion_labels):
    # Find an example with this label
    for ex in emotion['train']:
        if ex['label'] == i:
            print(f"{label.upper()}: \"{ex['text']}\"")
            break

### Rotten Tomatoes Dataset

A smaller sentiment dataset - good for quick experiments!

In [None]:
# Load Rotten Tomatoes (smaller, faster)
rotten = load_dataset("rotten_tomatoes")

print("Rotten Tomatoes dataset:")
print(f"  Training: {len(rotten['train'])} examples")
print(f"  Validation: {len(rotten['validation'])} examples")
print(f"  Test: {len(rotten['test'])} examples")

In [None]:
# Compare sizes
print("Dataset size comparison:\n")
print(f"IMDB: {len(imdb['train']):,} training examples")
print(f"Rotten Tomatoes: {len(rotten['train']):,} training examples")
print(f"Emotion: {len(emotion['train']):,} training examples")

print("\nSmaller datasets are faster to train on!")

---
## Part 5: Working with Datasets

### Selecting Specific Examples

In [None]:
# Get specific examples
first_10 = imdb['train'][:10]
print(f"First 10 examples: {len(first_10['text'])} items")

# Get a range
middle = imdb['train'][100:105]
print(f"Examples 100-104: {len(middle['text'])} items")

### Filtering Data

In [None]:
# Filter to only positive reviews
positive_reviews = imdb['train'].filter(lambda x: x['label'] == 1)
print(f"Positive reviews: {len(positive_reviews)}")

# Filter to only negative reviews
negative_reviews = imdb['train'].filter(lambda x: x['label'] == 0)
print(f"Negative reviews: {len(negative_reviews)}")

### Converting to Pandas (for easier exploration)

In [None]:
# Convert to pandas DataFrame
df = pd.DataFrame(rotten['train'])

print("As a pandas DataFrame:")
print(df.head())

In [None]:
# Basic statistics
print("\nLabel distribution:")
print(df['label'].value_counts())

print("\nAverage text length:")
df['text_length'] = df['text'].apply(len)
print(f"  {df['text_length'].mean():.0f} characters")

---
## Part 6: Train/Test Split - Why It Matters

Most datasets have splits:

- **Train**: Data the model learns from
- **Validation**: Data to check progress during training
- **Test**: Data to evaluate final performance

**Important:** The test data should NEVER be seen during training!

In [None]:
# See the splits
print("IMDB splits:")
for split in imdb.keys():
    print(f"  {split}: {len(imdb[split])} examples")

In [None]:
# Create a small sample for quick experiments
small_train = imdb['train'].shuffle(seed=42).select(range(1000))
small_test = imdb['test'].shuffle(seed=42).select(range(200))

print(f"Small training set: {len(small_train)} examples")
print(f"Small test set: {len(small_test)} examples")
print("\nSmaller datasets = faster experiments!")

---
## Part 7: What Makes a Good Dataset?

When choosing or creating a dataset, consider:

### 1. Size
- More data usually = better model
- But smaller datasets are faster to train

### 2. Quality
- Accurate labels matter!
- Clean, consistent data

### 3. Balance
- Equal examples of each category
- Imbalanced data can bias the model

### 4. Relevance
- Does it match your actual use case?
- Training on tweets won't help with legal documents

In [None]:
# Check if IMDB is balanced
train_df = pd.DataFrame(imdb['train'])

print("IMDB label balance:")
counts = train_df['label'].value_counts()
print(f"  Negative (0): {counts[0]} ({counts[0]/len(train_df):.1%})")
print(f"  Positive (1): {counts[1]} ({counts[1]/len(train_df):.1%})")
print("\nPerfectly balanced! (This is good)")

---
## Part 8: Challenge - Explore a New Dataset

Your turn! Find and explore a dataset that interests you.

**Ideas:**
- `ag_news` - News articles (4 categories)
- `squad` - Question-answering
- `tweet_eval` - Tweet classification
- `yelp_review_full` - 5-star reviews

In [None]:
# Try loading a new dataset!
# Uncomment and modify:

# my_dataset = load_dataset("ag_news")
# print(my_dataset)
# print(my_dataset['train'][0])

In [None]:
def explore_dataset(dataset_name):
    """Helper function to explore any dataset."""
    print(f"Loading {dataset_name}...")
    ds = load_dataset(dataset_name)
    
    print(f"\nDataset: {dataset_name}")
    print("=" * 50)
    
    # Splits
    print("\nSplits:")
    for split in ds.keys():
        print(f"  {split}: {len(ds[split])} examples")
    
    # Columns
    first_split = list(ds.keys())[0]
    print(f"\nColumns: {ds[first_split].column_names}")
    
    # Sample
    print(f"\nFirst example:")
    for key, value in ds[first_split][0].items():
        if isinstance(value, str) and len(value) > 100:
            print(f"  {key}: {value[:100]}...")
        else:
            print(f"  {key}: {value}")
    
    return ds

# Try it!
# explore_dataset("ag_news")

---
## Quick Reference

```python
from datasets import load_dataset

# Load a dataset
dataset = load_dataset("dataset_name")

# Access splits
train = dataset['train']
test = dataset['test']

# Get examples
example = train[0]           # First example
batch = train[:10]           # First 10
shuffled = train.shuffle()   # Random order

# Filter
filtered = train.filter(lambda x: x['label'] == 1)

# Convert to pandas
df = pd.DataFrame(train)
```

---
## Checklist: What You Learned Today

- [ ] Why datasets are essential for AI (data → training → model)
- [ ] How to browse the Hugging Face Datasets Hub
- [ ] Loading datasets with `load_dataset()`
- [ ] Exploring dataset structure and examples
- [ ] Train/test splits and why they matter
- [ ] What makes a good dataset (size, quality, balance)

---

## Looking Ahead: Next Week

Next week we'll **plan our portfolio projects**:
- Brainstorm project ideas
- Choose a dataset to work with
- Plan the fine-tuning approach

**Homework (optional):**
- Explore 2-3 datasets that interest you
- Think about what you'd like to build
- Save your exploration to GitHub!

---

*Youth Horizons AI Researcher Program - Level 2*