# Notebook 02: Natural Language Processing - Text Classification

**Learning Objectives:**
- Understand text classification and sentiment analysis
- Use pre-trained models for classification tasks
- Classify text into predefined categories
- Apply models to real-world use cases

## Prerequisites

### Hardware Requirements

| Model Option | Model Name | Size | Min RAM | Recommended Setup | Notes |
|--------------|------------|------|---------|-------------------|-------|
| **CPU (Small)** | distilbert-base-uncased-finetuned-sst-2-english | 268MB | 2GB | 4GB RAM, CPU | Fast, accurate |
| **GPU (Medium)** | bert-base-uncased | 440MB | 4GB | 6GB VRAM (RTX 4080) | More versatile |

### Software Requirements
- Python 3.8+
- Libraries: `transformers`, `torch`
- See `requirements.txt` for full list

## Overview

**Text Classification** assigns predefined categories or labels to text. Common applications include:

**Use Cases:**
- **Sentiment Analysis**: Positive, negative, neutral
- **Topic Classification**: Sports, politics, technology
- **Spam Detection**: Spam or not spam
- **Intent Recognition**: For chatbots and virtual assistants

**How it works:**
1. Text is tokenized and encoded
2. Model processes the text through transformer layers
3. Output layer produces probability scores for each class
4. Highest probability determines the predicted class

## Expected Behaviors

### First Time Running
- **Model Download**: ~268MB for distilbert (1-3 minutes depending on internet speed)
- Models cached in `~/.cache/huggingface/hub/`
- Subsequent runs load instantly from cache

### Setup Cell Output
```
PyTorch version: 2.x.x
CUDA available: True/False
```

### Model Loading
```
Loading distilbert-base-uncased-finetuned-sst-2-english...
Model loaded successfully!
```
- Takes 2-5 seconds on CPU, faster on GPU

### Classification Results Format
```python
[{'label': 'POSITIVE', 'score': 0.9998}]
```
- **label**: Either 'POSITIVE' or 'NEGATIVE' for sentiment analysis
- **score**: Confidence between 0 and 1 (higher = more confident)

### Expected Accuracy
- **Clear sentiment** (e.g., "I love this!"): 95-99% confidence
- **Neutral sentiment** (e.g., "It's okay"): 50-70% confidence (may vary)
- **Mixed sentiment**: Model picks dominant sentiment

### Batch Processing
- Processing 30 texts should take:
  - **CPU**: 2-5 seconds
  - **GPU**: 0.5-1 second

### Zero-Shot Classification
- Downloads larger model (~1.6GB for bart-large-mnli)
- Can classify into any categories you provide
- No additional training needed!

### Common Observations
- Very positive/negative texts get 95%+ confidence
- Neutral texts often get 60-80% confidence (acceptable)
- Emojis and exclamation marks influence predictions
- Model handles typos reasonably well

## Setup and Installation

In [None]:
# Import required libraries
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, set_seed
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(1103)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Model Selection

Choose one of the following models based on your hardware:

In [None]:
# CHOOSE YOUR MODEL:

# Option 1: CPU-friendly (recommended for beginners)
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"  # 268MB, sentiment analysis

# Option 2: GPU-optimized (uncomment if you have RTX 4080 or similar)
# MODEL_NAME = "bert-base-uncased"  # 440MB, needs fine-tuning for specific tasks
# Note: bert-base-uncased is a base model; for direct classification, use fine-tuned variants

print(f"Selected model: {MODEL_NAME}")

## Method 1: Using Pipeline (Simplest)

The `pipeline` API provides an easy interface for sentiment analysis.

In [None]:
# Create a sentiment analysis pipeline
print(f"Loading {MODEL_NAME}...")
classifier = pipeline(
    "sentiment-analysis",
    model=MODEL_NAME,
    device=0 if torch.cuda.is_available() else -1
)

### Basic Sentiment Analysis

In [None]:
# Classify a single text
text = "I absolutely love this product! It exceeded all my expectations."

result = classifier(text)

print(f"Text: {text}")
print(f"Prediction: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")

### Batch Classification

In [None]:
# Classify multiple texts at once (more efficient)
texts = [
    "This is the worst experience I've ever had.",
    "The movie was okay, nothing special.",
    "Absolutely fantastic! Highly recommend!",
    "I'm not sure how I feel about this.",
    "Terrible service and poor quality."
]

results = classifier(texts)

print("\n=== Batch Sentiment Analysis ===")
for text, result in zip(texts, results):
    print(f"\n{result['label']} ({result['score']:.4f})")
    print(f"   Text: {text}")

## Method 2: Using Model and Tokenizer Directly (Advanced)

For more control and understanding, load components separately.

In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded on: {device}")
print(f"Number of labels: {model.config.num_labels}")
print(f"Label mapping: {model.config.id2label}")

In [None]:
# Classify with detailed output
import torch.nn.functional as F

text = "The customer support was incredibly helpful and responsive."

# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = F.softmax(logits, dim=-1)

# Get predicted class
predicted_class = torch.argmax(probabilities, dim=-1).item()
confidence = probabilities[0][predicted_class].item()

print(f"Text: {text}")
print(f"\nPredicted class: {model.config.id2label[predicted_class]}")
print(f"Confidence: {confidence:.4f}")
print(f"\nAll probabilities:")
for idx, prob in enumerate(probabilities[0]):
    print(f"  {model.config.id2label[idx]}: {prob:.4f}")

## Practical Applications

### Example 1: Product Review Analysis

In [None]:
# Analyze product reviews
reviews = [
    "Great product! Works exactly as described.",
    "Disappointed with the quality. Would not recommend.",
    "Decent for the price, but could be better.",
    "Exceeded my expectations! Will buy again.",
    "Arrived damaged and customer service was unhelpful."
]

results = classifier(reviews)

# Calculate statistics
positive_count = sum(1 for r in results if r['label'] == 'POSITIVE')
negative_count = len(results) - positive_count

print("=== Product Review Analysis ===")
print(f"\nTotal reviews: {len(reviews)}")
print(f"Positive: {positive_count} ({positive_count/len(reviews)*100:.1f}%)")
print(f"Negative: {negative_count} ({negative_count/len(reviews)*100:.1f}%)")

print("\n=== Detailed Results ===")
for review, result in zip(reviews, results):
    print(f"\n{result['label']} ({result['score']:.3f}): {review}")

### Example 2: Social Media Monitoring

In [None]:
# Analyze social media posts
posts = [
    "Just launched our new feature! So excited to share this with everyone! üöÄ",
    "Another day, another bug. This is getting frustrating.",
    "Thanks for the amazing support team! Issue resolved quickly.",
    "Can't believe how slow the app has become lately.",
    "Love the new update! Everything runs so smoothly now."
]

results = classifier(posts)

print("=== Social Media Sentiment ===")
for post, result in zip(posts, results):
    print(f"\n[{result['score']:.2f}] {result['label']}: {post}")

### Example 3: Interactive Sentiment Checker

In [None]:
def analyze_sentiment(text):
    """
    Analyze sentiment with detailed feedback.
    """
    result = classifier(text)[0]
    
    # Interpret confidence
    confidence = result['score']
    if confidence > 0.9:
        strength = "Very confident"
    elif confidence > 0.7:
        strength = "Confident"
    else:
        strength = "Uncertain"
    
    print(f"\nText: {text}")
    print(f"Sentiment: {result['label']}")
    print(f"Confidence: {confidence:.4f} ({strength})")
    
    return result

# Test with different texts
test_texts = [
    "I'm having the best day ever!",
    "This is completely unacceptable.",
    "It's fine, I guess."
]

for text in test_texts:
    analyze_sentiment(text)

## Exploring Other Classification Tasks

HuggingFace has models for various classification tasks beyond sentiment:

In [None]:
# Zero-shot classification (classify without training on specific labels)
from transformers import pipeline

zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=0 if torch.cuda.is_available() else -1
)

text = "I love playing basketball and watching NBA games."
candidate_labels = ["sports", "technology", "politics", "entertainment"]

result = zero_shot_classifier(text, candidate_labels)

print(f"Text: {text}\n")
print("Classification results:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.4f}")

## Performance Benchmarking

In [None]:
# Using the SST-2 dataset (Stanford Sentiment Treebank)
# Dataset size: ~7MB, 67k training examples, 872 validation examples
from datasets import load_dataset

print("Loading SST-2 dataset...")
dataset = load_dataset("sst2", split="validation")

# Test on a few examples from the dataset
sample_texts = dataset['sentence'][:5]
sample_labels = dataset['label'][:5]  # 0=negative, 1=positive

print(f"Loaded {len(dataset)} validation examples\n")

results = classifier(sample_texts)

print("=== SST-2 Dataset Classification ===")
for i, (text, true_label, pred) in enumerate(zip(sample_texts, sample_labels, results)):
    true_sentiment = "POSITIVE" if true_label == 1 else "NEGATIVE"
    match = "‚úì" if pred['label'] == true_sentiment else "‚úó"
    
    print(f"\n{match} Example {i+1}:")
    print(f"   Text: {text}")
    print(f"   True label: {true_sentiment}")
    print(f"   Predicted: {pred['label']} ({pred['score']:.4f})")

## Exercises

Try these challenges to deepen your understanding:

1. **Custom Dataset**: Create your own list of texts and analyze their sentiment. Calculate the percentage of positive vs negative.

2. **Confidence Threshold**: Filter results to only show predictions with confidence > 0.8. How many are left?

3. **Multi-class Classification**: Try using a different model like `cardiffnlp/twitter-roberta-base-emotion` for emotion detection (joy, sadness, anger, etc.)

4. **Comparison**: Compare results from `distilbert` vs `bert-base-uncased` (if you have GPU). Are there differences?

5. **Real Data**: If you have access to real reviews or tweets, analyze them with the model.

In [None]:
# Your code here for exercises


## State-of-the-Art Open Models (Not Covered)

While this notebook uses DistilBERT and BERT for educational purposes, here are **state-of-the-art open-source classification models** you should know about:

### Large Classification Models

**ü§ñ RoBERTa** (Facebook/Meta)
- Robustly Optimized BERT approach
- Outperforms BERT on most benchmarks
- Sizes: base (125M), large (355M)
- [Model Card](https://huggingface.co/roberta-base) | [Paper](https://arxiv.org/abs/1907.11692)
- Note: Larger and slower than DistilBERT

**üéØ DeBERTa** (Microsoft)
- Decoding-enhanced BERT with disentangled attention
- State-of-the-art on many NLU benchmarks
- Sizes: base (140M), large (350M), XLarge (900M), V3-large (434M)
- [Model Card](https://huggingface.co/microsoft/deberta-v3-base) | [Paper](https://arxiv.org/abs/2006.03654)
- Excellent for: High-accuracy classification tasks

**‚ö° ELECTRA** (Google)
- More efficient pre-training method
- Better performance with less compute
- [Model Card](https://huggingface.co/google/electra-base-discriminator) | [Paper](https://arxiv.org/abs/2003.10555)

### Specialized Classification Models

**üê¶ Twitter-RoBERTa** (Cardiff NLP)
- Fine-tuned on 58M tweets
- Excellent for social media sentiment
- Supports emotion detection (joy, sadness, anger, fear, etc.)
- [Model Card](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

**üì∞ FinBERT** (ProsusAI)
- Specialized for financial sentiment analysis
- Trained on financial news and reports
- [Model Card](https://huggingface.co/ProsusAI/finbert)

**üè• BioBERT** (DMiS Lab)
- Pre-trained on biomedical literature
- Best for medical/scientific text classification
- [Model Card](https://huggingface.co/dmis-lab/biobert-v1.1)

**‚öñÔ∏è Legal-BERT** (SOTA NLP)
- Specialized for legal document analysis
- Trained on legal corpora
- [Model Card](https://huggingface.co/nlpaueb/legal-bert-base-uncased)

### Modern Instruction-Following Models

**ü¶ô Llama-based Classifiers**
- Fine-tuned Llama 2/3 for classification
- Can handle complex, nuanced classification tasks
- Example: [Llama-2-7b-chat-hf with classification adapters](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- Note: Requires 16GB+ GPU

**üåä Mistral-based Classifiers**
- Efficient 7B parameter classifiers
- Excellent instruction-following capabilities
- Can perform zero-shot classification via prompting
- [Model Card](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

### Why Not Covered Here?

These models require:
- **More compute**: RoBERTa-large needs 8GB+ VRAM
- **Slower inference**: Larger models take longer
- **Domain-specific data**: Some need fine-tuning for your use case
- **Advanced techniques**: May require knowledge of fine-tuning

**Learning Path**:
1. ‚úÖ Start with DistilBERT (this notebook) - fast, accurate baseline
2. Try RoBERTa or DeBERTa for better accuracy (if you have GPU)
3. Use domain-specific models (Twitter-RoBERTa, FinBERT) for specialized tasks
4. Fine-tune your own classifier (see Notebook 13) for custom categories

### Benchmarks & Leaderboards

- [GLUE Benchmark](https://gluebenchmark.com/leaderboard) - General Language Understanding
- [SuperGLUE](https://super.gluebenchmark.com/) - More challenging NLU tasks
- [Papers with Code - Text Classification](https://paperswithcode.com/task/text-classification)

### Practical Recommendations

| Use Case | Recommended Model | Why |
|----------|------------------|-----|
| General sentiment | DistilBERT (this notebook) | Fast, accurate, easy |
| High accuracy needed | DeBERTa-V3 | Best performance |
| Social media analysis | Twitter-RoBERTa | Domain-specific |
| Financial text | FinBERT | Specialized vocabulary |
| Limited resources | DistilBERT, ELECTRA-small | Efficient |
| Production deployment | RoBERTa-base, DeBERTa-base | Good balance |

## Key Takeaways

‚úÖ **Text classification** assigns predefined labels to text

‚úÖ **Pre-trained models** work well out-of-the-box for common tasks

‚úÖ **Batch processing** is more efficient than processing one at a time

‚úÖ **Confidence scores** indicate model certainty

‚úÖ **Zero-shot classification** works without task-specific training

## Next Steps

- Try **Notebook 03**: Text Summarization
- Explore [HuggingFace Models](https://huggingface.co/models?pipeline_tag=text-classification) for more classification models
- Learn about fine-tuning models on custom datasets

## Resources

- [Text Classification Guide](https://huggingface.co/docs/transformers/tasks/sequence_classification)
- [Sentiment Analysis Tutorial](https://huggingface.co/blog/sentiment-analysis-python)
- [Zero-Shot Classification](https://huggingface.co/tasks/zero-shot-classification)