# ü§ñ Advanced Transformers: BERT, RoBERTa & Beyond

**Author**: Data Science Master System  
**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced  
**Time**: 60 minutes  
**Prerequisites**: 16_nlp_sentiment_analysis

## Learning Objectives
- Understand transformer architecture
- Fine-tune BERT for classification
- Compare BERT variants
- Production deployment

In [None]:
import torch
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

## 1. Load Pretrained BERT

In [None]:
try:
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    
    model_name = 'bert-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    
    print(f"‚úÖ {model_name} loaded")
    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
    
except ImportError:
    print("Install: pip install transformers")

## 2. Tokenization

In [None]:
text = "Transformers have revolutionized NLP!"

tokens = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

print("üî§ Tokenization:")
print(f"  Input: {text}")
print(f"  Token IDs: {tokens['input_ids'][0][:10].tolist()}...")
print(f"  Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0][:10])}")

## 3. Fine-tuning

In [None]:
training_code = '''
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()
'''
print("üìã Fine-tuning Code:")
print(training_code)

## 4. Model Comparison

In [None]:
import pandas as pd

comparison = pd.DataFrame({
    'Model': ['BERT-base', 'BERT-large', 'RoBERTa', 'DistilBERT', 'ALBERT', 'DeBERTa'],
    'Params': ['110M', '340M', '125M', '66M', '12M', '134M'],
    'Speed': ['1x', '0.3x', '1x', '2x', '1.5x', '0.8x'],
    'GLUE': ['82.1', '85.2', '86.4', '79.0', '84.1', '88.8']
})

print("üìä Transformer Models:")
display(comparison)

## üéØ Key Takeaways
1. Use DistilBERT for speed
2. RoBERTa/DeBERTa for accuracy
3. Learning rate ~2e-5 for fine-tuning
4. 2-4 epochs usually sufficient

**Next**: 18_nlp_text_generation.ipynb