# üóûÔ∏è Fine-tuning RoBERTa on AG News Dataset

---

## üìö What You'll Learn

In this notebook, we'll fine-tune **FacebookAI/roberta-base** on the **fancyzhx/ag_news** dataset for news topic classification. By the end of this notebook, you'll understand:

1. **RoBERTa vs BERT** - Understanding the key differences
2. **AG News Dataset** - A 4-class news topic classification dataset
3. **Full Dataset Training** - Training on the complete dataset (120,000 samples)
4. **Multiclass Classification** - Classifying into World, Sports, Business, and Sci/Tech
5. **Inference Pipeline** - Using the fine-tuned model for predictions

---

## ü§ñ About RoBERTa

**RoBERTa** (Robustly Optimized BERT Pretraining Approach) is an improved version of BERT developed by Facebook AI. Key improvements include:

| Aspect | BERT | RoBERTa |
|--------|------|----------|
| Training Data | 16GB | 160GB |
| Training Steps | 1M | 500K |
| Next Sentence Prediction | ‚úÖ Used | ‚ùå Removed |
| Dynamic Masking | ‚ùå Static | ‚úÖ Dynamic |
| Batch Size | 256 | 8K |

RoBERTa generally achieves better results on NLP benchmarks!

---

## üì∞ About AG News Dataset

The AG News dataset is a collection of news articles for topic classification:

- **4 Classes**: World (0), Sports (1), Business (2), Sci/Tech (3)
- **Training Samples**: 120,000
- **Test Samples**: 7,600
- **Balanced**: 30,000 samples per class in training set

---

## üõ†Ô∏è Setup & Installation

Let's start by installing and importing the necessary libraries.

In [1]:
# Install required packages (uncomment if needed)
# !pip install transformers datasets torch accelerate evaluate -q

In [2]:
# Import essential libraries
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
import evaluate
import numpy as np
import torch
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


In [3]:
# Check for GPU availability
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"üöÄ Using CUDA GPU: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("üçé Using Apple Silicon MPS")
else:
    device = torch.device("cpu")
    print("üíª Using CPU (training will be slower)")

print(f"Device selected: {device}")

üöÄ Using CUDA GPU: 
Device selected: cuda


---

## üìä Part 1: Loading the AG News Dataset

We'll load the full AG News dataset from the `fancyzhx/ag_news` repository on Hugging Face.

In [4]:
# Load the AG News dataset
print("üì¶ Loading AG News dataset...")
ag_news_dataset = load_dataset("fancyzhx/ag_news")

print("\nüìä AG News Dataset Structure:")
print(ag_news_dataset)

üì¶ Loading AG News dataset...


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]


üìä AG News Dataset Structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


In [5]:
# Define the label names for AG News
label_names = {
    0: "üåç World",
    1: "‚öΩ Sports",
    2: "üíº Business",
    3: "üî¨ Sci/Tech"
}

# Examine a sample from each class
print("üì∞ Sample News Articles:")
print("=" * 70)

for label_id in range(4):
    # Find a sample with this label
    for sample in ag_news_dataset['train']:
        if sample['label'] == label_id:
            print(f"\n{label_names[label_id]}:")
            print(f"   {sample['text'][:150]}...")
            break

üì∞ Sample News Articles:

üåç World:
   Venezuelans Vote Early in Referendum on Chavez Rule (Reuters) Reuters - Venezuelans turned out early\and in large numbers on Sunday to vote in a histo...

‚öΩ Sports:
   Phelps, Thorpe Advance in 200 Freestyle (AP) AP - Michael Phelps took care of qualifying for the Olympic 200-meter freestyle semifinals Sunday, and th...

üíº Business:
   Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again....

üî¨ Sci/Tech:
   'Madden,' 'ESPN' Football Score in Different Ways (Reuters) Reuters - Was absenteeism a little high\on Tuesday among the guys at the office? EA Sports...


In [6]:
# Check the label distribution
from collections import Counter

train_labels = ag_news_dataset['train']['label']
label_counts = Counter(train_labels)

print("üìà Label Distribution in Training Set:")
print("=" * 50)
for label in sorted(label_counts.keys()):
    count = label_counts[label]
    bar = "‚ñà" * (count // 1000)
    print(f"   {label_names[label]}: {count:,} {bar}")

print(f"\n   Total training samples: {len(train_labels):,}")
print(f"   Total test samples: {len(ag_news_dataset['test']):,}")

üìà Label Distribution in Training Set:
   üåç World: 30,000 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   ‚öΩ Sports: 30,000 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   üíº Business: 30,000 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   üî¨ Sci/Tech: 30,000 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

   Total training samples: 120,000
   Total test samples: 7,600


---

## üî§ Part 2: Tokenization

Now we'll tokenize the dataset using RoBERTa's tokenizer. RoBERTa uses **Byte-Pair Encoding (BPE)** tokenization.

In [7]:
# Define the model checkpoint
MODEL_CHECKPOINT = "FacebookAI/roberta-base"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

print(f"‚úÖ Loaded tokenizer for: {MODEL_CHECKPOINT}")
print(f"   Vocabulary size: {tokenizer.vocab_size:,} tokens")
print(f"   Model max length: {tokenizer.model_max_length}")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

‚úÖ Loaded tokenizer for: FacebookAI/roberta-base
   Vocabulary size: 50,265 tokens
   Model max length: 512


In [8]:
# Let's see tokenization in action
sample_text = "Tech giant Apple unveils new iPhone with AI-powered features at annual conference."

# Tokenize the sample
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.encode(sample_text)

print("üî§ Tokenization Example:")
print(f"   Original: {sample_text}")
print(f"\n   Tokens: {tokens}")
print(f"\n   Token IDs: {token_ids}")
print(f"\n   Number of tokens: {len(tokens)}")

üî§ Tokenization Example:
   Original: Tech giant Apple unveils new iPhone with AI-powered features at annual conference.

   Tokens: ['Tech', 'ƒ†giant', 'ƒ†Apple', 'ƒ†unve', 'ils', 'ƒ†new', 'ƒ†iPhone', 'ƒ†with', 'ƒ†AI', '-', 'powered', 'ƒ†features', 'ƒ†at', 'ƒ†annual', 'ƒ†conference', '.']

   Token IDs: [0, 14396, 3065, 1257, 36685, 5290, 92, 2733, 19, 4687, 12, 10711, 1575, 23, 1013, 1019, 4, 2]

   Number of tokens: 16


In [9]:
# Define the tokenization function
def tokenize_function(examples):
    """
    Tokenizes the text with truncation.
    
    - truncation=True: Cuts longer texts to max_length
    - max_length=256: Maximum sequence length
    
    Note: We'll use DataCollatorWithPadding for dynamic padding
          which is more efficient than padding='max_length'
    """
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=256
    )

# Apply tokenization to the entire dataset
print("‚è≥ Tokenizing dataset... (this may take a minute)")
tokenized_dataset = ag_news_dataset.map(
    tokenize_function, 
    batched=True,
    remove_columns=['text']  # Remove original text to save memory
)
print("‚úÖ Tokenization complete!")

# View the new structure
print("\nüìä Tokenized Dataset Structure:")
print(tokenized_dataset)

‚è≥ Tokenizing dataset... (this may take a minute)


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

‚úÖ Tokenization complete!

üìä Tokenized Dataset Structure:
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7600
    })
})


In [10]:
# Examine a tokenized example
example = tokenized_dataset['train'][0]

print("üîç Tokenized Example:")
print(f"   Keys: {example.keys()}")
print(f"   Input IDs length: {len(example['input_ids'])}")
print(f"   Attention mask length: {len(example['attention_mask'])}")
print(f"   Label: {example['label']} ({label_names[example['label']]})")

üîç Tokenized Example:
   Keys: dict_keys(['label', 'input_ids', 'attention_mask'])
   Input IDs length: 39
   Attention mask length: 39
   Label: 2 (üíº Business)


---

## üß† Part 3: Setting Up the Model

Now we load the pretrained RoBERTa model and configure it for our 4-class classification task.

In [11]:
# Define id2label and label2id mappings for better model card
id2label = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
label2id = {"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3}

# Load the model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_CHECKPOINT,
    num_labels=4,  # 4-class classification
    id2label=id2label,
    label2id=label2id
)

# Move model to the appropriate device
model.to(device)

print(f"‚úÖ Model loaded and moved to {device}")
print(f"   Model type: {type(model).__name__}")
print(f"   Number of parameters: {model.num_parameters():,}")
print(f"   Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model loaded and moved to cuda
   Model type: RobertaForSequenceClassification
   Number of parameters: 124,648,708
   Trainable parameters: 124,648,708


---

## üìê Part 4: Setting Up Evaluation Metrics

We'll use accuracy as our evaluation metric, but also compute F1 score for a more comprehensive view.

In [14]:
# !pip install scikit-learn -q

In [15]:
# Load evaluation metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    """
    Computes accuracy and F1 score from predictions.
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    
    return {
        'accuracy': accuracy['accuracy'],
        'f1': f1['f1']
    }

print("‚úÖ Evaluation metrics configured!")

Downloading builder script: 0.00B [00:00, ?B/s]

‚úÖ Evaluation metrics configured!


---

## ‚öôÔ∏è Part 5: Training Configuration

Let's set up our training parameters for 3 epochs on the full dataset.

In [18]:
# Create data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define training arguments
training_args = TrainingArguments(
    # Output settings
    output_dir="./roberta_ag_news_model",
    
    # Training hyperparameters
    learning_rate=2e-5,              # Standard LR for fine-tuning transformers
    num_train_epochs=3,              # 3 epochs 
    per_device_train_batch_size=16,  # Adjust based on GPU memory
    per_device_eval_batch_size=32,   # Larger batch for evaluation (no gradients)
    weight_decay=0.01,               # Regularization
    warmup_ratio=0.1,                # 10% warmup steps
    
    # Evaluation strategy
    eval_strategy="epoch",           # Evaluate after each epoch
    save_strategy="epoch",           # Save checkpoint after each epoch
    load_best_model_at_end=True,     # Load the best model when training ends
    metric_for_best_model="accuracy", # Use accuracy to select best model
    
    # Logging
    logging_dir="./logs",
    logging_steps=500,               # Log every 500 steps
    
    # Performance optimizations
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    gradient_accumulation_steps=1,   # Adjust if batch size needs to be larger
    
    # Other settings
    seed=42,                         # For reproducibility
    report_to="none",                # Disable wandb/tensorboard reporting
)

print("‚úÖ Training arguments configured!")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Total training samples: {len(tokenized_dataset['train']):,}")
print(f"   Steps per epoch: {len(tokenized_dataset['train']) // training_args.per_device_train_batch_size:,}")

‚úÖ Training arguments configured!
   Epochs: 3
   Batch size: 16
   Learning rate: 2e-05
   Total training samples: 120,000
   Steps per epoch: 7,500


### üìñ Understanding Training Hyperparameters

| Parameter | Description | Our Value |
|-----------|-------------|------------|
| `learning_rate` | How much to update weights each step | 2e-5 |
| `num_train_epochs` | Complete passes through training data | 3 |
| `batch_size` | Samples processed before updating weights | 16 |
| `weight_decay` | Regularization to prevent overfitting | 0.01 |
| `warmup_ratio` | Fraction of steps for learning rate warmup | 0.1 |

In [19]:
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("‚úÖ Trainer initialized!")
print(f"   Training samples: {len(tokenized_dataset['train']):,}")
print(f"   Evaluation samples: {len(tokenized_dataset['test']):,}")

‚úÖ Trainer initialized!
   Training samples: 120,000
   Evaluation samples: 7,600


---

## üöÄ Part 6: Training the Model

Now we start the fine-tuning process! This will train on the full 120,000 samples for 3 epochs.

> ‚ö†Ô∏è **Note**: Training on the full dataset may take 30-60 minutes on a GPU, or several hours on CPU.

In [20]:
# Start training!
print("üöÄ Starting fine-tuning RoBERTa on AG News dataset...")
print("   This will train for 3 epochs on 120,000 samples.")
print("=" * 70)

train_result = trainer.train()

print("\n" + "=" * 70)
print("‚úÖ Training complete!")

üöÄ Starting fine-tuning RoBERTa on AG News dataset...
   This will train for 3 epochs on 120,000 samples.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2148,0.185017,0.942632,0.94253
2,0.1646,0.177822,0.950132,0.950147
3,0.1072,0.202116,0.954079,0.954104



‚úÖ Training complete!


In [21]:
# Display training metrics
print("üìä Training Metrics:")
print(f"   Total steps: {train_result.global_step:,}")
print(f"   Training loss: {train_result.training_loss:.4f}")
print(f"   Training runtime: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"   Samples per second: {train_result.metrics['train_samples_per_second']:.2f}")

üìä Training Metrics:
   Total steps: 22,500
   Training loss: 0.1910
   Training runtime: 571.41 seconds
   Samples per second: 630.01


In [22]:
# Evaluate the model on test set
print("üìä Evaluating the model on test set...")
eval_results = trainer.evaluate()

print("\nüìà Final Evaluation Results:")
print("=" * 50)
print(f"   Loss: {eval_results['eval_loss']:.4f}")
print(f"   Accuracy: {eval_results['eval_accuracy']:.4f} ({eval_results['eval_accuracy']*100:.2f}%)")
print(f"   F1 Score: {eval_results['eval_f1']:.4f}")
print(f"   Runtime: {eval_results['eval_runtime']:.2f} seconds")
print(f"   Samples/second: {eval_results['eval_samples_per_second']:.2f}")

üìä Evaluating the model on test set...



üìà Final Evaluation Results:
   Loss: 0.2021
   Accuracy: 0.9541 (95.41%)
   F1 Score: 0.9541
   Runtime: 2.21 seconds
   Samples/second: 3433.05


---

## üíæ Part 7: Saving the Fine-tuned Model

Let's save our model so we can use it later without retraining.

In [23]:
# Save the model and tokenizer
MODEL_SAVE_PATH = "./roberta_ag_news_model/final"

trainer.save_model(MODEL_SAVE_PATH)
tokenizer.save_pretrained(MODEL_SAVE_PATH)

print(f"‚úÖ Model saved to: {MODEL_SAVE_PATH}")

‚úÖ Model saved to: ./roberta_ag_news_model/final


---

## üîÆ Part 8: Inference with the Fine-tuned Model

Now let's test our model on some new news headlines!

In [24]:
# Load the fine-tuned model for inference
inference_model = AutoModelForSequenceClassification.from_pretrained(MODEL_SAVE_PATH)
inference_tokenizer = AutoTokenizer.from_pretrained(MODEL_SAVE_PATH)

inference_model.to(device)
inference_model.eval()  # Set to evaluation mode

print("‚úÖ Model loaded for inference!")

‚úÖ Model loaded for inference!


In [25]:
def predict_news_topic(text):
    """
    Predicts the topic category for a news article.
    
    Args:
        text: The news article or headline text
        
    Returns:
        A dictionary with predicted topic and confidence
    """
    # Tokenize the input
    inputs = inference_tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=256
    )
    
    # Move inputs to device
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    # Get predictions
    with torch.no_grad():
        outputs = inference_model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(probabilities, dim=-1).item()
        confidence = probabilities[0][predicted_class].item()
    
    return {
        "topic": label_names[predicted_class],
        "topic_id": predicted_class,
        "confidence": confidence,
        "all_probabilities": {
            label_names[i]: f"{prob:.2%}" 
            for i, prob in enumerate(probabilities[0].cpu().numpy())
        }
    }

In [27]:
# Test headlines - one from each category
test_headlines = [
    # World news
    "UN Security Council meets to discuss Middle East peace negotiations amid rising tensions between neighboring nations.",
    
    # Sports news
    "Manchester United defeats Liverpool 3-2 in thrilling Premier League match as Rashford scores winning goal in injury time.",
    
    # Business news
    "Stock markets rally as Federal Reserve signals potential interest rate cuts in upcoming quarterly review meeting.",
    
    # Science/Tech news
    "NASA's James Webb telescope discovers water vapor on distant exoplanet, raising hopes for potential extraterrestrial life.",
    
    # Additional mixed examples
    "Apple announces record quarterly earnings driven by strong iPhone 15 sales in Asian markets.",
    "World leaders gather in Paris for annual climate summit to address global warming concerns.",
    "OpenAI releases GPT-5 with unprecedented language understanding capabilities.",
    "LeBron James becomes NBA's all-time leading scorer with spectacular performance."
]

print("üóûÔ∏è News Topic Classification Results")
print("=" * 70)

for i, headline in enumerate(test_headlines, 1):
    result = predict_news_topic(headline)
    print(f"\nüì∞ Article {i}:")
    print(f"   \"{headline[:80]}...\"" if len(headline) > 80 else f"   \"{headline}\"")
    print(f"   ‚Üí Predicted Topic: {result['topic']}")
    print(f"   ‚Üí Confidence: {result['confidence']:.2%}")

üóûÔ∏è News Topic Classification Results

üì∞ Article 1:
   "UN Security Council meets to discuss Middle East peace negotiations amid rising ..."
   ‚Üí Predicted Topic: üåç World
   ‚Üí Confidence: 99.93%

üì∞ Article 2:
   "Manchester United defeats Liverpool 3-2 in thrilling Premier League match as Ras..."
   ‚Üí Predicted Topic: üåç World
   ‚Üí Confidence: 99.89%

üì∞ Article 3:
   "Stock markets rally as Federal Reserve signals potential interest rate cuts in u..."
   ‚Üí Predicted Topic: üíº Business
   ‚Üí Confidence: 99.00%

üì∞ Article 4:
   "NASA's James Webb telescope discovers water vapor on distant exoplanet, raising ..."
   ‚Üí Predicted Topic: üî¨ Sci/Tech
   ‚Üí Confidence: 94.04%

üì∞ Article 5:
   "Apple announces record quarterly earnings driven by strong iPhone 15 sales in As..."
   ‚Üí Predicted Topic: üî¨ Sci/Tech
   ‚Üí Confidence: 98.13%

üì∞ Article 6:
   "World leaders gather in Paris for annual climate summit to address global warmin..."
   ‚Üí P

In [28]:
# Detailed prediction with all probabilities
sample_text = "Tesla stock surges 15% after announcing breakthrough in battery technology for electric vehicles."

result = predict_news_topic(sample_text)

print("üîç Detailed Prediction Analysis")
print("=" * 50)
print(f"\nText: {sample_text}")
print(f"\nüìä Predicted Topic: {result['topic']}")
print(f"\nüìà Probability Distribution:")
for topic, prob in result['all_probabilities'].items():
    bar_length = int(float(prob.strip('%')) / 5)
    bar = "‚ñà" * bar_length
    print(f"   {topic}: {prob} {bar}")

üîç Detailed Prediction Analysis

Text: Tesla stock surges 15% after announcing breakthrough in battery technology for electric vehicles.

üìä Predicted Topic: üíº Business

üìà Probability Distribution:
   üåç World: 0.91% 
   ‚öΩ Sports: 0.01% 
   üíº Business: 97.56% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   üî¨ Sci/Tech: 1.52% 


---

## ü§ó Part 9: Using Pipeline for Easy Inference

Hugging Face provides a convenient `pipeline` API for even simpler inference.

In [29]:
from transformers import pipeline

# Create a text classification pipeline
classifier = pipeline(
    "text-classification", 
    model=MODEL_SAVE_PATH,
    device=0 if torch.cuda.is_available() else -1,
    top_k=None  # Return all classes with probabilities
)

print("‚úÖ Pipeline created!")

Device set to use cuda:0


‚úÖ Pipeline created!


In [30]:
# Quick classification with pipeline
news_articles = [
    "Scientists develop new AI algorithm that can predict weather patterns with 99% accuracy.",
    "Brazil wins World Cup after penalty shootout against Argentina in historic final.",
    "Amazon acquires competitor in $50 billion deal, largest tech merger this year."
]

print("üöÄ Quick Classification with Pipeline")
print("=" * 50)

for article in news_articles:
    results = classifier(article)
    top_prediction = max(results[0], key=lambda x: x['score'])
    
    print(f"\nüì∞ {article[:70]}...")
    print(f"   ‚Üí {top_prediction['label']} ({top_prediction['score']:.2%})")

üöÄ Quick Classification with Pipeline

üì∞ Scientists develop new AI algorithm that can predict weather patterns ...
   ‚Üí Sci/Tech (98.43%)

üì∞ Brazil wins World Cup after penalty shootout against Argentina in hist...
   ‚Üí World (99.89%)

üì∞ Amazon acquires competitor in $50 billion deal, largest tech merger th...
   ‚Üí Sci/Tech (97.99%)


---

## üì§ Part 10: Pushing to Hugging Face Hub (Optional)

Share your fine-tuned model with the world!

In [55]:
# Login to Hugging Face Hub
# Uncomment and run if you want to push to Hub

# from huggingface_hub import notebook_login
# notebook_login()

In [56]:
# # Push to Hub
# # Uncomment and modify to push your model

# from huggingface_hub import create_repo

# # Use the repo_id that trainer already has (the old name)
# repo_id = "your-username/roberta_ag_news_model"

# # Step 1: Create the repository with the trainer's expected name
# print(f"üì¶ Creating repository: {repo_id}")
# create_repo(repo_id, repo_type="model", exist_ok=True)
# print(f"‚úÖ Repository created: {repo_id}")

# # Step 2: Push using trainer (includes training metrics in model card)
# print(f"\nüöÄ Pushing model to: {repo_id}")
# trainer.push_to_hub()

# # Step 3: Push tokenizer
# tokenizer.push_to_hub(repo_id)

# print(f"\n‚úÖ Model successfully pushed with training info!")
# print(f"üîó View your model at: https://huggingface.co/{repo_id}")

---

## üéì Key Takeaways

### What We Accomplished:

1. **Fine-tuned RoBERTa-base** on the complete AG News dataset (120,000 training samples)
2. **Trained for 3 epochs** with proper evaluation metrics (accuracy and F1 score)
3. **Built inference pipeline** for easy prediction on new articles

### Model Performance:

RoBERTa-base typically achieves **~94-95% accuracy** on AG News after 3 epochs of training!

### Next Steps:

- Try different learning rates (1e-5 to 5e-5)
- Experiment with more epochs (4-5)
- Compare with other models (BERT, ALBERT, XLNet)
- Add more sophisticated evaluation (confusion matrix, per-class metrics)

---

## üìö References & Resources

- [RoBERTa Paper](https://arxiv.org/abs/1907.11692) - "RoBERTa: A Robustly Optimized BERT Pretraining Approach"
- [AG News Dataset](https://huggingface.co/datasets/fancyzhx/ag_news) - Dataset on Hugging Face
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers)
- [Fine-tuning Guide](https://huggingface.co/docs/transformers/training)
- [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) - Model Card

---