# Zero-Shot Topic Classification with DeBERTa

This notebook applies zero-shot classification to label news articles with topic categories.

**Model**: MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli  
**Dataset**: LRC_articles.parquet (147K articles)  
**Topics**: 11 topic labels
**Runtime**: Google Colab Pro with GPU

In [None]:
# Cell 0: Install dependencies
# Install transformers for DistilBERT model and datasets for efficient data handling

!pip install -q transformers torch pandas pyarrow tqdm datasets

import pandas as pd
import torch
import gc
from datetime import datetime
from transformers import pipeline
from tqdm import tqdm
from google.colab import drive
from datasets import Dataset

In [2]:
# Cell 2: Mount Google Drive
# First, upload LRC_articles.parquet to your Google Drive (in the root MyDrive folder)
# Then run this cell to mount your Drive and access the file

print("Mounting Google Drive...")
drive.mount('/content/drive')

print("\nGoogle Drive mounted successfully!")
print("Make sure LRC_articles.parquet is in your Google Drive root folder (MyDrive)")

Mounting Google Drive...
Mounted at /content/drive

Google Drive mounted successfully!
Make sure LRC_articles.parquet is in your Google Drive root folder (MyDrive)


In [3]:
# Cell 3: Load dataset
# Load the LRC articles parquet file from Google Drive

print("Loading dataset from Google Drive...")
df = pd.read_parquet('/content/drive/MyDrive/LRC_articles.parquet')

print(f"Dataset loaded: {len(df):,} articles")
print(f"Columns: {list(df.columns)}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"\nOutlet distribution:")
print(df['outlet_name'].value_counts())
print(f"\nBias distribution:")
print(df['bias'].value_counts())

Loading dataset from Google Drive...
Dataset loaded: 124,988 articles
Columns: ['uuid', 'url', 'outlet_name', 'bias', 'date', 'content', 'content_preprocessed']
Date range: 2015-01-01 to 2021-12-31

Outlet distribution:
outlet_name
Newsweek               32224
Breitbart News         22145
HuffPost               15969
New York Post          12968
CNBC                   11051
Reuters                10385
Washington Times       10359
BBC News                5198
The Washington Post     4689
Name: count, dtype: int64

Bias distribution:
bias
Left          48193
Center        26634
Lean Right    23327
Right         22145
Lean Left      4689
Name: count, dtype: int64


In [None]:
# Cell 4: Setup model and topic labels
# Initialize zero-shot classification pipeline with DeBERTa-v3-base
# Define 18 topic categories designed to be distinct from the 14 framing dimensions

TOPICS = [
      "immigration",
      "elections and politics",
      "China",
      "Middle East",
      "Russia",
      "technology",
      "pandemic",
      "healthcare",
      "entertainment",
      "finance",
      "crime"
  ]

print(f"Topic labels defined: {len(TOPICS)} topics\n")
for i, topic in enumerate(TOPICS, 1):
    print(f"{i:2d}. {topic}")

print("\nLoading DeBERTa-v3-base zero-shot model...")
print("This may take a few minutes on first run...")

device = 0 if torch.cuda.is_available() else -1
classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli",
    device=device,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# Configure tokenizer for optimal truncation
classifier.tokenizer.model_max_length = 512

classifier.hypothesis_template = "This text is about {}."

print(f"Model loaded successfully on device: {'GPU' if device == 0 else 'CPU'}")
print(f"Using mixed precision: {'fp16' if torch.cuda.is_available() else 'fp32'}")
print(f"Max token length: {classifier.tokenizer.model_max_length}")
print(f"Using hypothesis template: '{classifier.hypothesis_template}'")

In [5]:
# Cell 5: Test on sample articles
# Run classification on 10 random articles to verify model is working

print("Testing model on 10 sample articles...\n")

sample_df = df.sample(n=10, random_state=42)

for idx, row in sample_df.iterrows():
    text = row['content'][:512]

    result = classifier(text, TOPICS, multi_label=False)

    top_topic = result['labels'][0]
    top_score = result['scores'][0]

    print(f"Outlet: {row['outlet_name']} ({row['bias']}) | Date: {row['date']}")
    print(f"Text preview: {text[:150]}...")
    print(f"Predicted topic: {top_topic} (confidence: {top_score:.3f})")
    print("-" * 80)

print("\nSample test complete. Model is working correctly.")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Testing model on 10 sample articles...

Outlet: New York Post (Lean Right) | Date: 2021-11-29
Text preview: David Gulpilil, the legendary indigenous Australian actor who earned international acclaim in Paul Hogan’s “Crocodile Dundee” and Rolf de Heer’s “Char...
Predicted topic: entertainment (confidence: 0.616)
--------------------------------------------------------------------------------
Outlet: CNBC (Center) | Date: 2019-01-30
Text preview: The Anthem Inc. Anthem Anywhere application is seen in the App Store on an Apple Inc. iPhone displayed for a photograph in Washington, D.C., U.S., on ...
Predicted topic: technology (confidence: 0.508)
--------------------------------------------------------------------------------
Outlet: Reuters (Center) | Date: 2016-12-19
Text preview: NEW YORK - Most delinquent youth achieve few positive milestones in the years after their detention, especially if they are boys, Hispanic, or African...
Predicted topic: crime (confidence: 0.542)
-------------

In [None]:
# Cell 5.5: Test classification on 2000-article sample
# Analyze confidence patterns by topic before running full dataset
# This helps us understand if filtering will work for our research needs
# OPTIMIZED: Uses Dataset format for efficient GPU processing

print("Running classification test on 2,000 article sample...")

# Sample 2000 articles
SAMPLE_SIZE = 2000
BATCH_SIZE = 64  # Increased from 32 for L4 GPU
df_sample = df.sample(n=SAMPLE_SIZE, random_state=42)

print(f"Processing {SAMPLE_SIZE:,} articles with batch size {BATCH_SIZE}...")

# Convert to HuggingFace Dataset for efficient GPU processing
sample_dataset = Dataset.from_dict({
    'text': df_sample['content'].tolist()
})

# Process using Dataset format - much faster than sequential processing
sample_results = []
for batch in tqdm(
    classifier(sample_dataset['text'], TOPICS, multi_label=False, batch_size=BATCH_SIZE),
    total=len(sample_dataset),
    desc="Processing sample"
):
    sample_results.append({
        'topic_top1': batch['labels'][0],
        'topic_top1_score': batch['scores'][0],
    })

# Analyze results
sample_results_df = pd.DataFrame(sample_results)
df_sample_labeled = pd.concat([df_sample.reset_index(drop=True), sample_results_df], axis=1)

print("\nCONFIDENCE ANALYSIS BY TOPIC")

# Show confidence stats by topic
topic_confidence = df_sample_labeled.groupby('topic_top1')['topic_top1_score'].agg([
    ('count', 'count'),
    ('mean_conf', 'mean'),
    ('median_conf', 'median'),
    ('high_conf_pct', lambda x: (x >= 0.6).sum() / len(x) * 100)
]).sort_values('count', ascending=False)

print("\nConfidence by Topic (sorted by article count):")
print(topic_confidence.round(3))

print("\nFILTERING IMPACT ANALYSIS")

# Show what happens at different confidence thresholds
for threshold in [0.5, 0.6, 0.7]:
    kept = (df_sample_labeled['topic_top1_score'] >= threshold).sum()
    pct = kept / len(df_sample_labeled) * 100
    print(f"\nThreshold >= {threshold}: Keep {kept:,}/{SAMPLE_SIZE:,} articles ({pct:.1f}%)")

    # Show topic distribution at this threshold
    filtered_topics = df_sample_labeled[df_sample_labeled['topic_top1_score'] >= threshold]['topic_top1'].value_counts()
    print(f"Topics with >50 articles at this threshold:")
    print(filtered_topics[filtered_topics >= 50])

In [None]:
# Cell 6: Process full dataset
# Classify all articles in batches to manage memory
# Store top topic and top 3 topics with scores for each article
# OPTIMIZED: Uses Dataset format, fp16, increased batch size, tokenizer truncation

BATCH_SIZE = 64  # Increased from 32 for L4 GPU

print(f"Starting classification of {len(df):,} articles...")
print(f"Batch size: {BATCH_SIZE}")
print(f"Max token length: {classifier.tokenizer.model_max_length}")
print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n" + "="*80 + "\n")

# Convert to HuggingFace Dataset for efficient GPU processing
full_dataset = Dataset.from_dict({
    'text': df['content'].tolist()
})

# Process using Dataset format with batching
results = []
for batch in tqdm(
    classifier(full_dataset['text'], TOPICS, multi_label=False, batch_size=BATCH_SIZE),
    total=len(full_dataset),
    desc="Classifying articles"
):
    top_topic = batch['labels'][0]
    top_score = batch['scores'][0]

    top3_topics = batch['labels'][:3]
    top3_scores = batch['scores'][:3]

    results.append({
        'topic_top1': top_topic,
        'topic_top1_score': top_score,
        'topic_top2': top3_topics[1] if len(top3_topics) > 1 else None,
        'topic_top2_score': top3_scores[1] if len(top3_scores) > 1 else None,
        'topic_top3': top3_topics[2] if len(top3_topics) > 2 else None,
        'topic_top3_score': top3_scores[2] if len(top3_scores) > 2 else None,
    })

    # Periodic garbage collection to manage memory
    if len(results) % 16000 == 0 and len(results) > 0:
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            print(f"\nProgress: {len(results):,}/{len(full_dataset):,} articles ({len(results)/len(full_dataset)*100:.1f}%)")
            print(f"GPU memory: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")

print(f"\n\nClassification complete!")
print(f"End time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total articles processed: {len(results):,}")

In [None]:
# Cell 7: Add results to dataframe, filter, and save
# Combine original data with topic predictions
# Filter to only keep high-confidence predictions (>0.6)
# Save to new parquet file for analysis

print("Adding topic labels to dataframe...")

results_df = pd.DataFrame(results)
df_labeled = pd.concat([df.reset_index(drop=True), results_df], axis=1)

print(f"Original dataframe shape: {df_labeled.shape}")
print(f"Total articles: {len(df_labeled):,}")

# Filter to high-confidence predictions only
CONFIDENCE_THRESHOLD = 0.6
df_filtered = df_labeled[df_labeled['topic_top1_score'] >= CONFIDENCE_THRESHOLD].copy()

print(f"\nFiltering to confidence >= {CONFIDENCE_THRESHOLD}:")
print(f"Articles kept: {len(df_filtered):,} ({len(df_filtered)/len(df_labeled)*100:.1f}%)")
print(f"Articles removed: {len(df_labeled) - len(df_filtered):,} ({(len(df_labeled) - len(df_filtered))/len(df_labeled)*100:.1f}%)")

print(f"\nConfidence score distribution in filtered data:")
print(df_filtered['topic_top1_score'].describe())

output_filename = f"LRC_articles_with_topics_filtered_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet"

print(f"\nSaving filtered data to {output_filename}...")
df_filtered.to_parquet(output_filename, index=False)

print("Save complete!")
print(f"\nTo download from Colab: files.download('{output_filename}')")

In [None]:
# Cell 8: Analyze topic distribution
# Quick validation of topic classification results
# Check distribution across outlets and political bias

print("TOPIC DISTRIBUTION ANALYSIS (HIGH-CONFIDENCE PREDICTIONS ONLY)")
print(f"Analyzing {len(df_filtered):,} articles with confidence >= 0.6")
print("="*80)

print("\n1. Overall topic distribution:")
topic_counts = df_filtered['topic_top1'].value_counts()
print(topic_counts)
print(f"\nPercentages:")
print((topic_counts / len(df_labeled) * 100).round(2))

print("\n" + "="*80)
print("\n2. Average confidence scores by topic:")
confidence_by_topic = df_labeled.groupby('topic_top1')['topic_top1_score'].agg(['mean', 'std', 'count'])
confidence_by_topic = confidence_by_topic.sort_values('mean', ascending=False)
print(confidence_by_topic.round(3))

print("\n" + "="*80)
print("\n3. Topic distribution by political bias:")
bias_topic_crosstab = pd.crosstab(df_labeled['bias'], df_labeled['topic_top1'], normalize='index') * 100
print(bias_topic_crosstab.round(1))

print("\n" + "="*80)
print("\n4. Topic distribution by outlet (top 5 topics per outlet):")
for outlet in df_labeled['outlet_name'].unique():
    outlet_data = df_labeled[df_labeled['outlet_name'] == outlet]
    top5 = outlet_data['topic_top1'].value_counts().head(5)
    print(f"\n{outlet}:")
    print((top5 / len(outlet_data) * 100).round(1))

print("\n" + "="*80)
print("\n5. Sample articles by topic (5 random articles per topic):")
for topic in TOPICS[:3]:
    print(f"\n--- {topic.upper()} ---")
    topic_articles = df_labeled[df_labeled['topic_top1'] == topic]
    if len(topic_articles) > 0:
        sample = topic_articles.sample(n=min(5, len(topic_articles)), random_state=42)
        for _, row in sample.iterrows():
            print(f"  {row['outlet_name']} ({row['date']}): {row['content'][:100]}...")
    else:
        print("  No articles classified under this topic")

print("\n" + "="*80)
print("\nAnalysis complete!")