# AI News Summarizer: NewsDataHub + OpenAI

This notebook demonstrates how to build an AI-powered news summarization pipeline using:
- **NewsDataHub API** ‚Äî For fetching news articles with comprehensive metadata
- **OpenAI GPT-4o-mini** ‚Äî For generating concise, abstractive summaries

**‚ö†Ô∏è Important Note About AI Accuracy:** AI-generated summaries may occasionally contain inaccuracies, omit important details, or misinterpret nuanced information. Always review AI outputs for critical applications.

---

## Setup

Install required packages (run once):
```bash
pip install requests openai
```

In [None]:
# Import required libraries
import requests
import json
import os
from openai import OpenAI

## Configuration

Set your API keys and parameters:

In [None]:
# API Keys
NDH_API_KEY = ""  # NewsDataHub API key (leave empty to use sample data)
OPENAI_API_KEY = "your_openai_api_key_here"  # Required for summarization

# Configuration
MIN_CONTENT_LENGTH = 300  # Minimum characters for article content
NUM_ARTICLES_TO_PROCESS = 5  # Number of articles to summarize

## Step 1: Fetch News Articles from NewsDataHub

We'll fetch English news articles from NewsDataHub. If no API key is provided, the code automatically downloads sample data from GitHub.

In [None]:
# Check if NewsDataHub API key is provided
if NDH_API_KEY and NDH_API_KEY != "your_ndh_api_key_here":
    print("Using live NewsDataHub API data...")

    url = "https://api.newsdatahub.com/v1/news"
    headers = {"x-api-key": NDH_API_KEY}

    # Fetch 100 English articles (single page, no pagination)
    params = {
        "per_page": 100,
        "language": "en",  # English articles only
        "country": "US,GB,CA,AU",  # English-speaking countries
        "source_type": "mainstream_news,digital_native"  # Quality sources
    }

    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    data = response.json()

    articles = data.get("data", [])
    print(f"Fetched {len(articles)} English articles from NewsDataHub API")

else:
    print("No NewsDataHub API key provided. Loading sample data...")

    # Download sample data if not already present
    sample_file = "sample-news-data.json"

    if not os.path.exists(sample_file):
        print("Downloading sample data from GitHub...")
        sample_url = "https://raw.githubusercontent.com/newsdatahub/newsdatahub-ai-news-summarizer/refs/heads/main/data/sample-news-data.json"
        response = requests.get(sample_url)
        response.raise_for_status()
        with open(sample_file, "w") as f:
            json.dump(response.json(), f)
        print(f"Sample data saved to {sample_file}")

    # Load sample data
    with open(sample_file, "r") as f:
        data = json.load(f)

    # Handle both formats: raw array or API response with 'data' key
    if isinstance(data, dict) and "data" in data:
        articles = data["data"]
    elif isinstance(data, list):
        articles = data
    else:
        raise ValueError("Unexpected sample data format")

    print(f"Loaded {len(articles)} articles from sample data")

## Step 2: Filter Articles with Sufficient Content

Remove articles with minimal content (photo galleries, breaking alerts, etc.)

In [None]:
# Filter articles with sufficient content
filtered_articles = [
    article for article in articles
    if article.get("content") and len(article.get("content", "")) >= MIN_CONTENT_LENGTH
]

print(f"Filtered {len(filtered_articles)} articles with content >= {MIN_CONTENT_LENGTH} characters")
print(f"Removed {len(articles) - len(filtered_articles)} articles with insufficient content")

## Step 3: Initialize OpenAI Client and Summarization Function

Create a reusable function for generating AI summaries using GPT-4o-mini.

In [None]:
# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

def summarize_article(content, title):
    """
    Generate an abstractive summary of a news article using OpenAI GPT-4o-mini.

    Args:
        content (str): The full article content
        title (str): The article title (provides context to the AI)

    Returns:
        str: A 2-3 sentence summary, or error message if summarization fails
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "You are a professional news summarizer. Create concise, accurate 2-3 sentence summaries that capture the key information and main points of articles."
                },
                {
                    "role": "user",
                    "content": f"Summarize this news article in 2-3 sentences:\n\nTitle: {title}\n\nContent: {content}"
                }
            ],
            max_tokens=150,  # ~100-150 words for 2-3 sentences
            temperature=0.3  # Lower temperature for consistent, focused summaries
        )

        summary = response.choices[0].message.content.strip()
        return summary

    except Exception as e:
        return f"Error generating summary: {str(e)}"

print("‚úì Summarization function ready")

## Step 4: Test on a Single Article

Let's test the summarization function on one article to see how it works.

In [None]:
# Test on the first filtered article
test_article = filtered_articles[0]

print("="*80)
print("TEST ARTICLE")
print("="*80)
print(f"\nTitle: {test_article.get('title', 'N/A')}")
print(f"Source: {test_article.get('source_title', 'N/A')}")
print(f"Published: {test_article.get('pub_date', 'N/A')}")
print(f"Content length: {len(test_article.get('content', ''))} characters")

# Generate summary
print("\nGenerating AI summary...")
summary = summarize_article(
    content=test_article.get("content", ""),
    title=test_article.get("title", "")
)

print(f"\n{'='*80}")
print("AI SUMMARY")
print("="*80)
print(summary)

## Step 5: Create Structured Output Function

Combine NewsDataHub metadata with AI summaries in a structured format.

In [None]:
def create_summary_output(article, summary):
    """
    Combine NewsDataHub article metadata with AI-generated summary.

    Args:
        article (dict): Original article from NewsDataHub
        summary (str): AI-generated summary

    Returns:
        dict: Structured output with metadata and summary
    """
    return {
        "id": article.get("id"),
        "title": article.get("title"),
        "source": article.get("source_title"),
        "published": article.get("pub_date"),
        "url": article.get("article_link"),
        "language": article.get("language"),
        "topics": article.get("topics", []),
        "original_content_length": len(article.get("content", "")),
        "ai_summary": summary
    }

# Create structured output for test article
output = create_summary_output(test_article, summary)

print("="*80)
print("STRUCTURED OUTPUT (JSON)")
print("="*80)
print(json.dumps(output, indent=2))

## Step 6: Process Multiple Articles in Batch

Now let's process 5 articles to demonstrate a production pipeline.

In [None]:
print("="*80)
print(f"PROCESSING {NUM_ARTICLES_TO_PROCESS} ARTICLES")
print("="*80)

summarized_articles = []

for i, article in enumerate(filtered_articles[:NUM_ARTICLES_TO_PROCESS], 1):
    print(f"\n[{i}/{NUM_ARTICLES_TO_PROCESS}] Processing: {article.get('title', 'N/A')[:60]}...")

    # Generate summary
    summary = summarize_article(
        content=article.get("content", ""),
        title=article.get("title", "")
    )

    # Create structured output
    output = create_summary_output(article, summary)
    summarized_articles.append(output)

    print(f"    ‚úì Summary generated ({len(summary)} characters)")

print(f"\n‚úì Successfully processed {len(summarized_articles)} articles")

## Step 7: Save Results to JSON File

In [None]:
# Save to JSON file
output_file = "summarized_articles.json"

with open(output_file, "w") as f:
    json.dump(summarized_articles, f, indent=2)

print(f"‚úì Results saved to {output_file}")
print(f"  Total articles: {len(summarized_articles)}")
print(f"  File size: {os.path.getsize(output_file):,} bytes")

## Step 8: Display Summary Report

Let's create a clean, readable summary report.

In [None]:
print("\n" + "="*80)
print("SUMMARY REPORT")
print("="*80)

for i, article in enumerate(summarized_articles, 1):
    print(f"\nüì∞ Article {i}")
    print(f"   Title: {article['title']}")
    print(f"   Source: {article['source']} | Published: {article['published'][:10]}")
    print(f"   Topics: {', '.join(article['topics']) if article['topics'] else 'N/A'}")
    print(f"\n   üìù AI Summary:")
    print(f"   {article['ai_summary']}\n")
    print(f"   üîó Read full article: {article['url']}")
    print(f"   {'-'*76}")

print(f"\n‚úÖ Generated {len(summarized_articles)} AI summaries using NewsDataHub + OpenAI")
print("\n‚ö†Ô∏è  Reminder: AI-generated summaries may occasionally contain inaccuracies.")
print("   Always review outputs for critical applications.")

## Cost Estimation

Let's estimate the OpenAI API cost for this batch.

In [None]:
print("üí∞ OpenAI API Cost Estimation")
print("="*80)
print("GPT-4o-mini pricing:")
print("  Input tokens:  ~$0.15 per 1M tokens")
print("  Output tokens: ~$0.60 per 1M tokens")
print(f"\nFor {len(summarized_articles)} articles:")
print("  Approximate cost: < $0.01 (one cent)")
print("\nSummarization with GPT-4o-mini is extremely affordable!")

## Next Steps

**Expand the pipeline:**
- Process more articles (change `NUM_ARTICLES_TO_PROCESS`)
- Filter by specific topics or countries
- Adjust summary length (`max_tokens` parameter)
- Add retry logic for API failures
- Implement caching to avoid redundant API calls

**Learn more:**
- [Full Tutorial](https://newsdatahub.com/learning-center/article/ai-summarization-pipeline)
- [NewsDataHub API Docs](https://newsdatahub.com/docs)
- [OpenAI API Docs](https://platform.openai.com/docs)