# News API Data Ingestion Pipeline

This notebook collects finance-related news articles using NewsAPI.org.

**Features:**
- Search articles by keywords
- Filter by sources (Bloomberg, Reuters, WSJ, etc.)
- Clean and normalize article text
- Export to JSON format

**Note:** This is an exploration notebook. For production use, see `backend/app/pipelines/ingest_news.py`

## 1. Setup and Imports

In [23]:
import os
import re
import json
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, List, Any, Optional

# Load environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("‚ö†Ô∏è  python-dotenv not installed. Install with: pip install python-dotenv")

# NewsAPI library
try:
    from newsapi import NewsApiClient
    print("‚úì newsapi-python installed")
except ImportError:
    print("‚ùå newsapi-python not installed. Install with: pip install newsapi-python")

print("‚úì Imports complete")

‚úì newsapi-python installed
‚úì Imports complete


## 2. Configuration

In [24]:
# API Credentials (from .env file)
NEWS_API_KEY = os.getenv('NEWS_API_KEY')

# Search configuration
# Note: Free tier = 100 requests/day
KEYWORDS = [
    'stock market',
    'stocks',
    'earnings',
    'federal reserve',
    'interest rates',
    'inflation',
    'NVDA',
    'TSLA',
    'AAPL',
    'market crash',
    'bull market',
    'bear market'
]

# Financial news sources (NewsAPI source IDs)
SOURCES = [
    'bloomberg',
    'reuters',
    'the-wall-street-journal',
    'financial-times',
    'business-insider',
    'fortune',
    'cnbc'
]

MAX_ARTICLES = 100  # Per request (API limit)
LANGUAGE = 'en'
DAYS_BACK = 7  # Search last 7 days

# Output configuration
OUTPUT_DIR = Path('../data/processed/news')
RUN_ID = datetime.utcnow().strftime('%Y-%m-%d')

print(f"Output directory: {OUTPUT_DIR / RUN_ID}")
print(f"Keywords: {', '.join(KEYWORDS[:5])}... ({len(KEYWORDS)} total)")
print(f"Sources: {', '.join(SOURCES[:3])}... ({len(SOURCES)} total)")
print(f"Max articles: {MAX_ARTICLES}")
print(f"Date range: Last {DAYS_BACK} days")

Output directory: ..\data\processed\news\2025-11-02
Keywords: stock market, stocks, earnings, federal reserve, interest rates... (12 total)
Sources: bloomberg, reuters, the-wall-street-journal... (7 total)
Max articles: 100
Date range: Last 7 days


## 3. Helper Functions

In [25]:
# URL pattern for cleaning
_URL_RE = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

def clean_text(txt: Optional[str]) -> str:
    """
    Remove HTML tags, URLs, and normalize whitespace.
    
    Args:
        txt: Input text string
        
    Returns:
        Cleaned text
    """
    if not txt:
        return ''
    
    # Remove HTML tags
    txt = re.sub(r'<[^>]+>', '', txt)
    
    # Remove URLs
    txt = _URL_RE.sub('', txt)
    
    # Remove [+XXX chars] artifacts from NewsAPI
    txt = re.sub(r'\[\+\d+ chars\]', '', txt)
    
    # Remove [Removed] markers
    txt = re.sub(r'\[Removed\]', '', txt, flags=re.IGNORECASE)
    
    # Normalize whitespace
    txt = re.sub(r'\s+', ' ', txt)
    
    return txt.strip()

# Test the function
test_article = "<p>NVDA earnings beat! <a href='http://example.com'>Read more</a> [+1234 chars]</p>"
print(f"Original: {test_article}")
print(f"Cleaned:  {clean_text(test_article)}")

Original: <p>NVDA earnings beat! <a href='http://example.com'>Read more</a> [+1234 chars]</p>
Cleaned:  NVDA earnings beat! Read more


In [26]:
def normalize_article(article: Dict[str, Any]) -> Dict[str, Any]:
    """
    Extract and normalize fields from a NewsAPI article object.
    
    Args:
        article: NewsAPI article dictionary
        
    Returns:
        Dictionary with normalized article data
    """
    source = article.get('source', {})
    
    return {
        'source_id': source.get('id'),
        'source_name': source.get('name'),
        'author': article.get('author'),
        'title': article.get('title', ''),
        'description': article.get('description', ''),
        'url': article.get('url'),
        'url_to_image': article.get('urlToImage'),
        'published_at': article.get('publishedAt'),
        'content': article.get('content', ''),
        'clean_title': clean_text(article.get('title', '')),
        'clean_description': clean_text(article.get('description', '')),
        'clean_content': clean_text(article.get('content', '')),
    }

print("‚úì Helper functions defined")

‚úì Helper functions defined


In [27]:
def build_query(keywords: List[str]) -> str:
    """
    Build a NewsAPI search query from keywords.
    
    Args:
        keywords: List of keywords to search for
        
    Returns:
        Query string
    """
    # Quote multi-word phrases, leave single words unquoted
    terms = [f'"{k}"' if ' ' in k else k for k in keywords]
    
    # Join with OR for broad coverage
    query = ' OR '.join(terms)
    
    return query

# Test the query builder
test_query = build_query(KEYWORDS[:5])
print(f"Query: {test_query}")

Query: "stock market" OR stocks OR earnings OR "federal reserve" OR "interest rates"


In [28]:
def filter_quality_articles(articles: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """
    Remove low-quality, paywalled, or removed articles.
    
    Args:
        articles: List of article dictionaries
        
    Returns:
        Filtered list of quality articles
    """
    filtered = []
    
    for article in articles:
        # Skip if title is missing or too short
        if not article['clean_title'] or len(article['clean_title']) < 10:
            continue
        
        # Skip if marked as [Removed]
        if '[removed]' in article.get('content', '').lower():
            continue
        
        # Skip if content is too short (likely paywalled)
        if article['clean_content'] and len(article['clean_content']) < 100:
            continue
        
        # Skip if no description or content
        if not article['clean_description'] and not article['clean_content']:
            continue
        
        filtered.append(article)
    
    return filtered


def deduplicate_articles(articles: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """
    Remove duplicate articles by URL and similar titles.
    
    Args:
        articles: List of article dictionaries
        
    Returns:
        Deduplicated list of articles
    """
    seen_urls = set()
    seen_titles = set()
    deduped = []
    
    for article in articles:
        url = article['url']
        title = article['clean_title'].lower()
        
        # Skip if we've seen this URL
        if url and url in seen_urls:
            continue
        
        # Skip if we've seen very similar title
        if title in seen_titles:
            continue
        
        seen_urls.add(url)
        seen_titles.add(title)
        deduped.append(article)
    
    return deduped


print("‚úì Quality filter functions defined")

‚úì Quality filter functions defined


## 4. Initialize NewsAPI Client

In [29]:
def initialize_news_client() -> NewsApiClient:
    """
    Initialize and authenticate NewsAPI client.
    
    Returns:
        Authenticated NewsApiClient
    """
    if not NEWS_API_KEY:
        raise ValueError(
            'Missing NewsAPI credentials. Set NEWS_API_KEY '
            'environment variable or create a .env file.'
        )
    
    client = NewsApiClient(api_key=NEWS_API_KEY)
    
    print("‚úì NewsAPI client initialized")
    return client

# Initialize client
try:
    newsapi = initialize_news_client()
except Exception as e:
    print(f"‚ùå Error: {e}")
    newsapi = None

‚úì NewsAPI client initialized


## 5. Fetch Articles

In [30]:
def fetch_articles(
    newsapi: NewsApiClient,
    keywords: List[str],
    sources: Optional[List[str]] = None,
    days_back: int = 7,
    max_results: int = 100,
    language: str = 'en'
) -> List[Dict[str, Any]]:
    """
    Fetch articles matching keywords with quality filtering.
    
    Args:
        newsapi: Authenticated NewsApiClient
        keywords: List of keywords to search for
        sources: Optional list of source IDs to filter by
        days_back: Number of days to search back
        max_results: Maximum number of articles to fetch
        language: Language code
        
    Returns:
        List of normalized article dictionaries
    """
    query = build_query(keywords)
    print(f"Search query: {query[:100]}..." if len(query) > 100 else f"Search query: {query}")
    print(f"Fetching up to {max_results} articles...")
    
    # Calculate date range (NewsAPI needs YYYY-MM-DD format)
    to_date = datetime.utcnow()
    from_date = to_date - timedelta(days=days_back)
    
    print(f"Date range: {from_date.date()} to {to_date.date()}")
    
    try:
        # Call NewsAPI everything endpoint
        response = newsapi.get_everything(
            q=query,
            sources=','.join(sources) if sources else None,
            from_param=from_date.strftime('%Y-%m-%d'),
            to=to_date.strftime('%Y-%m-%d'),
            language=language,
            sort_by='publishedAt',
            page_size=min(max_results, 100)  # API limit is 100 per page
        )
        
        if response['status'] != 'ok':
            print(f"‚ö†Ô∏è  API returned status: {response['status']}")
            return []
        
        articles = response.get('articles', [])
        print(f"‚úì Fetched {len(articles)} raw articles")
        
        # Normalize articles
        normalized = [normalize_article(a) for a in articles]
        
        # Apply quality filters
        print("Applying quality filters...")
        filtered = filter_quality_articles(normalized)
        print(f"  After quality filter: {len(filtered)} articles")
        
        deduped = deduplicate_articles(filtered)
        print(f"  After deduplication: {len(deduped)} articles")
        
        return deduped
        
    except Exception as e:
        print(f"‚ùå Error fetching articles: {e}")
        return []

# Fetch articles with quality filtering
if newsapi:
    articles = fetch_articles(
        newsapi, 
        KEYWORDS, 
        SOURCES,
        DAYS_BACK, 
        MAX_ARTICLES, 
        LANGUAGE
    )
    print(f"\n‚úì Total quality articles collected: {len(articles)}")
else:
    print("‚ùå Client not initialized. Cannot fetch articles.")
    articles = []


Search query: "stock market" OR stocks OR earnings OR "federal reserve" OR "interest rates" OR inflation OR NVDA O...
Fetching up to 100 articles...
Date range: 2025-10-26 to 2025-11-02
‚úì Fetched 100 raw articles
Applying quality filters...
  After quality filter: 100 articles
  After deduplication: 100 articles

‚úì Total quality articles collected: 100
‚úì Fetched 100 raw articles
Applying quality filters...
  After quality filter: 100 articles
  After deduplication: 100 articles

‚úì Total quality articles collected: 100


## 6. Preview Data

In [31]:
# Display first few articles
if articles:
    print(f"\nFirst 3 articles:\n")
    for i, article in enumerate(articles[:3], 1):
        print(f"{i}. {article['clean_title'][:80]}...")
        print(f"   Source: {article['source_name']}")
        print(f"   Published: {article['published_at']}")
        print(f"   Description: {article['clean_description'][:100]}...")
        print()


First 3 articles:

1. How an 'accidental banker' is turning this LA-based investment bank into one of ...
   Source: Business Insider
   Published: 2025-11-01T11:08:01Z
   Description: Business Insider spoke with Houlihan Lokey's CEO about how he plans to ride the M&A rebound and the ...

2. More than half of Gen Z says they only use cash as ‚Äòa last resort‚Äô and doing so ...
   Source: Fortune
   Published: 2025-11-01T10:02:00Z
   Description: Some Gen Zers are so against using cash they‚Äôll forgo shopping from stores that are cash only....

3. How Amazon flipped the script on a challenging week...
   Source: Business Insider
   Published: 2025-11-01T09:24:02Z
   Description: Wall Street loved what it saw from Amazon this week. Yes, despite the layoffs....



## 7. Export to JSON

In [32]:
def export_to_json(
    articles: List[Dict[str, Any]],
    output_dir: Path,
    run_id: str
) -> str:
    """
    Export articles to JSON file.
    
    Args:
        articles: List of article dictionaries
        output_dir: Output directory path
        run_id: Run identifier
        
    Returns:
        Path to output JSON file
    """
    if not articles:
        print("‚ö†Ô∏è  No articles to export")
        return ""
    
    # Create output directory
    run_dir = output_dir / run_id
    run_dir.mkdir(parents=True, exist_ok=True)
    
    # Output file path
    output_file = run_dir / f'news_finance_{run_id}.json'
    
    # Write to JSON
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(articles, f, ensure_ascii=False, indent=2)
    
    print(f"‚úì Exported {len(articles)} articles to {output_file}")
    
    # Also save metadata
    meta_file = run_dir / f'news_finance_{run_id}_meta.json'
    metadata = {
        'run_id': run_id,
        'timestamp': datetime.utcnow().isoformat(),
        'total_articles': len(articles),
        'sources': list(set(a['source_name'] for a in articles if a['source_name'])),
        'date_range_days': DAYS_BACK,
    }
    with open(meta_file, 'w', encoding='utf-8') as f:
        json.dump(metadata, f, ensure_ascii=False, indent=2)
    
    print(f"‚úì Saved metadata to {meta_file}")
    
    return str(output_file)

# Export data
if articles:
    output_path = export_to_json(articles, OUTPUT_DIR, RUN_ID)
    print(f"\n‚úì Pipeline complete!")
else:
    print("\n‚ö†Ô∏è  No data to export")

‚úì Exported 100 articles to ..\data\processed\news\2025-11-02\news_finance_2025-11-02.json
‚úì Saved metadata to ..\data\processed\news\2025-11-02\news_finance_2025-11-02_meta.json

‚úì Pipeline complete!


## 8. Data Summary

In [33]:
# Display statistics
if articles:
    sources_count = {}
    for article in articles:
        source = article['source_name']
        if source:
            sources_count[source] = sources_count.get(source, 0) + 1
    
    print("\nüìä Summary Statistics:")
    print(f"Total articles: {len(articles)}")
    print(f"\nArticles by source:")
    for source, count in sorted(sources_count.items(), key=lambda x: x[1], reverse=True):
        print(f"  {source}: {count}")
    
    # Most recent article
    most_recent = max(articles, key=lambda a: a['published_at'] or '')
    print(f"\nMost recent article:")
    print(f"  {most_recent['clean_title'][:100]}...")
    print(f"  Published: {most_recent['published_at']}")
    print(f"  Source: {most_recent['source_name']}")


üìä Summary Statistics:
Total articles: 100

Articles by source:
  Business Insider: 59
  Fortune: 33
  Bloomberg: 4
  The Wall Street Journal: 4

Most recent article:
  How an 'accidental banker' is turning this LA-based investment bank into one of the biggest deal mac...
  Published: 2025-11-01T11:08:01Z
  Source: Business Insider


## Next Steps

1. **Review the collected data** in the JSON file
2. **Adjust keywords or sources** if needed for better coverage
3. **Convert to production script** once satisfied with results
4. **Add to pipeline** alongside Reddit and Twitter ingestion

See `backend/app/pipelines/ingest_news.py` for the production version.