# S&P 500 Financial News Collector

## Overview
This Python tool retrieves historical financial news for all S&P 500 companies using the EODHD API. It first obtains the complete list of S&P 500 companies directly from EODHD, then systematically collects news articles for each company while carefully managing API rate limits.

## Requirements
- Python 3.6+
- Required packages:
  - `eodhd`: EODHD API client
  - `json`: For data handling and storage
  - `time`: For timing operations and delays
  - `datetime`: For timestamp generation
  - `os`: For file and directory operations

## Installation
```bash
pip install eodhd
```

## Configuration
The script uses the following configurable parameters:

```python
# API Configuration
KEY = "your_eodhd_api_key"  # Replace with your actual API key

# Data Collection Settings
FROM_DATE = "2019-01-01"     # Start date for collecting news 
LIMIT = 1000                 # Maximum results per API call (API limit is 1000)
MAX_ITERATIONS_PER_COMPANY = 25  # Maximum pagination iterations per company

# API Limits
CALLS_PER_MINUTE_LIMIT = 1000  # API rate limit per minute
CALLS_PER_DAY_LIMIT = 100000   # API rate limit per day

# Output Settings
OUTPUT_DIR = "sp500_news_data"  # Directory for storing collected data
```

## Functionality

### 1. S&P 500 Ticker Retrieval
The script retrieves S&P 500 constituent companies through the EODHD API using the `get_fundamentals_data` method with the S&P 500 index symbol (`GSPC.INDX`). The retrieved data contains a 'Components' section listing all current S&P 500 companies.

### 2. News Collection Process
For each company in the S&P 500:
- Retrieves news articles in batches of 1000 (maximum allowed by API)
- Uses pagination with incremental offset values to retrieve all available articles
- Collects news starting from the configured start date
- Stops collection for a company when either:
  - No more news articles are available
  - The maximum number of iterations is reached
  - API rate limits are approaching

### 3. Rate Limit Management
The script implements sophisticated rate limit management:
- Tracks API calls per minute and total calls
- Pauses execution when approaching the per-minute limit
- Stops execution when approaching the daily limit
- Implements appropriate delays between requests

### 4. Data Storage
News articles are stored in the following structure:
- A dedicated output directory (`sp500_news_data` by default)
- Individual JSON files for each company (e.g., `AAPL_US_news.json`)
- A summary JSON file containing information about all processed companies

## Usage

1. Set your API key and desired configuration parameters at the top of the script
2. Run the script:
```bash
python sp500_news_collector.py
```

3. Monitor the console output for progress updates
4. Check the output directory for collected data

## Output Files

### Company News Files
Each company's news is stored in a separate JSON file named after its ticker symbol:
```
sp500_news_data/AAPL_US_news.json
sp500_news_data/MSFT_US_news.json
sp500_news_data/AMZN_US_news.json
...
```

Each news file contains an array of news article objects with fields like:
- `date`: Publication timestamp
- `title`: Article title
- `content`: Full article text
- `link`: URL to the original article
- `symbols`: Related stock symbols
- `sentiment`: Sentiment analysis data (if available)

### Summary File
A summary file is created with the naming pattern:
```
sp500_news_data/sp500_news_summary.json
```

The summary file contains:
- Timestamp of the collection
- Number of companies processed
- Total companies in the S&P 500
- Total news items collected
- Total API calls made
- Company-specific information:
  - Symbol
  - News count
  - Output filename

## Error Handling

The script includes comprehensive error handling:

1. **API Connection Issues**: Catches and logs API connection errors
2. **Rate Limit Handling**: Detects rate limit errors and implements appropriate waiting periods
3. **Data Structure Validation**: Validates received data before processing
4. **Fallback Mechanisms**: Uses fallback ticker list if S&P 500 retrieval fails
5. **Progress Preservation**: Saves summary after each company to preserve progress in case of interruption

## Troubleshooting

### Common Issues:

1. **API Key Invalid**
   - Verify your API key is correct and has appropriate permissions

2. **Rate Limit Exceeded**
   - The script should automatically manage rate limits
   - If you still encounter issues, consider decreasing `MAX_ITERATIONS_PER_COMPANY`

3. **Memory Issues With Large Datasets**
   - If processing all S&P 500 companies causes memory problems, consider running the script in batches
   - Modify the script to process a subset of companies at a time

4. **No News Found for Certain Companies**
   - Some companies may have limited or no news coverage
   - Verify the ticker symbol format is correct (should include `.US` suffix)

## API Rate Limit Management

The script implements two levels of rate limit protection:

1. **Per-Minute Limit (1,000 calls/minute)**
   - Tracks calls within rolling 60-second windows
   - Pauses execution when approaching the limit

2. **Daily Limit (100,000 calls/day)**
   - Tracks total API calls 
   - Stops execution when approaching the daily limit

This approach ensures responsible API usage while maximizing data collection efficiency.

## Example Collection Statistics

For a complete run of all S&P 500 companies with default settings:

- **Total API Calls**: ~15,000-25,000 (varies based on news volume)
- **Total Runtime**: 3-5 hours (depends on news volume and rate limit pauses)
- **Total Data Collected**: ~5-15GB (depends on news volume)
- **Average News per Company**: ~100-300 articles (highly variable)

## Notes and Best Practices

1. **Run During Off-Hours**: For uninterrupted execution, run the script during low-activity periods
2. **Incremental Collection**: Consider implementing date-based incremental collection for regular updates
3. **Data Backup**: Regularly back up collected data to prevent loss
4. **Monitoring**: Monitor the console output to track progress and identify any issues
5. **API Key Security**: Never share your API key or commit it to public repositories

In [1]:
from eodhd import APIClient
import json
import time
from datetime import datetime
import os

# Initialize the API client with API key
KEY = "603eafd117ce29.12741264"
api = APIClient(KEY)  # Assuming KEY is defined elsewhere in your code

# Configuration
FROM_DATE = "2019-01-01"
LIMIT = 1000  # Maximum allowed per request
MAX_ITERATIONS_PER_COMPANY = 25
CALLS_PER_MINUTE_LIMIT = 1000
CALLS_PER_DAY_LIMIT = 80000
OUTPUT_DIR = "sp500_news_data"  # Directory to store output files

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize tracking variables
api_calls = 0
start_time = time.time()
minute_start_time = start_time
minute_calls = 0
processed_companies = 0
total_news_items = 0

# Function to enforce rate limits
def check_rate_limits():
    global minute_calls, minute_start_time, api_calls
    
    current_time = time.time()
    elapsed_minute = current_time - minute_start_time
    
    # Reset minute counter if a minute has passed
    if elapsed_minute >= 60:
        minute_calls = 0
        minute_start_time = current_time
    
    # If we're close to the per-minute limit, wait until the next minute
    if minute_calls >= CALLS_PER_MINUTE_LIMIT - 5:  # Buffer of 5 calls
        wait_time = 60 - elapsed_minute
        print(f"Approaching per-minute rate limit. Waiting {wait_time:.2f} seconds...")
        time.sleep(wait_time + 1)  # Add 1 second buffer
        minute_calls = 0
        minute_start_time = time.time()
    
    # Check if we're approaching the daily limit
    if api_calls >= CALLS_PER_DAY_LIMIT - 1000:  # Buffer of 1000 calls
        print("WARNING: Approaching daily API call limit!")
        return False
    
    return True

# Function to fetch news for a single company
def fetch_company_news(symbol):
    global api_calls, minute_calls, total_news_items
    
    company_news = []
    
    print(f"\nStarting data collection for {symbol} from {FROM_DATE}")
    
    # Main loop for pagination
    for i in range(MAX_ITERATIONS_PER_COMPANY):
        offset = i * LIMIT
        
        # Check rate limits
        if not check_rate_limits():
            print(f"Stopping further requests for {symbol} due to API limits.")
            break
        
        try:
            print(f"Company {symbol} - Batch {i+1}/{MAX_ITERATIONS_PER_COMPANY}: Fetching news with offset {offset}...")
            
            # Make the API call
            news_batch = api.financial_news(
                s=symbol, 
                from_date=FROM_DATE, 
                offset=str(offset), 
                limit=str(LIMIT)
            )
            
            # Update counters
            api_calls += 1
            minute_calls += 1
            
            # Check if we got any results
            if not news_batch:
                print(f"No more news items found for {symbol} after offset {offset}.")
                break
            
            # Add the batch to our collection
            company_news.extend(news_batch)
            total_news_items += len(news_batch)
            
            # Display progress
            print(f"Retrieved {len(news_batch)} news items for {symbol}. Total for this company: {len(company_news)}")
            
            # If we got fewer items than requested, we've reached the end
            if len(news_batch) < LIMIT:
                print(f"Reached the end of available news for {symbol} at offset {offset}.")
                break
                
        except Exception as e:
            print(f"Error for {symbol} at offset {offset}: {str(e)}")
            
            # If the error might be rate-limit related, pause
            if "rate" in str(e).lower() or "limit" in str(e).lower():
                print("Possible rate limit reached. Pausing for 60 seconds...")
                time.sleep(60)
                minute_calls = 0
                minute_start_time = time.time()
            else:
                # For other errors, add a small delay before continuing
                print("Continuing to next batch after a short delay...")
                time.sleep(5)
        
        # Add a small delay between requests to be courteous
        time.sleep(0.5)
    
    return company_news

# Function to get S&P 500 tickers using get_fundamentals_data
def get_sp500_tickers():
    try:
        print("Fetching S&P 500 constituent companies via EODHD API...")
        
        # Get S&P 500 fundamentals data which includes Components
        sp500_data = api.get_fundamentals_data('GSPC.INDX')
        global api_calls, minute_calls
        api_calls += 1
        minute_calls += 1
        
        # Check if we got valid data with Components
        if not sp500_data or 'Components' not in sp500_data:
            raise Exception("Components data not found in API response")
        
        components = sp500_data['Components']
        tickers = []
        
        # Process Components based on its type
        if isinstance(components, list):
            for component in components:
                if isinstance(component, dict) and 'Code' in component:
                    tickers.append(component['Code'])
                elif isinstance(component, str):
                    tickers.append(component)
        elif isinstance(components, dict):
            for key, value in components.items():
                if isinstance(value, dict) and 'Code' in value:
                    tickers.append(value['Code'])
                else:
                    tickers.append(key)
        
        print(f"Successfully retrieved {len(tickers)} S&P 500 tickers")
        return tickers
        
    except Exception as e:
        print(f"Error fetching S&P 500 tickers: {str(e)}")
        return []

# Main execution
try:
    # Step 1: Get the list of S&P 500 company tickers
    sp500_tickers = get_sp500_tickers()
    total_companies = len(sp500_tickers)
    
    # Create a summary log file
    summary_file = os.path.join(OUTPUT_DIR, "sp500_news_summary.json")
    company_summaries = []
    
    # Step 2: Process each company ticker
    for ticker in sp500_tickers:
        # Check if we're approaching API limits before starting a new company
        if not check_rate_limits():
            print(f"Approaching API limits. Stopping after processing {processed_companies} companies.")
            break
        
        print(f"\n{'='*80}")
        print(f"Processing company {processed_companies+1}/{total_companies}: {ticker}")
        print(f"{'='*80}")
        
        # Fetch news for this company
        company_news = fetch_company_news(ticker)
        
        # If we got news, save it to a file
        if company_news:
            company_filename = f"{ticker.replace('.', '_')}_news.json"
            company_filepath = os.path.join(OUTPUT_DIR, company_filename)
            
            with open(company_filepath, 'w', encoding='utf-8') as f:
                json.dump(company_news, f, ensure_ascii=False, indent=4)
            
            print(f"Saved {len(company_news)} news items for {ticker} to {company_filepath}")
            
            # Add to summary
            company_summaries.append({
                'symbol': ticker,
                'news_count': len(company_news),
                'file': company_filename
            })
        else:
            print(f"No news found for {ticker}. Skipping file creation.")
            
            # Add to summary
            company_summaries.append({
                'symbol': ticker,
                'news_count': 0,
                'file': None
            })
        
        processed_companies += 1
        
        # Save the summary after each company to keep track of progress
        with open(summary_file, 'w', encoding='utf-8') as f:
            summary_data = {
                'companies_processed': processed_companies,
                'total_companies': total_companies,
                'total_news_items': total_news_items,
                'api_calls': api_calls,
                'company_summaries': company_summaries
            }
            json.dump(summary_data, f, ensure_ascii=False, indent=4)
        
        # Brief pause between companies
        time.sleep(1)

except Exception as e:
    print(f"Critical error in main execution: {str(e)}")

finally:
    # Final summary
    total_time = time.time() - start_time
    print(f"\n{'='*80}")
    print(f"Data collection complete:")
    print(f"- Made {api_calls} API calls")
    print(f"- Processed {processed_companies} of {total_companies} companies")
    print(f"- Collected {total_news_items} total news items")
    print(f"- Total runtime: {total_time:.2f} seconds ({total_time/3600:.2f} hours)")
    print(f"- Summary saved to {summary_file}")
    print(f"{'='*80}")

Fetching S&P 500 constituent companies via EODHD API...
Successfully retrieved 503 S&P 500 tickers

Processing company 1/503: AIZ

Starting data collection for AIZ from 2019-01-01
Company AIZ - Batch 1/25: Fetching news with offset 0...
Retrieved 743 news items for AIZ. Total for this company: 743
Reached the end of available news for AIZ at offset 0.
Saved 743 news items for AIZ to sp500_news_data/AIZ_news.json

Processing company 2/503: MNST

Starting data collection for MNST from 2019-01-01
Company MNST - Batch 1/25: Fetching news with offset 0...
Retrieved 1000 news items for MNST. Total for this company: 1000
Company MNST - Batch 2/25: Fetching news with offset 1000...
Retrieved 95 news items for MNST. Total for this company: 1095
Reached the end of available news for MNST at offset 1000.
Saved 1095 news items for MNST to sp500_news_data/MNST_news.json

Processing company 3/503: MTCH

Starting data collection for MTCH from 2019-01-01
Company MTCH - Batch 1/25: Fetching news with o