# 01 - Data Collection

Runs the data collection scripts to download and prepare all raw data.

**Prerequisites:**
- SEC EDGAR bulk data downloaded to `data/raw/kaggle/sec_edgar/companyfacts/`
- `FRED_API_KEY` set in `.env` file (free at https://fred.stlouisfed.org/)

**Scripts (in order):**
1. `download_prices.py` - S&P 500 daily prices via yfinance (~5 min)
2. `download_spy.py` - SPY benchmark prices via yfinance
3. `download_sectors.py` - Sector classifications from Wikipedia
4. `download_macro.py` - Macroeconomic indicators from FRED
5. `extract_financials.py` - Parse SEC EDGAR quarterly filings

In [1]:
import subprocess
import sys
import os

SCRIPTS_DIR = os.path.join('..', 'scripts')

def run_script(name):
    """Run a script and stream its output line by line."""
    path = os.path.join(SCRIPTS_DIR, name)
    print(f'\n{"=" * 60}')
    print(f'Running {name}...')
    print(f'{"=" * 60}\n', flush=True)
    process = subprocess.Popen(
        [sys.executable, '-u', path],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
    )
    for line in process.stdout:
        print(line, end='', flush=True)
    process.wait()
    if process.returncode != 0:
        raise RuntimeError(f'{name} failed with exit code {process.returncode}')

In [2]:
# Step 1: Download S&P 500 stock prices (~5 min)
run_script('download_prices.py')


Running download_prices.py...

Fetching S&P 500 ticker list from Wikipedia...
Found 503 tickers.

Downloading 503 tickers...

  Progress: 25/503 (25 successful, 0 failed)
  Progress: 50/503 (50 successful, 0 failed)
  Progress: 75/503 (75 successful, 0 failed)
  Progress: 100/503 (100 successful, 0 failed)
  Progress: 125/503 (125 successful, 0 failed)
  Progress: 150/503 (150 successful, 0 failed)
  Progress: 175/503 (175 successful, 0 failed)
  Progress: 200/503 (200 successful, 0 failed)
  Progress: 225/503 (225 successful, 0 failed)
  Progress: 250/503 (250 successful, 0 failed)
  Progress: 275/503 (275 successful, 0 failed)
  Progress: 300/503 (300 successful, 0 failed)
  Progress: 325/503 (325 successful, 0 failed)
  Progress: 350/503 (350 successful, 0 failed)
  Progress: 375/503 (375 successful, 0 failed)
  Progress: 400/503 (399 successful, 1 failed)
  Progress: 425/503 (424 successful, 1 failed)
  Progress: 450/503 (449 successful, 1 failed)
  Progress: 475/503 (474 successf

In [3]:
# Step 2: Download SPY benchmark prices
run_script('download_spy.py')


Running download_spy.py...

Downloading SPY benchmark data...
Downloaded 5319 days of SPY data
Date range: 2005-01-03 to 2026-02-24
Saved to c:\Users\chris\stock-prediction-ml\scripts\..\data\spy_prices.pkl


In [4]:
# Step 3: Download sector classifications
run_script('download_sectors.py')


Running download_sectors.py...

Fetching S&P 500 sector data from Wikipedia...
Total companies: 503
Unique sectors: 11

Price data tickers: 502
Matched with sector data: 502

Saved sector_data.pkl: 505 entries
File size: 22.1 KB

Sector Distribution:
  Industrials: 79
  Financials: 76
  Information Technology: 71
  Health Care: 60
  Consumer Discretionary: 48
  Consumer Staples: 36
  Utilities: 31
  Real Estate: 31
  Materials: 26
  Communication Services: 23
  Energy: 22

Done!


In [5]:
# Step 4: Download macroeconomic data
run_script('download_macro.py')


Running download_macro.py...

FRED API connected successfully.
Download range: 2005-01-01 to present

Indicators to download:
  GS10: 10-Year Treasury Constant Maturity Rate (daily)
  VIXCLS: CBOE Volatility Index (daily)
  UNRATE: Unemployment Rate (monthly)
  GDP: Gross Domestic Product (quarterly)
  CPIAUCSL: Consumer Price Index (monthly)

=== Downloading ===
  GS10: 253 observations (2005-01-01 to 2026-01-01)
  VIXCLS: 5347 observations (2005-01-03 to 2026-02-23)
  UNRATE: 252 observations (2005-01-01 to 2026-01-01)
  GDP: 84 observations (2005-01-01 to 2025-10-01)
  CPIAUCSL: 252 observations (2005-01-01 to 2026-01-01)

Downloaded 5/5 indicators

=== Resampling to daily frequency ===
  GS10: 253 raw -> 7671 daily observations
  VIXCLS: 5347 raw -> 7722 daily observations
  UNRATE: 252 raw -> 7671 daily observations
  GDP: 84 raw -> 7579 daily observations
  CPIAUCSL: 252 raw -> 7671 daily observations

Saved macro_data.pkl
File size: 900.6 KB
Indicators: ['GS10', 'VIXCLS', 'UNRA

In [6]:
# Step 5: Extract financial data from SEC EDGAR
run_script('extract_financials.py')


Running extract_financials.py...

SEC data dir: c:\Users\chris\stock-prediction-ml\data\raw\kaggle\sec_edgar\companyfacts
SEC files found: 18848
Price data: 502 companies

Fetching S&P 500 CIK mapping from Wikipedia...
S&P 500 tickers with SEC EDGAR + price data: 450

=== Parsing 450 companies ===
(quality thresholds: income_metrics >= 4, quarters >= 8)

  [50/450] BRK-B: 6 income + 6 balance, 52 quarters
  [100/450] CVS: 9 income + 11 balance, 53 quarters
  [150/450] FCX: 9 income + 9 balance, 54 quarters
  [200/450] IBM: 10 income + 10 balance, 53 quarters
  [250/450] LYB: 10 income + 11 balance, 45 quarters
  [300/450] NVR: 6 income + 6 balance, 50 quarters
  [350/450] RMD: 11 income + 11 balance, 50 quarters
  [400/450] TSN: 9 income + 8 balance, 50 quarters
  [450/450] ZTS: 8 income + 11 balance, 42 quarters

--- Results ---
Loaded: 443 companies
Skipped: 7 (insufficient data)
Errors: 0

Skipped: ['AZO (9 income, 1 quarters)', 'BLK (6 income, 7 quarters)', 'COST (9 income, 2 quar

In [7]:
# Verify all output files exist
import pickle

DATA_DIR = os.path.join('..', 'data')

expected_files = {
    'price_data.pkl': 'Stock prices',
    'spy_prices.pkl': 'SPY benchmark',
    'sector_data.pkl': 'Sector classifications',
    'macro_data.pkl': 'Macro indicators',
    'financial_data.pkl': 'Financial statements',
    'tickers.pkl': 'Ticker list',
}

print('=== Data Collection Summary ===')
all_ok = True
for filename, desc in expected_files.items():
    path = os.path.join(DATA_DIR, filename)
    if os.path.exists(path):
        size = os.path.getsize(path) / 1024 / 1024
        with open(path, 'rb') as f:
            data = pickle.load(f)
        if isinstance(data, dict):
            count = len(data)
        elif isinstance(data, list):
            count = len(data)
        else:
            count = len(data)
        print(f'  {filename:<25s} {desc:<25s} {count:>5} items  ({size:.1f} MB)')
    else:
        print(f'  {filename:<25s} MISSING')
        all_ok = False

if all_ok:
    print('\nAll data files present. Ready for notebook 02.')
else:
    print('\nSome files missing - check script output above.')

=== Data Collection Summary ===
  price_data.pkl            Stock prices                502 items  (130.3 MB)
  spy_prices.pkl            SPY benchmark              5319 items  (0.4 MB)
  sector_data.pkl           Sector classifications      505 items  (0.0 MB)
  macro_data.pkl            Macro indicators              5 items  (0.9 MB)
  financial_data.pkl        Financial statements        443 items  (4.1 MB)
  tickers.pkl               Ticker list                 443 items  (0.0 MB)

All data files present. Ready for notebook 02.
