# Raw Data Fetching and Exploration

This notebook demonstrates how to:
- Fetch minute-level data for SPY using Polygon.io API
- Fetch daily data for VIX using Yahoo Finance
- Load and validate CSV data
- Analyze calendar coverage and identify missing trading days


In [1]:
import sys
import os
from pathlib import Path
import importlib

# Get the project root directory (parent of notebooks/)
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()

# Add src directory to Python path
sys.path.insert(0, str(project_root / 'src'))

# Import and reload to pick up any code changes
try:
    from classes.data import loader as loader_module
except ImportError:
    # First time import
    import classes.data.loader as loader_module
else:
    # Module already imported, reload it
    importlib.reload(loader_module)

from classes.data.loader import DataLoader
from datetime import datetime, timedelta

# Initialize DataLoader
loader = DataLoader()

# Calculate date range for last 2 years
end_date = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=732)).strftime('%Y-%m-%d')

print(f"\n{'='*60}")
print(f"Fetching data from {start_date} to {end_date}")
print(f"Project root: {project_root}")
print(f"{'='*60}\n")



Fetching data from 2023-11-03 to 2025-11-03
Project root: c:\Users\simo0\Documents\GitHub\intraday-momentum



### 2. Fetch SPY Minute Data

Fetch 1-minute intraday data for SPY using Polygon.io API (last 2 years).


In [2]:
spy_df, spy_metadata = loader.fetch_polygon_data('SPY', start_date, end_date, 'minute')
print(f"\nSPY DataFrame shape: {spy_df.shape}")
print(f"Columns: {spy_df.columns.tolist()}")
print(f"\nMetadata: {spy_metadata}")
spy_df.head()


INFO [2025-11-04 18:05:47] Fetching SPY from Polygon: 2023-11-03 → 2025-11-03 (minute)
INFO [2025-11-04 18:05:53] Fetched 50000 entries
INFO [2025-11-04 18:06:22] Fetched 50000 entries
INFO [2025-11-04 18:06:29] Fetched 50000 entries
INFO [2025-11-04 18:06:36] Fetched 50000 entries
INFO [2025-11-04 18:06:41] Fetched 50000 entries
INFO [2025-11-04 18:06:45] Rate limit reached. Waiting 2.29 seconds...
INFO [2025-11-04 18:06:51] Fetched 50000 entries
INFO [2025-11-04 18:07:03] Fetched 50000 entries
INFO [2025-11-04 18:07:10] Fetched 50000 entries
INFO [2025-11-04 18:07:15] Fetched 1199 entries
INFO [2025-11-04 18:07:16] 194,105 entries fetched across 500 trading days (elapsed: 97.46s)
INFO [2025-11-04 18:07:24] Saved raw data to data\raw\SPY_1min_20231103_20251103.csv.gz



SPY DataFrame shape: (194105, 6)
Columns: ['volume', 'open', 'high', 'low', 'close', 'caldt']

Metadata: {'ticker': 'SPY', 'start_date': '2023-11-03', 'end_date': '2025-11-03', 'period': 'minute', 'source': 'polygon', 'entries': 194105, 'trading_days': 500, 'fetched_at': '2025-11-04 18:07:24', 'elapsed_seconds': 97.46, 'filepath': 'C:\\Users\\simo0\\Documents\\GitHub\\intraday-momentum\\data\\raw\\SPY_1min_20231103_20251103.csv.gz'}


Unnamed: 0,volume,open,high,low,close,caldt
0,960246.0,435.47,436.14,435.46,436.125,2023-11-06 09:30:00
1,361691.0,436.13,436.1498,435.95,436.03,2023-11-06 09:31:00
2,193663.0,436.02,436.07,435.73,435.74,2023-11-06 09:32:00
3,203217.0,435.75,435.79,435.59,435.74,2023-11-06 09:33:00
4,172073.0,435.74,435.79,435.65,435.75,2023-11-06 09:34:00


### 3. Fetch VIX Daily Data

Fetch daily closing data for VIX using Yahoo Finance (last 2 years).


In [3]:
vix_df, vix_metadata = loader.fetch_yahoo_data('^VIX', start_date, end_date, 'day')
print(f"\nVIX DataFrame shape: {vix_df.shape}")
print(f"Columns: {vix_df.columns.tolist()}")
print(f"\nMetadata: {vix_metadata}")
vix_df.head()


INFO [2025-11-04 18:12:25] Fetching ^VIX from Yahoo Finance: 2023-11-03 → 2025-11-03 (day)
INFO [2025-11-04 18:12:26] 500 entries fetched across 500 trading days (elapsed: 1.05s)
INFO [2025-11-04 18:12:26] Saved raw data to data\raw\^VIX_1day_20231103_20251103.csv



VIX DataFrame shape: (500, 6)
Columns: ['volume', 'open', 'high', 'low', 'close', 'caldt']

Metadata: {'ticker': '^VIX', 'start_date': '2023-11-03', 'end_date': '2025-11-03', 'period': 'day', 'source': 'yahoo', 'entries': 500, 'trading_days': 500, 'fetched_at': '2025-11-04 18:12:26', 'elapsed_seconds': 1.05, 'filepath': 'C:\\Users\\simo0\\Documents\\GitHub\\intraday-momentum\\data\\raw\\^VIX_1day_20231103_20251103.csv'}


Unnamed: 0,volume,open,high,low,close,caldt
0,0,15.7,15.83,14.91,14.91,2023-11-03 00:00:00-04:00
1,0,15.39,15.58,14.84,14.89,2023-11-06 00:00:00-05:00
2,0,15.1,15.17,14.71,14.81,2023-11-07 00:00:00-05:00
3,0,14.91,15.09,14.3,14.45,2023-11-08 00:00:00-05:00
4,0,14.61,15.57,14.13,15.29,2023-11-09 00:00:00-05:00


### 4. Load CSV Data

Load previously saved CSV files to verify the data was saved correctly.


In [5]:
# Example: Load the SPY CSV file we just created
# Note: Minute-level data is now saved as compressed .csv.gz
spy_file_path = project_root / 'data/raw/SPY_1min_20231103_20251103.csv.gz'
vix_file_path = project_root / 'data/raw/^VIX_1day_20231103_20251103.csv'
spy_df = loader.load_csv(spy_file_path)
vix_df = loader.load_csv(vix_file_path)
print(f"Loaded CSV with shape: {spy_df.shape}")
spy_df.head()


INFO [2025-11-04 18:13:32] Loaded 194,105 rows from data\raw\SPY_1min_20231103_20251103.csv.gz
INFO [2025-11-04 18:13:32] Loaded 500 rows from data\raw\^VIX_1day_20231103_20251103.csv


Loaded CSV with shape: (194105, 6)


Unnamed: 0,volume,open,high,low,close,caldt
0,960246.0,435.47,436.14,435.46,436.125,2023-11-06 09:30:00
1,361691.0,436.13,436.1498,435.95,436.03,2023-11-06 09:31:00
2,193663.0,436.02,436.07,435.73,435.74,2023-11-06 09:32:00
3,203217.0,435.75,435.79,435.59,435.74,2023-11-06 09:33:00
4,172073.0,435.74,435.79,435.65,435.75,2023-11-06 09:34:00


### 5. Validate Calendar Coverage

Analyze the calendar coverage to identify missing trading days and calculate coverage statistics.
The function only considers business days (weekdays) since markets don't trade on weekends.


In [9]:
# Validate SPY calendar coverage
spy_calendar = loader.validate_calendar(spy_df)
print("SPY Calendar Coverage:")
if 'frequency' in spy_calendar:
    print(f"Frequency: {spy_calendar['frequency']}")
print(f"Total days: {spy_calendar['total_days']}")
if spy_calendar.get('expected_days') is not None:
    print(f"Expected days: {spy_calendar['expected_days']}")
    print(f"Coverage: {spy_calendar['coverage_percentage']}%")
print(f"Date range: {spy_calendar['date_range']}")
if spy_calendar.get('missing_dates_count', 0) > 0:
    print(f"Missing dates: {spy_calendar['missing_dates_count']}")
print(f"Weekdays: {spy_calendar['weekday_count']}, Weekends: {spy_calendar['weekend_count']}")
if spy_calendar.get('missing_dates'):
    print(f"First 10 missing: {spy_calendar['missing_dates'][:10]}")


SPY Calendar Coverage:
Total days: 500
Expected days: 521
Coverage: 95.97%
Date range: (datetime.date(2023, 11, 6), datetime.date(2025, 11, 3))
Missing dates: 21
Weekdays: 500, Weekends: 0
First 10 missing: [datetime.date(2023, 11, 23), datetime.date(2023, 12, 25), datetime.date(2024, 1, 1), datetime.date(2024, 1, 15), datetime.date(2024, 2, 19), datetime.date(2024, 3, 29), datetime.date(2024, 5, 27), datetime.date(2024, 6, 19), datetime.date(2024, 7, 4), datetime.date(2024, 9, 2)]


In [10]:
# Validate VIX calendar coverage
vix_calendar = loader.validate_calendar(vix_df)
print("VIX Calendar Coverage:")
if 'frequency' in vix_calendar:
    print(f"Frequency: {vix_calendar['frequency']}")
print(f"Total days: {vix_calendar['total_days']}")
if vix_calendar.get('expected_days') is not None:
    print(f"Expected days: {vix_calendar['expected_days']}")
    print(f"Coverage: {vix_calendar['coverage_percentage']}%")
print(f"Date range: {vix_calendar['date_range']}")
if vix_calendar.get('missing_dates_count', 0) > 0:
    print(f"Missing dates: {vix_calendar['missing_dates_count']}")
print(f"Weekdays: {vix_calendar['weekday_count']}, Weekends: {vix_calendar['weekend_count']}")
if vix_calendar.get('missing_dates'):
    print(f"First 10 missing: {vix_calendar['missing_dates'][:10]}")


VIX Calendar Coverage:
Total days: 500
Expected days: 521
Coverage: 95.97%
Date range: (datetime.date(2023, 11, 3), datetime.date(2025, 10, 31))
Missing dates: 21
Weekdays: 500, Weekends: 0
First 10 missing: [datetime.date(2023, 11, 23), datetime.date(2023, 12, 25), datetime.date(2024, 1, 1), datetime.date(2024, 1, 15), datetime.date(2024, 2, 19), datetime.date(2024, 3, 29), datetime.date(2024, 5, 27), datetime.date(2024, 6, 19), datetime.date(2024, 7, 4), datetime.date(2024, 9, 2)]
