# ML-Enhanced European Option Pricing: Data Fetching and Processing

Complete data pipeline from fetching raw options market data and preparing it for machine learning modeling.

## Pipeline Overview:
1. **Data Fetching**: Options data from Yahoo Finance (real-time market data)
2. **Data Cleaning**: Remove invalid entries and apply quality filters
3. **Feature Engineering**: Create additional features for modeling
4. **Black-Scholes Calculation**: Compute theoretical prices
5. **Target Variable**: Calculate pricing errors (Market Price - BS Price)
6. **Data Quality Assessment**: Check for outliers and data quality issues
7. **Normalization**: Standardize features for modeling
8. **Data Export**: Save processed data for modeling


In [1]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

sys.path.append('..')

from config import Config
from src.data_fetcher import OptionsDataFetcher
from src.black_scholes import BlackScholesCalculator
from src.data_processor import OptionsDataProcessor

plt.style.use('default')
sns.set_palette("husl")

print("Setup complete")


Setup complete


## Step 1: Configuration and Setup

In [2]:
try:
    Config.validate_config()
    print(f"Target: {Config.TARGET_SYMBOL}")
except ValueError as e:
    print(f"Error: {e}")
    raise

fetcher = OptionsDataFetcher(alpha_vantage_api_key=Config.ALPHA_VANTAGE_API_KEY)
processor = OptionsDataProcessor()
bs_calculator = BlackScholesCalculator()

print("Components ready")

INFO:src.data_fetcher:OptionsDataFetcher initialized successfully


Target: SPY
Components ready


## Step 2: Data Fetching

In [3]:
# Fetch market data
print("📈 Fetching market data...")

risk_free_rate = fetcher.get_risk_free_rate(Config.RISK_FREE_RATE_TICKER)
stock_price = fetcher.get_current_stock_price(Config.TARGET_SYMBOL)

print(f"✅ Risk-free rate: {risk_free_rate:.4f} ({risk_free_rate*100:.2f}%)")
print(f"✅ Current {Config.TARGET_SYMBOL} price: ${stock_price:.2f}")

# Fetch options data from Yahoo Finance
print(f"\n📊 Fetching options data for {Config.TARGET_SYMBOL}...")
options_list = fetcher.fetch_options_data(Config.TARGET_SYMBOL)

if not options_list:
    print("❌ Failed to fetch options data")
    raise Exception("Data fetching failed")

print(f"✅ Successfully fetched {len(options_list)} options contracts")


📈 Fetching market data...


INFO:src.data_fetcher:Risk-free rate fetched: 0.0413
INFO:src.data_fetcher:Current SPY price: $635.55
INFO:src.data_fetcher:Fetching options data from Yahoo Finance for SPY...
INFO:src.data_fetcher:Fetching options data for SPY from Yahoo Finance...
INFO:src.data_fetcher:Fetching options chain for expiration: 2025-08-22


✅ Risk-free rate: 0.0413 (4.13%)
✅ Current SPY price: $635.55

📊 Fetching options data for SPY...


INFO:src.data_fetcher:Fetching options chain for expiration: 2025-08-25
INFO:src.data_fetcher:Fetching options chain for expiration: 2025-08-26
INFO:src.data_fetcher:Fetching options chain for expiration: 2025-08-27
INFO:src.data_fetcher:Successfully fetched 60 options from Yahoo Finance


✅ Successfully fetched 60 options contracts


In [4]:
# Process options data
print("🔄 Processing options data...")
raw_df = fetcher.process_options_data(options_list, stock_price, risk_free_rate)

print(f"\n📊 Raw Data Summary:")
print(f"Shape: {raw_df.shape}")
print(f"Columns: {list(raw_df.columns)}")
print(f"\nFirst few rows:")
print(raw_df.head())

# Save raw data
raw_data_path = f"{Config.DATA_RAW_PATH}options_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
os.makedirs(Config.DATA_RAW_PATH, exist_ok=True)
fetcher.save_raw_data(raw_df, raw_data_path)
print(f"\n💾 Raw data saved to: {raw_data_path}")


INFO:src.data_fetcher:Processed 60 option contracts into DataFrame
INFO:src.data_fetcher:Raw data saved to data/raw/options_data_20250821_174335.csv


🔄 Processing options data...

📊 Raw Data Summary:
Shape: (60, 15)
Columns: ['symbol', 'underlying_symbol', 'strike', 'expiration_date', 'contract_type', 'implied_volatility', 'open_interest', 'volume', 'bid', 'ask', 'last_price', 'market_price', 'underlying_price', 'risk_free_rate', 'fetch_timestamp']

First few rows:
                 symbol underlying_symbol  strike expiration_date  \
0  SPY20250822C00400000               SPY   400.0      2025-08-22   
1  SPY20250822C00410000               SPY   410.0      2025-08-22   
2  SPY20250822C00420000               SPY   420.0      2025-08-22   
3  SPY20250822C00445000               SPY   445.0      2025-08-22   
4  SPY20250822C00450000               SPY   450.0      2025-08-22   

  contract_type  implied_volatility  open_interest  volume     bid     ask  \
0          call            2.791019           21.0    20.0  234.60  237.40   
1          call            2.660160            4.0     4.0  224.60  227.41   
2          call            2.53

## Step 3: Data Cleaning and Quality Assessment

In [5]:
# Create configuration for data cleaning
cleaning_config = {
    'MIN_VOLUME': Config.MIN_VOLUME,
    'MIN_OPEN_INTEREST': Config.MIN_OPEN_INTEREST,
    'MIN_TIME_TO_EXPIRY': Config.MIN_TIME_TO_EXPIRY
}

print("🧹 Starting data cleaning process...")
print(f"Cleaning configuration: {cleaning_config}")

# Clean the data
cleaned_df = processor.clean_raw_data(raw_df, cleaning_config)

# Display cleaning results
print(f"\n📈 Data Cleaning Results:")
print(f"Original shape: {raw_df.shape}")
print(f"Cleaned shape:  {cleaned_df.shape}")
print(f"Rows removed:   {len(raw_df) - len(cleaned_df)} ({(len(raw_df) - len(cleaned_df))/len(raw_df)*100:.1f}%)")

# Check for missing values
print(f"\n🔍 Missing Values Check:")
missing_summary = cleaned_df.isnull().sum()
if missing_summary.sum() > 0:
    print(missing_summary[missing_summary > 0])
else:
    print("✅ No missing values found!")
    
print(f"\n📊 Cleaned Data Sample:")
print(cleaned_df.head())


INFO:src.data_processor:Data cleaning complete. Final shape: (45, 17)
INFO:src.data_processor:Removed 15 rows during cleaning


🧹 Starting data cleaning process...
Cleaning configuration: {'MIN_VOLUME': 0, 'MIN_OPEN_INTEREST': 0, 'MIN_TIME_TO_EXPIRY': 0.001}

📈 Data Cleaning Results:
Original shape: (60, 15)
Cleaned shape:  (45, 17)
Rows removed:   15 (25.0%)

🔍 Missing Values Check:
volume    6
dtype: int64

📊 Cleaned Data Sample:
                  symbol underlying_symbol  strike expiration_date  \
15  SPY20250825C00530000               SPY   530.0      2025-08-25   
16  SPY20250825C00550000               SPY   550.0      2025-08-25   
17  SPY20250825C00570000               SPY   570.0      2025-08-25   
18  SPY20250825C00580000               SPY   580.0      2025-08-25   
19  SPY20250825C00590000               SPY   590.0      2025-08-25   

   contract_type  implied_volatility  open_interest  volume     bid     ask  \
15          call            0.809084           11.0     NaN  104.73  107.53   
16          call            0.669437            2.0     NaN   84.74   87.54   
17          call            0.5322