#### CARRON Auriane, CHAUFOUR Chloé, DESTAILLEUR Mathilde - IF2
# Predicting Apple Option Prices Using Machine Learning

## Project Overview

### Context 

Options are financial derivatives that give the buyer the right (but not the obligation) to buy or sell an underlying asset at a predetermined price (strike) before a specific date (expiration). Pricing these instruments accurately is crucial for traders, market makers, and risk managers.

Traditional option pricing relies on theoretical models like Black-Scholes, which make assumptions about market behavior (constant volatility, no transaction costs, log-normal price distributions). However, real market prices often deviate from these theoretical values due to factors like supply/demand imbalances, liquidity constraints, and changing market conditions.

**This project explores whether machine learning can learn option pricing directly from market data, potentially capturing patterns and relationships that theoretical models miss.**


### Dataset
The dataset contains approximately **2 million option observations** for AAPL spanning 2016 to 2020, including:

- **Market data:** Bid/ask prices, last traded price, trading volume
- **Option characteristics:** Strike price, days to expiration, option type (call/put)
- **Greeks:** Delta, Gamma, Vega, Theta (risk measures)
- **Implied volatility:** Market's expectation of future price movement

It provides comprehensive information about how options were priced across different market conditions.

### Methodology

The project follows a systematic machine learning workflow:

**Part 1: Data Exploration and Baseline Model**

1. **Descriptive Analysis**
   - Data Loading and Initial Inspection
   - Initial Data Quality Assessment
   - Data Preparation and Structuring
   - Comprehensive Data Quality Analysis
   - Distribution Analysis of Key Variables
   - Correlation Analysis

2. **Data Preprocessing**
   - Feature Engineering

3. **Problem Formalization**
   - Define prediction task, features, and evaluation metrics

4. **Baseline Model - Linear Regression**
   - Data Splitting and Scaling
   - Model Training and Evaluation
   - Visualization of predictions

**Part 2: Model Optimization and Ensemble Learning**

1. **Grid Search for Model Optimization**
   - Find optimal hyperparameters for Random Forest
   - Results interpretation

2. **Evaluate the Best Random Forest Model**
   - Performance assessment on test data
   - Feature Importance Analysis

3. **Ensemble Model - Voting Regressor**
   - Combine Linear Regression, Random Forest, and Gradient Boosting

4. **Comparison of All Models**
   - Performance metrics comparison
   - Visual comparison of RMSE and R²

5. **Discussion on Obstacles and Solutions**
   - Main challenges encountered and solutions implemented

6. **Final Conclusions**
   - Summary of results
   - Business applications and future work

7. **References**
   - Scientific papers and resources

### Key Concepts 

To understand option pricing, several key terms are essential:

**Option Basics:**
- **Call option:** Right to BUY the underlying at the strike price
- **Put option:** Right to SELL the underlying at the strike price
- **Strike price (K):** The predetermined price at which the option can be exercised
- **Underlying price (S):** Current market price of the stock (AAPL)
- **Expiration date:** When the option contract expires
- **Time to maturity:** Time remaining until expiration

**Moneyness (option position):**
- **ITM (In-The-Money):** Option has intrinsic value if exercised now
  - Call ITM: S > K (stock price above strike)
  - Put ITM: S < K (stock price below strike)
- **ATM (At-The-Money):** Strike approximately equals stock price (S = K)
- **OTM (Out-of-The-Money):** Option has no intrinsic value, only time value
  - Call OTM: S < K (stock price below strike)
  - Put OTM: S > K (stock price above strike)

**Pricing Components:**
- **Intrinsic value:** Profit if option exercised immediately (max(S-K, 0) for calls)
- **Time value:** Additional premium due to possibility of favorable price movement
- **Mid-price:** Average of bid and ask prices, representing fair market estimate

**Market Measures:**
- **Implied Volatility:** Market's expectation of future price volatility, derived from option prices
- **Bid:** Price at which market makers will BUY the option
- **Ask:** Price at which market makers will SELL the option
- **Greeks:** Risk measures (Delta = price sensitivity, Gamma = delta sensitivity, Vega = volatility sensitivity, Theta = time decay)

# I. PART 1 - Data Exploration and Baseline Model

In this first part, our objective is to prepare the data and establish a baseline performance benchmark. 

**Data Loading and Restructuring**

We begin by loading the raw Apple options dataset containing approximately 2 million observations spanning 2016-2020. The data initially separates call and put options into different column sets (`[C_BID]`, ` [P_BID]` for example), which is problematic for modeling. We want a single unified model that learns pricing patterns for both option types simultaneously. To achieve this, we restructure the data by creating 2 separate dataframes for calls and puts with standardized column names, then combine them into `df_all` with an added `option_type` identifier. This unified structure simplifies subsequent analysis and ensures identical processing for both option types.

**Data Quality Analysis and Cleaning**

Once the data is properly structured, we check it to identify missing values, duplicates, and aberrant observations. Indeed, poor data quality compromises model performance. We have to pay particular attention to implied volatility (market's expectation of future price movement and also the only unknown variable in the Black-Scholes formula) : missing Implied Volatility typically indicates illiquid options where market makers don't provide quotes, we remove them to ensure our models train on high-quality, tradeable options only.

**Exploratory Data Analysis**

Once the data is cleaned, we visualize the distributions of key variables (such as underlying price, strike, time to maturity, implied volatility) to validate that our cleaning worked and that the data reflects realistic market patterns. We also examine correlations between features to identify the strongest predictors of option prices and check for multicollinearity issues that might affect model performance. This analysis builds intuition about the data's characteristics.

**Feature Engineering**

Beyond the raw variables, we create derived features that better capture option pricing dynamics : what matters for option value isn't the absolute strike or stock price, but their ratio, whether the option is In-The-Money (ITM), At-The-Money (ATM), or Out-of-The-Money (OTM). We create `moneyness` (K/S) and `log_moneyness` to explicitly capture this relationship. 

**Problem Formalization and Baseline Model**

After feature engineering, we formally define our prediction task: predict `mid_price` (average of bid and ask) using seven features including our engineered variables. We split the data 80/20 for training and testing, then train a Linear Regression baseline model. The purpose isn't to achieve great performance. Indeed, we expect it to struggle because option pricing is inherently non-linear. The baseline's limitations will motivate the need for more advanced algorithms in Part 2.

## 1. Descriptive analysis

In this first part, we explore and understand the dataset. 
The objective is to analyze its structure, identify the main variables and visualize how the data is distributed.

### 1.1 Data Loading and Initial Inspection

We begin by loading the dataset and examining its structure to understand the available features and data quality.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/aapl_2016_2020.csv", low_memory=False)

print("\nDataset view:")
display(df.head())
print("\nDataset info:")
df.info()
print("\nDataset statistics:")
display(df.describe())


Dataset view:


Unnamed: 0,[QUOTE_UNIXTIME],[QUOTE_READTIME],[QUOTE_DATE],[QUOTE_TIME_HOURS],[UNDERLYING_LAST],[EXPIRE_DATE],[EXPIRE_UNIX],[DTE],[C_DELTA],[C_GAMMA],...,[P_LAST],[P_DELTA],[P_GAMMA],[P_VEGA],[P_THETA],[P_RHO],[P_IV],[P_VOLUME],[STRIKE_DISTANCE],[STRIKE_DISTANCE_PCT]
0,1546462800,2019-01-02 16:00,2019-01-02,16.0,157.92,2019-01-04,1546635600,2.0,0.90886,0.00019,...,0.01,-0.00034,0.00011,0.00079,-0.00509,-0.00041,1.62555,0.0,57.9,0.367
1,1546462800,2019-01-02 16:00,2019-01-02,16.0,157.92,2019-01-04,1546635600,2.0,1.0,0.0,...,0.01,-0.00069,0.0001,0.00039,-0.00518,-0.0001,1.4619,200.0,52.9,0.335
2,1546462800,2019-01-02 16:00,2019-01-02,16.0,157.92,2019-01-04,1546635600,2.0,1.0,0.0,...,0.04,-0.00066,0.0002,0.0,-0.00425,-9e-05,1.30549,706.0,47.9,0.303
3,1546462800,2019-01-02 16:00,2019-01-02,16.0,157.92,2019-01-04,1546635600,2.0,1.0,0.0,...,0.01,-0.0012,0.00021,0.00089,-0.00434,-5e-05,1.15513,0.0,42.9,0.272
4,1546462800,2019-01-02 16:00,2019-01-02,16.0,157.92,2019-01-04,1546635600,2.0,1.0,0.0,...,0.01,-0.00109,0.00024,0.00045,-0.00429,-0.0002,1.01062,0.0,37.9,0.24



Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015352 entries, 0 to 1015351
Data columns (total 33 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   [QUOTE_UNIXTIME]        1015352 non-null  int64  
 1    [QUOTE_READTIME]       1015352 non-null  object 
 2    [QUOTE_DATE]           1015352 non-null  object 
 3    [QUOTE_TIME_HOURS]     1015352 non-null  float64
 4    [UNDERLYING_LAST]      1015352 non-null  float64
 5    [EXPIRE_DATE]          1015352 non-null  object 
 6    [EXPIRE_UNIX]          1015352 non-null  int64  
 7    [DTE]                  1015352 non-null  float64
 8    [C_DELTA]              1015352 non-null  object 
 9    [C_GAMMA]              1015352 non-null  object 
 10   [C_VEGA]               1015352 non-null  object 
 11   [C_THETA]              1015352 non-null  object 
 12   [C_RHO]                1015352 non-null  object 
 13   [C_IV]                 1015352 non-null  

Unnamed: 0,[QUOTE_UNIXTIME],[QUOTE_TIME_HOURS],[UNDERLYING_LAST],[EXPIRE_UNIX],[DTE],[STRIKE],[STRIKE_DISTANCE],[STRIKE_DISTANCE_PCT]
count,1015352.0,1015352.0,1015352.0,1015352.0,1015352.0,1015352.0,1015352.0,1015352.0
mean,1539378000.0,16.0,193.9912,1553661000.0,165.307,180.9003,57.59385,0.2971443
std,46295910.0,0.0,85.86036,51928480.0,210.3252,102.4306,54.89469,0.2447493
min,1451941000.0,16.0,90.34,1452287000.0,0.0,2.5,0.0,0.0
25%,1499285000.0,16.0,124.4,1510952000.0,22.0,110.0,16.6,0.101
50%,1541797000.0,16.0,173.0,1554494000.0,64.04,160.0,40.9,0.228
75%,1582837000.0,16.0,223.86,1596226000.0,224.0,230.0,82.2,0.445
max,1609448000.0,16.0,506.19,1679083000.0,890.96,1000.0,500.8,1.991
