# Cell 1: Introduction to Change Point Modeling

**Purpose**: This notebook implements a Bayesian change point model using PyMC to detect structural breaks in Brent oil price log returns and associate them with major events for the 10 Academy Week 10 Challenge (Task 2).

**Objectives**:
- Load and prepare preprocessed Brent oil price data.
- Build and run a Bayesian change point model on log returns.
- Interpret results to identify change points and quantify impacts.
- Associate change points with events from `major_events.csv`.

**Input**: `data/processed/cleaned_oil_data.csv`, `data/events/major_events.csv`
**Output**: Plots in `results/figures/`, model results in `results/models/`, and interpretation dictionary.

In [1]:
# Cell 2: Import Required Libraries
# Description: Import Python libraries for modeling, data handling, and visualization.

import pandas as pd
import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import csv

# Add the project root directory
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..'))
from src.models.changepoint_model import build_changepoint_model, run_mcmc, plot_changepoint_results, interpret_changepoint

# Ensure plots are displayed inline
%matplotlib inline



In [2]:
# Cell 3: Load Preprocessed Data
# Description: Load preprocessed Brent oil price data and major events dataset.
# Input: Preprocessed CSV file and events CSV file
# Output: DataFrames for prices and events

# Define file paths
PRICE_PATH = '../data/processed/cleaned_oil_data.csv'
EVENTS_PATH = '../data/events/major_events.csv'

# Load price data
price_df = pd.read_csv(PRICE_PATH, parse_dates=['Date'])

# Load events data with quoting to handle commas in text fields
events_df = pd.read_csv(EVENTS_PATH, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

# Display first few rows to verify
print("Price Data Head:")
print(price_df.head())
print("\nEvents Data Head:")
print(events_df.head())

Price Data Head:
        Date  Price  Log_Price  Log_Returns
0 1987-05-20  18.63   2.924773          NaN
1 1987-05-21  18.45   2.915064    -0.009709
2 1987-05-22  18.55   2.920470     0.005405
3 1987-05-25  18.60   2.923162     0.002692
4 1987-05-26  18.63   2.924773     0.001612

Events Data Head:
                Event Name  Start Date         Category  \
0          Gulf War Begins  1990-08-02         Conflict   
1   Asian Financial Crisis  1997-07-02   Economic Shock   
2             9/11 Attacks  2001-09-11  Political Shock   
3          Iraq War Begins  2003-03-20         Conflict   
4  Global Financial Crisis  2008-09-15   Economic Shock   

                                         Description  \
0  Iraq invades Kuwait; major geopolitical shock ...   
1  Currency collapse across Asia affects global d...   
2  Terror attacks create market instability and e...   
3  U.S. invasion leads to long-term shifts in Mid...   
4  Lehman collapse triggers global recession and ...   

        

In [3]:
# Cell 4: Prepare Data for Modeling
# Description: Extract log returns and dates for change point modeling, downsample to weekly data.
# Input: Preprocessed DataFrame
# Output: Numpy array of log returns and Series of dates

# Drop rows with NaN log returns (first row due to diff)
model_df = price_df.dropna(subset=['Log_Returns'])

# Downsample to weekly data
model_df = model_df.resample('W', on='Date').mean().reset_index()

# Extract log returns as numpy array
log_returns = model_df['Log_Returns'].values

# Extract corresponding dates
dates = model_df['Date']

# Get number of days
n_days = len(log_returns)

# Print data summary
print(f"Number of weeks: {n_days}")
print(f"Log returns range: {log_returns.min():.4f} to {log_returns.max():.4f}")

Number of weeks: 1853
Log returns range: -0.0647 to 0.0676


In [4]:
# Cell 5: Build and Run Change Point Model
# Description: Build the Bayesian model and run MCMC sampling.
# Input: Log returns array and number of weeks
# Output: PyMC model and InferenceData trace

# Build model
model = build_changepoint_model(log_returns, n_days)

# Run MCMC sampling with optimized parameters
trace = run_mcmc(model, n_days, log_returns, draws=250, tune=5000)

# Check convergence
print("MCMC Summary:")
print(az.summary(trace, round_to=4))

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [tau_idx]
>NUTS: [mu_1, mu_2, sigma]


Sampling 4 chains for 5_000 tune and 250 draw iterations (20_000 + 1_000 draws total) took 4794 seconds.
There were 229 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details


MCMC Summary:
             mean        sd  hdi_3%    hdi_97%  mcse_mean  mcse_sd  ess_bulk  \
tau_idx   49.5040   31.9015  0.0000    89.0000    10.0447   2.2037    9.9592   
mu_1       0.0002    0.0012 -0.0015     0.0025     0.0001   0.0002  167.8062   
mu_2      -0.0002    0.0007 -0.0014     0.0007     0.0003   0.0001    7.2100   
sigma      0.0107    0.0002  0.0105     0.0111     0.0000   0.0000  107.6897   
tau      990.0800  638.0300  0.0000  1780.0000   200.8947  44.0732    9.9592   

         ess_tail   r_hat  
tau_idx   36.5863  1.3428  
mu_1      68.9762  1.1029  
mu_2      22.1524  1.5439  
sigma    165.1581  1.0440  
tau       36.5863  1.3428  


In [5]:
# Cell 6: Visualize Model Results
# Description: Plot log returns with estimated change point and posterior distribution of tau.
# Input: MCMC trace, log returns data, and dates
# Output: Plot saved to 'results/figures/changepoint_results.png'

# Define output path
OUTPUT_PATH = '../results/figures/changepoint_results.png'

# Plot results
plot_changepoint_results(trace, log_returns, dates, OUTPUT_PATH)

# Display plot (already saved)
plt.show()

In [6]:
# Cell 7: Interpret Change Point Results
# Description: Associate change point with events and quantify impact on log returns.
# Input: MCMC trace, dates, and events DataFrame
# Output: Dictionary with interpretation results

# Interpret results
results = interpret_changepoint(trace, dates, events_df)

# Print interpretation
print("Change Point Interpretation:")
for key, value in results.items():
    print(f"{key}: {value}")

Change Point Interpretation:
change_point_date: 2006-05-14
associated_event: Global Financial Crisis
event_date: 2008-09-15
event_description: Lehman collapse triggers global recession and a dramatic oil demand drop
log_return_change: -0.0003
