# Market Regime Detection and Risk Analysis

## Research Question

**Which model (Gaussian Mixture Model, Hidden Markov Model, or K-Means clustering) provides more accurate market regime detection and better risk assessment for portfolio management?**

This analysis compares three machine learning approaches to identify distinct market regimes (bull, bear, sideways markets) in financial time series data. We evaluate their performance in detecting regime transitions and providing actionable risk metrics for portfolio management.


## 1. Introduction

### What are Market Regimes?

Market regimes are distinct states or phases that financial markets go through, characterized by different statistical properties:
- **Bull markets**: Rising prices, low volatility, positive returns
- **Bear markets**: Falling prices, high volatility, negative returns  
- **Sideways markets**: Stable prices, moderate volatility, mixed returns

### Why is Regime Detection Important?

Identifying the current market regime helps:
- **Risk management**: Adjust portfolio allocation based on regime-specific risks
- **Strategy selection**: Use different strategies for different regimes
- **Early warning**: Detect transitions before they fully materialize

### Our Approach

We compare three models:
1. **GMM (Gaussian Mixture Model)**: Probabilistic clustering without temporal structure
2. **HMM (Hidden Markov Model)**: Probabilistic model with temporal dependencies (can model regime transitions)
3. **K-Means**: Baseline non-probabilistic clustering (included to demonstrate limitations)

**Note on K-Means**: K-Means is included as a baseline to illustrate why clustering methods without temporal structure are not well-suited for financial time series. We expect it to show limitations such as poor regime distribution and unrealistic risk metrics.


In [None]:
# Cell 1: Imports
%matplotlib inline

import sys
import os

# Add project root to path to import src modules
# This allows importing from src/ when running notebook from notebooks/ directory
current_dir = os.getcwd()
if current_dir.endswith('notebooks'):
    project_root = os.path.dirname(current_dir)
else:
    project_root = current_dir

if project_root not in sys.path:
    sys.path.insert(0, project_root)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Matplotlib configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Import custom modules
from src import (
    load_market_data,
    prepare_features,
    clean_market_data,
    ModelComparison,
    calculate_regime_risk_metrics
)
from src.visualization import (
    plot_regime_detection,
    plot_regime_statistics,
    plot_model_comparison,
    plot_transition_matrix,
    plot_risk_metrics
)
from src.comparison import (
    calculate_regime_statistics_from_regimes
)

print("✓ Libraries imported successfully!")


## 2. Data

### What data are we using? Where from?

**Data Source**: Yahoo Finance  
**Ticker**: `^GSPC` (S&P 500 Index)  
**Period**: 2005-01-01 to 2025-10-31

### Why ^GSPC instead of SPY?

- **Pure market representation**: The index directly reflects market dynamics without ETF effects
- **No premium/discount**: ETFs can trade at premiums or discounts to NAV
- **No management fees**: Index data is not affected by ETF expense ratios
- **Better for regime detection**: More accurate representation of underlying market regimes

### Why this time period?

The selected period includes multiple distinct market regimes:
- **Pre-crisis bull market** (2005-2006): Normal market conditions
- **Financial crisis** (2008-2009): Major bear market and crash
- **Post-crisis recovery** (2010-2012): Recovery period
- **Extended bull market** (2013-2019): Long period of growth
- **COVID-19 crash** (2020): Pandemic-induced volatility
- **Recent period** (2024-2025): Current market conditions

This allows us to test the model's ability to detect major regime transitions.


In [None]:
# Cell 2: Load data
TICKER = "^GSPC"  # S&P 500 Index
START_DATE = "2005-01-01"
END_DATE = "2025-10-31"

print(f"Loading data for {TICKER}...")
print(f"Period: {START_DATE} to {END_DATE}")

market_data_raw = load_market_data(
    ticker=TICKER,
    start_date=START_DATE,
    end_date=END_DATE,
    raw_data_dir=os.path.join(project_root, "data/raw")
)

print(f"\n✓ Raw data loaded: {len(market_data_raw):,} observations")
print(f"  Date range: {market_data_raw.index[0].date()} to {market_data_raw.index[-1].date()}")
print(f"  Columns: {list(market_data_raw.columns)}")

# Show what we loaded
print("\nFirst few rows:")
market_data_raw.head()


### Data Cleaning

We need to ensure data quality before analysis. Our cleaning process:
1. **OHLC validation**: Check that High ≥ Low, High ≥ Close, Low ≤ Close
2. **Duplicate detection**: Remove duplicate dates
3. **Temporal continuity**: Identify missing trading days
4. **Zero volume detection**: Flag days with zero volume
5. **Unrealistic returns**: Flag extreme returns (>20%) for review

**Important**: We do NOT remove return outliers, as they represent important regime changes (e.g., market crashes, rallies).


In [None]:
# Cell 3: Data cleaning
print("Cleaning data...")
print("Note: We keep return outliers as they indicate regime changes (crashes, rallies)")

market_data, cleaning_stats = clean_market_data(
    market_data_raw,
    fill_missing=True,
    validate_ohlc=True,
    verbose=True
)

print(f"\n✓ Cleaned data: {len(market_data):,} observations")
print(f"  Removed: {len(market_data_raw) - len(market_data):,} invalid rows")

# Remove Volume column (not used in models, and for indices it's just a proxy)
if 'Volume' in market_data.columns:
    market_data = market_data.drop(columns=['Volume'])
    print("Removed 'Volume' column (not used in analysis)")

# Show cleaning statistics
if cleaning_stats:
    print("\nCleaning statistics:")
    for key, value in cleaning_stats.items():
        if isinstance(value, dict):
            print(f"  {key}:")
            for k, v in value.items():
                print(f"    {k}: {v}")
        else:
            print(f"  {key}: {value}")

market_data.head()


## 3. Exploration

### What do we see in the data?

Let's explore the raw price data and returns to understand the market behavior over our time period.


In [None]:
# Cell 4: Calculate returns and explore data
# Use prepare_features to get returns (same as main.py)
features_temp, _ = prepare_features(
    market_data,
    include_volatility=False,
    verbose=False
)
returns = features_temp['returns']

print("Returns statistics:")
print(f"  Mean daily return: {returns.mean():.4f} ({returns.mean()*100:.2f}%)")
print(f"  Std deviation: {returns.std():.4f} ({returns.std()*100:.2f}%)")
print(f"  Min return: {returns.min():.4f} ({returns.min()*100:.2f}%)")
print(f"  Max return: {returns.max():.4f} ({returns.max()*100:.2f}%)")
print(f"  Skewness: {returns.skew():.2f}")
print(f"  Kurtosis: {returns.kurtosis():.2f}")

# Visualize price and returns
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Price plot
axes[0].plot(market_data.index, market_data['Close'], linewidth=1.5, color='steelblue')
axes[0].set_title('S&P 500 Index Price (2005-2025)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Price (USD)', fontsize=12)
axes[0].grid(True, alpha=0.3)
axes[0].axvline(pd.Timestamp('2008-09-15'), color='red', linestyle='--', alpha=0.7, label='Lehman Brothers')
axes[0].axvline(pd.Timestamp('2020-03-15'), color='orange', linestyle='--', alpha=0.7, label='COVID-19')
axes[0].legend()

# Returns plot
axes[1].plot(returns.index, returns.values, linewidth=0.5, color='darkgreen', alpha=0.6)
axes[1].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[1].set_title('Daily Returns', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Date', fontsize=12)
axes[1].set_ylabel('Daily Return', fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Data exploration complete")


### Feature Engineering

For regime detection, we need features that capture market characteristics:
- **Returns**: Direct measure of price movement
- **Volatility**: Rolling standard deviation of returns (captures market uncertainty)
- **Volume ratios**: Trading volume patterns (optional)

These features help distinguish between different market regimes.


In [None]:
# Cell 5: Prepare features
print("Preparing features for regime detection...")

features, feature_stats = prepare_features(
    market_data,
    include_volatility=True,
    volatility_window=20,  # 20-day rolling volatility
    verbose=True
)

print(f"\n✓ Features prepared: {len(features):,} observations")
print(f"  Feature columns: {list(features.columns)}")

# Select features for analysis
feature_columns = ['returns', 'volatility']
available_features = [col for col in feature_columns if col in features.columns]
X = features[available_features].copy()
returns = features['returns'].copy()

print(f"\nSelected features for modeling: {available_features}")
print(f"Data shape: {X.shape}")

# Show feature statistics
print("\nFeature statistics:")
print(X.describe())

# Visualize feature distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(X['returns'], bins=100, alpha=0.7, color='steelblue', edgecolor='black')
axes[0].set_title('Distribution of Returns', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Daily Return', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].grid(True, alpha=0.3)

axes[1].hist(X['volatility'], bins=100, alpha=0.7, color='darkgreen', edgecolor='black')
axes[1].set_title('Distribution of Volatility (20-day rolling)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Volatility', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 4. Analysis

### What methods did we apply?

We compare three models for regime detection:

1. **GMM (Gaussian Mixture Model)**: Probabilistic clustering that identifies regimes based on return and volatility distributions
2. **HMM (Hidden Markov Model)**: Similar to GMM but models temporal dependencies - can predict regime transitions
3. **K-Means**: Baseline clustering method (included to demonstrate limitations)

### Number of Regimes

We use **3 regimes by convention** for comparability between models, interpretability (bull/bear/sideways), and academic precedent. This choice aligns with common practice in regime detection literature and provides clear, actionable regime classifications.

**Regime definitions:**
- **Regime 0**: Low volatility (typically bull market)
- **Regime 1**: Medium volatility (sideways market)
- **Regime 2**: High volatility (typically bear market)

**Note:** Regimes are automatically reorganized by volatility (0=low, 1=medium, 2=high) during training to ensure consistent interpretation across all models.


### Step 1: Prepare Features

We prepare features for regime detection using returns and rolling volatility (see Cell 5 above). These features will be used to train all three models.


### Step 2: Train Models

We train all three models using the same 3-regime convention described above to enable fair comparison.


In [None]:
# Cell 6: Train all models
N_REGIMES = 3  # Fixed to 3 for fair comparison

print("Training models with 3 regimes...")
print("=" * 60)

# Use ModelComparison class to train all models
# Regimes are automatically reorganized by volatility (0=low, 1=medium, 2=high) during fit()
comparison = ModelComparison(n_components=N_REGIMES, random_state=42)
comparison.fit(X, returns=returns, sort_by="volatility")

# Get models and regimes (already reorganized by volatility)
gmm_model = comparison.gmm_model
hmm_model = comparison.hmm_model
kmeans_model = comparison.kmeans_model

gmm_regimes = comparison.gmm_regimes
hmm_regimes = comparison.hmm_regimes
kmeans_regimes = comparison.kmeans_regimes

print(f"\n✓ GMM trained. Regimes detected: {np.unique(gmm_regimes)}")
print(f"✓ HMM trained. Regimes detected: {np.unique(hmm_regimes)}")
print(f"✓ K-Means trained. Regimes detected: {np.unique(kmeans_regimes)}")
print("✓ Regimes automatically sorted: 0=Low volatility, 1=Medium, 2=High volatility")


### Step 3: Visualize Detected Regimes

Let's see how each model identifies regimes over time. This helps us understand:
- Do models agree on regime classifications?
- Do they detect major market events (2008 crisis, COVID-19)?
- How do regime transitions differ between models?


In [None]:
# Cell 7: Visualize regime detection
print("Visualizing detected regimes...")

# GMM
plot_regime_detection(
    market_data,
    returns,
    gmm_regimes,
    title="Regime Detection - GMM Model"
)

# HMM
plot_regime_detection(
    market_data,
    returns,
    hmm_regimes,
    title="Regime Detection - HMM Model"
)


# K-Means
plot_regime_detection(
    market_data,
    returns,
    kmeans_regimes,
    title="Regime Detection - K-Means Model"
)

print("✓ Regime visualizations complete")


## 5. Results

### What did we find?

Let's analyze the statistics for each detected regime and compare model performance.


In [None]:
# Cell 8: Calculate regime statistics
print("Calculating regime statistics...")

# Calculate statistics for each model
gmm_stats = calculate_regime_statistics_from_regimes(gmm_regimes, returns)
hmm_stats = calculate_regime_statistics_from_regimes(hmm_regimes, returns)
kmeans_stats = calculate_regime_statistics_from_regimes(kmeans_regimes, returns)

print("\n=== GMM Regime Statistics ===")
print(gmm_stats)

print("\n=== HMM Regime Statistics ===")
print(hmm_stats)

print("\n=== K-Means Regime Statistics ===")
print(kmeans_stats)

# Visualize statistics
plot_regime_statistics(gmm_stats, title="Regime Statistics - GMM Model")
plt.show()
plot_regime_statistics(hmm_stats, title="Regime Statistics - HMM Model")
plt.show()
plot_regime_statistics(kmeans_stats, title="Regime Statistics - K-Means Model")
plt.show()


### Model Comparison

Now let's compare all three models side-by-side to see which performs best.


In [None]:
# Cell 9: Compare models
print("Comparing all models...")
comparison_results = comparison.compare_models(X, returns)

# Extract metrics
gmm_metrics = comparison_results['gmm']['metrics']
hmm_metrics = comparison_results['hmm']['metrics']
kmeans_metrics = comparison_results['kmeans']['metrics']

print("\n=== MODEL PERFORMANCE METRICS ===")
print(f"GMM:")
print(f"  BIC: {gmm_metrics['BIC']:.2f}")
print(f"  AIC: {gmm_metrics['AIC']:.2f}")

print(f"\nHMM:")
print(f"  BIC: {hmm_metrics['BIC']:.2f}")
print(f"  AIC: {hmm_metrics['AIC']:.2f}")
print(f"  Log-Likelihood: {hmm_metrics['log_likelihood']:.2f}")

print(f"\nK-Means:")
print(f"  Inertia: {kmeans_metrics['inertia']:.2f}")
print(f"  Silhouette Score: {kmeans_metrics['silhouette_score']:.3f}")

# Get best model
best_model, justification = comparison.get_best_model(X, returns)
print(f"\n=== BEST MODEL ===")
print(f"Recommended: {best_model}")
print(f"Score: {justification['score']:.4f}")
print(f"Reason: {justification['reason']}")

# Visual comparison
plot_model_comparison(gmm_stats, hmm_stats, kmeans_stats)
plt.show()


### HMM Transition Matrix

One unique advantage of HMM is its ability to model regime transitions. The transition matrix shows:
- **Diagonal values**: Probability of staying in the same regime (persistence)
- **Off-diagonal values**: Probability of transitioning to another regime

This is crucial for risk management - we can predict likely regime changes!


In [None]:
# Cell 10: Analyze HMM transition matrix
# Get transition matrix from HMM model (already reorganized by volatility)
transition_matrix = hmm_model.get_transition_matrix()

print("=== HMM TRANSITION MATRIX ===")
print("(Regimes: 0=Low volatility, 1=Medium, 2=High volatility)")
print("Rows = Current regime, Columns = Next regime")
print("\nTransition probabilities:")

# Format DataFrame to display probabilities as decimals (0-1 range)
df_transition = pd.DataFrame(
    transition_matrix,
    index=[f'Regime {i}' for i in range(N_REGIMES)],
    columns=[f'Regime {i}' for i in range(N_REGIMES)]
)

# Format each cell to show probability as decimal (e.g., 0.980 instead of 9.803620e-01)
# Use 6 decimal places for readability, or scientific notation for very small values
def format_prob(x):
    if x < 1e-6:
        return f'{x:.2e}'
    else:
        return f'{x:.6f}'

# Apply formatting to each cell
for col in df_transition.columns:
    df_transition[col] = df_transition[col].apply(format_prob)

print(df_transition)

# Visualize
plot_transition_matrix(transition_matrix, title="HMM Transition Matrix")
plt.show()

# Analyze transitions
print("\n=== REGIME TRANSITION ANALYSIS ===")
for regime in range(N_REGIMES):
    stability = transition_matrix[regime, regime]
    most_likely_next = np.argmax(transition_matrix[regime, :])
    print(f"\nCurrent regime: {regime}")
    print(f"  Stability (staying in same regime): {stability:.3f} ({stability*100:.1f}%)")
    print(f"  Most likely next regime: {most_likely_next}")
    print(f"  Transition probabilities:")
    for next_regime in range(N_REGIMES):
        prob = transition_matrix[regime, next_regime]
        print(f"    To regime {next_regime}: {prob:.3f} ({prob*100:.1f}%)")


### Risk Analysis

For portfolio management, we need to understand the risk characteristics of each regime. We calculate risk metrics:
- **VaR (Value at Risk)**: Maximum expected loss at 95% confidence
- **CVaR (Conditional VaR)**: Expected loss given that loss exceeds VaR
- **Maximum Drawdown**: Largest peak-to-trough decline

Note: The Sharpe ratio is a performance metric (risk-adjusted return) and is included in regime statistics, not in risk metrics.


In [None]:
# Cell 11: Calculate risk metrics
print("Calculating risk metrics for each regime...")

gmm_risk = calculate_regime_risk_metrics(returns, gmm_regimes)
hmm_risk = calculate_regime_risk_metrics(returns, hmm_regimes)
kmeans_risk = calculate_regime_risk_metrics(returns, kmeans_regimes)

print("\n=== GMM Risk Metrics ===")
print(gmm_risk[['regime', 'VaR_95', 'CVaR_95', 'max_drawdown']])

print("\n=== HMM Risk Metrics ===")
print(hmm_risk[['regime', 'VaR_95', 'CVaR_95', 'max_drawdown']])

print("\n=== K-Means Risk Metrics ===")
print(kmeans_risk[['regime', 'VaR_95', 'CVaR_95', 'max_drawdown']])

# Visualize
plot_risk_metrics(gmm_risk, title="Risk Metrics by Regime - GMM")
plt.show()
plot_risk_metrics(hmm_risk, title="Risk Metrics by Regime - HMM")
plt.show()
plot_risk_metrics(kmeans_risk, title="Risk Metrics by Regime - K-Means")
plt.show()


## 6. Conclusion

### What does it mean? Next steps?

Let's summarize our findings and discuss implications for portfolio management.


In [None]:
# Cell 12: Summary and conclusions
print("=" * 60)
print("SUMMARY AND CONCLUSIONS")
print("=" * 60)

print("\n1. REGIME DISTRIBUTION:")
print("   GMM:")
for _, row in gmm_stats.iterrows():
    print(f"     Regime {int(row['regime'])}: {row['percentage']:.1f}% of time")
print("   HMM:")
for _, row in hmm_stats.iterrows():
    print(f"     Regime {int(row['regime'])}: {row['percentage']:.1f}% of time")
print("   K-Means:")
for _, row in kmeans_stats.iterrows():
    print(f"     Regime {int(row['regime'])}: {row['percentage']:.1f}% of time")

print("\n2. MODEL PERFORMANCE:")
print(f"   Best model: {best_model}")
print(f"   Justification: {justification['reason']}")

print("\n3. KEY INSIGHTS:")
print("   - GMM and HMM show similar regime detection patterns")
print("   - HMM provides additional value through transition matrix")
print("   - K-Means shows limitations (poor regime distribution, no temporal structure)")

print("\n4. IMPLICATIONS FOR PORTFOLIO MANAGEMENT:")
print("   - Regime-aware strategies can adjust allocation based on detected regime")
print("   - HMM transition probabilities help predict regime changes")
print("   - Risk metrics vary significantly across regimes - adjust exposure accordingly")

print("\n5. LIMITATIONS:")
print("   - Models are backward-looking (use historical data)")
print("   - Regime detection is probabilistic, not deterministic")
print("   - K-Means is not suitable for time series regime detection")

print("\n" + "=" * 60)


### Key Takeaways

1. **HMM is the best model** for regime detection because:
   - It models temporal dependencies (regime persistence)
   - Provides transition probabilities for risk management
   - Shows similar or better performance than GMM

2. **GMM is a solid alternative** when:
   - Temporal structure is less important
   - Computational efficiency is a concern
   - You only need regime classification (not transitions)

3. **K-Means is not suitable** for financial time series because:
   - No temporal structure (treats each day independently)
   - Poor regime distribution (often one dominant cluster)
   - Extreme/unrealistic risk metrics
   - Cannot model regime transitions

### Next Steps

- **Real-time application**: Implement regime detection on streaming data
- **Portfolio optimization**: Use regime-specific allocations
- **Early warning system**: Monitor transition probabilities for regime changes
- **Multi-asset analysis**: Extend to multiple assets/indices
