# Regime Detection - HMM V1

This notebook implements a Hidden Markov Model (HMM) to detect 4 macro market regimes:
- **RISK_ON**: Low volatility, positive equity returns, tight credit spreads
- **RISK_OFF**: High volatility, negative equity returns, wide credit spreads
- **INFLATION_SHOCK**: High inflation expectations, commodity strength, negative real rates
- **DISINFLATION**: Falling inflation, bond rally, flattening commodities

## Outputs for Portfolio Optimization

The model produces two key outputs stored in database:

1. **Daily Regime Probabilities** (`REGIMES_GLOBAL` table)
   - Probability distribution across all 4 regimes for each date
   - Used for blending regime-conditional portfolios

2. **Regime Metadata** (`REGIME_METADATA` table)
   - Dominant regime with confidence level
   - Entropy (uncertainty measure) for risk scaling
   - Regime change flags for rebalancing triggers


## Setup


In [1]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import getpass

# Add src directory to path
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

# Import database connection
from ingestion.database import get_connection

# Import regime detection modules
from regimes.features import get_regime_features, validate_feature_data, print_validation_report, REGIME_FEATURES

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✅ Imports successful")


✅ Imports successful


In [4]:
# Database connection
db_password = getpass.getpass("Enter database password: ")
conn = get_connection(password=db_password)
print("✅ Connected to database")


✅ Connected to database


## Step 1: Feature Selection and Preparation

We select ~14 features that capture the four regime dimensions:
- Risk appetite (VIX, credit spreads, equity returns, style factors)
- Growth expectations (yield curve, leading indicators)
- Inflation dynamics (breakeven inflation, commodities, real rates)
- Cross-asset flows (AUD/JPY, gold/oil, EM vs US)


In [2]:
# Display selected features
print("Selected Regime Features:")
print("=" * 60)
for i, feature in enumerate(REGIME_FEATURES, 1):
    print(f"{i:2d}. {feature}")
print(f"\nTotal: {len(REGIME_FEATURES)} features")


Selected Regime Features:
 1. VOL_VIX_LEVEL
 2. CREDIT_HY_CHG_20D
 3. ^GSPC_RET_20D
 4. STYLE_CYCLICAL_VS_DEFENSIVE
 5. STYLE_GROWTH_VS_VALUE
 6. GLOBAL_YIELD_CURVE_SLOPE
 7. USSLIND_CHG_MOM
 8. GLOBAL_INFLATION_EXPECTATIONS
 9. CL=F_RET_20D
10. REAL_RATE_10Y
11. FX_AUD_JPY
12. COMMODITY_GOLD_OIL
13. EQUITY_EM_VS_US
14. GLOBAL_FINANCIAL_CONDITIONS

Total: 14 features


In [6]:
# Load features in wide format (dates x features)
df_features = get_regime_features(
    conn,
    version='V1_ML',
    start_date='2010-01-01'
)

print(f"\n✅ Loaded feature matrix: {df_features.shape}")


Loading 14 regime features...
Version: V1_ML, Start date: 2010-01-01
Loaded 1,534,807 rows from FEATURES table
✅ All 14 features present

Pivoted to wide format:
  Shape: (4163, 14) (dates x features)
  Date range: 2010-01-01 to 2025-12-16
  Total days: 4163

  - CL=F_RET_20D: 22 (0.5%)
  - COMMODITY_GOLD_OIL: 1 (0.0%)
  - CREDIT_HY_CHG_20D: 16 (0.4%)
  - EQUITY_EM_VS_US: 1 (0.0%)
  - GLOBAL_INFLATION_EXPECTATIONS: 1 (0.0%)
  - GLOBAL_YIELD_CURVE_SLOPE: 1 (0.0%)
  - REAL_RATE_10Y: 1 (0.0%)
  - STYLE_CYCLICAL_VS_DEFENSIVE: 1 (0.0%)
  - STYLE_GROWTH_VS_VALUE: 1 (0.0%)
  - USSLIND_CHG_MOM: 21 (0.5%)
  - VOL_VIX_LEVEL: 1 (0.0%)
  - ^GSPC_RET_20D: 22 (0.5%)

✅ Loaded feature matrix: (4163, 14)


In [7]:
# Display sample of feature data
print("\nFirst 5 rows:")
display(df_features.head())

print("\nLast 5 rows:")
display(df_features.tail())



First 5 rows:


feature,CL=F_RET_20D,COMMODITY_GOLD_OIL,CREDIT_HY_CHG_20D,EQUITY_EM_VS_US,FX_AUD_JPY,GLOBAL_FINANCIAL_CONDITIONS,GLOBAL_INFLATION_EXPECTATIONS,GLOBAL_YIELD_CURVE_SLOPE,REAL_RATE_10Y,STYLE_CYCLICAL_VS_DEFENSIVE,STYLE_GROWTH_VS_VALUE,USSLIND_CHG_MOM,VOL_VIX_LEVEL,^GSPC_RET_20D
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2010-01-01,,,,,0.009669,-0.17856,,,,,,,,
2010-01-04,,13.712427,,0.027172,0.009865,-0.17856,2.255,2.76,1.47,1.0005,0.318806,,20.040001,
2010-01-05,,13.673719,,0.027284,0.009956,-0.17856,2.23,2.76,1.43,1.031165,0.317314,,19.35,
2010-01-06,,13.655927,,0.027327,0.009967,-0.17856,2.265,2.84,1.48,1.027212,0.316554,,19.16,
2010-01-07,,13.707959,,0.02706,0.009827,-0.17856,2.3,2.82,1.44,1.053917,0.313623,,19.059999,



Last 5 rows:


feature,CL=F_RET_20D,COMMODITY_GOLD_OIL,CREDIT_HY_CHG_20D,EQUITY_EM_VS_US,FX_AUD_JPY,GLOBAL_FINANCIAL_CONDITIONS,GLOBAL_INFLATION_EXPECTATIONS,GLOBAL_YIELD_CURVE_SLOPE,REAL_RATE_10Y,STYLE_CYCLICAL_VS_DEFENSIVE,STYLE_GROWTH_VS_VALUE,USSLIND_CHG_MOM,VOL_VIX_LEVEL,^GSPC_RET_20D
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2025-12-10,-0.042267,71.782415,-0.26,0.008007,0.004234,-0.54587,2.285,0.59,1.88,1.261175,0.584053,0.15,15.77,0.005853
2025-12-11,-0.015216,74.401044,-0.31,0.007952,0.004284,-0.54587,2.285,0.62,1.89,1.274861,0.580128,0.15,14.85,0.00731
2025-12-12,-0.021298,74.862469,-0.28,0.007944,0.004284,-0.54587,2.29,0.67,1.93,1.282979,0.572853,0.15,15.74,0.013346
2025-12-15,-0.056915,76.495503,-0.28,0.007934,0.004264,-0.54587,2.27,0.67,1.93,1.272328,0.571161,0.15,16.5,0.012236
2025-12-16,-0.056915,76.495503,-0.28,0.007934,0.004284,-0.54587,2.27,0.67,1.93,1.272328,0.571161,0.15,16.5,0.012236


In [8]:
# Summary statistics
print("\nSummary Statistics:")
display(df_features.describe())



Summary Statistics:


feature,CL=F_RET_20D,COMMODITY_GOLD_OIL,CREDIT_HY_CHG_20D,EQUITY_EM_VS_US,FX_AUD_JPY,GLOBAL_FINANCIAL_CONDITIONS,GLOBAL_INFLATION_EXPECTATIONS,GLOBAL_YIELD_CURVE_SLOPE,REAL_RATE_10Y,STYLE_CYCLICAL_VS_DEFENSIVE,STYLE_GROWTH_VS_VALUE,USSLIND_CHG_MOM,VOL_VIX_LEVEL,^GSPC_RET_20D
count,4141.0,4162.0,4147.0,4162.0,4163.0,4163.0,4162.0,4162.0,4162.0,4162.0,4162.0,4142.0,4162.0,4141.0
mean,0.004112,25.316141,-0.009986,0.014759,0.007518,-0.467011,2.000429,0.985617,0.536961,1.036054,0.404827,0.061647,18.392261,0.00999
std,0.130752,12.298211,0.479953,0.005864,0.00272,0.179197,0.395797,0.951587,0.881111,0.138715,0.079809,0.167251,6.848446,0.041743
min,-2.677664,11.710478,-2.85,0.006907,0.003898,-0.798,0.32,-1.08,-1.19,0.691172,0.305158,-0.52,9.14,-0.309439
25%,-0.055845,17.133887,-0.24,0.009946,0.005892,-0.58241,1.725,0.25,0.03,0.945607,0.336081,-0.03,13.74,-0.009379
50%,0.006976,22.991857,-0.05,0.013313,0.006703,-0.50175,2.03,0.93,0.5,1.028017,0.374962,0.15,16.66,0.015543
75%,0.061923,27.952703,0.17,0.017723,0.009148,-0.36227,2.295,1.68,1.1,1.134782,0.468493,0.15,21.08,0.034265
max,2.246753,138.614261,6.03,0.028917,0.014269,0.31666,3.27,2.91,2.52,1.355222,0.602855,0.61,82.690002,0.224841


## Data Quality Validation


In [9]:
# Run validation checks
validation_report = validate_feature_data(df_features)
print_validation_report(validation_report)


FEATURE DATA VALIDATION REPORT

❌ FAILED - Data quality issues detected

Checks:
  ❌ Missing Values
  ✅ Date Continuity
  ✅ Sufficient Data
  ✅ Feature Variance
  ❌ Extreme Outliers

  - 2 features have extreme outliers (>10σ)

❌ Errors (1):
  - 89 missing values found across features

Summary:
  Dates: 4163 (2010-01-01 to 2025-12-16)
  Features: 14
  Missing values: 89
  Max gap: 3 days



In [10]:
# Check for any missing values in detail
missing_summary = df_features.isnull().sum()
if missing_summary.any():
    print("Missing Values by Feature:")
    print(missing_summary[missing_summary > 0])
else:
    print("✅ No missing values detected")

Missing Values by Feature:
feature
CL=F_RET_20D                     22
COMMODITY_GOLD_OIL                1
CREDIT_HY_CHG_20D                16
EQUITY_EM_VS_US                   1
GLOBAL_INFLATION_EXPECTATIONS     1
GLOBAL_YIELD_CURVE_SLOPE          1
REAL_RATE_10Y                     1
STYLE_CYCLICAL_VS_DEFENSIVE       1
STYLE_GROWTH_VS_VALUE             1
USSLIND_CHG_MOM                  21
VOL_VIX_LEVEL                     1
^GSPC_RET_20D                    22
dtype: int64


## Feature Visualization


In [None]:
# Plot feature correlation matrix
fig, ax = plt.subplots(figsize=(12, 10))
corr_matrix = df_features.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, ax=ax, 
            cbar_kws={'label': 'Correlation'})
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nHighly correlated feature pairs (|r| > 0.7):")
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.7:
            print(f"  {corr_matrix.columns[i]} <-> {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.3f}")


In [None]:
# Plot time series of key features
key_features = [
    'VOL_VIX_LEVEL',
    '^GSPC_RET_20D',
    'GLOBAL_YIELD_CURVE_SLOPE',
    'GLOBAL_INFLATION_EXPECTATIONS'
]

fig, axes = plt.subplots(len(key_features), 1, figsize=(14, 10))
fig.suptitle('Key Regime Features Over Time', fontsize=16, fontweight='bold')

for i, feature in enumerate(key_features):
    if feature in df_features.columns:
        axes[i].plot(df_features.index, df_features[feature], linewidth=0.8)
        axes[i].set_ylabel(feature, fontsize=9)
        axes[i].grid(True, alpha=0.3)
        axes[i].axhline(y=0, color='red', linestyle='--', linewidth=0.5, alpha=0.5)

axes[-1].set_xlabel('Date', fontsize=10)
plt.tight_layout()
plt.show()


In [None]:
# Feature distributions
n_features = len(df_features.columns)
n_cols = 4
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, n_rows * 3))
axes = axes.flatten()

for i, feature in enumerate(df_features.columns):
    axes[i].hist(df_features[feature].dropna(), bins=50, alpha=0.7, edgecolor='black')
    axes[i].set_title(feature, fontsize=8)
    axes[i].set_ylabel('Frequency', fontsize=7)
    axes[i].tick_params(labelsize=7)
    axes[i].grid(True, alpha=0.3)

# Hide unused subplots
for i in range(n_features, len(axes)):
    axes[i].axis('off')

fig.suptitle('Feature Distributions', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()


## Progress Summary

**Completed:**
- ✅ Module setup (`src/regimes/`)
- ✅ Feature selection (14 regime-relevant features)
- ✅ Data loading and pivoting to wide format
- ✅ Data quality validation
- ✅ Feature visualization

**Next Steps:**
- HMM model training (`hmm.py`)
- State probability extraction
- Regime labeling and interpretation
- Database storage
- Output validation


In [None]:
# Save feature matrix for next steps
print(f"Feature matrix ready for modeling: {df_features.shape}")
print(f"Date range: {df_features.index.min()} to {df_features.index.max()}")
print(f"Total observations: {len(df_features):,}")
