#### Part 3: Trend Analysis (Linear Regression)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import gc

from sklearn.preprocessing import MinMaxScaler , StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

import seaborn as sns

from sqlalchemy import create_engine

import statsmodels.api as sm

# Set seeds for reproducibility
np.random.seed(42)

import os
from dotenv import load_dotenv

In [2]:
 # Load environment variables from .env file
load_dotenv()

# Retrieve the database URL from the environment variable
database_url = os.getenv("DATABASE_URL")

# --- PostgreSQL connection ---
# Create the engine using the environment variable
engine = create_engine(database_url)

In [3]:
# --- Load Processed Data ---
query = "SELECT date, ticker, close FROM financial_data"
data = pd.read_sql(query, engine)
data['date'] = pd.to_datetime(data['date'])

In [4]:
# --- Pivot prices ---
prices = data.pivot(index='date', columns='ticker', values='close').sort_index()
prices = prices.dropna(axis=1)  # Drop tickers with missing data

# --- Compute log returns ---
log_returns = np.log(prices / prices.shift(1)).dropna()

# --- Compute rolling volatility (10-day) ---
volatility = log_returns.rolling(window=10).std().dropna()
log_returns = log_returns.loc[volatility.index]

# --- Stack into long format ---
returns_long = log_returns.stack().reset_index()
returns_long.columns = ['date', 'ticker', 'log_return']

vol_long = volatility.stack().reset_index()
vol_long.columns = ['date', 'ticker', 'volatility']

# --- Merge into a single feature set ---
features = pd.merge(returns_long, vol_long, on=['date', 'ticker'])

print("✅ Features loaded:")
print(features.head())

✅ Features loaded:
        date ticker  log_return  volatility
0 2020-01-16   AAPL    0.012449    0.012707
1 2020-01-16   ADBE    0.007090    0.008181
2 2020-01-16    ADI    0.013692    0.014322
3 2020-01-16    ADP    0.011907    0.007988
4 2020-01-16    AEP    0.007323    0.005477


Apply K-Means Regime Clustering

In [5]:
# --- Scale features ---
X = features[['log_return', 'volatility']].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Apply KMeans ---
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
features['regime'] = kmeans.fit_predict(X_scaled)

print(" Regime labels assigned:")
print(features.head())

 Regime labels assigned:
        date ticker  log_return  volatility  regime
0 2020-01-16   AAPL    0.012449    0.012707       0
1 2020-01-16   ADBE    0.007090    0.008181       0
2 2020-01-16    ADI    0.013692    0.014322       0
3 2020-01-16    ADP    0.011907    0.007988       0
4 2020-01-16    AEP    0.007323    0.005477       0


Preparing for Regime-Level Trend Data

In [6]:
# Use 'features' DataFrame with log_return, volatility, regime, and date

# --- Group by date and regime to calculate daily average behavior ---
regime_daily_avg = features.groupby(['date', 'regime'])[['log_return', 'volatility']].mean().reset_index()

# --- Pivot to create time series for each regime ---
pivot_returns = regime_daily_avg.pivot(index='date', columns='regime', values='log_return')
pivot_volatility = regime_daily_avg.pivot(index='date', columns='regime', values='volatility')

# --- Rename columns for clarity ---
pivot_returns.columns = [f'return_regime_{i}' for i in pivot_returns.columns]
pivot_volatility.columns = [f'volatility_regime_{i}' for i in pivot_volatility.columns]

# --- Merge returns and volatility into one DataFrame ---
regime_trend_df = pd.concat([pivot_returns, pivot_volatility], axis=1).dropna()

# --- Preview ---
print(" Regime-level trend data sample:")
print(regime_trend_df.head())

 Regime-level trend data sample:
            return_regime_0  return_regime_1  return_regime_2  \
date                                                            
2020-01-16         0.008390        -0.009709         0.036293   
2020-01-17         0.002672        -0.026933         0.002095   
2020-01-21        -0.002844        -0.038366         0.036633   
2020-01-22         0.003357        -0.028795         0.032368   
2020-01-23         0.002143        -0.025636         0.037258   

            volatility_regime_0  volatility_regime_1  volatility_regime_2  
date                                                                       
2020-01-16             0.011823             0.038181             0.033068  
2020-01-17             0.012112             0.030093             0.046636  
2020-01-21             0.011367             0.029628             0.036044  
2020-01-22             0.011575             0.029689             0.034807  
2020-01-23             0.011816             0.032927   

Linear Trend Modeling using statsmodels

In [7]:
# --- Convert date index to numeric time index ---
regime_trend_df = regime_trend_df.copy()
regime_trend_df['time'] = (regime_trend_df.index - regime_trend_df.index[0]).days

# --- Function to fit OLS and print summary ---
def fit_trend(series, time, label):
    X = sm.add_constant(time)  # Add intercept
    y = series
    model = sm.OLS(y, X).fit()
    print(f"\n OLS Trend for {label}")
    print(model.summary())
    return model

# --- Fit OLS models for each regime return series ---
ols_models = {}
for col in regime_trend_df.columns:
    if col.startswith("return_regime_"):
        regime_num = col.split("_")[-1]
        ols_models[regime_num] = fit_trend(regime_trend_df[col], regime_trend_df['time'], col)


 OLS Trend for return_regime_0
                            OLS Regression Results                            
Dep. Variable:        return_regime_0   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                   0.03722
Date:                Thu, 08 May 2025   Prob (F-statistic):              0.847
Time:                        15:13:45   Log-Likelihood:                 4178.2
No. Observations:                1199   AIC:                            -8352.
Df Residuals:                    1197   BIC:                            -8342.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.001

### 📈 Interpretation of Linear Trend Analysis (OLS Regression on Regime Returns)

In this section, we fit Ordinary Least Squares (OLS) linear regression models to the daily average log return for each regime, using time as the independent variable. The goal is to detect long-term trends in the return behavior of different market regimes.

---

#### 🔹 Regime 0 — Likely Bearish / High-Volatility Regime

- **Intercept (const):** -0.0301  
  Indicates the baseline daily return when time = 0. This regime starts off with a significantly **negative return**.
- **Time Coefficient:** -1.905e-06  
  The trend is **negative** and **statistically significant** (p = 0.029).
- **R² = 0.004** — very low, but enough to indicate a slight downward drift over time.
- ✅ **Interpretation:**  
  Returns in this regime are deteriorating slightly over time. This could represent a **bear market regime**, where losses accumulate gradually.

---

#### 🔸 Regime 1 — Likely Stable / Sideways Regime

- **Intercept (const):** +0.0013  
  Small positive average return at time = 0.
- **Time Coefficient:** -7.743e-08  
  Very close to zero, and **not statistically significant** (p = 0.854).
- **R² ≈ 0** — time explains almost none of the variation in returns.
- ✅ **Interpretation:**  
  This regime is **statistically flat** — returns do not trend up or down. It likely represents a **sideways or mean-reverting market phase**.

---

#### 🔺 Regime 2 — Likely Bullish / Risk-On Regime

- **Intercept (const):** +0.0342  
  Strong positive return to start.
- **Time Coefficient:** +6.37e-08  
  Very small and **not statistically significant** (p = 0.948).
- **R² ≈ 0** — again, no meaningful trend detected.
- ✅ **Interpretation:**  
  Although the average return is **high**, there's **no evidence of a consistent trend** over time. This regime may reflect **sudden risk-on bursts or short-lived bullish rallies** rather than a linear upward drift.

---

### 🧠 Overall Takeaways

- Only **Regime 0** shows a statistically significant time trend — and it's **negative**, reinforcing the idea of a slow deterioration or prolonged drawdown.
- **Regimes 1 and 2** show **stable behavior over time**, with high returns in Regime 2 and flat performance in Regime 1.
- These insights can be used to model **regime-specific drifts** in simulation, or as **flags** for adjusting risk exposure when a regime shift is detected.
