# Current Idea (open to other ideas of course!):

## Micro-Regime Autoencoders (MRAE)

### Problem Statement

Traditional factor models can struggle capturing short-horizon "micro-regimes" in stock price behavior. The Goal of this project is to use a small autoencoder to learn regimes from 20-day *(I just picked a number we can change this obviously)* sliding windows of price/volume features, and then studiy:

- What the regimes look like
- How they transition from one to another
- Whether transitions have predictive power for near-term returns or volatitlity
- Potential comparissons to simple baselines (momentum, volatility)

### Optional Download (You shouldn't need to do this) Please use the link in the readme to get the dataset

In [1]:
# uncomment to download data. 
# Not needed if you have the .parquet file in the data/historical folder

# !python qualifier\utils\download_stock_data.py

### Load Dataset from file

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings("ignore")

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Set random seeds for reproducibility
np.random.seed(42)

# Data path
PARQUET_PATH = Path("data/historical/all_stocks_historical.parquet")

print(f"Loading data from: {PARQUET_PATH}")
df = pd.read_parquet(PARQUET_PATH)
print(f"\nDataFrame shape: {df.shape}")
print(f"\nColumn dtypes:")
print(df.dtypes)

Loading data from: data\historical\all_stocks_historical.parquet

DataFrame shape: (89859344, 7)

Column dtypes:
ticker            object
date      datetime64[us]
open             float64
high             float64
low              float64
close            float64
volume           float64
dtype: object


In [3]:
# Standardize and prepare data
print("Available columns:", df.columns.tolist())

# Ensure date is datetime
df['date'] = pd.to_datetime(df['date'])

# Sort by ticker and date
df = df.sort_values(['ticker', 'date']).reset_index(drop=True)

print("\n" + "=" * 60)
print("Data Summary:")
print("=" * 60)
print(f"Total rows: {len(df):,}")
print(f"Unique tickers: {df['ticker'].nunique():,}")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"Days covered: {(df['date'].max() - df['date'].min()).days:,} days")
print(f"Date column type: {df['date'].dtype}")

Available columns: ['ticker', 'date', 'open', 'high', 'low', 'close', 'volume']

Data Summary:
Total rows: 89,859,344
Unique tickers: 7,144
Date range: 1962-01-02 to 2025-11-13
Days covered: 23,326 days
Date column type: datetime64[us]


In [6]:
# original data with nans
print("\n" + "=" * 60)
print("Original Data:")
print("=" * 60)
display(df.head())

# no nan 
print("\n" + "=" * 60)
print("valid (non-null) data:")
print("=" * 60)

# remove rows with nans
df_valid = df.dropna(subset=['close', 'volume'])
display(df_valid[['ticker', 'date', 'open', 'high', 'low', 'close', 'volume']])


Original Data:


Unnamed: 0,ticker,date,open,high,low,close,volume
0,A,1962-01-02,,,,,
1,A,1962-01-03,,,,,
2,A,1962-01-04,,,,,
3,A,1962-01-05,,,,,
4,A,1962-01-08,,,,,



valid (non-null) data:


Unnamed: 0,ticker,date,open,high,low,close,volume
9539,A,1999-11-18,27.245214,29.939798,23.951837,26.347021,62546380.0
9540,A,1999-11-19,25.710805,25.748231,23.839568,24.176392,15234146.0
9541,A,1999-11-22,24.737756,26.347021,23.989261,26.347021,6577870.0
9542,A,1999-11-23,25.448832,26.122478,23.951841,23.951841,5975611.0
9543,A,1999-11-24,24.026691,25.112009,23.951841,24.588062,4843231.0
...,...,...,...,...,...,...,...
89859339,ZYXI,2025-11-07,1.030000,1.100000,1.030000,1.040000,133400.0
89859340,ZYXI,2025-11-10,1.030000,1.190000,1.030000,1.180000,185000.0
89859341,ZYXI,2025-11-11,1.160000,1.190000,1.100000,1.160000,67100.0
89859342,ZYXI,2025-11-12,1.160000,1.200000,1.100000,1.100000,73900.0


### trimming data

In [None]:
# trim the df_valid to start at the year 2000
trimmed_df_valid = df_valid[df_valid['date'] >= '2000-01-01'].copy()
print("\n" + "=" * 60)
print("Filtered data (Year 2000 onwards):")
print("=" * 60)
print(f"Total rows: {len(trimmed_df_valid):,}")
print(f"Date range: {trimmed_df_valid['date'].min().date()} to {trimmed_df_valid['date'].max().date()}")
display(trimmed_df_valid[['ticker', 'date', 'open', 'high', 'low', 'close', 'volume']].head())


Filtered data (2000 onwards):
Total rows: 21,521,135
Date range: 2000-01-03 to 2025-11-13


Unnamed: 0,ticker,date,open,high,low,close,volume
9569,A,2000-01-03,47.155195,47.26747,40.343889,43.113323,4674353.0
9570,A,2000-01-04,40.792992,41.242092,38.772056,39.81995,4765083.0
9571,A,2000-01-05,39.670253,39.819954,36.114903,37.349918,5758642.0
9572,A,2000-01-06,36.900802,37.125352,34.805015,35.927757,2534434.0
9573,A,2000-01-07,35.366397,39.483121,35.328974,38.921749,2819626.0


In [11]:
## average volume of stocks over 30 dday period from 2000-2025
trimmed_df_valid = trimmed_df_valid.sort_values(['ticker', 'date']).reset_index(drop=True)

trimmed_df_valid['volume_30d_avg'] = trimmed_df_valid.groupby('ticker')['volume'].rolling(
    window=30, 
    min_periods=1
).mean().reset_index(0, drop=True)

avg_volume_by_ticker = trimmed_df_valid.groupby('ticker')['volume'].mean().sort_values(ascending=False)

print("\n" + "=" * 60)
print("Average Volume by Stock (2000-2025)")
print("=" * 60)
print(f"Date range: {trimmed_df_valid['date'].min().date()} to {trimmed_df_valid['date'].max().date()}")
print(f"\nStocks sorted by highest average volume:")
print("-" * 60)
display(avg_volume_by_ticker.to_frame(name='avg_volume').head(50))


Average Volume by Stock (2000-2025)
Date range: 2000-01-03 to 2025-11-13

Stocks sorted by highest average volume:
------------------------------------------------------------


Unnamed: 0_level_0,avg_volume
ticker,Unnamed: 1_level_1
NVDA,600617200.0
AAPL,374911000.0
AMZN,116037600.0
GOOGL,112346500.0
GOOG,110036500.0
TSLA,96896490.0
TQQQ,84554880.0
BAC,77327170.0
QQQ,66639310.0
NIO,61121490.0
