# Current Idea (open to other ideas of course!):

## Micro-Regime Autoencoders (MRAE)

### Problem Statement

Traditional factor models can struggle capturing short-horizon "micro-regimes" in stock price behavior. The Goal of this project is to use a small autoencoder to learn regimes from 20-day *(I just picked a number we can change this obviously)* sliding windows of price/volume features, and then studiy:

- What the regimes look like
- How they transition from one to another
- Whether transitions have predictive power for near-term returns or volatitlity
- Potential comparissons to simple baselines (momentum, volatility)

### Optional Download (You shouldn't need to do this) Please use the link in the readme to get the dataset

In [5]:
# uncomment to download data. 
# Not needed if you have the .parquet file in the data/historical folder

# !python qualifier\utils\download_stock_data.py

### Load Dataset from file

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings("ignore")

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Set random seeds for reproducibility
np.random.seed(42)

# Data path
PARQUET_PATH = Path("data/historical/all_stocks_historical.parquet")

print(f"Loading data from: {PARQUET_PATH}")
df = pd.read_parquet(PARQUET_PATH)
print(f"\nDataFrame shape: {df.shape}")
print(f"\nColumn dtypes:")
print(df.dtypes)

Loading data from: data\historical\all_stocks_historical.parquet

DataFrame shape: (25345, 7)

Column dtypes:
ticker            object
date      datetime64[us]
open             float64
high             float64
low              float64
close            float64
volume           float64
dtype: object


In [7]:
# Standardize and prepare data
print("Available columns:", df.columns.tolist())

# Ensure date is datetime
df['date'] = pd.to_datetime(df['date'])

# Sort by ticker and date
df = df.sort_values(['ticker', 'date']).reset_index(drop=True)

print("\n" + "=" * 60)
print("Data Summary:")
print("=" * 60)
print(f"Total rows: {len(df):,}")
print(f"Unique tickers: {df['ticker'].nunique():,}")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"Days covered: {(df['date'].max() - df['date'].min()).days:,} days")
print(f"Date column type: {df['date'].dtype}")

Available columns: ['ticker', 'date', 'open', 'high', 'low', 'close', 'volume']

Data Summary:
Total rows: 25,345
Unique tickers: 5
Date range: 2005-09-27 to 2025-11-18
Days covered: 7,357 days
Date column type: datetime64[us]


In [8]:
# original data with nans
print("\n" + "=" * 60)
print("Original Data:")
print("=" * 60)
display(df.head())

# no nan 
print("\n" + "=" * 60)
print("valid (non-null) data:")
print("=" * 60)

# remove rows with nans
df_valid = df.dropna(subset=['close', 'volume']).head()
display(df_valid[['ticker', 'date', 'open', 'high', 'low', 'close', 'volume']])


Original Data:


Unnamed: 0,ticker,date,open,high,low,close,volume
0,AACB,2005-09-27,,,,,
1,AACB,2005-09-28,,,,,
2,AACB,2005-09-29,,,,,
3,AACB,2005-09-30,,,,,
4,AACB,2005-10-03,,,,,



valid (non-null) data:


Unnamed: 0,ticker,date,open,high,low,close,volume
4912,AACB,2025-04-07,9.88,10.1,9.88,9.88,8400.0
4913,AACB,2025-04-08,9.95,9.95,9.95,9.95,100.0
4914,AACB,2025-04-09,9.9,9.9,9.9,9.9,125100.0
4915,AACB,2025-04-10,9.85,9.9,9.85,9.89,101000.0
4916,AACB,2025-04-11,9.88,9.89,9.88,9.885,641700.0
