Precursor is a financial machine learning pipeline built on Databricks that ingests S&P 500 market data, macroeconomic indicators, and SEC insider trading filings, engineers predictive features, trains two complementary models, and surfaces signals through a Streamlit dashboard.
The pipeline follows a medallion architecture (Bronze → Silver → Gold) and is divided into five stages:
| Stage | Notebooks | Description |
|---|---|---|
| Extraction | bootstrap_universe, 00_bootstrap_fred, 00_bootstrap_sec, 01_bootstrap_alpaca |
One-time historical data ingestion from FRED, SEC EDGAR, and Alpaca Markets |
| Transformation | 02_bronze, 03_silver, 04_gold |
Cleaning, joining, and feature engineering |
| Training | 05_train |
XGBoost and Temporal Fusion Transformer (TFT) model training |
| Evaluation | 06_evaluate |
Checkpoint evaluation and optional training resumption |
| Prediction | 07_predict_v3 |
Daily inference for all S&P 500 tickers |
A Streamlit dashboard (Dashboard/app.py) visualises the outputs across four pages: Insider Trades, Market Insights, Stock Explorer, and Today's Picks.
- Alpaca Markets - 5 years of daily OHLCV price history for every S&P 500 constituent
- FRED (Federal Reserve) - Macroeconomic indicators (Fed Funds Rate, CPI, VIX, yield curves, etc.)
- SEC EDGAR - Form 4 insider trading filings for all tickers in the universe
bootstrap_universe → builds precursor.bronze.universe + trading calendar
00_bootstrap_fred → writes precursor.bronze.fred_raw
00_bootstrap_sec → writes precursor.bronze.sec_raw
01_bootstrap_alpaca → writes precursor.bronze.alpaca_raw
Important: Run
bootstrap_universefirst. All other bootstrap notebooks depend onprecursor.bronze.universe.
After the initial bootstrap, daily updates are handled by separate ingestion jobs (not included in this repo).
02_bronze → cleans raw tables (type fixes, null handling, deduplication)
03_silver → joins all three clean tables into precursor.silver.joined (one row per ticker/date)
04_gold → engineers ~40 features + target variables → precursor.gold.features
The gold notebook enforces strict look-ahead bias prevention: all feature windows use rowsBetween(-n, -1) and only the target variable uses lead().
05_train trains two models:
| Model | Target | AUC |
|---|---|---|
| XGBoost | target_21d (21-day price direction) |
~0.5316–0.5333 |
| Temporal Fusion Transformer (PyTorch) | target_1d (next-day direction) |
- |
Models are tracked with MLflow and saved to /Volumes/precursor/models/artifacts/.
06_evaluate loads the TFT checkpoint, runs inference on the test set (2024-01-01 onward), logs metrics to MLflow, and optionally resumes training.
07_predict_v3 generates daily predictions for all tickers using both models and writes to:
precursor.gold.predictions- per-model directional predictions with horizon (1d/21d)precursor.gold.agreement- combined signal (both UP / both DOWN / disagree)
Start the dashboard locally:
pip install -r requirements.txt
streamlit run Dashboard/app.pyThe dashboard connects to Databricks SQL and provides four pages:
- The Insider Trades - SEC Form 4 activity across the S&P 500
- Market Insights - Macro signals, sector predictability rankings, and backtested strategy performance
- Stock Explorer - Deep-dive into any S&P 500 stock (price history, features, model predictions)
- Today's Picks - Top-ranked stocks by model confidence, filterable by model and horizon
The model selector in the sidebar switches between XGBoost (21-day) and TFT (1-day) views.
plotly==5.19.0
databricks-sql-connector==3.0.1
pandas==2.0.3
numpy==1.26.4
Additional notebook-level dependencies (installed via %pip install within each notebook):
fredapi- FRED data extractionalpaca-py- Alpaca Markets APIxgboost,scikit-learn- baseline modelpytorch-forecasting,pytorch-lightning,torch- TFT modelmlflow- experiment trackingexchange-calendars,pandas-market-calendars- trading calendar
Note: All notebooks pin
numpy<2.0to maintain ABI compatibility with pandas compiled C extensions. Do not upgrade numpy without testing.
precursor/
├── bronze/
│ ├── universe # S&P 500 constituents + trading calendar
│ ├── alpaca_raw # Raw OHLCV data
│ ├── fred_raw # Raw macro indicators
│ ├── sec_raw # Raw Form 4 filings
│ ├── alpaca_clean # Cleaned price data
│ ├── fred_clean # Cleaned macro data
│ └── sec_clean # Cleaned insider trades
├── silver/
│ └── joined # Unified ticker/date table (~1.4M rows)
└── gold/
├── features # Engineered features + targets
├── predictions # Model predictions
├── backtest # Backtest results
├── findings # Analytical findings
└── agreement # Dual-model agreement signal
- Configure your Databricks workspace and Unity Catalog.
- Run extraction notebooks in order (see Extraction).
- Run transformation notebooks:
02_bronze→03_silver→04_gold. - Train models:
05_train. - Evaluate:
06_evaluate. - Generate predictions:
07_predict_v3. - Launch the dashboard:
streamlit run Dashboard/app.py.
After initial setup, steps 4–7 are designed to run on a daily schedule.