Skip to content

cockles98/titanium-alpha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Titanium Alpha

Python 3.10+ Tests 1002 License MIT CI

An agentic multi-strategy hedge fund system that uses AI agents to debate investment decisions the way a real trading desk operates. Four specialized agents -- a Technical Analyst, a Fundamentalist, a Devil's Advocate, and a Portfolio Manager -- analyse deep learning forecasts, financial news, and market data, then argue their positions before committing capital. The system validates every strategy through CPCV-OOS parameter optimization with Deflated Sharpe Ratio, a walk-forward backtest across 52 S&P 500 constituents with triweekly rebalancing, volatility targeting, and risk-free cash carry -- then allocates risk using Hierarchical Risk Parity with Ledoit-Wolf shrinkage. A three-tier fine-tuning grid search (547 configs) with resume capability, PatchTST model caching, and top-N ticker selection enables systematic parameter optimization at scale.

War Room — live agent debate replay
War Room — the four-agent debate streaming live for each ticker, ending with the Portfolio Manager's verdict.


Why This Matters

Traditional quantitative trading systems rely on a single model making a single prediction. When that model is wrong, there is no safety net.

Titanium Alpha takes a fundamentally different approach:

  • Multiple perspectives reduce blind spots. A technical analyst may see a bullish RSI divergence while the bear agent identifies an earnings risk. The portfolio manager weighs both views before deciding -- mimicking how the best hedge fund teams actually operate.

  • Deep learning captures patterns that rules cannot. PatchTST (a transformer architecture purpose-built for time series) forecasts 5-day returns with quantile uncertainty bands. CDF interpolation produces continuous P(up) probabilities per ticker rather than crude discrete counts.

  • Memory matters. A RAG pipeline embeds financial news into ChromaDB, giving agents access to recent events -- earnings surprises, macro shifts, sector rotations -- so decisions are grounded in reality, not just price charts.

  • Every strategy parameter is validated before deployment. A three-tier fine-tuning grid search (547 configs across 3 tiers, ~18h per tier) with CPCV-OOS (Combinatorial Purged Cross-Validation Out-of-Sample) and Deflated Sharpe Ratio (Bailey & Lopez de Prado, 2014) tests configurations across 15 non-overlapping paths. The champion is then validated on a temporal holdout period (2 years, unseen by the grid search) with n_trials=1 -- eliminating the multiple-testing penalty. Resume capability, incremental saves (crash-safe), ETA estimation, and cross-tier deduplication ensure no wasted compute. Only parameters with pct_positive >= 66.7% and DSR p-value > 0.95 are accepted.

  • Walk-forward benchmark with rigorous temporal discipline. A full temporal simulation across 52 US large-cap stocks with triweekly rebalancing (15 trading days), semi-annual PatchTST retraining, 10% annualized volatility targeting, and zero look-ahead bias (decisions use only data up to t-1, returns are earned at t). Optional top-N ticker selection filters to the highest-conviction positions at each rebalance. Cash earns the risk-free rate pro-rata; margin (when leveraged) costs rf + spread. Transaction costs (slippage + commission) applied on every position change. PatchTST model caching avoids redundant retraining across runs with identical training windows. 16 portfolio-vs-benchmark metrics computed automatically.

  • Risk allocation is mathematically principled. Hierarchical Risk Parity (Lopez de Prado, 2016) with Ledoit-Wolf covariance shrinkage, confidence tilt from model signals (cap=0.20), and dynamic weight caps (min(6%, 2/N)) that scale automatically with the number of assets.

The result is an end-to-end system where every component -- from data ingestion to portfolio allocation -- is production-grade, fully tested, and designed to make better decisions under uncertainty.


Key Results (Walk-Forward Benchmark, 52 Tickers + SPY, 10 Years 2016--2026)

Benchmark dashboard — equity curve vs SPY, drawdown panel, rolling metrics
Benchmark tab — 10 years of walk-forward equity vs SPY, drawdown panel, rolling Sharpe, and weight heatmap.

Metric Fine-Tuned (10y OOS) SPY Buy-and-Hold
Sharpe Ratio 0.766 0.592
CAGR 13.68% 14.89%
Total Return 259.3% 299.5%
Max Drawdown -21.94% -33.72%
Max DD Duration 361 days 488 days
Sortino Ratio 1.058 0.826
Calmar Ratio 0.624 0.442
Alpha (CAPM) +2.57% --
Beta 0.566 1.0
Ann. Volatility 11.19% 17.93%
Tracking Error 9.11% --
Avg Positions 52.0 --
Avg Annual Turnover 230% --

Note: Walk-forward backtest over 10 years of out-of-sample equity (2016-04-19 → 2026-04-17, 2514 trading days, 15 years of raw OHLCV with 3 years consumed as covariance warmup). Fine-tuned config (rb=15, vol_target=10%, max_weight=min(6%, 2/N)=3.85%, Ward+Ledoit-Wolf) was identified by a 3-tier CPCV-OOS grid search (547 configs). Volatility targeting at 10% annualized is the single biggest Sharpe driver. The strategy carries ~57% of the market beta while generating +2.57% Jensen's alpha, producing a Sharpe of 0.766 vs SPY's 0.592 — lower absolute return (13.68% CAGR vs 14.89%) but materially lower risk (11.19% ann. vol vs 17.93%, -21.94% max DD vs -33.72%). SPY metrics computed on the same equity curve with the same formula (src/backtest/benchmark_metrics.py). Uses NaiveModelFactory on clean data (thread-safe yf.Ticker().history()) with 756-day covariance lookback and semi-annual retraining. The multi-agent debate layer has not yet been backtested. See docs/design_gap_backtest_vs_production.md for details.


Architecture

flowchart TB
    subgraph Data Layer
        YF[yfinance API<br/>52 tickers + SPY] -->|OHLCV parallel<br/>auto_adjust=True| PG[(PostgreSQL)]
        RSS[RSS Feeds] -->|News Articles| PG
        CFG[config/tickers.json<br/>52 US large caps] -.->|ticker list| YF
    end

    subgraph Feature Engineering
        PG -->|OHLCV| FE[Feature Engine<br/>RSI, Bollinger, Vol, VWAP, OBV]
        FE --> PTST[PatchTST<br/>5-day Quantile Forecasts]
        PTST --> PRED[predictions.parquet<br/>forecast.parquet]
    end

    subgraph RAG Pipeline
        PG -->|News| EMB[Sentence Transformers<br/>all-MiniLM-L6-v2]
        EMB --> CHROMA[(ChromaDB)]
        CHROMA -->|Semantic Retrieval| RAG[RAG Context<br/>top-k=5, 30-day window]
    end

    subgraph Agent Debate - LangGraph
        PRED --> LC[Load Context]
        RAG --> RR[RAG Retrieval]
        LC --> RR
        RR --> TA[Technical Analyst<br/>RSI, Bollinger, Volume]
        TA --> FA[Fundamentalist<br/>News, Macro, Earnings]
        FA --> BA[Bear Agent<br/>Devil's Advocate]
        BA --> PM[Portfolio Manager<br/>Final Synthesis]
        PM --> DEC{BUY / HOLD / SELL<br/>+ confidence}
    end

    subgraph Portfolio Allocation
        PG -->|OHLCV| RET[Log Returns<br/>756-day lookback]
        DEC -->|Confidences| HRP[HRP Optimizer<br/>Ledoit-Wolf + Ward]
        RET --> HRP
        HRP --> MERGE[Merge Actions + Weights<br/>BUY=HRP, HOLD=HRP*conf,<br/>SELL=0, cash implicit]
        MERGE --> JSON[decisions.json]
    end

    subgraph CPCV-OOS Validation
        PTST --> CPCV[CPCV Backtester<br/>15 paths, embargo, costs]
        CPCV --> DSR[Deflated Sharpe Ratio<br/>Bailey & Lopez de Prado 2014]
        DSR --> GRID[3-Tier Grid Search<br/>~500 configs, resume/dedup]
        GRID --> VALID[Validated Parameters]
    end

    subgraph Walk-Forward Benchmark
        PG -->|OHLCV 52 tickers| WF[Walk-Forward Backtester<br/>triweekly rebalance, vol targeting, decision_date=t-1]
        VALID -.->|optimised config| WF
        WF -->|equity curve| BM[Benchmark Metrics<br/>16 metrics vs SPY]
        BM --> PDF2[Benchmark Report PDF<br/>6 pages]
        WF -->|weights history| PQT[benchmark_equity.parquet<br/>benchmark_weights.parquet<br/>benchmark_metrics.json]
    end

    subgraph Dashboard - Streamlit
        PQT --> TAB0[Benchmark Tab<br/>Equity, Drawdown, Rolling Sharpe, Heatmap]
        JSON --> TAB1[Performance Tab<br/>Weights, Decisions, Metrics]
        JSON --> TAB2[War Room Tab<br/>Agent Debate Replay]
        PRED --> TAB3[Microstructure Tab<br/>Fan Charts, CDF P-up]
    end
Loading

Key Features

Component Description
PatchTST Forecaster Transformer-based 5-day return forecasting with quantile uncertainty (10th, 25th, 50th, 75th, 90th percentiles). CDF interpolation with monotonicity rearrangement produces continuous P(up) probability per ticker. NaN/infinity guards prevent corrupt predictions. get_params() and cache-aware load() enable deterministic model caching across walk-forward runs. Supports both old and new NeuralForecast column naming conventions.
Multi-Agent Debate Four Gemini agents with distinct personas debate each ticker through a LangGraph pipeline. Structured output via Pydantic ensures machine-readable decisions. Live streaming mode with per-node callbacks. Anthropic Claude also supported via LLM_PROVIDER=anthropic.
Financial RAG News articles embedded with sentence-transformers, stored in ChromaDB, retrieved by semantic similarity with date-aware reranking. Agents cite sources -- never hallucinate news.
CPCV-OOS Validation Three-tier fine-tuning grid search (547 configs, ~18h per tier). Tier 1: single-axis sweeps + key 2-way interactions (249 configs). Tier 2: factorial combinations on promising dimensions (149 configs). Tier 3: fine-grain refinement on champion sweet spot (149 configs). Resume capability skips already-computed configs on restart; incremental saves every 5 configs (crash-safe); rolling-average ETA; cross-tier deduplication. Holdout temporal validation: champion tested on unseen 2-year holdout period with n_trials=1 (no multiple-testing penalty). Deflated Sharpe Ratio (Bailey & Lopez de Prado, 2014) with empirical skewness/kurtosis on excess returns. Acceptance criteria: pct_positive >= 66.7% and DSR p-value > 0.95.
Walk-Forward Benchmark Full temporal simulation across 52 US large-cap stocks with triweekly rebalancing (15 trading days, CPCV-OOS validated), semi-annual PatchTST retraining (126 days), 756-day covariance lookback (~3 years), 10% annualized volatility targeting (63-day lookback, 0.5-1.0 leverage band), and zero look-ahead bias (decision_date = t-1). Portfolio starts 100% in cash (institutional initialization). Optional top-N ticker selection filters to highest-conviction positions at each rebalance with adaptive max_weight = min(6%, 2/N_selected). PatchTST model caching (hash of training window + hyperparameters) avoids redundant retraining. Cash earns rf pro-rata (geometric compounding); margin costs rf + spread. Ex-ante volatility targeting uses simulated portfolio returns (pre-allocation, not post-return). Bankruptcy safeguard terminates backtest if capital reaches zero. Compared against SPY buy-and-hold. Sigmoid-based momentum scoring in NaiveModelFactory replaces linear clamp for smoother confidence curves.
Benchmark Metrics 16 portfolio-vs-benchmark metrics: CAGR, Sharpe, Sortino, Information Ratio, Jensen's alpha (CAPM OLS), beta, max drawdown, max drawdown duration, Calmar ratio, tracking error, monthly hit rate, avg turnover, and more.
HRP Allocation Hierarchical Risk Parity with Ledoit-Wolf covariance shrinkage, sum-preserving confidence tilt (weighted-mean neutral, cap=0.20), waterfilling constraint optimizer with turnover latching (turnover_threshold=0.02). Dynamic weight caps (min(6%, 2/N)) that scale with the number of assets. Supports previous_weights for turnover-aware rebalancing.
Transaction Cost Model Slippage (5 bps), commission (10 bps), and liquidity-aware market impact (1/sqrt(relative_volume)) applied per position change. Turnover threshold (min_rebalance_delta=0.02) to skip dust rebalances.
Data Ingestion yf.Ticker().history() with auto_adjust=True ensures split- and dividend-adjusted OHLCV prices throughout. Thread-safe per-ticker objects prevent data corruption in parallel downloads (the older yf.download() API shares internal session/cache state across threads, causing adjacent tickers to receive identical data). 12 years of history (2014--2026) for 52 tickers + SPY benchmark.
Streamlit Dashboard Four-tab interface: benchmark (equity curve, drawdown, rolling Sharpe, weight heatmap, CPCV-OOS results), portfolio performance (donut + bar charts), war room (agent debate replay with live streaming), and microstructure (fan charts with quantile bands and last-close reference line).
Feature Engineering RSI, Bollinger Bands, realized volatility, VWAP, OBV, relative volume -- all implemented in Polars with zero look-ahead bias (verified by quant reviewer).

Dashboard

Four-tab Streamlit interface (make dashboard, then open http://localhost:8501). Every asset below is rendered from the same live data the backtester consumes -- no mocks, no cherry-picking.

War Room -- live agent debate

War Room tab — agent reports and Portfolio Manager final card

Streams the Technical Analyst, Fundamentalist, Bear and Portfolio Manager reports in real time, or replays a past debate frame-by-frame. Each agent's structured output (thesis, confidence, catalysts, risks, cited sources) is rendered in its own card; the PM card carries the final action, confidence, and suggested weight.

Microstructure -- per-ticker forecast uncertainty

Microstructure tab — PatchTST fan chart with quantile bands and P(up) card

Renders PatchTST's 5-quantile forecast as a fan chart (80% / 50% / median / 50% / 20% bands) around the last-close reference line, together with a continuous P(up) card derived by CDF interpolation of the quantile outputs -- not a discrete count of positive quantiles.


Quick Start

# 1. Clone and install dependencies
git clone https://github.com/cockles98/titanium-alpha.git
cd titanium-alpha && poetry install --no-root

# 2. Start PostgreSQL and ChromaDB
docker compose -f docker/docker-compose.yml up -d

# 3. Configure environment variables
cp .env.example .env  # then fill in API keys

# 4. Ingest market data (52 US tickers + SPY, parallel download, auto_adjust=True)
make ingest

# 5. Run PatchTST predictions
make predict

# 6. Run walk-forward benchmark vs S&P 500
make benchmark          # PatchTST model (production, cached)
make benchmark-fast     # NaiveModelFactory (quick validation)
# Optional: top-N ticker selection
python -m src.backtest.run_benchmark --top-n=10

# 7. Fine-tune parameters with CPCV-OOS (three-tier grid search)
python -m src.backtest.run_validation --tier 1       # Day 1 (~165 configs, ~18h)
python -m src.backtest.run_validation --tier 2       # Day 2 (~165 configs, ~18h)
python -m src.backtest.run_validation --tier 3       # Day 3 (~170 configs, ~18h)
python -m src.backtest.run_validation --holdout      # Validate champion on 2-year holdout
python -m src.backtest.run_validation --holdout --holdout-years=3
python -m src.backtest.run_validation --dry-run      # Print grid, no execution
python -m src.backtest.run_validation --estimate     # Time 1 config
make validate           # Full validation (all tiers)
make validate-fast      # Tier 1 only

# 8. Run agent debate + portfolio allocation (requires ANTHROPIC_API_KEY)
make decide

# 9. Launch the dashboard
make run

Prerequisites: Python 3.10+, Poetry, Docker, and an .env file with database credentials and API keys. See .env.example for required variables.


Project Structure

titanium-alpha/
|-- src/
|   |-- config.py           Ticker configuration loader (52 US + SPY benchmark)
|   |-- data/               Data ingestion (yfinance OHLCV + RSS/NewsAPI, parallel download)
|   |-- models/             PatchTST forecaster, feature engineering, prediction pipeline
|   |-- agents/             LangGraph debate graph, personas, state management, RAG
|   |-- backtest/           CPCV, CPCV-OOS validator, walk-forward backtester, benchmark metrics
|   |-- portfolio/          HRP optimizer (Ledoit-Wolf, ward linkage), decision engine
|   |-- dashboard/          Streamlit app (4 tabs: Benchmark, Performance, War Room, Microstructure)
|   |-- utils/              Database connections (PostgreSQL, ChromaDB)
|-- config/                 tickers.json (52 US large caps, SPY benchmark)
|-- tests/                  1002 tests (pytest), fixtures in conftest.py
|-- docker/                 docker-compose.yml (PostgreSQL 15 + ChromaDB)
|-- docs/                   Architecture, backtest metrics, design gap analysis, research notes
|-- notebooks/              Exploration only (never imported by src/)
|-- data/outputs/           Pipeline artifacts (predictions, decisions, equity curves, reports)
|-- models/checkpoints/     Saved PatchTST model weights
|-- models/wf_cache/        PatchTST walk-forward cache (hash-based, auto-managed)
|-- Makefile                setup, ingest, predict, decide, benchmark, validate, test, lint, run
|-- pyproject.toml          Poetry config, ruff, mypy, pytest settings

How It Works

Decision Pipeline

The DecisionEngine orchestrates the live decision pipeline in ten steps:

from src.portfolio.decision_engine import DecisionEngine

engine = DecisionEngine()
output = engine.run()

# output.decisions -> per-ticker BUY/HOLD/SELL with HRP weights
# output.hrp_final_weights -> {"AAPL": 0.04, "NVDA": 0.03, ...}  (investable tickers)
# output.metadata -> invested_fraction, confidence_source, n_buy/n_hold/n_sell

Pipeline steps:

  1. Load OHLCV from PostgreSQL (52 US large caps from config/tickers.json)
  2. Compute log returns in wide format, trimmed to a 504-day lookback window
  3. Run agent debate -- four Claude agents analyse PatchTST forecasts and RAG-retrieved news, producing BUY/HOLD/SELL with confidence scores per ticker
  4. Load PatchTST predictions (predictions.parquet) as fallback confidence source
  5. Extract confidences from the debate; missing tickers use PatchTST prob_up fallback (or 0.5 if neither is available)
  6. Classify tickers into BUY, HOLD, and SELL groups (no debate = BUY)
  7. Filter to investable subset (BUY + HOLD only) for HRP
  8. Run HRP on the investable subset with Ledoit-Wolf shrinkage and confidence tilt (cap=0.20), dynamic max_weight = min(6%, 2/N)
  9. HOLD scaling -- HOLD tickers get weight * confidence (reduced but non-zero); max_weight enforced without renormalization; sum(weights) <= 1.0 with implicit cash
  10. Save decisions.json and debate_history.json for dashboard consumption

Three-tier weight model: BUY tickers receive full HRP weight, HOLD tickers receive HRP weight scaled by confidence (< 0.3 by rule, so significantly reduced), SELL tickers get weight 0. The gap between sum(weights) and 1.0 is implicit cash.

Graceful degradation is built in: if the agent debate fails (no API key, network error), the pipeline loads PatchTST prob_up from predictions.parquet and applies percentile-based classification (bottom 15% SELL, 15-40% HOLD, top 60% BUY; requires minimum 5 tickers). If predictions are also unavailable, all tickers default to BUY with uniform confidence (0.5). See design gap analysis for details.

Walk-Forward Benchmark

The WalkForwardBacktester simulates the strategy historically against SPY buy-and-hold:

from src.backtest.run_benchmark import run_us_benchmark

result = run_us_benchmark(use_patchtst=True)  # or --naive for quick validation

# result.equity_curve -> daily portfolio_value vs benchmark_value
# result.metrics -> 16 metrics (Sharpe, Sortino, alpha, beta, max DD, ...)
# result.rebalance_history -> every rebalance with weights, turnover, costs

Benchmark configuration (baseline):

Parameter Value Rationale
Universe 52 US large caps + SPY S&P 500 constituents across 9 sectors
Rebalance Triweekly (15 trading days) CPCV-OOS validated (547 configs); balances alpha capture vs turnover cost
Retrain PatchTST Semi-annual (126 trading days) Separates slow/fast cycles
Lookback 756 days (~3 years) CPCV-OOS validated; stable covariance estimation
Costs 5 bps slippage + 10 bps commission Conservative for US large caps
Cash carry rf pro-rata (5% annual, geometric) Positive cash earns risk-free rate
Margin cost rf + 150 bps spread (geometric) Negative cash (leverage) incurs borrow cost
Top-N None (all tickers) Optional; filters to highest-conviction positions per rebalance
Vol targeting 10% annualized, 63-day lookback, 0.5--1.0 leverage CPCV-OOS validated; single biggest Sharpe driver (+0.035)
HRP Ledoit-Wolf shrinkage, Ward linkage, sum-preserving tilt cap=0.20, max_weight=min(6%, 2/N), turnover_threshold=0.02 CPCV-OOS validated; conservative tilt, diversified allocation, turnover-aware
Min rebalance delta 2% turnover Skip dust rebalances
Decision cutoff t-1 (previous close) Zero look-ahead bias
Capital $1,000,000 Institutional standard
OOS period 10 years (configurable) Leap-year safe date filtering
PatchTST cache models/wf_cache/<hash>/ Avoids redundant retraining across runs

Outputs: benchmark_equity.parquet, benchmark_metrics.json, benchmark_weights.parquet, and a 6-page PDF report (equity curve, drawdown, metrics table, rolling Sharpe, weight heatmap, turnover chart).

CPCV-OOS Parameter Validation

Before deploying any configuration, parameters are validated through Combinatorial Purged Cross-Validation:

# Three-tier systematic grid search
python -m src.backtest.run_validation --tier 1        # Day 1: single-axis sweeps (~165 configs)
python -m src.backtest.run_validation --tier 2        # Day 2: factorial combos (~165 configs)
python -m src.backtest.run_validation --tier 3        # Day 3: champion refinement (~170 configs)
python -m src.backtest.run_validation --tier all      # Run all sequentially

# Holdout temporal validation (champion tested on unseen period, n_trials=1)
python -m src.backtest.run_validation --holdout                    # 2-year holdout (default)
python -m src.backtest.run_validation --holdout --holdout-years=3  # 3-year holdout
python -m src.backtest.run_validation --tier all --holdout         # Grid search + holdout

# Utilities
python -m src.backtest.run_validation --estimate      # Time 1 config
python -m src.backtest.run_validation --dry-run       # Print grid, no execution

The validator:

  1. Divides the OOS period into 6 temporal blocks
  2. Generates C(6,2) = 15 combinatorial paths with purge + embargo
  3. Runs the walk-forward backtester on each path independently
  4. Computes Deflated Sharpe Ratio (corrects for multiple testing)
  5. Accepts only configurations with pct_positive >= 66.7% AND DSR p-value > 0.95
  6. Holdout validation: champion is tested on a reserved temporal period (default 2 years) that the grid search never sees, with n_trials=1 -- eliminating the multiple-testing penalty that makes DSR acceptance nearly impossible with 547 configs

Operational features: Resume capability (skips already-computed configs on restart), incremental saves every 5 configs (crash-safe), rolling-average ETA estimation, and cross-tier deduplication (same parameter fingerprint never runs twice). Time budget: ~375s per config (25s x 15 CPCV-OOS paths), ~172 configs per 18h tier.


Testing

The test suite covers every module with 1002 tests:

make test
# or directly:
poetry run pytest tests/ -v --tb=short

Test coverage highlights:

Area Tests What is validated
Data ingestion 35 Download, schema, retry, upsert, parallel, partial failure, thread-safe yf.Ticker().history()
News ingestion 43 RSS parsing, dedup, embedding status, ticker matching
Feature engineering 31 RSI, Bollinger, volatility, volume, no look-ahead bias
PatchTST model 49 Init, prepare, build, fit, predict, CDF monotonicity rearrangement, NaN guards, get_params(), cache-aware load
Prediction pipeline 13 Load, metrics (MAE/RMSE), Parquet roundtrip
Ticker config 13 Config loading, fallback, validation, real config
Agent state + personas 58 TypedDicts, Pydantic models, validation, registry
LangGraph graph 36 Node execution, RAG integration, full pipeline
RAG (ChromaDB) 40 Embedding, retrieval, reranking, edge cases
CPCV backtest 94 Splits, purge, embargo, Sharpe, drawdown, costs
CPCV-OOS validator 66 DSR math, purged factory, grid search, integration
CPCV report 15 PDF generation, plots, edge cases
Walk-forward backtest 84 Config, returns, costs, drift, vol targeting, killswitch, top-N selection, sigmoid momentum, bankruptcy safeguard, look-ahead bias
Benchmark metrics 40 Sharpe, Sortino, alpha, beta, drawdown, hit rate, edge cases
Benchmark report 19 PDF 6-page generation, helpers, edge cases
Run benchmark 15 Filter, model factory, PatchTST caching, save, e2e integration
Run validation 73 Three-tier grid builders, resume/dedup, holdout temporal validation, champion identification, output savers, HRP integration, pipeline mocked
HRP optimizer 80 Covariance, clustering, bisection, tilt, Ledoit-Wolf shrinkage, Ward linkage, waterfilling
Decision engine 74 Returns, merge, classify, percentile-based fallback, HOLD scaling, metadata, debate, JSON output
Dashboard 81 Loaders, charts, agent styles, streaming, benchmark tab, fan chart sort, schema v1.2
DB utilities 15 Connection pooling, env vars, overrides

All tests use mocks for external dependencies (APIs, databases, LLMs). No real API calls are made during testing.


Development Workflow -- Claude Code Subagents

The repository ships five versioned Claude Code subagents under .claude/agents/ that enforce the same review pipeline across anyone who forks the project. They are not runtime components -- they are reviewers and generators that Claude delegates to during development, each tuned against concrete incidents from the session history (look-ahead bias, overfitted champions, thread-unsafe yfinance, stale docs, missing tests).

Agent Invoked when
architect Before any new module, class, or cross-system integration is written. Validates structure, proposes public signatures first, guards the sacred folder layout.
quant-reviewer Mandatory after any change to model, backtest, or feature engineering code. Checks for look-ahead bias, proper DSR application, geometric rf conversion, transaction-cost integrity.
security-data After any data-pipeline script. Enforces yf.Ticker().history() (not the thread-unsafe yf.download()), explicit SPY inclusion, no real keys in .env diffs, OHLCV schema validation.
test-writer After any new function or class in src/. Writes pytest with Polars fixtures, zero real API calls, target coverage >=80% (>=70% for LangGraph-heavy modules).
docs-writer At the end of each module or phase. Enforces metric hygiene (never the old look-ahead Sharpe ~2.7; always report Sharpe + CAGR + MaxDD + Beta together).

See .claude/README.md for the full workflow diagram and a fork-adaptation guide.


Tech Stack

Layer Technology Purpose
Language Python 3.10+ Type hints, modern syntax
DataFrames Polars Fast columnar processing (no Pandas)
Deep Learning NeuralForecast (Nixtla) PatchTST with MQLoss (5 quantiles)
Agents LangGraph + LangChain Multi-agent orchestration
LLM Gemini 3.1-flash-lite (Google AI) Structured output for agent reasoning (Anthropic Claude also supported via LLM_PROVIDER=anthropic)
Embeddings sentence-transformers all-MiniLM-L6-v2 for news embedding
Vector Store ChromaDB Semantic search over financial news
Database PostgreSQL 15 OHLCV + news persistence
Portfolio scipy + numpy + scikit-learn HRP clustering, Ledoit-Wolf shrinkage
Backtesting Custom CPCV + CPCV-OOS + Walk-Forward Cross-validation + three-tier grid search + temporal benchmark
Reporting matplotlib + seaborn 6-page benchmark PDF (equity, drawdown, metrics, heatmap)
Dashboard Streamlit + Plotly Interactive 4-tab interface
Logging loguru Structured logging (never print())
Config python-dotenv + JSON Environment variables + ticker configuration
Packaging Poetry Dependency management
Containers Docker Compose PostgreSQL + ChromaDB services
Linting ruff + mypy Style enforcement + strict type checking
CI GitHub Actions Automated test + lint on push/PR
Testing pytest 1002 tests, all mocked

Known Limitations

  • Backtest-production gap: The walk-forward benchmark validates NaiveModelFactory (sigmoid momentum) and PatchTST (with model caching). The multi-agent debate pipeline has not yet been backtested with the corrected temporal discipline. See design gap analysis.
  • Agent fallback: When LangGraph agents are unavailable, the decision engine falls back to PatchTST prob_up from predictions.parquet with percentile-based classification (bottom 15% SELL, 15-40% HOLD, top 60% BUY). If predictions are also unavailable, all tickers default to BUY with uniform confidence (0.5).
  • CPCV-OOS acceptance: Grid search across 547 configs (3 tiers) found 0 DSR-accepted configurations when penalised for multiple testing. A holdout temporal validation approach (--holdout) addresses this: the champion is tested on a reserved 2-year period the grid search never sees, with n_trials=1 (no multiple-testing penalty). The current champion (rb=15, vol_target=10%, max_weight=min(6%, 2/N), Ward+Ledoit-Wolf) achieved Sharpe 0.766 over 10 years of walk-forward equity. See data/outputs/validation_3tier_analysis.md for the complete analysis.
  • Stochastic debate: make decide is non-deterministic. Analyst agents run at temperature=0.2 and the Portfolio Manager at temperature=0.1, so borderline tickers (confidence near the MIN_CONFIDENCE_FOR_ACTION=0.3 gate) can oscillate between BUY and HOLD across reruns. A given decisions.json is a single snapshot of one run, not an average -- tickers with strong signal (e.g. clear uptrend + aligned news) reproduce consistently, borderline tickers do not. For stable production use, either run the debate multiple times and average, or lower the temperature to 0. The Live Debate panel in the War Room is intended for inspecting robustness on specific tickers.
  • RAG grounding depends on news coverage: The Fundamentalist cites articles it finds in ChromaDB. With sparse ticker-level coverage (e.g. 1-3 articles on a given name), the agent will produce ungrounded reports -- sources_cited=[]. The grounding rate in the current run (14/52 = 26.9%) is bounded by the 172-article ChromaDB snapshot used to bootstrap the demo; a production RAG refreshed nightly with make ingest targets ~5 articles per ticker to keep grounded coverage above 70%.
  • No short selling: The system only goes long or flat (no short positions).

Citations

The system rests on the following peer-reviewed papers and libraries. If you find this project useful for academic work, please cite both Titanium Alpha and the underlying methods it implements.

  • Hierarchical Risk Parity -- Lopez de Prado, M. (2016). Building Diversified Portfolios that Outperform Out of Sample. Journal of Portfolio Management, 42(4), 59-69. Implemented in src/portfolio/hrp.py.
  • Combinatorial Purged Cross-Validation -- Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapters 7 and 12. Implemented in src/backtest/cpcv.py.
  • Deflated Sharpe Ratio -- Bailey, D. H., & Lopez de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5), 94-107. Implemented in src/backtest/cpcv_oos.py.
  • PatchTST -- Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2023). A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. ICLR 2023. Used via NeuralForecast; wrapped in src/models/patchtst_model.py.
  • Ledoit-Wolf Covariance Shrinkage -- Ledoit, O., & Wolf, M. (2004). A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices. Journal of Multivariate Analysis, 88(2), 365-411. Used via sklearn.covariance.LedoitWolf inside src/portfolio/hrp.py.
  • CDF Rearrangement for monotonic quantile forecasts -- Chernozhukov, V., Fernandez-Val, I., & Galichon, A. (2010). Quantile and Probability Curves Without Crossing. Econometrica, 78(3), 1093-1125. Implemented in src/models/patchtst_model.py::_rearrange_quantiles.
  • Sentence Embeddings for financial RAG -- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. Model all-MiniLM-L6-v2 used in src/agents/rag.py.

License

MIT License. See LICENSE for details.

About

Agentic multi-strategy hedge fund: PatchTST forecasts, 4-agent LangGraph debate, CPCV-OOS + DSR validation, HRP with Ledoit-Wolf shrinkage. 10-year OOS Sharpe=0.766.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors