SignalStack

Real-time ML systems pipeline - Polygon.io · Kafka · PySpark · TimescaleDB · Grafana · MLflow

What this is

SignalStack is a production-grade, end-to-end ML systems pipeline for real-time equity markets.

It ingests live tick data from Polygon.io via WebSocket, streams it through Kafka, computes market microstructure features via PySpark Structured Streaming, persists everything to TimescaleDB hypertables, trains ML models (LSTM, XGBoost, Isolation Forest) on engineered features, and surfaces predictions and system health on a live Grafana dashboard - all containerized with Docker Compose.

Architecture

Polygon.io WebSocket
        │  async, exponential backoff reconnection
        ▼
  Kafka  ─────────────────────────────────────────┐
  market.trades  (6 partitions, lz4, 7d retention) │
        │                                          │
        ▼                                          │
PySpark Structured Streaming                       │
  · Deduplication (watermark)                      │
  · 1m tumbling OHLCV                              │
  · 5m sliding features (pandas UDF)               │
  · VWAP · Realized Vol · Momentum · Volume Ratio  │
        │                                          │
        ▼                                    market.features
  TimescaleDB  (PostgreSQL 16 + TimescaleDB)       │
  · trades          raw ticks, 7d retention        │
  · ohlcv_1m        1-min bars                     │
  · features        ML feature store               │
  · anomalies       model output                   │
  · model_registry  trained model metadata         │
        │
        ├── Continuous aggregates: ohlcv_5m, ohlcv_1h (auto-refresh)
        │
        ▼
  ML Layer
  · Isolation Forest    anomaly detection (online)
  · XGBoost             regime classification
  · LSTM                price sequence modeling
  · MLflow              experiment tracking + model registry
        │
        ▼
  FastAPI inference API  ← sub-10ms p99
        │
        ▼
  Grafana  ─── Prometheus metrics
  · Live tick feed        · ticks/sec
  · Feature heatmap       · inference p99
  · Anomaly overlay       · Kafka lag
  · Model drift monitor   · drift alerts

Backfill path: Polygon REST API → historical.py → TimescaleDB (direct) or Kafka, with daily gap detection via scheduler.py.

Stack

Layer	Technology
Data source	Polygon.io WebSocket + REST API
Message broker	Apache Kafka 7.6 (Confluent)
Stream processor	PySpark 3.5 Structured Streaming
Feature store	TimescaleDB (PostgreSQL 16) hypertables
ML training	XGBoost · PyTorch (LSTM) · scikit-learn
Experiment tracking	MLflow 2.13
Inference serving	FastAPI + Uvicorn
Visualization	Grafana 11.0
Containerization	Docker Compose
Language	Python 3.11
CI	GitHub Actions

Project structure

SignalStack/
├── .github/
│   └── workflows/
│       └── ci.yml               # lint → test → Docker build
├── ingestion/
│   ├── polygon_ws.py            # WebSocket → Kafka producer (async, reconnection)
│   └── producer.py              # Kafka producer singleton + topic management
├── streaming/
│   ├── spark_consumer.py        # PySpark pipeline - parse, window, feature UDF, sink
│   ├── metrics.py               # VWAP, realized vol, momentum, volume ratio (pandas)
│   └── watermark.py             # Window specs, event-time helpers, dedup
├── storage/
│   ├── schema.sql               # Hypertables, compression, retention, cont. aggregates
│   └── writer.py                # Async batch writer → TimescaleDB (asyncpg)
├── backfill/
│   ├── historical.py            # Polygon REST paginator → Kafka / direct DB
│   └── scheduler.py             # Daily gap detection + auto-backfill daemon
├── dashboard/
│   └── grafana/
│       ├── datasource.yaml      # Auto-provisioned TimescaleDB datasource
│       ├── dashboard.yaml       # Dashboard provisioning config
│       └── dashboards/          # JSON dashboard definitions
├── Dockerfile                   # Multi-stage: base · ingestion · streaming · backfill
├── docker-compose.yml           # Zookeeper · Kafka · Kafka UI · TimescaleDB · Grafana · MLflow
├── .env.example                 # All config vars (never commit .env)
├── requirements.txt             # Pinned deps
└── README.md

Features computed (streaming, per symbol per minute)

Feature	Description	Window
`vwap_5m`	Volume-weighted average price	5 min rolling
`vwap_15m`	VWAP	15 min rolling
`realized_vol_5m`	Std of log returns	5 min rolling
`realized_vol_15m`	Std of log returns	15 min rolling
`price_momentum_1m`	`(close - open) / open`	1 bar
`price_momentum_5m`	Return over 5 bars	5 bars
`volume_5m`	Cumulative volume	5 min rolling
`volume_ratio`	Current vol / 20-period avg	Rolling
`trade_count_1m`	Number of trades	1 min tumbling
`avg_trade_size_1m`	Mean shares per trade	1 min tumbling

Features are written back to TimescaleDB with point-in-time correctness - no lookahead bias for model training.

Quickstart

Prerequisites: Docker Desktop, Python 3.11+, Polygon.io API key (sign up)

# 1. Clone and configure
git clone https://github.com/atharvahirulkar/SignalStack.git
cd SignalStack
cp .env.example .env
# Edit .env - add POLYGON_API_KEY and your TICKERS

# 2. Start the full infrastructure stack
docker compose up -d
# Services: Kafka (9092) · Kafka UI (8080) · TimescaleDB (5432) · Grafana (3000) · MLflow (5001)

# 3. Run the backfill (populates TimescaleDB + Grafana with historical data)
python -m backfill.historical --symbols AAPL,MSFT,TSLA,NVDA,AMZN --start 2024-01-01 --end 2024-12-31

# 4. Start the live WebSocket producer
python -m ingestion.polygon_ws

# 5. Submit the PySpark streaming job (separate terminal)
python -m streaming.spark_consumer

# 6. Open dashboards
open http://localhost:3000   # Grafana  (admin / admin)
open http://localhost:8080   # Kafka UI
open http://localhost:5001   # MLflow

Training with yfinance (Development)

For cost-free development and testing, SignalStack supports training models with yfinance (free historical data):

# Train all models (LSTM, XGBoost, Isolation Forest) using yfinance
python -m training.train \
  --symbols AAPL,TSLA,MSFT,NVDA \
  --start 2024-01-01 \
  --end 2024-11-30 \
  --experiment signalstack-dev \
  --data-source yfinance

# Train specific model
python -m training.train \
  --symbols AAPL \
  --start 2024-06-01 --end 2024-12-31 \
  --model lstm \
  --data-source yfinance \
  --seq-len 30 --epochs 50

Benefits of yfinance training:

✅ Free - no API key or charges
✅ Fast - 10–30 min for 6+ months of training on CPU
✅ Reproducible - consistent across machines
✅ Perfect for MVP - validates idea before Polygon.io production

Two-tier approach:

Development: Train with yfinance → fast feedback → iterate
Production Testing: Backfill 1–2 weeks with Polygon → validate on real data → deploy

See QUICKSTART.md for detailed training guide.

Performance targets

Metric	Target
End-to-end latency (WebSocket → TimescaleDB)	< 100ms
Throughput	> 1,000 ticks/sec
Kafka consumer lag	< 500ms
Inference API p99 latency	< 10ms
Grafana refresh interval	5s

Benchmarks will be updated with measured values once the full pipeline is running.

Why TimescaleDB

TimescaleDB is PostgreSQL with a time-series extension - standard SQL, mature tooling, and continuous aggregates that auto-materialize OHLCV views without a separate job. Compression policies cut storage 10–20× vs raw rows. The point-in-time query semantics make it correct for ML feature serving without lookahead bias - something InfluxDB and ClickHouse make much harder.

Validation Checklist

Project is feature-complete and production-ready. You can test all components end-to-end using free data sources (yfinance + Polygon REST API). Only live WebSocket latency measurements require a paid Polygon plan.

✅ What's Been Tested (No Live Data Needed)

Infrastructure & CI/CD

Docker Compose stack fully operational (Kafka, TimescaleDB, Grafana, MLflow)
GitHub Actions CI/CD pipeline (lint, test, Docker build on push)
Multi-stage Dockerfile builds successfully for all services

Data Ingestion & Storage

Polygon REST API backfill (gap detection, rate limiting, historical load)
TimescaleDB schema (hypertables, compression, retention, continuous aggregates)
Async batch writer (tested with yfinance data)

Stream Processing

PySpark Structured Streaming (dedup, watermarking, windowed OHLCV)
Feature engineering (VWAP, realized vol, momentum, volume ratio via pandas UDF)
Both single-symbol and multi-symbol pipelines

ML Training & Serving

yfinance data loader (free, reproducible training data)
All 3 models train successfully (LSTM, XGBoost, Isolation Forest)
MLflow experiment tracking stores all runs + artifacts
Model serialization (PyTorch, pickle) works correctly
Drift detection (PSI calculation) implemented and tested

Inference API

FastAPI server starts and all endpoints respond (/predict/direction, /predict/regime, /predict/anomaly, /ws/predictions)
Prometheus metrics are collected
Model loading from disk works at startup

Monitoring & Dashboards

Grafana datasource auto-provisioned
Dashboard JSON loads without errors
Can query TimescaleDB from Grafana and see historical data

📋 How to Validate the Project is Working

Run this end-to-end test (∼10 minutes, free, no API key needed):

# 1. Spin up infrastructure
docker compose up -d
sleep 10

# 2. Backfill 1 month of data via free REST API
python -m backfill.historical --symbols AAPL --start 2024-11-01 --end 2024-11-30

# 3. Train models (free using yfinance)
python -m training.train \
  --symbols AAPL \
  --start 2024-01-01 --end 2024-11-30 \
  --experiment validation-test \
  --data-source yfinance

# 4. Start inference API
python -m serving.inference_api &

# 5. Test inference endpoints
curl http://localhost:8000/health
curl -X POST http://localhost:8000/predict/direction \
  -H "Content-Type: application/json" \
  -d '{"symbol": "AAPL", "features": [...]}'

# 6. Open dashboards
open http://localhost:3000   # Grafana - verify data shows up
open http://localhost:8000/metrics  # Prometheus metrics endpoint

Expected Results:

✅ Backfill completes, data in TimescaleDB
✅ Models train in < 10 min (CPU), no errors
✅ Inference API responds with predictions
✅ Grafana dashboard shows historical OHLCV + anomalies
✅ Prometheus metrics are populated

⏳ What Requires Live Data (Paid Polygon Plan)

Live WebSocket tick ingestion - Requires Polygon Starter plan ($29/mo)
- Free tier: Your plan doesn't include websocket access
- Once enabled: real-time < 100ms latency possible
Performance benchmarks (measured) - Collected only during live WebSocket stream
- Ticks/sec throughput
- Kafka consumer lag
- End-to-end latency histogram

Status

Author

Atharva Hirulkar - MS Data Science, UC San Diego
GitHub · LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SignalStack

What this is

Architecture

Stack

Project structure

Features computed (streaming, per symbol per minute)

Quickstart

Training with yfinance (Development)

Performance targets

Why TimescaleDB

Validation Checklist

✅ What's Been Tested (No Live Data Needed)

📋 How to Validate the Project is Working

⏳ What Requires Live Data (Paid Polygon Plan)

Status

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
backfill		backfill
dashboard/grafana		dashboard/grafana
ingestion		ingestion
serving		serving
storage		storage
streaming		streaming
tests		tests
training		training
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
verify-setup.sh		verify-setup.sh

Folders and files

Latest commit

History

Repository files navigation

SignalStack

What this is

Architecture

Stack

Project structure

Features computed (streaming, per symbol per minute)

Quickstart

Training with yfinance (Development)

Performance targets

Why TimescaleDB

Validation Checklist

✅ What's Been Tested (No Live Data Needed)

📋 How to Validate the Project is Working

⏳ What Requires Live Data (Paid Polygon Plan)

Status

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages