Quantarded is a personal research project I ran to test a simple hypothesis: public, messy, unstructured data sources leak enough signal to systematically beat a broad index — if you process them without the emotional bias humans bring to markets.
This repository is the data pipeline that powers the experiment. It ingests three signal sources in parallel — Reddit (r/wallstreetbets), US congressional trade disclosures (via Quiver Quant), and contrarian indicators — normalizes them into a unified event schema, classifies the unstructured content with an LLM, and streams everything into Tinybird for analytics.
The companion site at quantarded.com publishes a weekly trading basket derived from the data, and tracks live performance against the NASDAQ.
Important
This is an educational research project. Results below are hypothetical (paper trading on public signals), not investment advice, and past performance does not predict future returns.
The experiment has been running continuously since week 51 of 2025 (late December 2025), publishing one signal-driven trading basket per week. As of early May 2026:
| Metric | Value |
|---|---|
| Cumulative return | +32.84% |
| Edge vs NASDAQ | +23.59 pp |
| Max drawdown | −9.34% |
| Sharpe ratio (annualized) | 1.77 |
| Weeks running | 20+ |
Live numbers, position history, and the weekly newsletter are at quantarded.com. The intent of publishing in the open was to make the experiment falsifiable: every signal, every entry, and every loss is timestamped and public.
A few things became obvious only by running this end-to-end for several months:
- Breadth beats depth. Baskets where many independent signals pointed the same direction were structurally more stable than baskets with one large conviction trade. Concentrated bets won the biggest weeks and lost the biggest weeks; broad consensus was less spectacular but compounded.
- Visibility is not conviction. Tickers like TSLA and NVDA dominated raw mention counts on WSB, but sentiment was so divided that they rarely cleared the imbalance threshold. The signals worth trading were almost never the loudest.
- Congressional trades are slow, not useless. Form 4 disclosures are stale by the time they're public — but clustered, repeat purchases by the same representative over multiple weeks did indicate position-building worth tracking on a longer horizon.
- LLMs are cheap precision filters. A naive regex over
r/wallstreetbetsproduces thousands of false positives (every "FOR", "ALL", "ON" gets flagged as a ticker). A constrained prompt with high-precision rules cuts that to a usable signal at fractions of a cent per request. - Ship to a real warehouse from day one. Writing every event to Tinybird from the first commit meant I could answer "what was the signal on April 3rd?" months later without re-running anything. The instinct to write to JSONL files and "figure out storage later" would have killed the project.
┌──────────────────────────────────┐
│ SOURCES │
├──────────────────────────────────┤
┌──────────────────┤ Reddit · Quiver Quant API │
│ └──────────────────────────────────┘
│
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ ┌───────────┐
└──▶│ Fetch │───▶│ Normalize │───▶│ LLM classify │───▶│ Tinybird │
│ (paginate, │ │ (event │ │ (Reddit only: │ │ (events + │
│ rate-lim, │ │ schema, │ │ ticker + │ │ job_runs │
│ proxy) │ │ dedupe) │ │ sentiment) │ │ tables) │
└─────────────┘ └──────────────┘ └──────────────────┘ └───────────┘
Two scrapers, one container, shared event schema:
| Scraper | Source | Schedule | Why LLM? |
|---|---|---|---|
reddit-scraper |
r/wallstreetbets submissions + comments |
Every 5 min, 15-min window | Yes — content is unstructured prose |
quiver-scraper |
US House congressional trades | Cron (default 6h) | No — tickers are structured fields |
Both scrapers emit NormalizedEvent records to the same Tinybird events_landing data source, so downstream analytics queries the same table regardless of source. Job execution metrics (success, duration, counts, errors) land in a separate job_runs data source for observability.
- Single normalized event schema means new sources can be added without touching the warehouse layer. The
payloadfield is intentionally untyped JSON so each source can preserve its native fields verbatim. - Deterministic event IDs (SHA-256 of the natural key) make ingestion idempotent — re-running the scraper on the same window produces identical event IDs, so duplicates are dropped at the warehouse.
- Parallel LLM batches with bounded concurrency keep latency low without thrashing rate limits. Empirically, 3 concurrent batches of 50 items each was the sweet spot for
gpt-4o-mini. - No queue, no broker, no orchestrator. It's a single Node process running in a container with a shell loop. The simplest thing that could work — and for ~$5/month on Hetzner, it does.
The scoring algorithms that turn raw events into a weekly basket live in doc/algorithm/:
| Key | Where to get it |
|---|---|
LLM_API_KEY |
OpenAI — platform.openai.com/api-keys |
TINYBIRD_TOKEN |
Tinybird — tinybird.co |
QUIVER_API_KEY |
Quiver Quant (Hobbyist+) — api.quiverquant.com |
tb push tinybird/datasources/events_landing__v0.datasource
tb push tinybird/datasources/job_runs__v0.datasourcenpm install
cp .env.example .env # fill in API keys
npm run reddit:scrape # one-shot Reddit scrape
npm run quiver:scrape # one-shot Quiver scrapeIn development (default NODE_ENV), events are also written to tmp/*.jsonl for inspection. In production, only Tinybird receives them.
cp docker-compose.yml.example docker-compose.yml
# edit env vars
docker compose up --buildThe container runs both scrapers on their own schedules — Reddit every 5 minutes, Quiver on a cron expression — and restarts automatically on failure.
All configuration is environment-variable driven. See .env.example for the full list with defaults; the most important ones:
| Variable | Default | Notes |
|---|---|---|
REDDIT_SCRAPER_ENABLED |
true |
Toggle the Reddit scraper |
TIME_WINDOW_MINUTES |
15 |
How far back each Reddit run looks |
SCRAPER_INTERVAL_MINUTES |
5 |
How often Reddit runs (Docker) |
CLASSIFY_BATCH_SIZE |
50 |
Items per LLM call |
CLASSIFY_CONCURRENCY |
3 |
Parallel LLM batches |
MIN_CONTENT_LENGTH |
10 |
Skip items shorter than this (saves tokens) |
MAX_CONTENT_LENGTH |
2000 |
Truncate longer items (caps tokens) |
LLM_MODEL |
gpt-4o-mini |
Any OpenAI-compatible chat model |
QUIVER_SCRAPER_ENABLED |
false |
Toggle the Quiver scraper |
QUIVER_SCRAPER_CRON |
0 */6 * * * |
Standard 5-field cron expression |
Reddit's API caps listings at 1,000 items per endpoint. A 15-minute window with a 5-minute interval comfortably fits inside that limit during peak WSB hours.
A single shape covers every source. payload is intentionally loose so each source preserves its native fields.
For congressional trades, payload carries the full Quiver row verbatim plus an ingested_at timestamp, so any new field Quiver adds is captured without a schema change.
src/
├── lib/ # Domain modules
│ ├── reddit.ts # Reddit API client (paginated)
│ ├── quiver.ts # Quiver API client (paginated, rate-limited)
│ ├── normalize.ts # Reddit → NormalizedEvent
│ ├── normalize-quiver.ts # Quiver trade → NormalizedEvent
│ ├── classify.ts # LLM ticker + sentiment extraction
│ ├── tinybird.ts # Tinybird ingestion client
│ └── job-runner.ts # Shared job lifecycle utilities
├── scripts/ # CLI entry points (one per scraper)
├── utils/ # HTTP, hashing, date helpers
├── config.ts # Env-driven configuration
└── types.ts # Shared TypeScript types
tinybird/ # Tinybird datasources & pipes
doc/algorithm/ # Algorithm design docs (versioned)
infra/ # Terraform + Hetzner deploy scripts
bin/docker-entrypoint.sh # Container scheduler (both scrapers)
.github/workflows/ # CI (lint + typecheck) + CD (GHCR + deploy)
npm run lint # ESLint
npm run lint:fix # ESLint with --fix
npm run format # Prettier write
npm run format:check # Prettier check (used in CI)A Husky pre-commit hook runs Prettier, ESLint --fix, and tsc --noEmit on staged files. CI runs the same checks on every push.
The infra/ directory contains a Terraform setup that provisions a single Hetzner Cloud VM and a GitHub Actions pipeline that builds the Docker image, pushes it to GHCR, and deploys on every tagged release. See infra/README.md for details.
The whole production setup costs ~€5/month. The point isn't that this is the right way to deploy a serious system — it's the minimum viable infrastructure that runs the experiment reliably enough to publish weekly results.
MIT — use it, fork it, learn from it.
This is a personal research project for educational purposes. Nothing in this repository or on quantarded.com constitutes financial advice. The published returns are hypothetical and based on paper-trading public signals; they should not be interpreted as a recommendation or as evidence of future performance. Trade your own money at your own risk.


{ "event_type": "reddit_comment", // or reddit_submission, congressional_trade "event_id": "<sha256 of natural key>", "source": "wsb", // or quiver-daily "timestamp": "2025-12-16T17:58:35Z", "version": "1", "payload": { "reddit_link": "...", "content": "...", "tickers": [ { "ticker": "TSLA", "sentiment": "sell", "confidence": 0.85 } ] } }