Predict your Hacker News virality before you post. A pixel-faithful HN clone with a calibrated LightGBM predictor, a Gemini-grounded comment simulator, an auto-improving title rewriter, and a live calibration ledger that scores HN's actual front page every 10 minutes.
Live at hackernews.foresyn.ai ·
/predictionsfor the calibration ledger ·/aboutfor the model card
The site's /about page reads these from src/lib/hackernews/model/v1_metrics.json at build time — they cannot drift from training.
| Metric | Value | Reference baseline |
|---|---|---|
| Spearman ρ on log-score (holdout) | 0.33 | — |
| MAE on log-score (holdout) | 1.65 | ≈ 5.2× off in raw points (typical) |
| AUC for "score ≥ 100" (holdout) | 0.67 | ontology2 2014 LR: 0.77 · Dubach 2024-25 BERT: 0.65-0.69 |
| Precision@30 (holdout) | 0.83 | — |
| Training corpus | 148,400 stories | Algolia HN search API, chronological split |
| Inference latency (p50, warm) | ~280 ms | Gemini embedding dominates |
ρ caps around 0.4-0.6 in this domain. Early-vote stochasticity bounds how much of HN's actual scoring is predictable from title + URL + time. Anyone claiming much above that is either testing on a leaky split or sampling from a non-random slice.
- Submit a draft — paste a title + optional URL + optional body.
- Get a calibrated score — virality 0-99, raw HN points estimate, p10/p90 interval, front-page probability.
- See evidence — top-5 cosine-nearest historical hits the predictor used as comparables.
- See the takedowns — five Gemini-simulated comments grounded in five high-scoring kNN siblings (the-skeptic, the-pedant, the-tangent, the-supportive, the-correction).
- Auto-improve — title rewriter generates variants, scores each, keeps the climbers; you watch the hill-climb in real time.
- Check live calibration —
/predictionsre-scores HN's actual top 30 every 10 min and publishes the predicted-vs-actual delta. No quiet inflation — you can audit any wrong call. - Drill into one story —
/predictions/story/[id]shows the per-snapshot timeline (predicted line vs actual line + the verdict).
Create a project (free tier is fine). pgvector ≥ 0.5 (for halfvec) and pg_trgm need to be available — both are in default Supabase Postgres 15+.
Apply the schema:
# Option A: Supabase CLI
supabase link --project-ref YOUR-PROJECT-REF
supabase db push
# Option B: copy-paste supabase/migrations/0001_init.sql into the SQL editor.The migration creates 8 tables (hn_items, hn_item_embeddings, hn_user_submissions, hn_frontpage_snapshots, hn_comments, hn_comments_sim_cache, hn_rewrites_cache, hn_predictions_audit) and 2 RPCs (hn_search_items_by_embedding, hn_user_crossovers).
cp .env.example .env.localFill in:
| Variable | What it's for |
|---|---|
SUPABASE_URL |
Project URL (Supabase dashboard → Project Settings → API) |
SUPABASE_SERVICE_KEY |
Service-role key (same page, server-side only — never ship to browser) |
GEMINI_API_KEY |
https://aistudio.google.com/app/apikey — embeddings + rewriter + comments |
UPSTASH_REDIS_* |
Optional. Distributed rate-limit. If blank, falls back to in-memory. |
NEXT_PUBLIC_POSTHOG_KEY |
Optional. Funnel analytics. Page works fine without it. |
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt # or: supabase requests numpy pandas aiohttp
python scripts/ingest_algolia.py --min-score 5 --limit 50000The script is idempotent on (id) so you can stop / resume / cron it. ~50K rows fits comfortably in the Supabase free tier. Push for 148K+ if you have the budget.
python scripts/embed_items.py --batch-size 64Costs ~$1 per 10K rows on Gemini text-embedding-004. Stored as halfvec(3072) (half-precision pgvector) plus a bit(3072) binary-quantization for the HNSW prefilter.
pip install lightgbm scikit-learn m2cgen
python scripts/train_predictor_v2.py --output-dir src/lib/hackernews/modelTrains four heads (median regressor, p10 / p90 quantile regressors, frontpage binary classifier), exports them to pure JS via m2cgen, writes v1_metrics.json (holdout ρ + MAE + AUC + P@30) and v1_features.json (the feature schema). The site's /about page reads these at build time and shows the numbers verbatim — you cannot quietly inflate them.
Note: the repo ships with a working reference build of the model JS files already in src/lib/hackernews/model/. You can skip step 5 if you just want to play with the engine end-to-end before training your own.
npm install
npm run dev
# → http://localhost:3000 (redirects to /hackernews)/predictions and /predictions/story/[id] stay useful only if something is writing to hn_frontpage_snapshots on a cadence. Two options:
- Vercel Cron Job: add a
POST /api/hackernews/score-frontpageroute (~30 lines: pull HN top-30 via Firebase API, call the same predictor, insert intohn_frontpage_snapshots), wire it invercel.jsonwith0/10 * * * *. Vercel Hobby allows daily; bump to Pro or use Fleet for 10-min cadence. - systemd timer on any VM: same script, scheduled with a
.timerunit. Free, reliable.
src/
├── pages/hackernews/ # Next.js Pages-router routes
│ ├── index.tsx # front page + submit form
│ ├── submit.tsx # bare submit page
│ ├── item/[id].tsx # one draft: score + rewrites + comment sim + edit
│ ├── news.tsx + news/[page].tsx # chronological feed
│ ├── leaderboard.tsx # top drafts by predicted virality
│ ├── predictions.tsx # live HN front page scored every 10 min
│ ├── predictions/story/[item_id] # per-story predicted-vs-actual timeline
│ └── about.tsx # model card + holdout metrics
├── pages/api/hackernews/ # API routes
│ ├── predict.ts # POST title+url → prediction (the core endpoint)
│ ├── rewrites.ts # POST title → 3 LLM rewrites scored back through predict
│ ├── auto-improve.ts # SSE: multi-iteration hill-climb on rewrites
│ ├── comments-simulator.ts # SSE: 5 archetypal comments grounded in kNN siblings
│ ├── submissions/[id].ts # PATCH submission (edit & rescore)
│ ├── items/[id].ts # GET item detail
│ ├── og.tsx # Open Graph card renderer (@vercel/og)
│ └── {robots,sitemap,llms}.ts # SEO surface
├── components/hackernews/ # 16 React components (HN-faithful Verdana 10pt aesthetic)
├── lib/hackernews/
│ ├── server/ # Server-only: Supabase, predictLgbm, kNN, rate limit, gemini
│ ├── client/ # Client-only: analytics, share helpers, time-ago
│ └── model/ # LightGBM heads as plain JS + feature schema + metrics
└── styles/hackernews.module.css
scripts/ # Python — corpus ingest, embeddings, training
├── ingest_algolia.py # pull HN stories from public Algolia API → hn_items
├── embed_items.py # embed titles+bodies via Gemini → hn_item_embeddings
├── embed_service.py # optional sentence-transformers sidecar (FastAPI)
├── deploy_embed_service.sh
├── train_predictor.py # v1 (deprecated, kept for reference)
├── train_predictor_v2.py # v2: chunked numpy kNN over full corpus, time-causal
└── train_when_ready.sh # one-shot orchestrator: poll → ingest → train → commit
supabase/migrations/
└── 0001_init.sql # all tables + RPCs + indexes in one file
docs/
└── ARCHITECTURE.md # request flow, training flow, design rationale
Predictor. LightGBM gradient-boosted regressor + α=0.1/0.9 quantile heads + binary "score ≥ 100" classifier. All four converted to plain JavaScript via m2cgen so inference runs in the Vercel function with zero ML runtime dependency. ~10 MB bundle, sub-ms inference.
Feature vector (31-dim):
- 16 kNN-derived — neighbor score p10/p50/p90, max, mean log, frontpage rate, mean cosine, recency, etc. Computed from a top-50 cosine-neighbor lookup at query time over the halfvec HNSW index.
- 7 title craft — length in chars + words, has-question, has-show-prefix, has-ask-prefix, has-colon, digit ratio.
- 4 time — hour-of-day sin/cos, day-of-week, is-weekend.
- 4 domain priors — one-hot of top-N domains, log frequency, target-encoded mean log score.
kNN. halfvec(3072) embeddings via pgvector; HNSW index over a binary-quantized prefilter then cosine rescore on the half-precision vectors. The hybrid pattern keeps queries fast at corpus sizes where naive halfvec HNSW crawls. See hn_search_items_by_embedding in the migration.
Time-causal training. All kNN features at training time are computed over neighbor.time < candidate.time — same constraint as production. This is the part most "I trained a model on HN" attempts get wrong. Chronological train/val split alone isn't enough if the neighbor lookup peeks at the future.
Comment simulator. Top-5 high-scoring kNN neighbors → Gemini Flash with a 5-shot prompt → 5 archetypal comment outputs (the-skeptic, the-pedant, the-tangent, the-supportive, the-correction). Each comment cites the neighbor that motivated it — grounded, not riffed.
Rewriter. Same neighbor-grounded prompt; generates 3-5 title variants, scores each through the live predictor, keeps the ones that beat the base score. Auto-improve runs N rounds of this and visualizes the climb.
Calibration ledger. A cron hits the predict endpoint every 10 min, pulls HN's top 30 via Firebase, scores each through the same model, writes to hn_frontpage_snapshots. The /predictions page renders the predicted-vs-actual delta in real time.
→ Full request flow + training flow diagrams in docs/ARCHITECTURE.md.
This repo uses Next.js Pages Router. Routes live under src/pages/hackernews/* rather than at the root — that's the layout the engine ships in, because in production a middleware rewrite serves hackernews.foresyn.ai/<path> from /hackernews/<path>.
You have two options:
A. Keep the /hackernews/ prefix (zero refactor). Pages live at localhost:3000/hackernews, localhost:3000/hackernews/news, etc. The included src/pages/index.tsx redirects / → /hackernews so the root URL works.
B. Mount at the root. Move every file from src/pages/hackernews/<x> to src/pages/<x> (and src/pages/api/hackernews/<x> to src/pages/api/<x>), then search-and-replace /hackernews/ → / in JSX. Components and lib don't change.
For subdomain deployment (hn.example.com), keep option A + add a host-conditional rewrite:
// next.config.js
async rewrites() {
return [{
source: '/:path*',
destination: '/hackernews/:path*',
has: [{ type: 'host', value: 'hn.example.com' }]
}];
}- Not a replacement for posting good content. The predictor adds maybe ~0.3 ρ of signal over time-of-day-and-domain heuristics. The remaining variance is the actual content + early-vote luck.
- Not a paper-grade benchmark. Holdout ρ ~ 0.33 is in line with the published HN-prediction literature (see metrics table). Don't expect ρ > 0.6 honestly. If you reproduce above that, double-check your splits.
- Not corpus-complete. The Algolia-ingested corpus is biased toward stories that already cleared HN's "show me content" minimums. The bottom 80% of submissions (the
/newestdropouts) aren't in the training data. Predictions for low-traction drafts are weakest. - Not the comment simulator. It's a riff on what kinds of comments similar posts attracted. Directionally honest, not literally predictive.
- Not affiliated with Hacker News or Y Combinator. The clone is loving, not a mark.
PRs welcome. A few high-leverage directions:
- Better features. Author features (karma, prior post mean log-score) would lift ρ meaningfully if you have the author data ingested. Comment-payload features (early-comment sentiment, top-commenter karma) too.
- A second corpus source. Algolia has score ceilings.
bigquery-public-data.hacker_news.fullhas full historical fidelity. Ascripts/ingest_bigquery.pyis missing and would be welcome. - Per-K precision in the calibration trend. Spearman ρ is fine but Precision@K=30 is what HN's audience actually cares about (the front page is 30 slots). Surface it on
/predictions. - Author handle. The current submit flow is anonymous. Threading the post through a captcha + email-handle would unlock attribution and the "Foresyn-scored draft → real HN crossover" loop.
File an issue first for anything substantial — the architecture has opinionated choices (m2cgen-compiled JS over ONNX, neighbor-grounded comments over fine-tuning, time-causal kNN, halfvec hybrid) that are worth discussing before refactoring.
- Paul Houle — 2014 title-only logistic regression baseline that everyone's been trying to beat for a decade.
- Philipp Dubach — 2024-25 BERT fine-tuning work + the honest accuracy numbers that grounded our target band.
- Marc-André Sollami — 2018 NN baseline.
m2cgen— the only reason this ships LightGBM-grade accuracy without an ML runtime.- pgvector — halfvec + binary-quantize support is the entire reason the kNN is fast.
- Hacker News and Y Combinator — for the platform, the trade dress, and the patience for clones.
MIT.
Built by Artemii Novoselov. x.com/earthml1 · LinkedIn.