Foresyn HN

Predict your Hacker News virality before you post. A pixel-faithful HN clone with a calibrated LightGBM predictor, a Gemini-grounded comment simulator, an auto-improving title rewriter, and a live calibration ledger that scores HN's actual front page every 10 minutes.

Live at hackernews.foresyn.ai · /predictions for the calibration ledger · /about for the model card

Honest holdout numbers

The site's /about page reads these from src/lib/hackernews/model/v1_metrics.json at build time — they cannot drift from training.

Metric	Value	Reference baseline
Spearman ρ on log-score (holdout)	0.33	—
MAE on log-score (holdout)	1.65	≈ 5.2× off in raw points (typical)
AUC for "score ≥ 100" (holdout)	0.67	ontology2 2014 LR: 0.77 · Dubach 2024-25 BERT: 0.65-0.69
Precision@30 (holdout)	0.83	—
Training corpus	148,400 stories	Algolia HN search API, chronological split
Inference latency (p50, warm)	~280 ms	Gemini embedding dominates

ρ caps around 0.4-0.6 in this domain. Early-vote stochasticity bounds how much of HN's actual scoring is predictable from title + URL + time. Anyone claiming much above that is either testing on a leaky split or sampling from a non-random slice.

What it does (the user-visible loop)

Submit a draft — paste a title + optional URL + optional body.
Get a calibrated score — virality 0-99, raw HN points estimate, p10/p90 interval, front-page probability.
See evidence — top-5 cosine-nearest historical hits the predictor used as comparables.
See the takedowns — five Gemini-simulated comments grounded in five high-scoring kNN siblings (the-skeptic, the-pedant, the-tangent, the-supportive, the-correction).
Auto-improve — title rewriter generates variants, scores each, keeps the climbers; you watch the hill-climb in real time.
Check live calibration — /predictions re-scores HN's actual top 30 every 10 min and publishes the predicted-vs-actual delta. No quiet inflation — you can audit any wrong call.
Drill into one story — /predictions/story/[id] shows the per-snapshot timeline (predicted line vs actual line + the verdict).

Quick start

1. Supabase

Create a project (free tier is fine). pgvector ≥ 0.5 (for halfvec) and pg_trgm need to be available — both are in default Supabase Postgres 15+.

Apply the schema:

# Option A: Supabase CLI
supabase link --project-ref YOUR-PROJECT-REF
supabase db push

# Option B: copy-paste supabase/migrations/0001_init.sql into the SQL editor.

The migration creates 8 tables (hn_items, hn_item_embeddings, hn_user_submissions, hn_frontpage_snapshots, hn_comments, hn_comments_sim_cache, hn_rewrites_cache, hn_predictions_audit) and 2 RPCs (hn_search_items_by_embedding, hn_user_crossovers).

2. Environment

cp .env.example .env.local

Fill in:

Variable	What it's for
`SUPABASE_URL`	Project URL (Supabase dashboard → Project Settings → API)
`SUPABASE_SERVICE_KEY`	Service-role key (same page, server-side only — never ship to browser)
`GEMINI_API_KEY`	https://aistudio.google.com/app/apikey — embeddings + rewriter + comments
`UPSTASH_REDIS_*`	Optional. Distributed rate-limit. If blank, falls back to in-memory.
`NEXT_PUBLIC_POSTHOG_KEY`	Optional. Funnel analytics. Page works fine without it.

3. Ingest the corpus

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt  # or: supabase requests numpy pandas aiohttp
python scripts/ingest_algolia.py --min-score 5 --limit 50000

The script is idempotent on (id) so you can stop / resume / cron it. ~50K rows fits comfortably in the Supabase free tier. Push for 148K+ if you have the budget.

4. Embed the corpus

python scripts/embed_items.py --batch-size 64

Costs ~$1 per 10K rows on Gemini text-embedding-004. Stored as halfvec(3072) (half-precision pgvector) plus a bit(3072) binary-quantization for the HNSW prefilter.

5. Train the predictor

pip install lightgbm scikit-learn m2cgen
python scripts/train_predictor_v2.py --output-dir src/lib/hackernews/model

Trains four heads (median regressor, p10 / p90 quantile regressors, frontpage binary classifier), exports them to pure JS via m2cgen, writes v1_metrics.json (holdout ρ + MAE + AUC + P@30) and v1_features.json (the feature schema). The site's /about page reads these at build time and shows the numbers verbatim — you cannot quietly inflate them.

Note: the repo ships with a working reference build of the model JS files already in src/lib/hackernews/model/. You can skip step 5 if you just want to play with the engine end-to-end before training your own.

6. Run the dev server

npm install
npm run dev
# → http://localhost:3000 (redirects to /hackernews)

7. (Optional) Calibration cron

/predictions and /predictions/story/[id] stay useful only if something is writing to hn_frontpage_snapshots on a cadence. Two options:

Vercel Cron Job: add a POST /api/hackernews/score-frontpage route (~30 lines: pull HN top-30 via Firebase API, call the same predictor, insert into hn_frontpage_snapshots), wire it in vercel.json with 0/10 * * * *. Vercel Hobby allows daily; bump to Pro or use Fleet for 10-min cadence.
systemd timer on any VM: same script, scheduled with a .timer unit. Free, reliable.

Repo tour

src/
├── pages/hackernews/                # Next.js Pages-router routes
│   ├── index.tsx                    #   front page + submit form
│   ├── submit.tsx                   #   bare submit page
│   ├── item/[id].tsx                #   one draft: score + rewrites + comment sim + edit
│   ├── news.tsx + news/[page].tsx   #   chronological feed
│   ├── leaderboard.tsx              #   top drafts by predicted virality
│   ├── predictions.tsx              #   live HN front page scored every 10 min
│   ├── predictions/story/[item_id]  #   per-story predicted-vs-actual timeline
│   └── about.tsx                    #   model card + holdout metrics
├── pages/api/hackernews/            # API routes
│   ├── predict.ts                   #   POST title+url → prediction (the core endpoint)
│   ├── rewrites.ts                  #   POST title → 3 LLM rewrites scored back through predict
│   ├── auto-improve.ts              #   SSE: multi-iteration hill-climb on rewrites
│   ├── comments-simulator.ts        #   SSE: 5 archetypal comments grounded in kNN siblings
│   ├── submissions/[id].ts          #   PATCH submission (edit & rescore)
│   ├── items/[id].ts                #   GET item detail
│   ├── og.tsx                       #   Open Graph card renderer (@vercel/og)
│   └── {robots,sitemap,llms}.ts     #   SEO surface
├── components/hackernews/           # 16 React components (HN-faithful Verdana 10pt aesthetic)
├── lib/hackernews/
│   ├── server/                      # Server-only: Supabase, predictLgbm, kNN, rate limit, gemini
│   ├── client/                      # Client-only: analytics, share helpers, time-ago
│   └── model/                       # LightGBM heads as plain JS + feature schema + metrics
└── styles/hackernews.module.css

scripts/                             # Python — corpus ingest, embeddings, training
├── ingest_algolia.py                # pull HN stories from public Algolia API → hn_items
├── embed_items.py                   # embed titles+bodies via Gemini → hn_item_embeddings
├── embed_service.py                 # optional sentence-transformers sidecar (FastAPI)
├── deploy_embed_service.sh
├── train_predictor.py               # v1 (deprecated, kept for reference)
├── train_predictor_v2.py            # v2: chunked numpy kNN over full corpus, time-causal
└── train_when_ready.sh              # one-shot orchestrator: poll → ingest → train → commit

supabase/migrations/
└── 0001_init.sql                    # all tables + RPCs + indexes in one file

docs/
└── ARCHITECTURE.md                  # request flow, training flow, design rationale

Architecture

Predictor. LightGBM gradient-boosted regressor + α=0.1/0.9 quantile heads + binary "score ≥ 100" classifier. All four converted to plain JavaScript via m2cgen so inference runs in the Vercel function with zero ML runtime dependency. ~10 MB bundle, sub-ms inference.

Feature vector (31-dim):

16 kNN-derived — neighbor score p10/p50/p90, max, mean log, frontpage rate, mean cosine, recency, etc. Computed from a top-50 cosine-neighbor lookup at query time over the halfvec HNSW index.
7 title craft — length in chars + words, has-question, has-show-prefix, has-ask-prefix, has-colon, digit ratio.
4 time — hour-of-day sin/cos, day-of-week, is-weekend.
4 domain priors — one-hot of top-N domains, log frequency, target-encoded mean log score.

kNN. halfvec(3072) embeddings via pgvector; HNSW index over a binary-quantized prefilter then cosine rescore on the half-precision vectors. The hybrid pattern keeps queries fast at corpus sizes where naive halfvec HNSW crawls. See hn_search_items_by_embedding in the migration.

Time-causal training. All kNN features at training time are computed over neighbor.time < candidate.time — same constraint as production. This is the part most "I trained a model on HN" attempts get wrong. Chronological train/val split alone isn't enough if the neighbor lookup peeks at the future.

Comment simulator. Top-5 high-scoring kNN neighbors → Gemini Flash with a 5-shot prompt → 5 archetypal comment outputs (the-skeptic, the-pedant, the-tangent, the-supportive, the-correction). Each comment cites the neighbor that motivated it — grounded, not riffed.

Rewriter. Same neighbor-grounded prompt; generates 3-5 title variants, scores each through the live predictor, keeps the ones that beat the base score. Auto-improve runs N rounds of this and visualizes the climb.

Calibration ledger. A cron hits the predict endpoint every 10 min, pulls HN's top 30 via Firebase, scores each through the same model, writes to hn_frontpage_snapshots. The /predictions page renders the predicted-vs-actual delta in real time.

→ Full request flow + training flow diagrams in docs/ARCHITECTURE.md.

Routing

This repo uses Next.js Pages Router. Routes live under src/pages/hackernews/* rather than at the root — that's the layout the engine ships in, because in production a middleware rewrite serves hackernews.foresyn.ai/<path> from /hackernews/<path>.

You have two options:

A. Keep the /hackernews/ prefix (zero refactor). Pages live at localhost:3000/hackernews, localhost:3000/hackernews/news, etc. The included src/pages/index.tsx redirects / → /hackernews so the root URL works.

B. Mount at the root. Move every file from src/pages/hackernews/<x> to src/pages/<x> (and src/pages/api/hackernews/<x> to src/pages/api/<x>), then search-and-replace /hackernews/ → / in JSX. Components and lib don't change.

For subdomain deployment (hn.example.com), keep option A + add a host-conditional rewrite:

// next.config.js
async rewrites() {
  return [{
    source: '/:path*',
    destination: '/hackernews/:path*',
    has: [{ type: 'host', value: 'hn.example.com' }]
  }];
}

What this isn't

Not a replacement for posting good content. The predictor adds maybe ~0.3 ρ of signal over time-of-day-and-domain heuristics. The remaining variance is the actual content + early-vote luck.
Not a paper-grade benchmark. Holdout ρ ~ 0.33 is in line with the published HN-prediction literature (see metrics table). Don't expect ρ > 0.6 honestly. If you reproduce above that, double-check your splits.
Not corpus-complete. The Algolia-ingested corpus is biased toward stories that already cleared HN's "show me content" minimums. The bottom 80% of submissions (the /newest dropouts) aren't in the training data. Predictions for low-traction drafts are weakest.
Not the comment simulator. It's a riff on what kinds of comments similar posts attracted. Directionally honest, not literally predictive.
Not affiliated with Hacker News or Y Combinator. The clone is loving, not a mark.

Contributing

PRs welcome. A few high-leverage directions:

Better features. Author features (karma, prior post mean log-score) would lift ρ meaningfully if you have the author data ingested. Comment-payload features (early-comment sentiment, top-commenter karma) too.
A second corpus source. Algolia has score ceilings. bigquery-public-data.hacker_news.full has full historical fidelity. A scripts/ingest_bigquery.py is missing and would be welcome.
Per-K precision in the calibration trend. Spearman ρ is fine but Precision@K=30 is what HN's audience actually cares about (the front page is 30 slots). Surface it on /predictions.
Author handle. The current submit flow is anonymous. Threading the post through a captcha + email-handle would unlock attribution and the "Foresyn-scored draft → real HN crossover" loop.

File an issue first for anything substantial — the architecture has opinionated choices (m2cgen-compiled JS over ONNX, neighbor-grounded comments over fine-tuning, time-causal kNN, halfvec hybrid) that are worth discussing before refactoring.

Acknowledgments

Paul Houle — 2014 title-only logistic regression baseline that everyone's been trying to beat for a decade.
Philipp Dubach — 2024-25 BERT fine-tuning work + the honest accuracy numbers that grounded our target band.
Marc-André Sollami — 2018 NN baseline.
m2cgen — the only reason this ships LightGBM-grade accuracy without an ML runtime.
pgvector — halfvec + binary-quantize support is the entire reason the kNN is fast.
Hacker News and Y Combinator — for the platform, the trade dress, and the patience for clones.

License

MIT.

Built by Artemii Novoselov. x.com/earthml1 · LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
scripts		scripts
src		src
supabase/migrations		supabase/migrations
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
next-env.d.ts		next-env.d.ts
next.config.js		next.config.js
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Foresyn HN

Honest holdout numbers

What it does (the user-visible loop)

Table of contents

Quick start

1. Supabase

2. Environment

3. Ingest the corpus

4. Embed the corpus

5. Train the predictor

6. Run the dev server

7. (Optional) Calibration cron

Repo tour

Architecture

Routing

What this isn't

Contributing

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Foresyn HN

Honest holdout numbers

What it does (the user-visible loop)

Table of contents

Quick start

1. Supabase

2. Environment

3. Ingest the corpus

4. Embed the corpus

5. Train the predictor

6. Run the dev server

7. (Optional) Calibration cron

Repo tour

Architecture

Routing

What this isn't

Contributing

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages