Skip to content

crimeacs/foresyn-hackernews

Repository files navigation

Foresyn HN

Predict your Hacker News virality before you post. A pixel-faithful HN clone with a calibrated LightGBM predictor, a Gemini-grounded comment simulator, an auto-improving title rewriter, and a live calibration ledger that scores HN's actual front page every 10 minutes.

Live demo License: MIT Next.js 14 Made with LightGBM pgvector halfvec(3072)

Live at hackernews.foresyn.ai · /predictions for the calibration ledger · /about for the model card


Honest holdout numbers

The site's /about page reads these from src/lib/hackernews/model/v1_metrics.json at build time — they cannot drift from training.

Metric Value Reference baseline
Spearman ρ on log-score (holdout) 0.33
MAE on log-score (holdout) 1.65 ≈ 5.2× off in raw points (typical)
AUC for "score ≥ 100" (holdout) 0.67 ontology2 2014 LR: 0.77 · Dubach 2024-25 BERT: 0.65-0.69
Precision@30 (holdout) 0.83
Training corpus 148,400 stories Algolia HN search API, chronological split
Inference latency (p50, warm) ~280 ms Gemini embedding dominates

ρ caps around 0.4-0.6 in this domain. Early-vote stochasticity bounds how much of HN's actual scoring is predictable from title + URL + time. Anyone claiming much above that is either testing on a leaky split or sampling from a non-random slice.


What it does (the user-visible loop)

  1. Submit a draft — paste a title + optional URL + optional body.
  2. Get a calibrated score — virality 0-99, raw HN points estimate, p10/p90 interval, front-page probability.
  3. See evidence — top-5 cosine-nearest historical hits the predictor used as comparables.
  4. See the takedowns — five Gemini-simulated comments grounded in five high-scoring kNN siblings (the-skeptic, the-pedant, the-tangent, the-supportive, the-correction).
  5. Auto-improve — title rewriter generates variants, scores each, keeps the climbers; you watch the hill-climb in real time.
  6. Check live calibration/predictions re-scores HN's actual top 30 every 10 min and publishes the predicted-vs-actual delta. No quiet inflation — you can audit any wrong call.
  7. Drill into one story/predictions/story/[id] shows the per-snapshot timeline (predicted line vs actual line + the verdict).

Table of contents


Quick start

1. Supabase

Create a project (free tier is fine). pgvector ≥ 0.5 (for halfvec) and pg_trgm need to be available — both are in default Supabase Postgres 15+.

Apply the schema:

# Option A: Supabase CLI
supabase link --project-ref YOUR-PROJECT-REF
supabase db push

# Option B: copy-paste supabase/migrations/0001_init.sql into the SQL editor.

The migration creates 8 tables (hn_items, hn_item_embeddings, hn_user_submissions, hn_frontpage_snapshots, hn_comments, hn_comments_sim_cache, hn_rewrites_cache, hn_predictions_audit) and 2 RPCs (hn_search_items_by_embedding, hn_user_crossovers).

2. Environment

cp .env.example .env.local

Fill in:

Variable What it's for
SUPABASE_URL Project URL (Supabase dashboard → Project Settings → API)
SUPABASE_SERVICE_KEY Service-role key (same page, server-side only — never ship to browser)
GEMINI_API_KEY https://aistudio.google.com/app/apikey — embeddings + rewriter + comments
UPSTASH_REDIS_* Optional. Distributed rate-limit. If blank, falls back to in-memory.
NEXT_PUBLIC_POSTHOG_KEY Optional. Funnel analytics. Page works fine without it.

3. Ingest the corpus

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt  # or: supabase requests numpy pandas aiohttp
python scripts/ingest_algolia.py --min-score 5 --limit 50000

The script is idempotent on (id) so you can stop / resume / cron it. ~50K rows fits comfortably in the Supabase free tier. Push for 148K+ if you have the budget.

4. Embed the corpus

python scripts/embed_items.py --batch-size 64

Costs ~$1 per 10K rows on Gemini text-embedding-004. Stored as halfvec(3072) (half-precision pgvector) plus a bit(3072) binary-quantization for the HNSW prefilter.

5. Train the predictor

pip install lightgbm scikit-learn m2cgen
python scripts/train_predictor_v2.py --output-dir src/lib/hackernews/model

Trains four heads (median regressor, p10 / p90 quantile regressors, frontpage binary classifier), exports them to pure JS via m2cgen, writes v1_metrics.json (holdout ρ + MAE + AUC + P@30) and v1_features.json (the feature schema). The site's /about page reads these at build time and shows the numbers verbatim — you cannot quietly inflate them.

Note: the repo ships with a working reference build of the model JS files already in src/lib/hackernews/model/. You can skip step 5 if you just want to play with the engine end-to-end before training your own.

6. Run the dev server

npm install
npm run dev
# → http://localhost:3000 (redirects to /hackernews)

7. (Optional) Calibration cron

/predictions and /predictions/story/[id] stay useful only if something is writing to hn_frontpage_snapshots on a cadence. Two options:

  • Vercel Cron Job: add a POST /api/hackernews/score-frontpage route (~30 lines: pull HN top-30 via Firebase API, call the same predictor, insert into hn_frontpage_snapshots), wire it in vercel.json with 0/10 * * * *. Vercel Hobby allows daily; bump to Pro or use Fleet for 10-min cadence.
  • systemd timer on any VM: same script, scheduled with a .timer unit. Free, reliable.

Repo tour

src/
├── pages/hackernews/                # Next.js Pages-router routes
│   ├── index.tsx                    #   front page + submit form
│   ├── submit.tsx                   #   bare submit page
│   ├── item/[id].tsx                #   one draft: score + rewrites + comment sim + edit
│   ├── news.tsx + news/[page].tsx   #   chronological feed
│   ├── leaderboard.tsx              #   top drafts by predicted virality
│   ├── predictions.tsx              #   live HN front page scored every 10 min
│   ├── predictions/story/[item_id]  #   per-story predicted-vs-actual timeline
│   └── about.tsx                    #   model card + holdout metrics
├── pages/api/hackernews/            # API routes
│   ├── predict.ts                   #   POST title+url → prediction (the core endpoint)
│   ├── rewrites.ts                  #   POST title → 3 LLM rewrites scored back through predict
│   ├── auto-improve.ts              #   SSE: multi-iteration hill-climb on rewrites
│   ├── comments-simulator.ts        #   SSE: 5 archetypal comments grounded in kNN siblings
│   ├── submissions/[id].ts          #   PATCH submission (edit & rescore)
│   ├── items/[id].ts                #   GET item detail
│   ├── og.tsx                       #   Open Graph card renderer (@vercel/og)
│   └── {robots,sitemap,llms}.ts     #   SEO surface
├── components/hackernews/           # 16 React components (HN-faithful Verdana 10pt aesthetic)
├── lib/hackernews/
│   ├── server/                      # Server-only: Supabase, predictLgbm, kNN, rate limit, gemini
│   ├── client/                      # Client-only: analytics, share helpers, time-ago
│   └── model/                       # LightGBM heads as plain JS + feature schema + metrics
└── styles/hackernews.module.css

scripts/                             # Python — corpus ingest, embeddings, training
├── ingest_algolia.py                # pull HN stories from public Algolia API → hn_items
├── embed_items.py                   # embed titles+bodies via Gemini → hn_item_embeddings
├── embed_service.py                 # optional sentence-transformers sidecar (FastAPI)
├── deploy_embed_service.sh
├── train_predictor.py               # v1 (deprecated, kept for reference)
├── train_predictor_v2.py            # v2: chunked numpy kNN over full corpus, time-causal
└── train_when_ready.sh              # one-shot orchestrator: poll → ingest → train → commit

supabase/migrations/
└── 0001_init.sql                    # all tables + RPCs + indexes in one file

docs/
└── ARCHITECTURE.md                  # request flow, training flow, design rationale

Architecture

Predictor. LightGBM gradient-boosted regressor + α=0.1/0.9 quantile heads + binary "score ≥ 100" classifier. All four converted to plain JavaScript via m2cgen so inference runs in the Vercel function with zero ML runtime dependency. ~10 MB bundle, sub-ms inference.

Feature vector (31-dim):

  • 16 kNN-derived — neighbor score p10/p50/p90, max, mean log, frontpage rate, mean cosine, recency, etc. Computed from a top-50 cosine-neighbor lookup at query time over the halfvec HNSW index.
  • 7 title craft — length in chars + words, has-question, has-show-prefix, has-ask-prefix, has-colon, digit ratio.
  • 4 time — hour-of-day sin/cos, day-of-week, is-weekend.
  • 4 domain priors — one-hot of top-N domains, log frequency, target-encoded mean log score.

kNN. halfvec(3072) embeddings via pgvector; HNSW index over a binary-quantized prefilter then cosine rescore on the half-precision vectors. The hybrid pattern keeps queries fast at corpus sizes where naive halfvec HNSW crawls. See hn_search_items_by_embedding in the migration.

Time-causal training. All kNN features at training time are computed over neighbor.time < candidate.time — same constraint as production. This is the part most "I trained a model on HN" attempts get wrong. Chronological train/val split alone isn't enough if the neighbor lookup peeks at the future.

Comment simulator. Top-5 high-scoring kNN neighbors → Gemini Flash with a 5-shot prompt → 5 archetypal comment outputs (the-skeptic, the-pedant, the-tangent, the-supportive, the-correction). Each comment cites the neighbor that motivated it — grounded, not riffed.

Rewriter. Same neighbor-grounded prompt; generates 3-5 title variants, scores each through the live predictor, keeps the ones that beat the base score. Auto-improve runs N rounds of this and visualizes the climb.

Calibration ledger. A cron hits the predict endpoint every 10 min, pulls HN's top 30 via Firebase, scores each through the same model, writes to hn_frontpage_snapshots. The /predictions page renders the predicted-vs-actual delta in real time.

→ Full request flow + training flow diagrams in docs/ARCHITECTURE.md.


Routing

This repo uses Next.js Pages Router. Routes live under src/pages/hackernews/* rather than at the root — that's the layout the engine ships in, because in production a middleware rewrite serves hackernews.foresyn.ai/<path> from /hackernews/<path>.

You have two options:

A. Keep the /hackernews/ prefix (zero refactor). Pages live at localhost:3000/hackernews, localhost:3000/hackernews/news, etc. The included src/pages/index.tsx redirects //hackernews so the root URL works.

B. Mount at the root. Move every file from src/pages/hackernews/<x> to src/pages/<x> (and src/pages/api/hackernews/<x> to src/pages/api/<x>), then search-and-replace /hackernews// in JSX. Components and lib don't change.

For subdomain deployment (hn.example.com), keep option A + add a host-conditional rewrite:

// next.config.js
async rewrites() {
  return [{
    source: '/:path*',
    destination: '/hackernews/:path*',
    has: [{ type: 'host', value: 'hn.example.com' }]
  }];
}

What this isn't

  • Not a replacement for posting good content. The predictor adds maybe ~0.3 ρ of signal over time-of-day-and-domain heuristics. The remaining variance is the actual content + early-vote luck.
  • Not a paper-grade benchmark. Holdout ρ ~ 0.33 is in line with the published HN-prediction literature (see metrics table). Don't expect ρ > 0.6 honestly. If you reproduce above that, double-check your splits.
  • Not corpus-complete. The Algolia-ingested corpus is biased toward stories that already cleared HN's "show me content" minimums. The bottom 80% of submissions (the /newest dropouts) aren't in the training data. Predictions for low-traction drafts are weakest.
  • Not the comment simulator. It's a riff on what kinds of comments similar posts attracted. Directionally honest, not literally predictive.
  • Not affiliated with Hacker News or Y Combinator. The clone is loving, not a mark.

Contributing

PRs welcome. A few high-leverage directions:

  • Better features. Author features (karma, prior post mean log-score) would lift ρ meaningfully if you have the author data ingested. Comment-payload features (early-comment sentiment, top-commenter karma) too.
  • A second corpus source. Algolia has score ceilings. bigquery-public-data.hacker_news.full has full historical fidelity. A scripts/ingest_bigquery.py is missing and would be welcome.
  • Per-K precision in the calibration trend. Spearman ρ is fine but Precision@K=30 is what HN's audience actually cares about (the front page is 30 slots). Surface it on /predictions.
  • Author handle. The current submit flow is anonymous. Threading the post through a captcha + email-handle would unlock attribution and the "Foresyn-scored draft → real HN crossover" loop.

File an issue first for anything substantial — the architecture has opinionated choices (m2cgen-compiled JS over ONNX, neighbor-grounded comments over fine-tuning, time-causal kNN, halfvec hybrid) that are worth discussing before refactoring.


Acknowledgments

  • Paul Houle — 2014 title-only logistic regression baseline that everyone's been trying to beat for a decade.
  • Philipp Dubach — 2024-25 BERT fine-tuning work + the honest accuracy numbers that grounded our target band.
  • Marc-André Sollami — 2018 NN baseline.
  • m2cgen — the only reason this ships LightGBM-grade accuracy without an ML runtime.
  • pgvector — halfvec + binary-quantize support is the entire reason the kNN is fast.
  • Hacker News and Y Combinator — for the platform, the trade dress, and the patience for clones.

License

MIT.

Built by Artemii Novoselov. x.com/earthml1 · LinkedIn.

About

Pixel-faithful Hacker News clone with a virality predictor and live calibration ledger. The engine behind hackernews.foresyn.ai.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors