Skip to content

gharibyan/agentrail

Repository files navigation

AgentRail

AgentRail is a Cloudflare edge layer that gives known AI agents deterministic Markdown responses from the same URLs humans already visit.

Browser or search crawler -> /pricing -> origin HTML
Known AI agent           -> /pricing -> generated Markdown if ready
Known AI agent           -> /pricing -> origin HTML if Markdown is unavailable

The crawler runs in the background. Request handling never waits for extraction, so cache misses fall through to the original site without adding generation latency. When a known AI agent requests a page that is not in KV yet, AgentRail returns the origin page and uses ctx.waitUntil to warm KV from that same origin response. A later AI-agent request can then receive the prepared Markdown.

E2E Flow

flowchart TD
  browser["Human browser"] --> worker["Cloudflare Worker route"]
  search["Search crawler"] --> worker
  ai["Known AI agent"] --> worker

  worker --> classify{"Classify request"}
  classify -->|"Browser, search crawler, unknown bot, asset, or non-GET/HEAD"| origin["Origin website HTML"]
  classify -->|"Known AI agent"| kvcheck{"KV record exists?"}

  kvcheck -->|"ready or fresh stale"| markdown["Return deterministic Markdown"]
  markdown --> headers["text/markdown + x-ai-response-layer"]

  kvcheck -->|"missing"| originfetch["Fetch origin HTML"]
  originfetch --> firstbot["Return origin HTML to first bot"]
  originfetch --> waituntil["ctx.waitUntil warmup"]
  waituntil --> extract["Extract deterministic Markdown"]
  extract --> store["Store page:<normalized-url> in AGENTRAIL_RESOURCES KV"]

  kvcheck -->|"pending, failed, skipped, or too stale"| origin
  cron["Cloudflare Cron Trigger"] --> sitemap["Fetch sitemap"]
  sitemap --> crawl["Crawl sitemap URLs"]
  crawl --> extract

  store --> nextbot["Next AI-agent request"]
  nextbot --> kvcheck
Loading

What It Includes

  • @agentrail/bot-detector: classifies AI agents, search crawlers, browsers, and unknown bots.
  • @agentrail/markdown-extractor: deterministic HTML to Markdown extraction.
  • @agentrail/crawler: sitemap parsing, link discovery, resource keys, and crawl processing.
  • @agentrail/worker: Cloudflare Worker runtime.
  • create-agentrail: scaffold generator for Cloudflare projects.

Quick Test

AgentRail expects Node 22 or newer. Current Wrangler 4 releases require it.

npm test

The repository uses Node's built-in test runner and has no runtime test dependency.

Generate A Site Project

From this repository:

node --import tsx packages/create-agentrail/bin/create-agentrail.ts my-site \
  --origin=https://example.com \
  '--route=example.com/*' \
  --schedule="0 */6 * * *"

The CLI checks Cloudflare through Wrangler, reuses an existing AGENTRAIL_RESOURCES KV namespace if one is present, or creates it automatically if it is missing. When that setup succeeds, the generated project contains a Wrangler-compatible Worker entrypoint and config with the real KV namespace id already written into wrangler.jsonc. If automatic setup is skipped or fails, the config keeps a placeholder and the generated README explains the manual KV setup.

It also runs npm install inside the generated project by default, so the normal next step is deploy:

cd my-site
npm run deploy

AgentRail includes a Cron Trigger for background crawling. On a fresh Cloudflare account, open the Cloudflare dashboard and visit Workers & Pages once before the first deploy. Cloudflare creates the required workers.dev subdomain there. If npm run deploy fails with Cloudflare code: 10063, do that dashboard step and rerun the deploy command.

If you want to generate files only:

node --import tsx packages/create-agentrail/bin/create-agentrail.ts my-site \
  --origin=https://example.com \
  '--route=example.com/*' \
  --skip-install

If you are offline, not logged into Wrangler, or want to wire Cloudflare later:

node --import tsx packages/create-agentrail/bin/create-agentrail.ts my-site \
  --origin=https://example.com \
  '--route=example.com/*' \
  --skip-cloudflare

The generated wrangler.jsonc will contain this placeholder until you add the real KV namespace id:

{
  "binding": "AGENTRAIL_RESOURCES",
  "id": "replace-with-agentrail-resources-kv-id"
}

If you already have a namespace id:

node --import tsx packages/create-agentrail/bin/create-agentrail.ts my-site \
  --origin=https://example.com \
  '--route=example.com/*' \
  --kv-id=your-kv-namespace-id

Manual KV Namespace Setup

Use this when automatic Cloudflare setup was skipped or failed.

First make sure Wrangler is logged in:

npx wrangler login

Check whether the namespace already exists:

npx wrangler kv namespace list --json

If the output includes a namespace with "title": "AGENTRAIL_RESOURCES", copy its "id".

If it does not exist, create it:

npx wrangler kv namespace create AGENTRAIL_RESOURCES

Wrangler prints an id. It may look like this:

id = "abc123..."

Paste that id into wrangler.jsonc:

{
  "kv_namespaces": [
    {
      "binding": "AGENTRAIL_RESOURCES",
      "id": "abc123..."
    }
  ]
}

Then deploy:

npm install
npm run deploy

Generated projects are local deployment workspaces. Keep them under projects/; that folder is ignored so your site-specific Cloudflare config does not get committed to the AgentRail source repo.

Deploy This Worker Directly

Copy the example config and edit the route and origin:

cp wrangler.example.jsonc wrangler.jsonc

Follow the manual KV setup above if AGENTRAIL_RESOURCES is not configured yet, then deploy:

npm install
npm run deploy

If this is the first Worker on the Cloudflare account, open Workers & Pages in the Cloudflare dashboard once before deploying so Cloudflare creates the required workers.dev subdomain for cron schedules.

Runtime Contract

AgentRail only returns Markdown when a stored resource is safe to serve:

  • ready: return Markdown.
  • stale: return Markdown only inside the configured stale window.
  • missing, pending, failed, skipped, or too stale: pass through to origin.

Humans, traditional search crawlers, unknown bots, assets, and non-GET/HEAD requests always pass through to origin. Known AI-agent GET requests with no KV record also schedule a background warmup from the origin response before passing through. That keeps the first miss fast and prepares the next bot request.

Default AI-Agent Bots

AgentRail treats these user agents as AI-agent traffic by default:

Applebot
GPTBot
ChatGPT-User
OAI-SearchBot
Google-CloudVertexBot
ClaudeBot
Claude-User
Claude-SearchBot
Anthropic-AI
PerplexityBot
Perplexity-User
YouBot
Cohere-AI
Amazonbot
Anchor Browser
Bytespider
Cloudflare Crawler
CCBot
DuckAssistBot
FacebookBot
Manus Bot
Meta-ExternalAgent
Meta-ExternalFetcher
MistralAI-User
Novellum AI Crawl
PetalBot
ProRataInc
TikTok Spider
Timpibot

Googlebot, Bingbot, DuckDuckBot, YandexBot, Baiduspider, archive.org_bot, Arquivo Web Crawler, Terracotta Bot, Slurp, and other traditional search crawlers stay on the origin path.

Basic Cloudflare Mode

The basic mode uses:

  • Worker routes for request switching.
  • Cron Trigger for sitemap crawling.
  • KV namespace named AGENTRAIL_RESOURCES for Markdown records.
  • Request-time warmup for AI-agent misses.
  • Persisted Worker logs through Cloudflare observability.

Cron can crawl sitemap pages directly into KV. A production deployment can add Queues and D1 later, but they are not required for the first useful version.

Local Wrangler does not run Cron Triggers by itself. AgentRail's dev script uses --test-scheduled, so you can run npm run dev and trigger the crawler manually:

curl "http://localhost:8787/__scheduled?cron=0+*/6+*+*+*"

For deployed Workers, AgentRail enables persisted logs and invocation logs in wrangler.jsonc. Use npm run tail or the Cloudflare dashboard logs view to inspect requests while testing.

Generated Markdown

Each record stores Markdown with this shape:

# Page Title

Canonical URL: https://example.com/page
Last generated: 2026-06-03T00:00:00.000Z
Source: public HTML

## Description
Meta description or first meaningful paragraph.

## Content
Clean extracted page content.

The extractor preserves source ordering where practical and does not use LLM summarization.

License

Apache-2.0. See LICENSE.

About

AgentRail is a Cloudflare edge layer that gives known AI agents deterministic Markdown responses from the same URLs humans already visit.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors