Rust SDK and CLI for Cloudflare Browser Rendering crawl jobs.
This repository is organized as a Cargo workspace with two publishable crates:
cf_crawl: SDK crate forcargo add cf_crawlcf_crawl_cli: CLI package forcargo install cf_crawl_cli, which installs thecf_crawlbinary
It wraps Cloudflare's /crawl REST API with a typed, async CLI that can:
- start crawl jobs;
- poll job status;
- wait for terminal completion;
- fetch paginated results;
- cancel running jobs;
- emit text, JSON, or NDJSON-friendly output.
- Typed request and response models with
serde - Async HTTP client built on
reqwest+tokio - Config loading from CLI flags, environment variables, and config file
- Retry/backoff for transient API failures
- TTY-aware wait spinner
- JSON and text output modes
- File export for result sets
- Shell completions
- Unit tests for config loading, history handling, and output formatting
- Rust toolchain with Cargo
- Cloudflare account ID
- Cloudflare API token with Browser Rendering access
Add the SDK to another Rust project:
cargo add cf_crawlInstall the CLI globally:
cargo install cf_crawl_cliBuild locally:
cargo build --releaseThe binary will be at:
target/release/cf_crawl
You can also run it directly with Cargo:
cargo run -p cf_crawl_cli -- <command>Use the crate as a library when you want to integrate crawl jobs into another Rust service:
use cf_crawl::{
ClientConfig, CloudflareClient, CrawlRequest, GetCrawlParams, RecordStatusFilter,
};
use secrecy::SecretString;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = CloudflareClient::new(
ClientConfig::new("your_account_id", SecretString::from("your_api_token".to_string())),
)?;
let started = client
.start_crawl(&CrawlRequest {
url: "https://example.com".parse()?,
limit: Some(100),
depth: None,
render: Some(true),
source: None,
formats: None,
options: None,
max_age: None,
modified_since: None,
user_agent: None,
authenticate: None,
set_extra_http_headers: None,
goto_options: None,
wait_for_selector: None,
reject_resource_types: None,
json_options: None,
})
.await?;
let job = client
.collect_records(
&started.id,
&GetCrawlParams::new().with_status(RecordStatusFilter::Completed),
)
.await?;
println!("fetched {} records", job.records.len());
Ok(())
}Configuration precedence:
- CLI flags
- Environment variables
- Config file
- Built-in defaults
Supported:
CF_ACCOUNT_IDCF_API_TOKENCF_CRAWL_CONFIGCF_CRAWL_DEFAULT_LIMITCF_CRAWL_POLL_INTERVALCF_CRAWL_PREFERRED_FORMAT
Only commands that talk to Cloudflare require CF_ACCOUNT_ID and CF_API_TOKEN.
Local-only commands such as completions and start --dry-run work without them.
Example:
export CF_ACCOUNT_ID=your_account_id
export CF_API_TOKEN=your_api_token
export CF_CRAWL_PREFERRED_FORMAT=markdown
export CF_CRAWL_DEFAULT_LIMIT=100Default path on Unix-like systems:
~/.config/cf_crawl/config.toml
Example:
account_id = "your_account_id"
api_token = "your_api_token"
poll_interval = "5s"
preferred_format = "markdown"
default_limit = 100preferred_format is used by start when no --formats value or payload-file format is
provided. default_limit is used as the fallback limit for commands that support --limit,
including start, status, results, and wait.
You can override the path with:
cf_crawl --config /path/to/config.toml <command>Top-level help:
cf_crawl --helpAvailable commands:
startstatusresultswaitcancelcompletions
cf_crawl start https://example.com --limit 100 --source sitemapsCommon options:
--limit--depth--render--no-render--source--formats--include-pattern--exclude-pattern--include-subdomains--include-external-links--max-age--modified-since--user-agent--wait-until--goto-timeout-ms--wait-for-selector--wait-for-selector-timeout-ms--wait-for-selector-visible--reject-resource-type--basic-auth-user--basic-auth-password--header--payload-file--dry-run
Render the final JSON request without sending it:
cf_crawl start https://example.com --limit 25 --dry-run--dry-run validates and prints the final request locally, so it does not require Cloudflare
credentials.
Use a JSON payload file for advanced request authoring:
cf_crawl start --payload-file crawl.jsonValues from explicit CLI flags override the payload file. The crawl URL can be supplied either positionally or inside the payload file.
cf_crawl status <job_id>If you omit <job_id>, the CLI falls back to the saved last started job and prints a note showing
which job ID it resolved to.
Fetch a limited number of records with the job:
cf_crawl status <job_id> --limit 5 --record-status completedcf_crawl wait <job_id>As with status and results, omitting <job_id> uses the saved last started job and prints a
note in text mode.
With a custom interval:
cf_crawl wait <job_id> --interval 10sPrint or export records after completion:
cf_crawl wait <job_id> --print-results
cf_crawl wait <job_id> --print-results --out records.ndjsonWhen --print-results is enabled, wait fetches the full filtered result set after the job reaches
its terminal state. --limit controls the page size used for those fetches.
--status and --out are only valid together with --print-results.
Fetch one page:
cf_crawl results <job_id> --limit 100If <job_id> is omitted, results uses the saved last started job and prints the resolved job ID
in text mode.
Fetch all pages:
cf_crawl results <job_id> --allFilter by record status:
cf_crawl results <job_id> --status completed --allResume from a cursor:
cf_crawl results <job_id> --cursor <cursor>Write results to disk:
cf_crawl results <job_id> --all --out records.ndjson
cf_crawl --json results <job_id> --all --out records.jsonIn --json mode, results and wait --print-results emit a single JSON object with job and
records fields.
cf_crawl cancel <job_id>Use cf_crawl cancel last to explicitly cancel the saved last started job.
cf_crawl completions zsh
cf_crawl completions bash
cf_crawl completions fishDefault output is human-readable text.
Enable JSON output globally:
cf_crawl --json status <job_id>Behavior:
- text mode prints concise summaries or record lines
- JSON mode prints machine-readable JSON
- file output uses
.jsonfor pretty JSON and.ndjson/.jsonlfor line-delimited records - in text mode, result exports default to NDJSON unless the file extension is
.json
Start a sitemap-driven crawl and wait for it:
job_id=$(cf_crawl --json start https://developers.cloudflare.com --source sitemaps | jq -r .id)
cf_crawl wait "$job_id"Start a static crawl without rendering:
cf_crawl start https://example.com --no-render --formats markdownExport only completed records:
cf_crawl results <job_id> --status completed --all --out completed.ndjsonAuthenticated crawl with headers:
cf_crawl start https://internal.example.com \
--header "Authorization=Bearer token" \
--header "X-Team=docs"0: success2: invalid input3: remote terminal/API error4: interrupted5: failure
Format:
cargo fmtType-check:
cargo checkLint:
cargo clippy --all-targets --all-features -- -D warningsTest:
cargo testsrc/
api.rs
cli.rs
config.rs
error.rs
lib.rs
main.rs
model.rs
output.rs
tracing_init.rs
cmd/
tests/
- This project targets the Cloudflare Browser Rendering crawl API, not general browser automation.
- The CLI models the common crawl request surface directly and allows additional request shaping through
--payload-file, with explicit CLI flags taking precedence. - Secrets are handled with
secrecy::SecretString, but environment variables and local config files still need normal operational care.