cf_crawl

Rust SDK and CLI for Cloudflare Browser Rendering crawl jobs.

This repository is organized as a Cargo workspace with two publishable crates:

cf_crawl: SDK crate for cargo add cf_crawl
cf_crawl_cli: CLI package for cargo install cf_crawl_cli, which installs the cf_crawl binary

It wraps Cloudflare's /crawl REST API with a typed, async CLI that can:

start crawl jobs;
poll job status;
wait for terminal completion;
fetch paginated results;
cancel running jobs;
emit text, JSON, or NDJSON-friendly output.

Features

Typed request and response models with serde
Async HTTP client built on reqwest + tokio
Config loading from CLI flags, environment variables, and config file
Retry/backoff for transient API failures
TTY-aware wait spinner
JSON and text output modes
File export for result sets
Shell completions
Unit tests for config loading, history handling, and output formatting

Requirements

Rust toolchain with Cargo
Cloudflare account ID
Cloudflare API token with Browser Rendering access

Installation

Add the SDK to another Rust project:

cargo add cf_crawl

Install the CLI globally:

cargo install cf_crawl_cli

Build locally:

cargo build --release

The binary will be at:

target/release/cf_crawl

You can also run it directly with Cargo:

cargo run -p cf_crawl_cli -- <command>

SDK Usage

Use the crate as a library when you want to integrate crawl jobs into another Rust service:

use cf_crawl::{
    ClientConfig, CloudflareClient, CrawlRequest, GetCrawlParams, RecordStatusFilter,
};
use secrecy::SecretString;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = CloudflareClient::new(
        ClientConfig::new("your_account_id", SecretString::from("your_api_token".to_string())),
    )?;

    let started = client
        .start_crawl(&CrawlRequest {
            url: "https://example.com".parse()?,
            limit: Some(100),
            depth: None,
            render: Some(true),
            source: None,
            formats: None,
            options: None,
            max_age: None,
            modified_since: None,
            user_agent: None,
            authenticate: None,
            set_extra_http_headers: None,
            goto_options: None,
            wait_for_selector: None,
            reject_resource_types: None,
            json_options: None,
        })
        .await?;

    let job = client
        .collect_records(
            &started.id,
            &GetCrawlParams::new().with_status(RecordStatusFilter::Completed),
        )
        .await?;

    println!("fetched {} records", job.records.len());
    Ok(())
}

Configuration

Configuration precedence:

CLI flags
Environment variables
Config file
Built-in defaults

Environment variables

Supported:

CF_ACCOUNT_ID
CF_API_TOKEN
CF_CRAWL_CONFIG
CF_CRAWL_DEFAULT_LIMIT
CF_CRAWL_POLL_INTERVAL
CF_CRAWL_PREFERRED_FORMAT

Only commands that talk to Cloudflare require CF_ACCOUNT_ID and CF_API_TOKEN. Local-only commands such as completions and start --dry-run work without them.

Example:

export CF_ACCOUNT_ID=your_account_id
export CF_API_TOKEN=your_api_token
export CF_CRAWL_PREFERRED_FORMAT=markdown
export CF_CRAWL_DEFAULT_LIMIT=100

Config file

Default path on Unix-like systems:

~/.config/cf_crawl/config.toml

Example:

account_id = "your_account_id"
api_token = "your_api_token"
poll_interval = "5s"
preferred_format = "markdown"
default_limit = 100

preferred_format is used by start when no --formats value or payload-file format is provided. default_limit is used as the fallback limit for commands that support --limit, including start, status, results, and wait.

You can override the path with:

cf_crawl --config /path/to/config.toml <command>

Usage

Top-level help:

cf_crawl --help

Available commands:

start
status
results
wait
cancel
completions

Commands

Start a crawl

cf_crawl start https://example.com --limit 100 --source sitemaps

Common options:

--limit
--depth
--render
--no-render
--source
--formats
--include-pattern
--exclude-pattern
--include-subdomains
--include-external-links
--max-age
--modified-since
--user-agent
--wait-until
--goto-timeout-ms
--wait-for-selector
--wait-for-selector-timeout-ms
--wait-for-selector-visible
--reject-resource-type
--basic-auth-user
--basic-auth-password
--header
--payload-file
--dry-run

Render the final JSON request without sending it:

cf_crawl start https://example.com --limit 25 --dry-run

--dry-run validates and prints the final request locally, so it does not require Cloudflare credentials.

Use a JSON payload file for advanced request authoring:

cf_crawl start --payload-file crawl.json

Values from explicit CLI flags override the payload file. The crawl URL can be supplied either positionally or inside the payload file.

Check job status

cf_crawl status <job_id>

If you omit <job_id>, the CLI falls back to the saved last started job and prints a note showing which job ID it resolved to.

Fetch a limited number of records with the job:

cf_crawl status <job_id> --limit 5 --record-status completed

Wait for completion

cf_crawl wait <job_id>

As with status and results, omitting <job_id> uses the saved last started job and prints a note in text mode.

With a custom interval:

cf_crawl wait <job_id> --interval 10s

Print or export records after completion:

cf_crawl wait <job_id> --print-results
cf_crawl wait <job_id> --print-results --out records.ndjson

When --print-results is enabled, wait fetches the full filtered result set after the job reaches its terminal state. --limit controls the page size used for those fetches.

--status and --out are only valid together with --print-results.

Fetch results

Fetch one page:

cf_crawl results <job_id> --limit 100

If <job_id> is omitted, results uses the saved last started job and prints the resolved job ID in text mode.

Fetch all pages:

cf_crawl results <job_id> --all

Filter by record status:

cf_crawl results <job_id> --status completed --all

Resume from a cursor:

cf_crawl results <job_id> --cursor <cursor>

Write results to disk:

cf_crawl results <job_id> --all --out records.ndjson
cf_crawl --json results <job_id> --all --out records.json

In --json mode, results and wait --print-results emit a single JSON object with job and records fields.

Cancel a crawl

cf_crawl cancel <job_id>

Use cf_crawl cancel last to explicitly cancel the saved last started job.

Generate shell completions

cf_crawl completions zsh
cf_crawl completions bash
cf_crawl completions fish

Output modes

Default output is human-readable text.

Enable JSON output globally:

cf_crawl --json status <job_id>

Behavior:

text mode prints concise summaries or record lines
JSON mode prints machine-readable JSON
file output uses .json for pretty JSON and .ndjson/.jsonl for line-delimited records
in text mode, result exports default to NDJSON unless the file extension is .json

Examples

Start a sitemap-driven crawl and wait for it:

job_id=$(cf_crawl --json start https://developers.cloudflare.com --source sitemaps | jq -r .id)
cf_crawl wait "$job_id"

Start a static crawl without rendering:

cf_crawl start https://example.com --no-render --formats markdown

Export only completed records:

cf_crawl results <job_id> --status completed --all --out completed.ndjson

Authenticated crawl with headers:

cf_crawl start https://internal.example.com \
  --header "Authorization=Bearer token" \
  --header "X-Team=docs"

Exit codes

0: success
2: invalid input
3: remote terminal/API error
4: interrupted
5: failure

Development

Format:

cargo fmt

Type-check:

cargo check

Lint:

cargo clippy --all-targets --all-features -- -D warnings

Test:

cargo test

Project layout

src/
  api.rs
  cli.rs
  config.rs
  error.rs
  lib.rs
  main.rs
  model.rs
  output.rs
  tracing_init.rs
  cmd/
tests/

Notes

This project targets the Cloudflare Browser Rendering crawl API, not general browser automation.
The CLI models the common crawl request surface directly and allows additional request shaping through --payload-file, with explicit CLI flags taking precedence.
Secrets are handled with secrecy::SecretString, but environment variables and local config files still need normal operational care.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
cf_crawl		cf_crawl
cf_crawl_cli		cf_crawl_cli
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cf_crawl

Features

Requirements

Installation

SDK Usage

Configuration

Environment variables

Config file

Usage

Commands

Start a crawl

Check job status

Wait for completion

Fetch results

Cancel a crawl

Generate shell completions

Output modes

Examples

Exit codes

Development

Project layout

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cf_crawl

Features

Requirements

Installation

SDK Usage

Configuration

Environment variables

Config file

Usage

Commands

Start a crawl

Check job status

Wait for completion

Fetch results

Cancel a crawl

Generate shell completions

Output modes

Examples

Exit codes

Development

Project layout

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages