Skip to content

fdkevin0/cf_crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cf_crawl

Rust SDK and CLI for Cloudflare Browser Rendering crawl jobs.

This repository is organized as a Cargo workspace with two publishable crates:

  • cf_crawl: SDK crate for cargo add cf_crawl
  • cf_crawl_cli: CLI package for cargo install cf_crawl_cli, which installs the cf_crawl binary

It wraps Cloudflare's /crawl REST API with a typed, async CLI that can:

  • start crawl jobs;
  • poll job status;
  • wait for terminal completion;
  • fetch paginated results;
  • cancel running jobs;
  • emit text, JSON, or NDJSON-friendly output.

Features

  • Typed request and response models with serde
  • Async HTTP client built on reqwest + tokio
  • Config loading from CLI flags, environment variables, and config file
  • Retry/backoff for transient API failures
  • TTY-aware wait spinner
  • JSON and text output modes
  • File export for result sets
  • Shell completions
  • Unit tests for config loading, history handling, and output formatting

Requirements

  • Rust toolchain with Cargo
  • Cloudflare account ID
  • Cloudflare API token with Browser Rendering access

Installation

Add the SDK to another Rust project:

cargo add cf_crawl

Install the CLI globally:

cargo install cf_crawl_cli

Build locally:

cargo build --release

The binary will be at:

target/release/cf_crawl

You can also run it directly with Cargo:

cargo run -p cf_crawl_cli -- <command>

SDK Usage

Use the crate as a library when you want to integrate crawl jobs into another Rust service:

use cf_crawl::{
    ClientConfig, CloudflareClient, CrawlRequest, GetCrawlParams, RecordStatusFilter,
};
use secrecy::SecretString;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = CloudflareClient::new(
        ClientConfig::new("your_account_id", SecretString::from("your_api_token".to_string())),
    )?;

    let started = client
        .start_crawl(&CrawlRequest {
            url: "https://example.com".parse()?,
            limit: Some(100),
            depth: None,
            render: Some(true),
            source: None,
            formats: None,
            options: None,
            max_age: None,
            modified_since: None,
            user_agent: None,
            authenticate: None,
            set_extra_http_headers: None,
            goto_options: None,
            wait_for_selector: None,
            reject_resource_types: None,
            json_options: None,
        })
        .await?;

    let job = client
        .collect_records(
            &started.id,
            &GetCrawlParams::new().with_status(RecordStatusFilter::Completed),
        )
        .await?;

    println!("fetched {} records", job.records.len());
    Ok(())
}

Configuration

Configuration precedence:

  1. CLI flags
  2. Environment variables
  3. Config file
  4. Built-in defaults

Environment variables

Supported:

  • CF_ACCOUNT_ID
  • CF_API_TOKEN
  • CF_CRAWL_CONFIG
  • CF_CRAWL_DEFAULT_LIMIT
  • CF_CRAWL_POLL_INTERVAL
  • CF_CRAWL_PREFERRED_FORMAT

Only commands that talk to Cloudflare require CF_ACCOUNT_ID and CF_API_TOKEN. Local-only commands such as completions and start --dry-run work without them.

Example:

export CF_ACCOUNT_ID=your_account_id
export CF_API_TOKEN=your_api_token
export CF_CRAWL_PREFERRED_FORMAT=markdown
export CF_CRAWL_DEFAULT_LIMIT=100

Config file

Default path on Unix-like systems:

~/.config/cf_crawl/config.toml

Example:

account_id = "your_account_id"
api_token = "your_api_token"
poll_interval = "5s"
preferred_format = "markdown"
default_limit = 100

preferred_format is used by start when no --formats value or payload-file format is provided. default_limit is used as the fallback limit for commands that support --limit, including start, status, results, and wait.

You can override the path with:

cf_crawl --config /path/to/config.toml <command>

Usage

Top-level help:

cf_crawl --help

Available commands:

  • start
  • status
  • results
  • wait
  • cancel
  • completions

Commands

Start a crawl

cf_crawl start https://example.com --limit 100 --source sitemaps

Common options:

  • --limit
  • --depth
  • --render
  • --no-render
  • --source
  • --formats
  • --include-pattern
  • --exclude-pattern
  • --include-subdomains
  • --include-external-links
  • --max-age
  • --modified-since
  • --user-agent
  • --wait-until
  • --goto-timeout-ms
  • --wait-for-selector
  • --wait-for-selector-timeout-ms
  • --wait-for-selector-visible
  • --reject-resource-type
  • --basic-auth-user
  • --basic-auth-password
  • --header
  • --payload-file
  • --dry-run

Render the final JSON request without sending it:

cf_crawl start https://example.com --limit 25 --dry-run

--dry-run validates and prints the final request locally, so it does not require Cloudflare credentials.

Use a JSON payload file for advanced request authoring:

cf_crawl start --payload-file crawl.json

Values from explicit CLI flags override the payload file. The crawl URL can be supplied either positionally or inside the payload file.

Check job status

cf_crawl status <job_id>

If you omit <job_id>, the CLI falls back to the saved last started job and prints a note showing which job ID it resolved to.

Fetch a limited number of records with the job:

cf_crawl status <job_id> --limit 5 --record-status completed

Wait for completion

cf_crawl wait <job_id>

As with status and results, omitting <job_id> uses the saved last started job and prints a note in text mode.

With a custom interval:

cf_crawl wait <job_id> --interval 10s

Print or export records after completion:

cf_crawl wait <job_id> --print-results
cf_crawl wait <job_id> --print-results --out records.ndjson

When --print-results is enabled, wait fetches the full filtered result set after the job reaches its terminal state. --limit controls the page size used for those fetches.

--status and --out are only valid together with --print-results.

Fetch results

Fetch one page:

cf_crawl results <job_id> --limit 100

If <job_id> is omitted, results uses the saved last started job and prints the resolved job ID in text mode.

Fetch all pages:

cf_crawl results <job_id> --all

Filter by record status:

cf_crawl results <job_id> --status completed --all

Resume from a cursor:

cf_crawl results <job_id> --cursor <cursor>

Write results to disk:

cf_crawl results <job_id> --all --out records.ndjson
cf_crawl --json results <job_id> --all --out records.json

In --json mode, results and wait --print-results emit a single JSON object with job and records fields.

Cancel a crawl

cf_crawl cancel <job_id>

Use cf_crawl cancel last to explicitly cancel the saved last started job.

Generate shell completions

cf_crawl completions zsh
cf_crawl completions bash
cf_crawl completions fish

Output modes

Default output is human-readable text.

Enable JSON output globally:

cf_crawl --json status <job_id>

Behavior:

  • text mode prints concise summaries or record lines
  • JSON mode prints machine-readable JSON
  • file output uses .json for pretty JSON and .ndjson/.jsonl for line-delimited records
  • in text mode, result exports default to NDJSON unless the file extension is .json

Examples

Start a sitemap-driven crawl and wait for it:

job_id=$(cf_crawl --json start https://developers.cloudflare.com --source sitemaps | jq -r .id)
cf_crawl wait "$job_id"

Start a static crawl without rendering:

cf_crawl start https://example.com --no-render --formats markdown

Export only completed records:

cf_crawl results <job_id> --status completed --all --out completed.ndjson

Authenticated crawl with headers:

cf_crawl start https://internal.example.com \
  --header "Authorization=Bearer token" \
  --header "X-Team=docs"

Exit codes

  • 0: success
  • 2: invalid input
  • 3: remote terminal/API error
  • 4: interrupted
  • 5: failure

Development

Format:

cargo fmt

Type-check:

cargo check

Lint:

cargo clippy --all-targets --all-features -- -D warnings

Test:

cargo test

Project layout

src/
  api.rs
  cli.rs
  config.rs
  error.rs
  lib.rs
  main.rs
  model.rs
  output.rs
  tracing_init.rs
  cmd/
tests/

Notes

  • This project targets the Cloudflare Browser Rendering crawl API, not general browser automation.
  • The CLI models the common crawl request surface directly and allows additional request shaping through --payload-file, with explicit CLI flags taking precedence.
  • Secrets are handled with secrecy::SecretString, but environment variables and local config files still need normal operational care.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors