GitHub - edsu/cloudflare-crawl: Start, monitor and download data from the Cloudflare Crawl API endpoint

This is a simplistic Python based command line utility that will use Cloudflare's Crawl API to crawl a website, and then fetch the results to the filesystem once the job is completed. It was created to help me test the Cloudflare service, and not to provide access to all the options that the service provides.

You run it with uv, which will create the job, poll till its complete, and then download the data:

uvx https://github.com/edsu/cloudflare-crawl/ crawl https://example.com

created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

If you can't wait for it to complete you can also check up on a crawl using its ID:

uvx https://github.com/edsu/cloudflare-crawl/ status 36f80f5e-d112-4506-8457-89719a158ce2

id: 36f80f5e-d112-4506-8457-89719a158ce2
status: completed
browserSecondsUsed: 1382.8220786132817
total: 1967
finished: 1967
skipped: 6862
cursor: 1

Similarly you can initiate the download separately once the job is complete:

uvx https://github.com/edsu/cloudflare-crawl/ download 36f80f5e-d112-4506-8457-89719a158ce2

wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

I guess if this proves useful to others I could put it on pypi. But there are a lot of options in Cloudflare's API that would probably need command line equivalents.

Note: you will need to set these in your environment or in a .env file for the program to work:

CLOUDFRONT_TOKEN
CLOUDFRONT_ACCOUNT_ID

In order to create a token you will need to go to the Cloudfront dashboard and create a token that has the Browser Rendering:Edit permission.

The analysis directory contains a Marimo notebook and some data for evaluating the crawling behavior of the service. You can view it by:

cd analysis
uv run marimo edit analysis.py

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
analysis		analysis
src/cloudflare_crawl		src/cloudflare_crawl
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages