Skip to content

edsu/cloudflare-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI

This is a simplistic Python based command line utility that will use Cloudflare's Crawl API to crawl a website, and then fetch the results to the filesystem once the job is completed. It was created to help me test the Cloudflare service, and not to provide access to all the options that the service provides.

You run it with uv, which will create the job, poll till its complete, and then download the data:

uvx https://github.com/edsu/cloudflare-crawl/ crawl https://example.com

created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

If you can't wait for it to complete you can also check up on a crawl using its ID:

uvx https://github.com/edsu/cloudflare-crawl/ status 36f80f5e-d112-4506-8457-89719a158ce2

id: 36f80f5e-d112-4506-8457-89719a158ce2
status: completed
browserSecondsUsed: 1382.8220786132817
total: 1967
finished: 1967
skipped: 6862
cursor: 1

Similarly you can initiate the download separately once the job is complete:

uvx https://github.com/edsu/cloudflare-crawl/ download 36f80f5e-d112-4506-8457-89719a158ce2

wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

I guess if this proves useful to others I could put it on pypi. But there are a lot of options in Cloudflare's API that would probably need command line equivalents.

Note: you will need to set these in your environment or in a .env file for the program to work:

  • CLOUDFRONT_TOKEN
  • CLOUDFRONT_ACCOUNT_ID

In order to create a token you will need to go to the Cloudfront dashboard and create a token that has the Browser Rendering:Edit permission.

The analysis directory contains a Marimo notebook and some data for evaluating the crawling behavior of the service. You can view it by:

cd analysis
uv run marimo edit analysis.py

About

Start, monitor and download data from the Cloudflare Crawl API endpoint

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages