-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Browser Rendering] Crawl endpoint #26191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ToriLindsay
wants to merge
7
commits into
production
Choose a base branch
from
tori/crawl-endpoint-br
base: production
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+210
−0
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
abd8379
[Browser Rendering] Crawl endpoint
ToriLindsay 7f9a35f
First draft
ToriLindsay c24feba
Update crawl-endpoint.mdx
kathayl 49095bb
Update src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
ToriLindsay f610363
Update src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
ToriLindsay c8ad009
Removed the // comments from code block
ToriLindsay 9e0f736
Added Type component
ToriLindsay File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
210 changes: 210 additions & 0 deletions
210
src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| --- | ||
| pcx_content_type: how-to | ||
| title: /crawl - Crawl web content | ||
| sidebar: | ||
| order: 11 | ||
| --- | ||
|
|
||
| import { Type, MetaInfo, Render } from "~/components"; | ||
|
|
||
| The `/crawl` endpoint automates the process of scraping content from webpages starting with a single URL and crawling to a specified number or depth of links. The response can be returned in either HTML, Markdown, or JSON. | ||
|
|
||
| The `/crawl` endpoint respects the directives of `robots.txt` files, including `crawl-delay` and [`content-signal`](https://contentsignals.org/). All URLs that `/crawl` is directed not to crawl are listed in the response with `"status": "disallowed"`. | ||
|
|
||
| ## Endpoint | ||
|
|
||
| ```txt | ||
| https://api.cloudflare.com/client/v4/accounts/<account_id>/browser-rendering/crawl | ||
| ``` | ||
|
|
||
| ## Required fields | ||
| You must provide `url`: | ||
| - `url` (string) | ||
|
|
||
| ## Common use cases | ||
|
|
||
| - Scraping online content to build a knowledge base of up-to-date information | ||
| - Converting online content into LLM-friendly formats to train [Retrieval-Augmented Generation (RAG) applications](/reference-architecture/diagrams/ai/ai-rag/) and other AI systems | ||
|
|
||
| ## Basic usage | ||
|
|
||
| There are two separate steps to the `/crawl` endpoint: | ||
| 1. [Initiate the crawl job](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) — A `POST` request where you initiate the crawl and receive a response with a job `id`. | ||
| 2. [Request results of the crawl job](browser-rendering/rest-api/crawl-endpoint/#request-results-of-the-crawl-job) — A `GET` request where you request the status or results of the crawl. | ||
|
|
||
| :::note[Free plan limitation] | ||
| If you are on a Workers Free plan, your crawl may fail if it hits the [limit of 10 minutes per day](/browser-rendering/platform/pricing/). To avoid this, you can either [upgrade to a Workers Paid plan](/workers/platform/pricing/) or you can [put limitations on timeouts](/browser-rendering/reference/timeouts/) to get the most out of the 10 minutes of your crawl request. | ||
| ::: | ||
|
|
||
| ### Initiate the crawl job | ||
|
|
||
| Here are the basic parameters you can use to initiate your crawl job: | ||
| - `url` <Type text="String" /> <MetaInfo text="Required" /> | ||
| - Starts crawling from this URL | ||
| - `limit` <Type text="Number" /> <MetaInfo text="Optional" /> | ||
| - Maximum number of pages to crawl (default is 10, maximum is 100,000) | ||
| - `depth` <Type text="Number" /> <MetaInfo text="Optional" /> | ||
| - Maximum link depth to crawl from the starting URL | ||
| - `formats` <Type text="Array of strings" /> <MetaInfo text="Optional" /> | ||
| - Response format (default is HTML, other options are Markdown and JSON) | ||
|
|
||
| Here is an example that uses the basic parameters: | ||
| ```bash | ||
| curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ | ||
| -H 'Authorization: Bearer <apiToken>' \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
|
|
||
| "url": "https://developers.cloudflare.com/workers/", | ||
|
|
||
| "limit": 50, | ||
|
|
||
| "depth": 2, | ||
|
|
||
| "formats": ["markdown"] | ||
| }' | ||
| ``` | ||
|
|
||
| The API will respond immediately with a job `id` you will use to retrieve the status and results of the crawl job. | ||
|
|
||
| Here is an example of the response, which includes a job `id`: | ||
|
|
||
| ```json output | ||
| { | ||
| "result": { | ||
| "id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e" | ||
| }, | ||
| "success": true | ||
| } | ||
| ``` | ||
| See the [advanced usage section below](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) for additional parameters. | ||
|
|
||
| ### Request results of the crawl job | ||
|
|
||
| Here is an example of how you would check the status or request the results of your crawl job with the job `id` you were provided: | ||
|
|
||
| ```bash | ||
| curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/result/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \ | ||
| -H 'Authorization: Bearer YOUR_API_TOKEN' | ||
| ``` | ||
|
|
||
| Here is an example response: | ||
|
|
||
| ```json output | ||
|
|
||
| { | ||
| "result": { | ||
| "id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e", | ||
| "status": "complete", | ||
| "browserTimeSpent": 134.7, | ||
| "total": 50, | ||
| "completed": 50, | ||
| "entries": [ | ||
| { | ||
| "url": "[https://developers.cloudflare.com/workers/](https://developers.cloudflare.com/workers/)", | ||
| "status": "completed", | ||
| "markdown": "# Cloudflare Workers\nBuild and deploy serverless applications...", | ||
| "html": null, | ||
| "metadata": { | ||
| "title": "Cloudflare Workers · Cloudflare Workers docs", | ||
| "language": "en-US" | ||
| } | ||
| }, | ||
| { | ||
| "url": "[https://developers.cloudflare.com/workers/get-started/quickstarts/](https://developers.cloudflare.com/workers/get-started/quickstarts/)", | ||
| "status": "completed", | ||
| "markdown": "## Quickstarts\nGet up and running with a simple 'Hello World'...", | ||
| "html": null, | ||
| "metadata": { | ||
| "title": "Quickstarts · Cloudflare Workers docs", | ||
| "language": "en-US" | ||
| } | ||
| } | ||
| // ... 48 more entries omitted for brevity | ||
| ] | ||
| }, | ||
| "success": true | ||
| } | ||
| ``` | ||
|
|
||
| ### Cancel a crawl job | ||
|
|
||
| If you need to cancel a job that is currently in progress, here is an example of how to cancel a crawl job with the job `id` you were provided: | ||
|
|
||
| ```bash | ||
| curl -X DELETE 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/result/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \ | ||
| -H 'Authorization: Bearer YOUR_API_TOKEN' | ||
| ``` | ||
|
|
||
| A successful cancellation will return a `200 OK` status code. The job status will be updated to cancelled. | ||
|
|
||
| ## Advanced usage | ||
|
|
||
| The `/crawl` endpoint has many parameters you can use to customize your crawl. For the full list, check the [API docs[(https://developers.cloudflare.com/api/resources/browser_rendering/). | ||
|
|
||
| Here is an example that uses the additional parameters that are currently available, in addition to the [basic parameters shown in the example above](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) and the [render parameter below](/browser-rendering/rest-api/crawl-endpoint/#render-a-simple-html-fetch): | ||
|
|
||
| ```bash | ||
| curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ | ||
| -H 'Authorization: Bearer <apiToken>' \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
|
|
||
| // Required: The URL to start crawling from | ||
| "url": "https://www.exampledocs.com/docs/", | ||
|
|
||
| // Optional: The maximum age of a cached resource that can be returned (in seconds) | ||
| "maxAge": 7200, | ||
|
|
||
| "options": { | ||
|
|
||
| // Optional: If true, follows links to external domains (default is false) | ||
| "includeExternalLinks": true, | ||
|
|
||
| // Optional: If true, follows links to subdomains of the starting URL (default is false) | ||
| "includeSubdomains": true, | ||
|
|
||
| // Optional: Only visits URLs that match one of these patterns | ||
| "includePatterns": [ | ||
| ".*/api/v1/.*" | ||
| ], | ||
|
|
||
| // Optional: Does not visit URLs that match any of these patterns | ||
| "excludePatterns": [ | ||
| ".*/learning-paths/.*" | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ### Choose when to render JavaScript | ||
|
|
||
| Use the `render` parameter to control whether the `crawl` endpoint spins up a headless browser and executes page JavaScript. The default is `render: true`. Set `render: false` to do a fast HTML fetch without executing JavaScript. | ||
|
|
||
| Use `render: true` when the page builds content in the browser. Use `render: false` when the content you need is already in the initial HTML response. | ||
|
|
||
| Crawls with the `render: false` are billed under [Workers pricing](/workers/platform/pricing/), Crawls with `render: true` use a headless browser and are billed under typical Browser Rendering pricing. | ||
|
|
||
| Here is an example of a request that uses the `render` parameter: | ||
|
|
||
| ```bash | ||
| curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ | ||
| -H 'Authorization: Bearer <apiToken>' \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
| // Required: The URL to start crawling from | ||
| "url": "https://developers.cloudflare.com/workers/", | ||
|
|
||
| //Optional: If false, only does a simple HTML fetch crawl (default is true) | ||
| "render": false | ||
| }' | ||
| ``` | ||
|
|
||
| <Render | ||
| file="setting-custom-user-agent" | ||
| product="browser-rendering" | ||
| /> | ||
|
|
||
| <Render | ||
| file="faq" | ||
| product="browser-rendering" | ||
| /> | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.