Skip to content

Expose a browser-impersonation toggle directly on HTTP crawlers #1923

@vdusek

Description

@vdusek

Solutioning follow-up from #1911.

What happened: a plain GET worked with requests but 404'd through HttpCrawler. Not a Crawlee bug — all HTTP clients inject an Accept-Language header by default (browser impersonation), and this particular API 404s whenever that header is present.

The friction: turning impersonation off today means importing a client, disabling it there, and passing it back to the crawler — and the opt-out is named differently per client:

from crawlee.http_clients import ImpitHttpClient

crawler = HttpCrawler(http_client=ImpitHttpClient(browser=None))
# or HttpxHttpClient(header_generator=None)
# or CurlImpersonateHttpClient(impersonate=None)

That's a lot of ceremony for a common need.

Proposal — expose it on the HTTP crawlers. Add a simple flag so it's a one-liner with no client import:

crawler = HttpCrawler(impersonate=False)

Applies to HttpCrawler, BeautifulSoupCrawler, ParselCrawler (the AbstractHttpCrawler family). It would configure the default HTTP client under the hood. If the user passes their own http_client, the flag doesn't apply — they configure impersonation on that client directly (we should document this clearly, or raise on conflict).

Also: a short docs guide on default browser-like headers (esp. Accept-Language) — when impersonation helps vs. hurts.

Related: #1683, #1685, #1911

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions