fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks#3578
fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks#3578barjin merged 4 commits intoapify:masterfrom
Conversation
Custom `respectRobotsTxtFile.userAgent` support reached the crawler-level robots checks, but `enqueueLinks()` still evaluated rules as if the catch-all `*` agent was in effect. This threads the configured user-agent through the enqueue path and adds a regression test that covers the user-visible mismatch. Constraint: Keep the fix narrow and preserve existing default `*` behavior when no custom user-agent is configured Rejected: Broader robots.txt refactor | unnecessary for a localized regression fix Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future robots.txt changes should keep crawler admission and enqueueLinks evaluation on the same user-agent source of truth Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'enqueueLinks should respect custom user-agent robots.txt rules'` Not-tested: Full test suite; browser-backed E2E suites
There was a problem hiding this comment.
Pull request overview
Fixes an inconsistency in robots.txt enforcement by ensuring context.enqueueLinks() evaluates robots rules using the same configured respectRobotsTxtFile.userAgent as crawler-level request admission.
Changes:
- Adds an optional
robotsTxtUserAgenttoenqueueLinks()options and uses it when callingrobotsTxtFile.isAllowed(...). - Threads the configured
respectRobotsTxtFile.userAgentthroughBasicCrawler.enqueueLinksWithCrawlDepth()intoenqueueLinks(). - Adds a regression test for
enqueueLinks()honoring custom user-agent robots.txt rules.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| test/core/crawlers/basic_crawler.test.ts | Adds regression coverage for enqueueLinks() with custom robots.txt user-agent rules. |
| packages/core/src/enqueue_links/enqueue_links.ts | Extends enqueueLinks() API to accept an optional robots.txt user-agent and applies it in filtering. |
| packages/basic-crawler/src/internals/basic-crawler.ts | Passes configured respectRobotsTxtFile.userAgent into enqueueLinks() calls made from the crawling context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| })({ | ||
| requestQueue, | ||
| respectRobotsTxtFile: { userAgent: 'MyCrawler' }, | ||
| requestHandler: async (context) => { | ||
| visitedUrls.push(context.request.url); | ||
|
|
||
| if (context.request.label) { | ||
| return; | ||
| } | ||
|
|
||
| await context.enqueueLinks({ | ||
| urls: [ | ||
| 'http://example.com/yes', | ||
| 'http://example.com/no', | ||
| 'http://example.com/no-globally', | ||
| 'http://example.com/my-crawler/anything', | ||
| ], | ||
| label: 'child', | ||
| }); | ||
| }, | ||
| }); | ||
|
|
||
| await crawler.run(['http://example.com/start']); | ||
|
|
||
| expect(visitedUrls).toEqual([ | ||
| 'http://example.com/start', | ||
| 'http://example.com/yes', | ||
| 'http://example.com/my-crawler/anything', | ||
| ]); |
There was a problem hiding this comment.
This test asserts a strict visitation order via toEqual([...]), but BasicCrawler runs with AutoscaledPool default maxConcurrency (200). That can make visitedUrls ordering nondeterministic and lead to flaky tests. Consider setting maxConcurrency: 1 for this crawler instance, or asserting the visited URLs without relying on order (e.g., sort before comparing).
There was a problem hiding this comment.
Good catch. I pushed 32f7f9b to pin the regression test to maxConcurrency: 1, so the visit-order assertion is deterministic without changing the production fix.
Copilot correctly pointed out that the new regression test relied on the crawler's default autoscaled concurrency, which can make visitation order nondeterministic. This pins the test to a single worker so the assertion stays deterministic without weakening the signal of the underlying bug. Constraint: Keep the production fix untouched and scope the follow-up strictly to test stability Rejected: Relax the assertion to an order-insensitive check | that would stop proving the exact traversal sequence covered by the regression Confidence: high Scope-risk: narrow Reversibility: clean Directive: Regression tests that assert visit order should opt into deterministic concurrency instead of relying on AutoscaledPool defaults Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'enqueueLinks should respect custom user-agent robots.txt rules'` Not-tested: Full test suite; browser-backed E2E suites
barjin
left a comment
There was a problem hiding this comment.
Good catch @shaun0927 !
I don't see any issues with your changes, so I'm approving this now. I'd prefer to wait for a second opinion from another maintainer before merging.
Thank you for your contribution :)
| * User-agent name to use when evaluating {@apilink EnqueueLinksOptions.robotsTxtFile|`robotsTxtFile`} rules. | ||
| * Defaults to `*` when not provided. | ||
| */ | ||
| robotsTxtUserAgent?: string; |
There was a problem hiding this comment.
Instead of changing the API like this, perhaps we could pass the same respectRobotsTxtFile type as the one used in BasicCrawler here? So that extending this respectRobotsTxtFile type with additional options doesn't force us to add another option to enqueueLinks here as well?
There was a problem hiding this comment.
Fair point, this would also allow crawler users to disable robots.txt limits where desired, something like
new CheerioCrawler({
respectRobotsTxtFile: true,
async requestHandler({ enqueueLinks }) {
if(xyz) await enqueueLinks({ respectRobotsTxtFile: false });
}
});This now requires a hacky solution with passing an "allow all" RobotsTxtFile instance, which is not optimal
barjin
left a comment
There was a problem hiding this comment.
I took the liberty of finishing the last bits on this PR.
cc @janbuchar is it good to go by you?
| return enqueueLinks({ | ||
| requestQueue, | ||
| robotsTxtFile: await this.getRobotsTxtFileForUrl(request!.url), | ||
| respectRobotsTxtFile: this.respectRobotsTxtFile, |
There was a problem hiding this comment.
Context-bound enqueueLinks now inherits the BasicCrawlerOptions.respectRobotsTxtFile value (e.g., the crawler-wide emulated user-agent string).
Closes #3576
core issue: feat: support custom
userAgentwithrespectRobotsTxtFile#3226 added customrespectRobotsTxtFile.userAgentsupport for crawler-level robots checks, butenqueueLinks()still evaluated child URLs with the default*agent.this made
addRequests()andenqueueLinks()disagree on the same robots.txt rules when a custom user-agent was configured.Fix:
robotsTxtUserAgentthroughenqueueLinks()optionsBasicCrawler.enqueueLinksWithCrawlDepth()*behavior unchanged when no custom user-agent is configuredenqueueLinks()with custom user-agent robots rulesThis keeps the change narrow and focused on restoring consistent robots.txt behavior across Crawlee's enqueue paths.