Skip to content

fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks#3578

Merged
barjin merged 4 commits intoapify:masterfrom
shaun0927:fix/enqueue-links-robots-user-agent
Apr 23, 2026
Merged

fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks#3578
barjin merged 4 commits intoapify:masterfrom
shaun0927:fix/enqueue-links-robots-user-agent

Conversation

@shaun0927
Copy link
Copy Markdown
Contributor

Closes #3576

  • core issue: feat: support custom userAgent with respectRobotsTxtFile #3226 added custom respectRobotsTxtFile.userAgent support for crawler-level robots checks, but enqueueLinks() still evaluated child URLs with the default * agent.

  • this made addRequests() and enqueueLinks() disagree on the same robots.txt rules when a custom user-agent was configured.

  • Fix:

    • thread an optional robotsTxtUserAgent through enqueueLinks() options
    • pass the configured user-agent from BasicCrawler.enqueueLinksWithCrawlDepth()
    • keep the default * behavior unchanged when no custom user-agent is configured
    • add a regression test covering enqueueLinks() with custom user-agent robots rules

This keeps the change narrow and focused on restoring consistent robots.txt behavior across Crawlee's enqueue paths.

Custom `respectRobotsTxtFile.userAgent` support reached the crawler-level
robots checks, but `enqueueLinks()` still evaluated rules as if the
catch-all `*` agent was in effect. This threads the configured user-agent
through the enqueue path and adds a regression test that covers the
user-visible mismatch.

Constraint: Keep the fix narrow and preserve existing default `*` behavior when no custom user-agent is configured
Rejected: Broader robots.txt refactor | unnecessary for a localized regression fix
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Any future robots.txt changes should keep crawler admission and enqueueLinks evaluation on the same user-agent source of truth
Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'enqueueLinks should respect custom user-agent robots.txt rules'`
Not-tested: Full test suite; browser-backed E2E suites
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an inconsistency in robots.txt enforcement by ensuring context.enqueueLinks() evaluates robots rules using the same configured respectRobotsTxtFile.userAgent as crawler-level request admission.

Changes:

  • Adds an optional robotsTxtUserAgent to enqueueLinks() options and uses it when calling robotsTxtFile.isAllowed(...).
  • Threads the configured respectRobotsTxtFile.userAgent through BasicCrawler.enqueueLinksWithCrawlDepth() into enqueueLinks().
  • Adds a regression test for enqueueLinks() honoring custom user-agent robots.txt rules.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
test/core/crawlers/basic_crawler.test.ts Adds regression coverage for enqueueLinks() with custom robots.txt user-agent rules.
packages/core/src/enqueue_links/enqueue_links.ts Extends enqueueLinks() API to accept an optional robots.txt user-agent and applies it in filtering.
packages/basic-crawler/src/internals/basic-crawler.ts Passes configured respectRobotsTxtFile.userAgent into enqueueLinks() calls made from the crawling context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1743 to +1771
})({
requestQueue,
respectRobotsTxtFile: { userAgent: 'MyCrawler' },
requestHandler: async (context) => {
visitedUrls.push(context.request.url);

if (context.request.label) {
return;
}

await context.enqueueLinks({
urls: [
'http://example.com/yes',
'http://example.com/no',
'http://example.com/no-globally',
'http://example.com/my-crawler/anything',
],
label: 'child',
});
},
});

await crawler.run(['http://example.com/start']);

expect(visitedUrls).toEqual([
'http://example.com/start',
'http://example.com/yes',
'http://example.com/my-crawler/anything',
]);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts a strict visitation order via toEqual([...]), but BasicCrawler runs with AutoscaledPool default maxConcurrency (200). That can make visitedUrls ordering nondeterministic and lead to flaky tests. Consider setting maxConcurrency: 1 for this crawler instance, or asserting the visited URLs without relying on order (e.g., sort before comparing).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I pushed 32f7f9b to pin the regression test to maxConcurrency: 1, so the visit-order assertion is deterministic without changing the production fix.

Copilot correctly pointed out that the new regression test relied on the
crawler's default autoscaled concurrency, which can make visitation order
nondeterministic. This pins the test to a single worker so the assertion
stays deterministic without weakening the signal of the underlying bug.

Constraint: Keep the production fix untouched and scope the follow-up strictly to test stability
Rejected: Relax the assertion to an order-insensitive check | that would stop proving the exact traversal sequence covered by the regression
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Regression tests that assert visit order should opt into deterministic concurrency instead of relying on AutoscaledPool defaults
Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'enqueueLinks should respect custom user-agent robots.txt rules'`
Not-tested: Full test suite; browser-backed E2E suites
@barjin barjin requested review from barjin and janbuchar April 16, 2026 08:58
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @shaun0927 !

I don't see any issues with your changes, so I'm approving this now. I'd prefer to wait for a second opinion from another maintainer before merging.

Thank you for your contribution :)

* User-agent name to use when evaluating {@apilink EnqueueLinksOptions.robotsTxtFile|`robotsTxtFile`} rules.
* Defaults to `*` when not provided.
*/
robotsTxtUserAgent?: string;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the API like this, perhaps we could pass the same respectRobotsTxtFile type as the one used in BasicCrawler here? So that extending this respectRobotsTxtFile type with additional options doesn't force us to add another option to enqueueLinks here as well?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, this would also allow crawler users to disable robots.txt limits where desired, something like

new CheerioCrawler({
    respectRobotsTxtFile: true,
    async requestHandler({ enqueueLinks }) {
        if(xyz) await enqueueLinks({ respectRobotsTxtFile: false });
    }
});

This now requires a hacky solution with passing an "allow all" RobotsTxtFile instance, which is not optimal

Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of finishing the last bits on this PR.

cc @janbuchar is it good to go by you?

return enqueueLinks({
requestQueue,
robotsTxtFile: await this.getRobotsTxtFileForUrl(request!.url),
respectRobotsTxtFile: this.respectRobotsTxtFile,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context-bound enqueueLinks now inherits the BasicCrawlerOptions.respectRobotsTxtFile value (e.g., the crawler-wide emulated user-agent string).

@barjin barjin requested a review from janbuchar April 22, 2026 14:44
@barjin barjin merged commit 1d2528b into apify:master Apr 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enqueueLinks() does not honor custom respectRobotsTxtFile.userAgent

6 participants