fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks by shaun0927 · Pull Request #3578 · apify/crawlee

shaun0927 · 2026-04-16T08:23:55Z

core issue: feat: support custom userAgent with respectRobotsTxtFile #3226 added custom respectRobotsTxtFile.userAgent support for crawler-level robots checks, but enqueueLinks() still evaluated child URLs with the default * agent.
this made addRequests() and enqueueLinks() disagree on the same robots.txt rules when a custom user-agent was configured.
Fix:
- thread an optional robotsTxtUserAgent through enqueueLinks() options
- pass the configured user-agent from BasicCrawler.enqueueLinksWithCrawlDepth()
- keep the default * behavior unchanged when no custom user-agent is configured
- add a regression test covering enqueueLinks() with custom user-agent robots rules

This keeps the change narrow and focused on restoring consistent robots.txt behavior across Crawlee's enqueue paths.

Custom `respectRobotsTxtFile.userAgent` support reached the crawler-level robots checks, but `enqueueLinks()` still evaluated rules as if the catch-all `*` agent was in effect. This threads the configured user-agent through the enqueue path and adds a regression test that covers the user-visible mismatch. Constraint: Keep the fix narrow and preserve existing default `*` behavior when no custom user-agent is configured Rejected: Broader robots.txt refactor | unnecessary for a localized regression fix Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future robots.txt changes should keep crawler admission and enqueueLinks evaluation on the same user-agent source of truth Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'enqueueLinks should respect custom user-agent robots.txt rules'` Not-tested: Full test suite; browser-backed E2E suites

Copilot

Pull request overview

Fixes an inconsistency in robots.txt enforcement by ensuring context.enqueueLinks() evaluates robots rules using the same configured respectRobotsTxtFile.userAgent as crawler-level request admission.

Changes:

Adds an optional robotsTxtUserAgent to enqueueLinks() options and uses it when calling robotsTxtFile.isAllowed(...).
Threads the configured respectRobotsTxtFile.userAgent through BasicCrawler.enqueueLinksWithCrawlDepth() into enqueueLinks().
Adds a regression test for enqueueLinks() honoring custom user-agent robots.txt rules.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
test/core/crawlers/basic_crawler.test.ts	Adds regression coverage for `enqueueLinks()` with custom robots.txt user-agent rules.
packages/core/src/enqueue_links/enqueue_links.ts	Extends `enqueueLinks()` API to accept an optional robots.txt user-agent and applies it in filtering.
packages/basic-crawler/src/internals/basic-crawler.ts	Passes configured `respectRobotsTxtFile.userAgent` into `enqueueLinks()` calls made from the crawling context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T08:52:41Z

+            })({
+                requestQueue,
+                respectRobotsTxtFile: { userAgent: 'MyCrawler' },
+                requestHandler: async (context) => {
+                    visitedUrls.push(context.request.url);
+
+                    if (context.request.label) {
+                        return;
+                    }
+
+                    await context.enqueueLinks({
+                        urls: [
+                            'http://example.com/yes',
+                            'http://example.com/no',
+                            'http://example.com/no-globally',
+                            'http://example.com/my-crawler/anything',
+                        ],
+                        label: 'child',
+                    });
+                },
+            });
+
+            await crawler.run(['http://example.com/start']);
+
+            expect(visitedUrls).toEqual([
+                'http://example.com/start',
+                'http://example.com/yes',
+                'http://example.com/my-crawler/anything',
+            ]);


This test asserts a strict visitation order via toEqual([...]), but BasicCrawler runs with AutoscaledPool default maxConcurrency (200). That can make visitedUrls ordering nondeterministic and lead to flaky tests. Consider setting maxConcurrency: 1 for this crawler instance, or asserting the visited URLs without relying on order (e.g., sort before comparing).

Good catch. I pushed 32f7f9b to pin the regression test to maxConcurrency: 1, so the visit-order assertion is deterministic without changing the production fix.

Copilot correctly pointed out that the new regression test relied on the crawler's default autoscaled concurrency, which can make visitation order nondeterministic. This pins the test to a single worker so the assertion stays deterministic without weakening the signal of the underlying bug. Constraint: Keep the production fix untouched and scope the follow-up strictly to test stability Rejected: Relax the assertion to an order-insensitive check | that would stop proving the exact traversal sequence covered by the regression Confidence: high Scope-risk: narrow Reversibility: clean Directive: Regression tests that assert visit order should opt into deterministic concurrency instead of relying on AutoscaledPool defaults Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'enqueueLinks should respect custom user-agent robots.txt rules'` Not-tested: Full test suite; browser-backed E2E suites

barjin

Good catch @shaun0927 !

I don't see any issues with your changes, so I'm approving this now. I'd prefer to wait for a second opinion from another maintainer before merging.

Thank you for your contribution :)

janbuchar · 2026-04-16T17:55:13Z

+     * User-agent name to use when evaluating {@apilink EnqueueLinksOptions.robotsTxtFile|`robotsTxtFile`} rules.
+     * Defaults to `*` when not provided.
+     */
+    robotsTxtUserAgent?: string;


Instead of changing the API like this, perhaps we could pass the same respectRobotsTxtFile type as the one used in BasicCrawler here? So that extending this respectRobotsTxtFile type with additional options doesn't force us to add another option to enqueueLinks here as well?

Fair point, this would also allow crawler users to disable robots.txt limits where desired, something like

new CheerioCrawler({ respectRobotsTxtFile: true, async requestHandler({ enqueueLinks }) { if(xyz) await enqueueLinks({ respectRobotsTxtFile: false }); } });

This now requires a hacky solution with passing an "allow all" RobotsTxtFile instance, which is not optimal

barjin

I took the liberty of finishing the last bits on this PR.

cc @janbuchar is it good to go by you?

barjin · 2026-04-22T14:43:15Z

        return enqueueLinks({
            requestQueue,
            robotsTxtFile: await this.getRobotsTxtFileForUrl(request!.url),
+            respectRobotsTxtFile: this.respectRobotsTxtFile,


Context-bound enqueueLinks now inherits the BasicCrawlerOptions.respectRobotsTxtFile value (e.g., the crawler-wide emulated user-agent string).

barjin requested a review from Copilot April 16, 2026 08:48

Copilot started reviewing on behalf of barjin April 16, 2026 08:49 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

barjin requested review from barjin and janbuchar April 16, 2026 08:58

barjin approved these changes Apr 16, 2026

View reviewed changes

janbuchar reviewed Apr 16, 2026

View reviewed changes

fix: improve enqueueLinks robots.txt interface

4191266

barjin reviewed Apr 22, 2026

View reviewed changes

barjin requested a review from janbuchar April 22, 2026 14:44

chore: run linter

d6b03c0

janbuchar approved these changes Apr 23, 2026

View reviewed changes

barjin merged commit 1d2528b into apify:master Apr 23, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks#3578

fix: honor custom respectRobotsTxtFile userAgent in enqueueLinks#3578
barjin merged 4 commits intoapify:masterfrom
shaun0927:fix/enqueue-links-robots-user-agent

shaun0927 commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

shaun0927 Apr 16, 2026

Uh oh!

barjin left a comment

Uh oh!

janbuchar Apr 16, 2026

Uh oh!

barjin Apr 22, 2026

Uh oh!

barjin left a comment

Uh oh!

barjin Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

shaun0927 commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

shaun0927 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

barjin Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

barjin Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants