Skip to content

fix(core): stop draining async iterables after maxNewRequests is exhausted#3579

Open
shaun0927 wants to merge 2 commits intoapify:masterfrom
shaun0927:fix/max-new-requests-async-iterables
Open

fix(core): stop draining async iterables after maxNewRequests is exhausted#3579
shaun0927 wants to merge 2 commits intoapify:masterfrom
shaun0927:fix/max-new-requests-async-iterables

Conversation

@shaun0927
Copy link
Copy Markdown
Contributor

Closes #3577

  • core issue: fix: Prevent accidental request dropping with maxRequestsPerCrawl #3531 fixed duplicate-count budgeting, but exact requestsOverLimit reporting for async iterables still required draining the producer after the budget was already exhausted.

  • for stalled or long-running async producers, that made a bounded maxNewRequests enqueue behave like a blocking operation.

  • Fix:

    • keep exact requestsOverLimit reporting for materialized inputs
    • keep exact reporting for async iterables when callers explicitly opt into waitForAllRequestsToBeAdded
    • let the default async-iterable path return as soon as the maxNewRequests budget is exhausted instead of draining the producer further
    • add regression coverage for the non-blocking async-iterable case
    • keep existing coverage for exact leftover reporting by making the async-generator test opt into waitForAllRequestsToBeAdded

This keeps the change narrow while separating two concerns that currently conflict:

  1. exact leftover reporting
  2. liveness for streaming async producers

Exact `requestsOverLimit` reporting for async iterables currently requires
fully draining the producer after the budget is already exhausted. That
turns a bounded enqueue operation into a blocking one when the producer
stalls. This change keeps exact leftover reporting for materialized inputs
(and for callers that explicitly opt into waiting), while letting the
default async-iterable path return as soon as the budget is reached.

Constraint: Preserve exact leftover reporting for arrays and for callers that explicitly set waitForAllRequestsToBeAdded
Rejected: Remove requestsOverLimit entirely for every input type | too broad and breaks useful existing behavior for materialized sources
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: If the requestsOverLimit contract changes again, keep async-producer liveness and exact leftover reporting as separate concerns
Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should not wait for an async iterable beyond the remaining budget'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should correctly report requestsOverLimit for generator input'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should correctly report requestsOverLimit for array input'`
Not-tested: Full test suite; behavior for custom iterable implementations beyond the covered array/async-generator cases
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f0d67770c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +459 to +460
const shouldCollectExactRequestsOverLimit =
maxNewRequests !== undefined && (!isAsyncIterable(requests) || options.waitForAllRequestsToBeAdded);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve over-limit reporting for crawler async wrappers

This condition turns off requestsOverLimit collection for every async iterable unless waitForAllRequestsToBeAdded is explicitly true, but BasicCrawler.addRequests() always passes filteredRequests() (an async generator) together with maxNewRequests from maxRequestsPerCrawl. That means the default crawler path now gets an empty requestsOverLimit, so handleSkippedRequest({ reason: 'limit' }) is no longer called for skipped URLs, which drops onSkippedRequest limit callbacks and the related max-limit log behavior in that flow.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I pushed 440a790 to keep the non-blocking direct async-iterable path while making BasicCrawler.addRequests() opt into exact leftover reporting when maxRequestsPerCrawl relies on skipped-limit callbacks. I also added a regression test for that crawler-side path.

Codex correctly pointed out that the previous change stopped collecting
`requestsOverLimit` for every async iterable unless callers explicitly
waited for all batches. That fixed the direct RequestProvider liveness
issue, but it also removed `onSkippedRequest({ reason: 'limit' })`
reporting from the BasicCrawler path that powers maxRequestsPerCrawl
logging and callbacks.

This follow-up keeps the non-blocking default for direct async-producer
usage while making BasicCrawler opt into exact leftover reporting when it
needs to surface skipped-limit behavior.

Constraint: Keep direct RequestProvider async usage non-blocking while preserving existing crawler-level skipped-limit callbacks
Rejected: Revert to always draining every async iterable | reintroduces the blocking behavior reported in apify#3577
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: When changing maxNewRequests semantics, verify both direct RequestProvider usage and BasicCrawler skipped-request reporting paths
Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should not wait for an async iterable beyond the remaining budget'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequests should still report skipped requests for maxRequestsPerCrawl when the source is async'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should correctly report requestsOverLimit for generator input'`
Not-tested: Full test suite; direct enqueueLinks limit reporting beyond existing coverage
@janbuchar janbuchar self-requested a review April 16, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

addRequestsBatched() keeps consuming async iterables after maxNewRequests is exhausted

3 participants