fix(core): stop draining async iterables after maxNewRequests is exhausted#3579
fix(core): stop draining async iterables after maxNewRequests is exhausted#3579shaun0927 wants to merge 2 commits intoapify:masterfrom
Conversation
Exact `requestsOverLimit` reporting for async iterables currently requires fully draining the producer after the budget is already exhausted. That turns a bounded enqueue operation into a blocking one when the producer stalls. This change keeps exact leftover reporting for materialized inputs (and for callers that explicitly opt into waiting), while letting the default async-iterable path return as soon as the budget is reached. Constraint: Preserve exact leftover reporting for arrays and for callers that explicitly set waitForAllRequestsToBeAdded Rejected: Remove requestsOverLimit entirely for every input type | too broad and breaks useful existing behavior for materialized sources Confidence: medium Scope-risk: moderate Reversibility: clean Directive: If the requestsOverLimit contract changes again, keep async-producer liveness and exact leftover reporting as separate concerns Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should not wait for an async iterable beyond the remaining budget'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should correctly report requestsOverLimit for generator input'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should correctly report requestsOverLimit for array input'` Not-tested: Full test suite; behavior for custom iterable implementations beyond the covered array/async-generator cases
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f0d67770c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const shouldCollectExactRequestsOverLimit = | ||
| maxNewRequests !== undefined && (!isAsyncIterable(requests) || options.waitForAllRequestsToBeAdded); |
There was a problem hiding this comment.
Preserve over-limit reporting for crawler async wrappers
This condition turns off requestsOverLimit collection for every async iterable unless waitForAllRequestsToBeAdded is explicitly true, but BasicCrawler.addRequests() always passes filteredRequests() (an async generator) together with maxNewRequests from maxRequestsPerCrawl. That means the default crawler path now gets an empty requestsOverLimit, so handleSkippedRequest({ reason: 'limit' }) is no longer called for skipped URLs, which drops onSkippedRequest limit callbacks and the related max-limit log behavior in that flow.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good point. I pushed 440a790 to keep the non-blocking direct async-iterable path while making BasicCrawler.addRequests() opt into exact leftover reporting when maxRequestsPerCrawl relies on skipped-limit callbacks. I also added a regression test for that crawler-side path.
Codex correctly pointed out that the previous change stopped collecting
`requestsOverLimit` for every async iterable unless callers explicitly
waited for all batches. That fixed the direct RequestProvider liveness
issue, but it also removed `onSkippedRequest({ reason: 'limit' })`
reporting from the BasicCrawler path that powers maxRequestsPerCrawl
logging and callbacks.
This follow-up keeps the non-blocking default for direct async-producer
usage while making BasicCrawler opt into exact leftover reporting when it
needs to surface skipped-limit behavior.
Constraint: Keep direct RequestProvider async usage non-blocking while preserving existing crawler-level skipped-limit callbacks
Rejected: Revert to always draining every async iterable | reintroduces the blocking behavior reported in apify#3577
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: When changing maxNewRequests semantics, verify both direct RequestProvider usage and BasicCrawler skipped-request reporting paths
Tested: `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should not wait for an async iterable beyond the remaining budget'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequests should still report skipped requests for maxRequestsPerCrawl when the source is async'`; `node node_modules/vitest/vitest.mjs run test/core/crawlers/basic_crawler.test.ts -t 'addRequestsBatched with maxNewRequests should correctly report requestsOverLimit for generator input'`
Not-tested: Full test suite; direct enqueueLinks limit reporting beyond existing coverage
Closes #3577
core issue: fix: Prevent accidental request dropping with
maxRequestsPerCrawl#3531 fixed duplicate-count budgeting, but exactrequestsOverLimitreporting for async iterables still required draining the producer after the budget was already exhausted.for stalled or long-running async producers, that made a bounded
maxNewRequestsenqueue behave like a blocking operation.Fix:
requestsOverLimitreporting for materialized inputswaitForAllRequestsToBeAddedmaxNewRequestsbudget is exhausted instead of draining the producer furtherwaitForAllRequestsToBeAddedThis keeps the change narrow while separating two concerns that currently conflict: