fix: Simplified RequestQueueV2 implementation by janbuchar · Pull Request #2775 · apify/crawlee

janbuchar · 2024-12-17T14:58:49Z

closes Utilize queueHasLockedRequests to simplify RequestQueue v2 #2767
should resolve feedback from misc(RequestQueueV2): adding notes for request queue v2 implementation #2700

…nishes

packages/core/src/storages/request_queue_v2.ts

drobnikj

Nice 💪

I would do some testing myself, but the first what about some unit tests, did you consider,add some? There are none -> https://github.com/apify/crawlee/blob/03951bdba8fb34f6bed00d1b68240ff7cd0bacbf/test/core/storages/request_queue.test.ts
Honesly, we are dealing with various bugs during time and we do not have any tests for these features still.

packages/core/src/storages/request_provider.ts

packages/core/src/storages/request_queue_v2.ts

drobnikj · 2025-01-06T10:04:39Z

The build did not finish, can you check @janbuchar ?
I would like to test it in some Actors.

janbuchar · 2025-01-06T11:55:30Z

The build did not finish, can you check @janbuchar ? I would like to test it in some Actors.

I can, but only later this week - I have different stuff to finish first.

janbuchar · 2025-01-23T17:08:35Z

@drobnikj the unit tests are now passing so you should be able to build. I'm still working on some e2e tests, if you have any ideas for scenarios to test (e2e, unit, doesn't matter), I'd love to hear those.

drobnikj

Looks good, I did not find any issue, even during testing.
I have a few more comments, can you check pls? @janbuchar

packages/core/src/storages/request_queue_v2.ts

drobnikj · 2025-01-30T11:10:06Z

packages/core/src/storages/request_queue_v2.ts

@@ -361,7 +430,8 @@ export class RequestQueue extends RequestProvider {



I cannot comment it below but during code review, I see that we are removing locks one by one in _clearPossibleLock.
see

crawlee/packages/core/src/storages/request_queue_v2.ts

Line 446 in 092d60c

while ((requestId = this.queueHeadIds.removeFirst()) !== null) {

There is 200 rps rate limit. I would remove lock in some batches maybe 10 to speed it up.

What do you mean? I don't think there is a batch unlock endpoint. Launching those requests in parallel surely won't help against rate limiting, too.

I mean to unlock in some batches like

protected async _clearPossibleLocks() { this.queuePausedForMigration = true; let requestId: string | null; const batchSize = 10; const deleteRequests: Promise<void>[] = []; // eslint-disable-next-line no-cond-assign while ((requestId = this.queueHeadIds.removeFirst()) !== null) { deleteRequests.push( this.client.deleteRequestLock(requestId).catch(() => { // We don't have the lock, or the request was never locked. Either way it's fine }) ); if (deleteRequests.length >= batchSize) { // Process the batch of 10 await Promise.all(deleteRequests); deleteRequests.length = 0; // Reset the array for the next batch } } // Process any remaining requests that didn't form a full batch if (deleteRequests.length > 0) { await Promise.all(deleteRequests); } }

I see. However, I still doubt that there will be any measurable benefit - this code is only executed on migration and there shouldn't be more than ~25 requests in the queue head.

packages/core/src/storages/request_queue_v2.ts

janbuchar · 2025-02-04T22:50:01Z

@barjin I gave the forefront handling a makeover. If you could check that out, I'd be super grateful.

barjin · 2025-02-05T09:29:14Z

Looking good to me 👍🏽 I remember reversing the forefront array somewhere already (likely memory-storage?), but as long as those tests are passing, this part is IMO good to go.

drobnikj

Looks like almost all my notes were addressed and I commented the rest.

packages/core/src/storages/request_queue_v2.ts

drobnikj · 2025-02-10T15:58:41Z

packages/core/src/storages/request_queue_v2.ts

@@ -361,7 +430,8 @@ export class RequestQueue extends RequestProvider {



I mean to unlock in some batches like

protected async _clearPossibleLocks() { this.queuePausedForMigration = true; let requestId: string | null; const batchSize = 10; const deleteRequests: Promise<void>[] = []; // eslint-disable-next-line no-cond-assign while ((requestId = this.queueHeadIds.removeFirst()) !== null) { deleteRequests.push( this.client.deleteRequestLock(requestId).catch(() => { // We don't have the lock, or the request was never locked. Either way it's fine }) ); if (deleteRequests.length >= batchSize) { // Process the batch of 10 await Promise.all(deleteRequests); deleteRequests.length = 0; // Reset the array for the next batch } } // Process any remaining requests that didn't form a full batch if (deleteRequests.length > 0) { await Promise.all(deleteRequests); } }

packages/core/src/storages/request_queue_v2.ts

Co-authored-by: Vlad Frangu <me@vladfrangu.dev>

vladfrangu

lgtm once the format is fixed (woops, sorryy ;w;)

B4nan

few comments from my end, nothing really blocking so approving

B4nan · 2025-02-11T09:16:15Z

packages/basic-crawler/src/internals/basic-crawler.ts

            // for request queue v2, we want to lock requests by the timeout that would also account for internals (plus 5 seconds padding), but
            // with a minimum of a minute
-            this.requestQueue.requestLockSecs = Math.max(this.internalTimeoutMillis / 1000 + 5, 60);
+            this.requestQueue.requestLockSecs = Math.max(this.requestHandlerTimeoutMillis / 1000 + 5, 60);


the comment still mentions the internal timeout

I'm honestly not sure what it was trying to say so I reworded it.

B4nan · 2025-02-11T10:29:10Z

packages/core/src/storages/request_provider.ts

+        this.inProgressRequestBatches.push(promise);
+        void promise.finally(() => {
+            this.inProgressRequestBatches = this.inProgressRequestBatches.filter((it) => it !== promise);
+        });


how many items do you we expect in that array in a high concurrency run? this solution is not the best one, but if the size wont be large, we can keep it.

how is this different than a simple integer counter? that would be the most performant approach, just increment instead of push and decrement in the finally block

The integer counter was in fact the previous implementation. However, it could not work with multiple clients, and we cannot reliably detect that - the queueHadMultipleClients flag is set even if the other client was a pre-migration instance of the same run, if that makes sense.

You are right that each forefront request might make us lock 25 more requests, and that could unbalance parallel instances quite a bit. Maybe we should give up "excess" requests after we're done checking for forefront requests.

Hmm, not sure I follow why the counter wouldn't be enough, how is this better? Each client will have its own local cache (this new var). You store values in an array and wipe them based on identity, but the promises are not really used anywhere. My suggestion is doing the same, just without the memory/perf overhead.

Just to be sure, this is what I meant, it still uses the promise.finally:

this.inProgressRequestBatches++; void promise.finally(() => { this.inProgressRequestBatches--; });

Oh damn, I'm sorry. I thought you are commenting on a different part of code - the one that handles forefront requests. If it's any help, you pushed me to tie up a loose end that I forgot about.

Regarding the batches, you're probably right 😁

test/e2e/request-queue-with-concurrency/test.mjs

B4nan · 2025-02-11T10:47:14Z

packages/core/src/storages/request_queue_v2.ts

+        if (this.queueHeadIds.length() > 0) {
+            return false;
+        }


i guess the duplicity (same check 5 lines later) here is for performance reasons?

Yup. If the queueHeadIds is non-empty, we return immediately, otherwise we try to fetch something from the upstream queue, which may take time. I'll add a comment.

packages/core/src/storages/request_queue_v2.ts

This PR ports over the changes from apify/crawlee#2775. Key changes: - tracking of "locked" or "in progress" requests was moved from `storages.RequestQueue` to request storage client implementations - queue head cache gets invalidated after we enqueue a new forefront request (before that, it would only be processed after the current head cache is consumed) - the `RequestQueue.is_finished` function has been rewritten to avoid race conditions - I tried running SDK integration tests with these changes and they passed

janbuchar added 4 commits December 12, 2024 15:48

Make a separate isFinished implementation for RQv2

4b662c2

Do not use recentlyHandledRequestsCache in RQv2

0529e12

Add queueHasLockedRequests field

8407033

RQv2-specific implementation of isFinished()

7fbecc4

janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 17, 2024

github-actions bot assigned janbuchar Dec 17, 2024

github-actions bot added this to the 105th sprint - Tooling team milestone Dec 17, 2024

janbuchar marked this pull request as draft December 17, 2024 14:59

janbuchar added 3 commits December 18, 2024 12:20

Track request batch addition so that we do not terminate before it fi…

b94a9c8

…nishes

Do not return additional requests when paused for migration

a40d04e

Slightly simplify forefront request handling

07b3717

janbuchar requested review from B4nan, barjin, drobnikj and vladfrangu December 19, 2024 18:34

janbuchar commented Dec 19, 2024

View reviewed changes

packages/core/src/storages/request_queue_v2.ts Show resolved Hide resolved

packages/core/src/storages/request_queue_v2.ts Outdated Show resolved Hide resolved

packages/core/src/storages/request_queue_v2.ts Show resolved Hide resolved

Resolve lint errors

092d60c

drobnikj reviewed Dec 20, 2024

View reviewed changes

packages/core/src/storages/request_provider.ts Outdated Show resolved Hide resolved

barjin reviewed Dec 20, 2024

View reviewed changes

packages/core/src/storages/request_queue_v2.ts Outdated Show resolved Hide resolved

janbuchar added 3 commits January 21, 2025 12:19

Merge remote-tracking branch 'origin/master' into simplify-rq-v2

edf73e9

Remove unused RequestProvider.isFinished implementation

c4b45cb

Update tests, remove too implementation-dependent assertions

efea4ff

github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 22, 2025

Bug

fe3e129

drobnikj self-requested a review January 30, 2025 10:41

drobnikj reviewed Jan 30, 2025

View reviewed changes

packages/core/src/storages/request_queue_v2.ts Show resolved Hide resolved

janbuchar added 3 commits February 4, 2025 23:38

Fix forefront request tracking

5d0222e

Add a log

4fd0085

Lint

4d2c77d

janbuchar marked this pull request as ready for review February 4, 2025 22:49

janbuchar requested review from barjin and drobnikj February 4, 2025 22:49

drobnikj approved these changes Feb 10, 2025

View reviewed changes

janbuchar added 2 commits February 10, 2025 17:54

Make lock time depend on request handler timeout

f3cc973

Throttle logs a bit

c2af4d2

vladfrangu requested changes Feb 10, 2025

View reviewed changes

packages/core/src/storages/request_queue_v2.ts Show resolved Hide resolved

packages/core/src/storages/request_queue_v2.ts Outdated Show resolved Hide resolved

Apply suggestions from code review

76926ae

Co-authored-by: Vlad Frangu <me@vladfrangu.dev>

janbuchar requested a review from vladfrangu February 10, 2025 17:51

vladfrangu approved these changes Feb 10, 2025

View reviewed changes

Fix format

9092125

B4nan approved these changes Feb 11, 2025

View reviewed changes

janbuchar added 6 commits February 11, 2025 12:40

Improve comments in RequestQueue.isFinished

fc7db72

Improve another comment

eb23de8

Give up excess locked requests after checking for forefront requests

d76b0ae

Simplify in-progress batch handling

95d782c

Type error

4112a38

Lint

4a3ec3a

B4nan approved these changes Feb 11, 2025

View reviewed changes

janbuchar merged commit d1a094a into master Feb 11, 2025
9 checks passed

janbuchar deleted the simplify-rq-v2 branch February 11, 2025 14:33

janbuchar mentioned this pull request Feb 11, 2025

misc(RequestQueueV2): adding notes for request queue v2 implementation #2700

Closed

B4nan mentioned this pull request Mar 4, 2025

fix(RequestQueueV2): implement an isFinished escape hatch if the queue ends up in what seems to be a deadlock state #2652

Closed

janbuchar mentioned this pull request Mar 7, 2025

refactor: Simplify request queue implementation apify/crawlee-python#653

Merged

		@@ -361,7 +430,8 @@ export class RequestQueue extends RequestProvider {

Conversation

janbuchar commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drobnikj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

drobnikj commented Jan 6, 2025

Uh oh!

janbuchar commented Jan 6, 2025

Uh oh!

janbuchar commented Jan 23, 2025

Uh oh!

drobnikj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

janbuchar commented Feb 4, 2025

Uh oh!

barjin commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drobnikj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vladfrangu left a comment

Choose a reason for hiding this comment

Uh oh!

B4nan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

janbuchar commented Dec 17, 2024 •

edited

Loading

drobnikj left a comment •

edited

Loading

barjin commented Feb 5, 2025 •

edited

Loading

janbuchar Feb 11, 2025 •

edited

Loading