Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request queue v2 - locking #1365

Closed
drobnikj opened this issue Jun 6, 2022 · 5 comments · Fixed by #1975
Closed

Request queue v2 - locking #1365

drobnikj opened this issue Jun 6, 2022 · 5 comments · Fixed by #1975
Assignees
Labels
feature Issues that represent new features or improvements to existing features. product roadmap Issues contributing to product roadmap. sprint goal Issues contributing to sprint goal. t-tooling Issues with this label are in the ownership of the tooling team. tech roadmap Issues contributing to tech roadmap.

Comments

@drobnikj
Copy link
Member

drobnikj commented Jun 6, 2022

Request queue v2 context

There is a specification for the request queue v2 project in internal docs, as it is not updated and not available for all outside of Apify. I sum up the main changes about the rq v2 project here:

  • Persistence & reusability - The main goal is to make a request queue reusable.
  • Batch operations - Primary goal - enable user to enqueue 1M URLs quickly. Secondary goal - enable the user to also quickly delete requests from the queue as they will be able with no TTL (above) create over time and maintain a very large queue
  • Distributivity - The main goal is to move all logic about locking requests on the server. This makes request queue scalable and allow to scraper one column from several clients (actors).

Changes in crawlee

  • It should use listAndLockHead instead listHead method and lock the requests to process. The client will hold these requests locally in the process. This is the only client that will be able to lock and fetch these requests.
  • If the request will be successfully processed, it will be marked as handled with { handledAt: new Date() } attribute, it remove the lock under hood.
  • If request timeouts/fails, the request will still be held locally and possibly prolong lock with updating request with attribute { lockExpiresAt: Date.now() + lockSecs * 1000 } or using prolongRequestLock() method.
  • If the actor is migrating, it should unlock all with deleteRequestLock() method. If not the lock expires anyway.
  • Ensuring that the queue is empty should wait for the lockSecs after it is empty to potential request lock can be expired.

Change in memory storage

The locking mechanism can be based on the request orderNo attribute.

There are the following sets of orderNo values:

  • [-Date.now(), Date.now()] : The request is not locked.
  • (Date.now(), +infinity) : The request is locked.
  • (-infinity, -Date.now()) : The request is locked, and it was added to queue with forefront parameter.

Based on that, the local request queue will handle

  • listAndLockHead - will get sorted request with orderNo [-Date.now(), Date.now()] and return, pick first x and set orderNo to Date.now() + lockSecs * 1000 (if forefront * -1)
  • prolongRequestLock - set orderNo on |orderNo| + lockSecs * 1000 (if forefront * -1)
  • deleteRequestLock - set orderNo on Date.now()
@drobnikj drobnikj added the feature Issues that represent new features or improvements to existing features. label Jun 6, 2022
@metalwarrior665
Copy link
Member

I can see a potential issue if the client will lock the requests and then migration will clear its memory. It will then fetch the next batch but there might be no other requests to fetch, only the locked ones.

@mnmkng
Copy link
Member

mnmkng commented Aug 30, 2022

We just need to be careful not to finish the crawler when there are still locked requests in the queue. Even if we don't manage to call the unlock method before the migration, which is unlikely, the locks will time-out eventually.

@metalwarrior665
Copy link
Member

I missed the part that the crawler will pause & unlock on migration. That solves the problem

@drobnikj drobnikj changed the title [WIP] Request queue v2 - batch operations, locking, requests list Request queue v2 - locking Apr 5, 2023
@drobnikj
Copy link
Member Author

drobnikj commented Apr 5, 2023

I updated this old issue as most parts was done already, we are just missing the changes in crawlee and in local storage.

@vladfrangu
Copy link
Member

vladfrangu commented Jul 21, 2023

Remaining methods to be added for #1975

  • markRequestHandled
  • reclaimRequest
  • isEmpty
  • isFinished
  • drop
  • handledCount
  • getInfo
  • static open
  • addRequestsBatched

  • implement support for locks in storage-local
  • Make current request queue tests check both rqv1 and rqv2
  • Ensure rqv2 actually works as intended
  • add experiments option in crawlers
    • This object will remain present even when RQv2 becomes the RequestQueue that crawlers use by default, as it will allow us to add future experimental features without needing to respect semver on them

@B4nan B4nan added product roadmap Issues contributing to product roadmap. tech roadmap Issues contributing to tech roadmap. sprint goal Issues contributing to sprint goal. and removed product roadmap Issues contributing to product roadmap. labels Sep 11, 2023
@B4nan B4nan added this to the 72nd sprint - Tooling team milestone Sep 11, 2023
@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 11, 2023
B4nan added a commit that referenced this issue Sep 21, 2023
Closes #1365

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>
Co-authored-by: Jakub Drobník <drobnik.j@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. product roadmap Issues contributing to product roadmap. sprint goal Issues contributing to sprint goal. t-tooling Issues with this label are in the ownership of the tooling team. tech roadmap Issues contributing to tech roadmap.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants