Skip to content

refactor: Simplify request queue implementation#653

Merged
janbuchar merged 16 commits intomasterfrom
simplify-request-queue
Mar 11, 2025
Merged

refactor: Simplify request queue implementation#653
janbuchar merged 16 commits intomasterfrom
simplify-request-queue

Conversation

@janbuchar
Copy link
Collaborator

@janbuchar janbuchar commented Nov 5, 2024

This PR ports over the changes from apify/crawlee#2775.

Key changes:

  • tracking of "locked" or "in progress" requests was moved from storages.RequestQueue to request storage client implementations
  • queue head cache gets invalidated after we enqueue a new forefront request (before that, it would only be processed after the current head cache is consumed)
  • the RequestQueue.is_finished function has been rewritten to avoid race conditions
  • I tried running SDK integration tests with these changes and they passed

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 5, 2024
@github-actions github-actions bot added this to the 102nd sprint - Tooling team milestone Nov 5, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@janbuchar janbuchar added the adhoc Ad-hoc unplanned task added during the sprint. label Nov 6, 2024
@B4nan
Copy link
Member

B4nan commented Nov 6, 2024

I was hoping we would first address this in the JS version where its actually costing us money, do we have some reported problems on the python side too?

@janbuchar
Copy link
Collaborator Author

I was hoping we would first address this in the JS version where its actually costing us money, do we have some reported problems on the python side too?

I get your angle, but I wanted to try it on code where the implementation isn't scattered between RequestProvider and RequestQueue. Plus, yeah, actors also get stuck with the Python SDK and the request queue is a prime suspect.

@B4nan
Copy link
Member

B4nan commented Nov 6, 2024

If that abstraction is problematic I am fine with removing it and leaving the base class acting like an interface mostly (we can't just make it one as that would be technically breaking), duplication is not a huge deal if we plan to drop the RQ v1 implementation anyway - but first we need to make sure v2 is working fine, we still get peeps from delivery moving back to v1 to deal with those weird issues...

@janbuchar
Copy link
Collaborator Author

If that abstraction is problematic I am fine with removing it and leaving the base class acting like an interface mostly (we can't just make it one as that would be technically breaking), duplication is not a huge deal if we plan to drop the RQ v1 implementation anyway - but first we need to make sure v2 is working fine, we still get peeps from delivery moving back to v1 to deal with those weird issues...

...and all of the above is why I chose to validate the points by Kuba in the python version 😄

@B4nan
Copy link
Member

B4nan commented Nov 6, 2024

My issue is that not many people use the python version, and those issues reported by delivery are quite random, so hard to confirm something helps here. Maybe it will resolve the issue we have a repro for here, that's great, but it might be a different one than those where we actually bleed money (we usually refund people with stuck runs completely).

I am fine doing it first here, but let's not wait for some confirmation before trying to fix it in the JS version.

@vdusek vdusek removed this from the 105th sprint - Tooling team milestone Mar 3, 2025
@github-actions github-actions bot added this to the 109th sprint - Tooling team milestone Mar 4, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 7, 2025
@janbuchar janbuchar requested review from Pijukatel and vdusek March 7, 2025 12:50
@janbuchar janbuchar marked this pull request as ready for review March 7, 2025 12:50
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit type - maybe fix would be better? I believe this deserves to be in the changelog.

Other than that, it looks much better, although I cannot say I do understand everything.


next_request_id, _ = self._queue_head_dict.popitem(last=False) # ~removeFirst()

# This should never happen, but...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, great we're getting rid of this 😄

Comment on lines -36 to -58
class BoundedSet(Generic[T]):
"""A simple set datastructure that removes the least recently accessed item when it reaches `max_length`."""

def __init__(self, max_length: int) -> None:
self._max_length = max_length
self._data = OrderedDict[T, object]()

def __contains__(self, item: T) -> bool:
found = item in self._data
if found:
self._data.move_to_end(item, last=True)
return found

def add(self, item: T) -> None:
self._data[item] = True
self._data.move_to_end(item)

if len(self._data) > self._max_length:
self._data.popitem(last=False)

def clear(self) -> None:
self._data.clear()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines -103 to -107
_RECENTLY_HANDLED_CACHE_SIZE = 1000
"""Cache size for recently handled requests."""

_STORAGE_CONSISTENCY_DELAY = timedelta(seconds=3)
"""Expected delay for storage to achieve consistency, guiding the timing of subsequent read operations."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self.file_operation_lock = asyncio.Lock()
self._last_used_timestamp = Decimal(0)

self._in_progress = set[str]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw # noqa: SLF001 several times when accessing this. Maybe there should be public getter for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janbuchar janbuchar merged commit 06829d1 into master Mar 11, 2025
25 checks passed
@janbuchar janbuchar deleted the simplify-request-queue branch March 11, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants