Move remote replies scraping to a rate-limited background worker

`persistPost()` currently fetches the remote `replies` collection while handling the post that triggered the persistence. It then walks that collection and recursively calls `persistPost()` for each reply.

That makes inbox and lookup processing do extra remote work synchronously. A single post can cause a burst of HTTP requests to the origin instance, and recursive reply traversal can multiply that load. In #443, this showed up as repeated requests to the same remote `/replies` URL within milliseconds, followed by HTTP 429 responses from the origin.

Now that Hollo has worker nodes in development, remote replies scraping should move out of the immediate `persistPost()` path. Persisting a post should save the post itself and enqueue follow-up work for replies when appropriate. A worker can then fetch replies later, at a controlled pace.

This issue tracks the background worker and rate-limiting part of #443. It is separate from making duplicate `Announce` handling idempotent.

Expected behavior
-----------------

When Hollo persists a remote post with a `replies` collection, it should not eagerly walk the whole collection in the same request or inbox-processing path. Instead, Hollo should enqueue a background job for scraping replies.

The worker should process those jobs slowly enough to avoid hammering the origin instance. It should also avoid fetching the same replies collection repeatedly when several activities refer to the same post around the same time.

Possible design
---------------

 -  Add a background job type for remote replies scraping.
 -  Enqueue the job from `persistPost()` only when the post is remote and has a `replies` collection worth fetching.
 -  Deduplicate jobs by post IRI or replies collection IRI so concurrent inbox deliveries do not enqueue the same work repeatedly.
 -  Process the jobs from worker nodes, not web-only nodes.
 -  Apply a rate limit per origin host, with a small amount of concurrency at most.
 -  Respect HTTP 429 responses by backing off before retrying that origin.
 -  Bound the amount of work per job, for example by limiting depth, item count, or pages fetched.
 -  Make recursive fetching explicit in the worker rather than an implicit side effect of every `persistPost()` call.

Open questions
--------------

 -  Should replies scraping run for every remote post, or only for posts that are shown in local timelines?
 -  Should the worker fetch only direct replies, or should it recursively fetch deeper replies with a depth limit?
 -  What should the default per-origin rate limit be?
 -  Should there be an environment variable or admin setting for enabling, disabling, or tuning this behavior?

Acceptance criteria
-------------------

 -  `persistPost()` no longer walks remote replies collections synchronously during normal inbox or lookup handling.
 -  Remote replies scraping is performed by worker nodes.
 -  Multiple events for the same post do not cause duplicate scraping jobs for the same replies collection.
 -  The worker enforces a per-origin rate limit.
 -  The worker backs off after HTTP 429 responses instead of immediately retrying the same origin.
 -  The implementation has tests for job deduplication and bounded reply scraping.
 -  The behavior is documented, including any new configuration knobs if they are added.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move remote replies scraping to a rate-limited background worker #445

Expected behavior

Possible design

Open questions

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Move remote replies scraping to a rate-limited background worker #445

Description

Expected behavior

Possible design

Open questions

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions