Skip to content

Move remote replies scraping to a rate-limited background worker #445

@dahlia

Description

@dahlia

persistPost() currently fetches the remote replies collection while handling the post that triggered the persistence. It then walks that collection and recursively calls persistPost() for each reply.

That makes inbox and lookup processing do extra remote work synchronously. A single post can cause a burst of HTTP requests to the origin instance, and recursive reply traversal can multiply that load. In #443, this showed up as repeated requests to the same remote /replies URL within milliseconds, followed by HTTP 429 responses from the origin.

Now that Hollo has worker nodes in development, remote replies scraping should move out of the immediate persistPost() path. Persisting a post should save the post itself and enqueue follow-up work for replies when appropriate. A worker can then fetch replies later, at a controlled pace.

This issue tracks the background worker and rate-limiting part of #443. It is separate from making duplicate Announce handling idempotent.

Expected behavior

When Hollo persists a remote post with a replies collection, it should not eagerly walk the whole collection in the same request or inbox-processing path. Instead, Hollo should enqueue a background job for scraping replies.

The worker should process those jobs slowly enough to avoid hammering the origin instance. It should also avoid fetching the same replies collection repeatedly when several activities refer to the same post around the same time.

Possible design

  • Add a background job type for remote replies scraping.
  • Enqueue the job from persistPost() only when the post is remote and has a replies collection worth fetching.
  • Deduplicate jobs by post IRI or replies collection IRI so concurrent inbox deliveries do not enqueue the same work repeatedly.
  • Process the jobs from worker nodes, not web-only nodes.
  • Apply a rate limit per origin host, with a small amount of concurrency at most.
  • Respect HTTP 429 responses by backing off before retrying that origin.
  • Bound the amount of work per job, for example by limiting depth, item count, or pages fetched.
  • Make recursive fetching explicit in the worker rather than an implicit side effect of every persistPost() call.

Open questions

  • Should replies scraping run for every remote post, or only for posts that are shown in local timelines?
  • Should the worker fetch only direct replies, or should it recursively fetch deeper replies with a depth limit?
  • What should the default per-origin rate limit be?
  • Should there be an environment variable or admin setting for enabling, disabling, or tuning this behavior?

Acceptance criteria

  • persistPost() no longer walks remote replies collections synchronously during normal inbox or lookup handling.
  • Remote replies scraping is performed by worker nodes.
  • Multiple events for the same post do not cause duplicate scraping jobs for the same replies collection.
  • The worker enforces a per-origin rate limit.
  • The worker backs off after HTTP 429 responses instead of immediately retrying the same origin.
  • The implementation has tests for job deduplication and bounded reply scraping.
  • The behavior is documented, including any new configuration knobs if they are added.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdependenciesPull requests that update a dependency fileenhancementNew feature or requestperformance

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions