fix: aggregator restart missing events by dav1do · Pull Request #737 · ceramicnetwork/rust-ceramic

dav1do · 2025-08-30T16:51:38Z

Fix issue where the aggregator could restart and miss events, causing validation errors on valid streams.

This includes two changes:

First, we make the aggregator process_conclusion_events_batch and join_event_states idempotent and exclude duplicate incoming events by preferring the on disk events when they are duplicated in the batch. This means we could go backward/see events again and not cause errors.
Second, fix the issue where the memory cache seemed to be "going backwards", in that we'd flush events to disk and then we'd have events with a lower conclusion_event_order in memory, so on restart we'd skip them, as we started from our previous on disk max. As we'd have missing events (usually init), this would cause aggregator errors and streams would be full of validation errors. This was due to the fact that we write events from the batch to disk in phases (Models, MIDs). Although our batch was from [X, Y], the two internal batches could have interleaving order, so if the first batch caused a cache flush, we'd leave the remaining events in memory. Now we only flush the cache once per conclusion events batch, so they are all written to disk together. This means our cache table can grow even more over it's allotted size, but it's short lived and worth it.

With this, we shouldn't need the shutdown task to flush but I don't see any harm in keeping it so it's still there.

restart from the beginning of conclusion events to avoid skipping on restarts. need to figure out why memory table contains conclusion_event_order that is less than the data written to event_states still, and hopefully restart where we left off. for now, we just ignore things we've seen before.

… whole we split the batch into models/mids and were writing both pieces to disk separately. The entire batch is orded by conclusion_event_order, but each group could overlap. So we could flush the cache for the first write and then we'd have a cache that was "going backward" in conclusion_event_order. On restart, those events were missed and events failed to aggregate correctly

m0ar

I think this looks good, just some clarification requests and nitpicks ✨

m0ar · 2025-09-10T08:32:40Z

        let models = self.validate_models(models).context("validating models")?;
        let mut models = self
-            .store_event_states(models)
+            .store_event_states(models, false)


Could you elaborate on why we prevent flush here, but not for the mids below?

so this is the crux of the fix of what I tried to explain in the second bullet point (verbosely and imprecisely). basically, we put the entire batch of conclusion events into memory (models here and MIDs below) and only flush once so that we can't end up with the in memory order being behind the on disk order.

m0ar · 2025-09-10T08:35:28Z

            .with_column_renamed("event_height", "previous_height")?;

-        Ok(conclusion_events
+        let conclusion_events = conclusion_events


Did we lose a pattern-matching error check here?

No, I just shadowed the name while doing some more filtering (to be idempotent if we received the conclusion event a second time).. instead of returning Ok(select_things().await?) we now assign and then return Ok(conclusion_events) at the end.

m0ar · 2025-09-10T08:48:00Z

+        if allow_cache_flush {
            self.flush_cache().await?;
+            let count = self.count_cache().await?;
+            tracing::debug!(%count, will_flush = %count >= self.max_cached_rows, "counts for mem table");
+            // If we have enough data cached in memory write it out to persistent store
+            if count >= self.max_cached_rows {
+                self.flush_cache().await?;
+            }


Would be good to document why this check is important. I don't really understand it outside of testing use cases, in particular the block on l301 🤔

hmm, I think this comment is reasonable but open to improvements: If we have enough data cached in memory write it out to persistent store

I'm trying to figure out how to clarify.. we used to flush the cache every time we got here if we had more rows in the mem table than our max_cached_rows value. Now, I added a allow_cache_flush flag because we call this twice while processing a single batch of conclusion events, and we don't want to flush the cache in the middle of the batch and now do it only once at the end.

If we only processed the conclusion events in order this wouldn't be necessary, but the conclusion_event order is the arrival on the node, which only deals with stream ordering (hence the guarantee that any conclusion_event is after all events in its stream but not necessarily its model, as we accept and persist events without the model present).

m0ar · 2025-09-10T08:50:38Z

+        // splice the 3 events together making sure each vec isn't reordered but not all in a row
+        let events = &events;


Not sure if this comment refers to the right thing

whoops, yep, I modified this behavior... removed and moved the model into the middle to be more likely to replicate the real cause

dav1do added 2 commits August 30, 2025 08:36

chore: deal with rustc 1.89 mismatched_lifetime_syntaxes warnings

c64249c

dav1do temporarily deployed to tnet-prod-2024 August 30, 2025 17:13 — with GitHub Actions Inactive

dav1do marked this pull request as ready for review August 31, 2025 16:02

dav1do requested a review from a team as a code owner August 31, 2025 16:02

dav1do requested review from m0ar, smrz2001 and stephhuynh18 and removed request for a team August 31, 2025 16:02

dav1do temporarily deployed to tnet-prod-2024 August 31, 2025 16:15 — with GitHub Actions Inactive

dav1do force-pushed the fix/aggregator-restart branch from 519f887 to 492c7b3 Compare September 1, 2025 16:26

dav1do temporarily deployed to tnet-prod-2024 September 1, 2025 16:48 — with GitHub Actions Inactive

m0ar approved these changes Sep 10, 2025

View reviewed changes

test: fix comment and modify test to better replicate ordering bug

68f5c07

dav1do temporarily deployed to tnet-prod-2024 September 17, 2025 02:24 — with GitHub Actions Inactive

dav1do added this pull request to the merge queue Sep 30, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Sep 30, 2025

m0ar added this pull request to the merge queue Sep 30, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Sep 30, 2025

m0ar added this pull request to the merge queue Sep 30, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Sep 30, 2025

chore(ci): add fmt/clippy

f2b3fb5

smrz2001 enabled auto-merge September 30, 2025 19:30

smrz2001 had a problem deploying to tnet-prod-2024 September 30, 2025 19:51 — with GitHub Actions Failure

smrz2001 had a problem deploying to tnet-prod-2024 September 30, 2025 21:01 — with GitHub Actions Failure

smrz2001 had a problem deploying to tnet-prod-2024 September 30, 2025 21:28 — with GitHub Actions Failure

chore(ci): use newer jq image

9355aaa

smrz2001 had a problem deploying to tnet-prod-2024 September 30, 2025 23:29 — with GitHub Actions Failure

chore(ci): use alpine image for jq

c0255df

smrz2001 had a problem deploying to tnet-prod-2024 October 1, 2025 00:05 — with GitHub Actions Failure

smrz2001 had a problem deploying to tnet-prod-2024 October 1, 2025 00:58 — with GitHub Actions Failure

smrz2001 had a problem deploying to tnet-prod-2024 October 1, 2025 01:00 — with GitHub Actions Failure

chore(ci): install bash in init container

dae2fd8

smrz2001 temporarily deployed to tnet-prod-2024 October 1, 2025 01:26 — with GitHub Actions Inactive

smrz2001 added this pull request to the merge queue Oct 1, 2025

Merged via the queue into main with commit 6c11a03 Oct 1, 2025
20 checks passed

smrz2001 deleted the fix/aggregator-restart branch October 1, 2025 02:04

smrz2001 mentioned this pull request Oct 1, 2025

chore: version v0.56.1 #740

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: aggregator restart missing events#737

fix: aggregator restart missing events#737
smrz2001 merged 8 commits into
mainfrom
fix/aggregator-restart

dav1do commented Aug 30, 2025 •

edited

Loading

Uh oh!

m0ar left a comment

Uh oh!

m0ar Sep 10, 2025

Uh oh!

dav1do Sep 17, 2025

Uh oh!

m0ar Sep 10, 2025

Uh oh!

dav1do Sep 17, 2025

Uh oh!

m0ar Sep 10, 2025

Uh oh!

dav1do Sep 17, 2025

Uh oh!

m0ar Sep 10, 2025

Uh oh!

dav1do Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// splice the 3 events together making sure each vec isn't reordered but not all in a row
		let events = &events;

Conversation

dav1do commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m0ar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dav1do commented Aug 30, 2025 •

edited

Loading