feat: Make sure we don't store duplicate actor events caused to reorgs in events.db #11015

fridrik01 · 2023-06-28T17:23:11Z

Context

When lotus is configured to enable indexing of actor events (when EnableEthRPC=true and DisableHistoricFilterAPI=false) then Lotus will observe for new tipset and store all the actor events in a sqlite table stored at sqlite/events.db.

However, users have been reporting that these events are duplicated and hard to understand (see filecoin-project/ref-fvm#1350).

To address this, I originally implemented a quick fix (#10897) which addressed this in the API level but the team decided that we instead want a less hacky fix where we do not store duplicate events in the database to start with.

This PR implements this on the DB level where depending on whether we are applying messages to revert or not we:

if revert: 
  mark all events events in this tipset as reverted
else
  for each event 
    if event already exists
       update event and set revert to false
    else 
       insert new event

I introduced a new events migration which adds an index on (height,tipset_key) which I confirmed is used by all relevant sql statements:

EXPLAIN QUERY PLAN SELECT MAX(id) FROM event WHERE height=? AND tipset_key=? AND tipset_key_cid=? AND emitter_addr=? AND event_index=? AND message_cid=? AND message_index=?
- SEARCH event USING INDEX height_tipset_key (height=? AND tipset_key=?)
EXPLAIN QUERY PLAN UPDATE event SET reverted=true WHERE height=? AND tipset_key=?
- SEARCH event USING INDEX height_tipset_key (height=? AND tipset_key=?)
EXPLAIN QUERY PLAN UPDATE event SET reverted=false WHERE height=? AND tipset_key=? AND tipset_key_cid=? AND emitter_addr=? AND event_index=? AND message_cid=? AND message_index=?
- SEARCH event USING INDEX height_tipset_key (height=? AND tipset_key=?)

Furthermore, this PR also include the migration of existing actor events to not contain duplicate events

Test plan

I started a node with the changes in this PR and let it run for 40min (epoch 2988646-2988730). Then I started another node from master branch and then compared the generated actor events within those two epoch ranges between the two instances:

SELECT * FROM (
  SELECT e.height, e.tipset_key, e.emitter_addr, e.event_index, e.message_cid, e.message_index, e.reverted, ee.indexed, ee.flags, ee.key, ee.codec, ee.value 
  FROM event e, event_entry ee
  WHERE ee.event_id = e.id AND e.height > 2988646 AND e.height < 2988730
  EXCEPT
  SELECT e.height, e.tipset_key, e.emitter_addr, e.event_index, e.message_cid, e.message_index, e.reverted, ee.indexed, ee.flags, ee.key, ee.codec, ee.value 
  FROM events2.event e, events2.event_entry ee
  WHERE ee.event_id = e.id AND e.height > 2988646 AND e.height < 2988730
) WHERE reverted = 0

Result: 0 rows returned in 29ms

As expected there were no differences.

Stebalien

First pass.

chain/events/filter/index.go

chain/events/filter/index_test.go

Stebalien

Couple of small things but they're up to you. Otherwise, LGTM!

chain/events/filter/index.go

Stebalien · 2023-06-29T21:54:00Z

chain/events/filter/index.go

+				}
+			} else {
+				// event already exists, lets mark it as not reverted
+				res, err := tx.Stmt(ei.stmtRestoreEvent).Exec(


Technically, due to the transaction semantics, if any events are recorded, all should be recorded. So we should be able to check:

Are any events recorded for this tipset. If so, unmark them as reverted and return.

Otherwise, insert them.

But, we can also keep the current logic unless it becomes a performance issue.

Stebalien · 2023-06-29T21:56:06Z

You'll have to tell the linter to "go away" by appending //nolint:errcheck to the deferred reverts.

arajasek · 2023-07-06T16:38:06Z

@fridrik01 @Stebalien So, the migration in this PR is...slow. On @TippyFlitsUK's machine, it seemed to be happening at a pace of less than 1 per second, which would take (at least) 400k seconds, or 5 days, for anyone with history going back to FEVM launch. Further, it's blocking the daemon's API coming alive, which is an issue for SPs, RPC service providers, etc. Can we:

Speed up the migration? :D
Run the migration in the background, without blocking the daemon's? That way you have the node up and running, even if the EventsAPI subsystem is down
Run the migration entirely offline?

Stebalien · 2023-07-06T16:53:13Z

One second per epoch is really strange. It should be milliseconds (we have a constant number of sql queries per epoch). I've filed an issue and will investigate: #11056.

Parallelism should be the easiest way to speed this up, but even then, I'm not sure how much parallelism we can get out of sqlite here.

fridrik01 force-pushed the fix-dedup branch 5 times, most recently from b5b6123 to 908c63d Compare June 29, 2023 15:45

fridrik01 marked this pull request as ready for review June 29, 2023 16:29

fridrik01 requested a review from a team as a code owner June 29, 2023 16:29

Stebalien reviewed Jun 29, 2023

View reviewed changes

fridrik01 force-pushed the fix-dedup branch 4 times, most recently from eeb07a7 to e754715 Compare June 29, 2023 21:31

Stebalien approved these changes Jun 29, 2023

View reviewed changes

fridrik01 force-pushed the fix-dedup branch from e754715 to a382139 Compare June 29, 2023 21:55

feat: Make sure we dont store duplidate actor events caused by reorgs

b80c082

fridrik01 force-pushed the fix-dedup branch from a382139 to b80c082 Compare June 30, 2023 11:21

fridrik01 merged commit 8577dcb into master Jun 30, 2023
93 checks passed

fridrik01 deleted the fix-dedup branch June 30, 2023 11:56

fridrik01 changed the title ~~feat: Make sure we don't store duplidate actor events caused to reorgs in events.db~~ feat: Make sure we don't store duplicate actor events caused to reorgs in events.db Jul 6, 2023

Fatman13 mentioned this pull request Jul 10, 2023

eth api: actor event dup fix / actor 事件重复修复 filecoin-project/venus#6061

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Make sure we don't store duplicate actor events caused to reorgs in events.db #11015

feat: Make sure we don't store duplicate actor events caused to reorgs in events.db #11015

fridrik01 commented Jun 28, 2023 •

edited

Stebalien left a comment

Stebalien left a comment

Stebalien Jun 29, 2023

Stebalien commented Jun 29, 2023

arajasek commented Jul 6, 2023

Stebalien commented Jul 6, 2023

feat: Make sure we don't store duplicate actor events caused to reorgs in events.db #11015

feat: Make sure we don't store duplicate actor events caused to reorgs in events.db #11015

Conversation

fridrik01 commented Jun 28, 2023 • edited

Context

Test plan

Stebalien left a comment

Choose a reason for hiding this comment

Stebalien left a comment

Choose a reason for hiding this comment

Stebalien Jun 29, 2023

Choose a reason for hiding this comment

Stebalien commented Jun 29, 2023

arajasek commented Jul 6, 2023

Stebalien commented Jul 6, 2023

fridrik01 commented Jun 28, 2023 •

edited