Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Make sure we don't store duplicate actor events caused to reorgs in events.db #11015

Merged
merged 1 commit into from Jun 30, 2023

Conversation

fridrik01
Copy link
Contributor

@fridrik01 fridrik01 commented Jun 28, 2023

Fixes: filecoin-project/ref-fvm#1350

Context

When lotus is configured to enable indexing of actor events (when EnableEthRPC=true and DisableHistoricFilterAPI=false) then Lotus will observe for new tipset and store all the actor events in a sqlite table stored at sqlite/events.db.

However, users have been reporting that these events are duplicated and hard to understand (see filecoin-project/ref-fvm#1350).

To address this, I originally implemented a quick fix (#10897) which addressed this in the API level but the team decided that we instead want a less hacky fix where we do not store duplicate events in the database to start with.

This PR implements this on the DB level where depending on whether we are applying messages to revert or not we:

if revert: 
  mark all events events in this tipset as reverted
else
  for each event 
    if event already exists
       update event and set revert to false
    else 
       insert new event   

I introduced a new events migration which adds an index on (height,tipset_key) which I confirmed is used by all relevant sql statements:

EXPLAIN QUERY PLAN SELECT MAX(id) FROM event WHERE height=? AND tipset_key=? AND tipset_key_cid=? AND emitter_addr=? AND event_index=? AND message_cid=? AND message_index=?
- SEARCH event USING INDEX height_tipset_key (height=? AND tipset_key=?)
EXPLAIN QUERY PLAN UPDATE event SET reverted=true WHERE height=? AND tipset_key=?
- SEARCH event USING INDEX height_tipset_key (height=? AND tipset_key=?)
EXPLAIN QUERY PLAN UPDATE event SET reverted=false WHERE height=? AND tipset_key=? AND tipset_key_cid=? AND emitter_addr=? AND event_index=? AND message_cid=? AND message_index=?
- SEARCH event USING INDEX height_tipset_key (height=? AND tipset_key=?)

Furthermore, this PR also include the migration of existing actor events to not contain duplicate events

Test plan

I started a node with the changes in this PR and let it run for 40min (epoch 2988646-2988730). Then I started another node from master branch and then compared the generated actor events within those two epoch ranges between the two instances:

SELECT * FROM (
  SELECT e.height, e.tipset_key, e.emitter_addr, e.event_index, e.message_cid, e.message_index, e.reverted, ee.indexed, ee.flags, ee.key, ee.codec, ee.value 
  FROM event e, event_entry ee
  WHERE ee.event_id = e.id AND e.height > 2988646 AND e.height < 2988730
  EXCEPT
  SELECT e.height, e.tipset_key, e.emitter_addr, e.event_index, e.message_cid, e.message_index, e.reverted, ee.indexed, ee.flags, ee.key, ee.codec, ee.value 
  FROM events2.event e, events2.event_entry ee
  WHERE ee.event_id = e.id AND e.height > 2988646 AND e.height < 2988730
) WHERE reverted = 0

Result: 0 rows returned in 29ms

As expected there were no differences.

@fridrik01 fridrik01 force-pushed the fix-dedup branch 5 times, most recently from b5b6123 to 908c63d Compare June 29, 2023 15:45
@fridrik01 fridrik01 marked this pull request as ready for review June 29, 2023 16:29
@fridrik01 fridrik01 requested a review from a team as a code owner June 29, 2023 16:29
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass.

chain/events/filter/index.go Show resolved Hide resolved
chain/events/filter/index.go Outdated Show resolved Hide resolved
chain/events/filter/index.go Outdated Show resolved Hide resolved
chain/events/filter/index.go Outdated Show resolved Hide resolved
chain/events/filter/index.go Outdated Show resolved Hide resolved
chain/events/filter/index_test.go Outdated Show resolved Hide resolved
@fridrik01 fridrik01 force-pushed the fix-dedup branch 4 times, most recently from eeb07a7 to e754715 Compare June 29, 2023 21:31
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small things but they're up to you. Otherwise, LGTM!

chain/events/filter/index.go Outdated Show resolved Hide resolved
}
} else {
// event already exists, lets mark it as not reverted
res, err := tx.Stmt(ei.stmtRestoreEvent).Exec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, due to the transaction semantics, if any events are recorded, all should be recorded. So we should be able to check:

  1. Are any events recorded for this tipset. If so, unmark them as reverted and return.
  2. Otherwise, insert them.

But, we can also keep the current logic unless it becomes a performance issue.

@Stebalien
Copy link
Member

You'll have to tell the linter to "go away" by appending //nolint:errcheck to the deferred reverts.

@fridrik01 fridrik01 merged commit 8577dcb into master Jun 30, 2023
93 checks passed
@fridrik01 fridrik01 deleted the fix-dedup branch June 30, 2023 11:56
@arajasek
Copy link
Contributor

arajasek commented Jul 6, 2023

@fridrik01 @Stebalien So, the migration in this PR is...slow. On @TippyFlitsUK's machine, it seemed to be happening at a pace of less than 1 per second, which would take (at least) 400k seconds, or 5 days, for anyone with history going back to FEVM launch. Further, it's blocking the daemon's API coming alive, which is an issue for SPs, RPC service providers, etc. Can we:

  • Speed up the migration? :D
  • Run the migration in the background, without blocking the daemon's? That way you have the node up and running, even if the EventsAPI subsystem is down
  • Run the migration entirely offline?

@Stebalien
Copy link
Member

One second per epoch is really strange. It should be milliseconds (we have a constant number of sql queries per epoch). I've filed an issue and will investigate: #11056.

Parallelism should be the easiest way to speed this up, but even then, I'm not sure how much parallelism we can get out of sqlite here.

@fridrik01 fridrik01 changed the title feat: Make sure we don't store duplidate actor events caused to reorgs in events.db feat: Make sure we don't store duplicate actor events caused to reorgs in events.db Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Logs: Duplication + nondeterminism
3 participants