Replace "type" index with multi-column index on "id" and "type" #69

daniel-niknam-envato · 2021-10-27T14:03:53Z

Context

With the current design, every time a new event is stored in the database, every projector/reactor will receive a signal from PostgreSQL and eventually they will query the database for new events despite the fact that the new event could be irrelevant to them. As a result, queries like this:

SELECT * FROM "events" WHERE ((id >= ?) AND ("type" IN (?))) ORDER BY "id" LIMIT ?

will be executed multiple times. Fixing this issue won't be very simple and requires changes in both event sourcery and this gem. However, we could modify the existing index on the events.type column and make it a multi-column index on both events.id and events.type. Although this change will not fix the issue entirely, it will improve the event sourcery performance significantly.

Adding the above index will have impact on EventSourcery::Postgres::EventStore#get_next_from method performance.

Recent investigation

In the recent investigation, we've found out that having a multi-column index for both events.id and events.type columns on the events table could improve performance significantly. We processed around 3M events on the same AWS RDS instance multiple times, here is the result:

Process duration before adding the index: 264 minutes
Process duration after adding the index: 103 minutes

Questions

What needs to be done for existing apps?

We can use gemspec post-install message to notify the existing applications about the change. At the moment, the event-sourcery gem does not provide a "migration" command so it would be better to help them by providing a SQL query for modifying the index. The documentation regarding creating a new index exists in the CHANGELOG.md file and that could be enough too.

Would this change would solve all performance issues?

No. There are other parts of the system that should be fixed or improved. For example, I can think of the following ideas:

Better polling mechanism to trigger loading events only when the newly created event is watched by the event processor (reactors/projectors).
Time-based polling for applications that do not require real-time event processing
Configurable event poller that supports native Postgres and Redis pub/sub.

Why not just add another index?

I was not able to find a query for the events.type column alone so I don't see any benefit for adding a new index when the existing one is not used.

scottyp-env · 2021-10-27T20:58:07Z

It may not be used by the framework but I do tend to find querying on type useful for ad-hoc diagnostics. Is that enough to push to having multiple indexes or would you suggest we leave it up to the consumer of the gem as to how they index their DB tables in that case?

daniel-niknam-envato · 2021-10-28T03:35:03Z

It may not be used by the framework but I do tend to find querying on type useful for ad-hoc diagnostics. Is that enough to push to having multiple indexes or would you suggest we leave it up to the consumer of the gem as to how they index their DB tables in that case?

@scottyp-env do you use the events.type column only in those queries or is it with the conjunction of other columns?

I think it would be better if the gem will create indexes that it requires and leave the rest to the user. for example, I have found it to be useful to have an index for "id" and "created_at" for debugging events but I don't expect the gem to implement that.

scottyp-env

Your reasoning sounds good @daniel-niknam-envato!

CHANGELOG.md

Domon · 2021-10-29T00:58:34Z

lib/event_sourcery/postgres/schema.rb

@@ -38,7 +38,7 @@ def create_events(db: EventSourcery::Postgres.config.event_store_database,
          column :created_at,     :'timestamp without time zone', null: false, default: Sequel.lit("(now() at time zone 'utc')")
          index [:aggregate_id, :version], unique: true
          index :uuid, unique: true
-          index :type
+          index [:id, :type]


What will be affected by this change?

I did a quick search of the create_events and create_event_store methods in this repo but I only found their usage in the benchmarking scripts and test setup. 🤔

Adding the above index will have impact on EventSourcery::Postgres::EventStore#get_next_from method performance.

I understand that adding the index to each application's events table will improve the performance of the get_next_from method.

My questions is more about what is depending on this schema file and what will be benefited from this change.

In the changelog, it is mentioned that this change doesn't affect existing projects. That made me wonder what is the outcome of this change.

If someone creates a new Event Sourcery project, will the latest schema from this file be used somehow?

Domon · 2021-10-29T01:04:00Z

CHANGELOG.md

@@ -15,10 +16,27 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
 - Use GitHub Actions for the CI build instead of Travis CI ([#66]).
 - This project now uses `main` as its default branch ([#68]).
  - Documentation updated to refer to `main` and links updated accordingly.
+- The `events` table `type` modified to include both `id` and `type` ([#69]). 


Instead of an item in the "Changed" section, how about splitting it into two items?

One item in the "Added" section for the addition of the new composite index

Another item in the "Removed" section for the removal of the type index

Good thinking. I thought it would be easier to follow when everything is in one place.

liamdawson · 2021-12-09T05:50:39Z

I'm investigating further, but I think the proposed index could be vastly improved by swapping the order of the columns. The PostgreSQL documentation on multi-column indexes notes:

the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.

The context I'm investigating involves a database which has the ('id', 'type') index, and one event type that hasn't been "seen" for an excessively long duration:

=> SELECT * FROM projector_tracker ORDER BY last_processed_event_id ASC;
 id | name | last_processed_event_id
----+------------+-------------------------
 30 | a    |               716812009
 24 | b    |               977842924
...
 16 | c    |               988610648

Given that gap, something is running the following query, which utilizes an index scan:

=> EXPLAIN SELECT * FROM "events" WHERE ((id >= 716812010) AND ("type" IN ('a'))) ORDER BY "id" LIMIT 1000;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Limit  (cost=0.70..20456.54 rows=1000 width=516)
   ->  Index Scan using events_id_type_index on events  (cost=0.70..11185312.84 rows=546803 width=516)
         Index Cond: ((id >= 716812010) AND ((type)::text = 'a'::text))
(3 rows)

In this context, \di+ events_id_type_index reports the events_id_type_index is 52 GB in size. Scaling that proportionally to the number of rows scanned suggests that PostgreSQL needs to scan ~27% of the index to determine which of the events match the appropriate type. On an m4.xlarge RDS instance with 2,000 provisioned IOPS, that took ~165s to return 0 rows, given a relatively cold cache.

In the next couple of days, I'm looking to verify that replacing the 'type' index with a ('type', 'id') index vastly improves this query's performance. Additionally, I believe that an index that begins with the id column isn't particularly distinct from doing a sequential scan of the table proper, unless all of the information used by the query is contained in the index.

liamdawson · 2021-12-09T11:18:33Z

I need to verify further (to make sure it wasn't influenced by the process of generating the indexes), but initial results look promising:

=> EXPLAIN (VERBOSE, ANALYZE, BUFFERS, TIMING) SELECT * FROM "events" WHERE ((id >= 716812010) AND ("type" IN ('a'))) ORDER BY "id" LIMIT 1000;
                                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.70..3878.35 rows=1000 width=516) (actual time=3.174..3.176 rows=0 loops=1)
   Output: id, uuid, aggregate_id, type, body, version, correlation_id, causation_id, created_at
   Buffers: shared hit=3 read=5
   ->  Index Scan using events_type_id_index on public.events  (cost=0.70..2122202.62 rows=547291 width=516) (actual time=3.171..3.172 rows=0 loops=1)
         Output: id, uuid, aggregate_id, type, body, version, correlation_id, causation_id, created_at
         Index Cond: (((events.type)::text = 'a'::text) AND (events.id >= 716812010))
         Buffers: shared hit=3 read=5
 Planning time: 0.157 ms
 Execution time: 3.202 ms
(9 rows)

liamdawson · 2021-12-10T00:15:38Z

I took a snapshot of the DB with the new index, restored it as a new DB, and re-ran the query. It was slower, but was still a significant improvement from the "id", "type" index version:

=> EXPLAIN (VERBOSE, ANALYZE, BUFFERS, TIMING) SELECT * FROM "events" WHERE ((id >= 716812010) AND ("type" IN ('a'))) ORDER BY "id" LIMIT 1000;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.70..12.58 rows=2 width=512) (actual time=83.039..83.045 rows=0 loops=1)
   Output: id, uuid, aggregate_id, type, body, version, correlation_id, causation_id, created_at
   Buffers: shared hit=3 read=5
   ->  Index Scan using events_type_id_index on public.events  (cost=0.70..12.58 rows=2 width=512) (actual time=83.035..83.037 rows=0 loops=1)
         Output: id, uuid, aggregate_id, type, body, version, correlation_id, causation_id, created_at
         Index Cond: (((events.type)::text = 'a'::text) AND (events.id >= 716812010))
         Buffers: shared hit=3 read=5
 Planning time: 1515.274 ms
 Execution time: 83.102 ms
(9 rows)

For the database I'm investigating, I'll be proposing:

Create events_type_id_index

CREATE INDEX CONCURRENTLY events_type_id_index ON events ("type", "id");

Verify the index is in use for the given query
Drop the events_type_index
Monitor PostgreSQL index statistics, and drop the events_id_type_index unless it's still being used

This new index has the following properties:

It can continue to be used wherever the events_type_index was previously used with a minimal loss of efficiency, because
The events_type_id_index is only ~2% bigger in my tests (52GB vs. 51GB)

As noted by others in this PR, this is a form of stopgap measure, but it does help in scenarios where events aren't seen for significant periods of time.

daniel-niknam-envato self-assigned this Oct 27, 2021

scottyp-env approved these changes Oct 28, 2021

View reviewed changes

daniel-niknam-envato marked this pull request as ready for review October 28, 2021 15:30

orien approved these changes Oct 28, 2021

View reviewed changes

Domon reviewed Oct 29, 2021

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

Domon reviewed Oct 29, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Domon reviewed Oct 29, 2021

View reviewed changes

Replace "type" index with multi-column index on "id" and "type"

cc78aba

daniel-niknam-envato force-pushed the dan/add-id-and-type-index branch from 4d405fd to cc78aba Compare December 1, 2021 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace "type" index with multi-column index on "id" and "type" #69

Replace "type" index with multi-column index on "id" and "type" #69

daniel-niknam-envato commented Oct 27, 2021 •

edited

Loading

scottyp-env commented Oct 27, 2021

daniel-niknam-envato commented Oct 28, 2021

scottyp-env left a comment

Domon Oct 29, 2021 •

edited

Loading

daniel-niknam-envato Oct 29, 2021

Domon Nov 4, 2021 •

edited

Loading

Domon Oct 29, 2021 •

edited

Loading

daniel-niknam-envato Oct 29, 2021

liamdawson commented Dec 9, 2021 •

edited

Loading

liamdawson commented Dec 9, 2021

liamdawson commented Dec 10, 2021 •

edited

Loading

Replace "type" index with multi-column index on "id" and "type" #69

Are you sure you want to change the base?

Replace "type" index with multi-column index on "id" and "type" #69

Conversation

daniel-niknam-envato commented Oct 27, 2021 • edited Loading

Context

Recent investigation

Questions

What needs to be done for existing apps?

Would this change would solve all performance issues?

Why not just add another index?

scottyp-env commented Oct 27, 2021

daniel-niknam-envato commented Oct 28, 2021

scottyp-env left a comment

Choose a reason for hiding this comment

Domon Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

daniel-niknam-envato Oct 29, 2021

Choose a reason for hiding this comment

Domon Nov 4, 2021 • edited Loading

Choose a reason for hiding this comment

Domon Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

daniel-niknam-envato Oct 29, 2021

Choose a reason for hiding this comment

liamdawson commented Dec 9, 2021 • edited Loading

liamdawson commented Dec 9, 2021

liamdawson commented Dec 10, 2021 • edited Loading

daniel-niknam-envato commented Oct 27, 2021 •

edited

Loading

Domon Oct 29, 2021 •

edited

Loading

Domon Nov 4, 2021 •

edited

Loading

Domon Oct 29, 2021 •

edited

Loading

liamdawson commented Dec 9, 2021 •

edited

Loading

liamdawson commented Dec 10, 2021 •

edited

Loading