Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Persistence.Sql.Common: add separate tags table for indexing individual tags #5296

Closed
Aaronontheweb opened this issue Oct 1, 2021 · 18 comments
Labels

Comments

@Aaronontheweb
Copy link
Member

Is your feature request related to a problem? Please describe.
Right now all of our tags in Akka.Persistence.SqlServer and other related plugins are simply stored as comma-delimited text that is, effectively, not indexed:

https://github.com/akkadotnet/Akka.Persistence.SqlServer/blob/ccf62119cf99daf9ea77edd1f7596d74f53ea5a6/src/Akka.Persistence.SqlServer/Journal/SqlServerQueryExecutor.cs#L56-L78

When we perform a "read by tags" query, it performs a full table scan of all items in the Akka.Persistence.Journal:

https://github.com/akkadotnet/Akka.Persistence.SqlServer/blob/ccf62119cf99daf9ea77edd1f7596d74f53ea5a6/src/Akka.Persistence.SqlServer/Journal/SqlServerQueryExecutor.cs#L35-L41

For any reasonably large journal (i.e. millions of events or more) this is going to be non-performant.

Describe the solution you'd like
I'm trying to decide if it would be better to:

  1. Add a second "index" table for specific tags or
  2. Materialize the "events by tag" data into a second, denormalized table in order to avoid joins. Advantage of this approach: can be made to work in non-relational environments, especially if we denormalize the full set of sequence numbers and persistent entity ids too, which should allow for safe deletion from both collections.

The idea in either case - avoid full table scans and use an index to query events by tag.

Describe alternatives you've considered
One alternative is to simply leave as-is and maybe have users implement their own custom journals.

Additional context
One nasty thing about introducing this change is how we'd make it backwards compatible with current journal implementations - there would, essentially, need to be an ETL process or tool of some kind introduced.

@Aaronontheweb
Copy link
Member Author

cc @ismaelhamed @to11mtm

@malclear
Copy link
Contributor

malclear commented Oct 1, 2021

One thing to note about how tags are currently stored: they are unordered in the semicolon-delimited list. The tags "A", "B", and "C" will appear in any order in the Tags field:

  • ";A;B;C;"
  • ";A;C;B;"
  • ";B;C;A;"
  • etc.

@Aaronontheweb
Copy link
Member Author

@malclear and that works today because the SQL we use has a LIKE cause for filtering through the tags - which makes it even more non-performant, IMHO.

What I'd propose as part of an ETL process is parsing that field - each tag into its own row.

@malclear
Copy link
Contributor

malclear commented Oct 4, 2021

Would that mean duplicating the payload for each tag? If so, that's not insignificant. I think a many-to-many join table, or a tag-specific child table would be better for us.

@Aaronontheweb
Copy link
Member Author

@malclear yeah, after thinking about it - for SQL it has to be less expensive to just rely on an index rather than denormalization.

@ismaelhamed
Copy link
Member

This was eventually addressed in the akka-persistence-jdbc plugin, see #467 and original issue #194

IMO, we should put all the effort in the Akka.Persistence.Linq2Db plugin moving forward. After all, Akka.Persistence.Sql.Common has served its purpose just fine for many users for all these years (given its limitations).

@Aaronontheweb
Copy link
Member Author

IMO, we should put all the effort in the Akka.Persistence.Linq2Db plugin moving forward. After all, Akka.Persistence.Sql.Common has served its purpose just fine for many users for all these years (given its limitations).

Yes. We should.

@to11mtm
Copy link
Member

to11mtm commented Oct 7, 2021

I wish we had this thread a few months ago when this was fresher in my head...

The nice thing about doing this all in Persistence.Linq2Db is that we will only have to do it once for everyone.

However, there's some caveats that should be kept in mind:

  • Separate tables for Tags -will- lower write performance or require additional changes
    • We get a -lot- of our 'turbocharged' performance via Linq2Db's Bulk-copy (typically, a multi-insert batch statement). Unfortunately, there's not a good way to use that mechanism to get back the inserted primary key with that operation.
      • Possible work-arounds:
        • A: Tie the two together via one of the following:
          • 1: Use a GUID/ULID or similar structure (keeping in mind that we may need special mapping around SQL Server because of how it searches those fields) as the tie point between the inserted record and the tag records. This would add 16 bytes of overhead on each inserted tag for a record as well as the main record)
          • 2: Use the PersistenceId+SequenceNumber as the tie point between records. It would probably be more overhead 'per-tag' and possibly more fragile otherwise.
        • B: Re-write the batch writing logic so that we 'split' the items with tags and without tags, using BulkCopy on the rows that don't have tags, and fallback to a bunch of inserts getting back the identities and bulk copying the tags in (both tagged and untagged would still get handled in a single transaction so it's atomic success or failure).
        • C: Add bulk copy insert with identity support to linq2db. There's challenges in doing it -properly- because SQL Databases are tricky and ideally in Linq2Db we like to make sure features have as wide DB support as possible. SQL Server in particular is nasty because you have to do some weird sub-selects to handle the fact that SQL Server doesn't guarantee the returned IDs come back in the order you gave the rows :)

If we don't do any of the above and go the 'lazy' route (individual writes for everything but at least transacted) we'll see a huge perf drop on scale-out. Mind you I still think it would be faster than if done in Sql.Common, but we are probably talking a 50% or more drop in performance.

@Aaronontheweb
Copy link
Member Author

Will there be any performance hit if you don't use tags?

@Aaronontheweb
Copy link
Member Author

Unfortunately, there's not a good way to use that mechanism to get back the inserted primary key with that operation.

If we adopt a "blind write" approach and have the primary key be:

Tag, PersistentId, SeqNo

With a secondary index on Ordering (that can be a server-side timestamp)

Would that work? All of that should be predictable and sorted on the index.

@to11mtm
Copy link
Member

to11mtm commented Oct 7, 2021

Will there be any performance hit if you don't use tags?

Yes. how much depends on the route chosen. In addition to the write aspect I mentioned, there is also the aspect of Reads.

Adding the Tag table means we either need to do two DB queries and combine the results or do a one left-join DB query. Either are easy but if you are in a 'mixed mode' (some events with tags, some without) as the tagged events grow there could be a read side impact as well. I'd expect that if you are never using tags in a design, the impact on reads should be negligible in any of the cases.

As far as Writes go:

  • Option A:
    • Will likely provide good performance in -both- cases and likely be easier to maintain long-term as far as the code goes. Dealing with SQLServer Byteswapping is a one-time pain point. There -is- the 16ish byte per event additional overhead to keep in mind.
  • Option B:
    • In a 'best case' where all events in a batch do not have tags, we will pay an additional cost of looping through the set and possibly an additional array allocation.
  • Option C:
    • There's going to be some penalty in all cases; we don't need the ID's back when we aren't inserting tags, so the return/marshalling is going to hurt us on top of the more complex SQL-Side logic.

If we adopt a "blind write" approach and have the primary key be:

Tag, PersistentId, SeqNo

With a secondary index on Ordering (that can be a server-side timestamp)

'Work' yes. But then you'd need to include all three columns on every Tag row for a given record, and it's a harder join on read.

@Aaronontheweb
Copy link
Member Author

I'm going to be annoying here, but for your use case at work which of these three options would make you feel least concerned?

When the technical merits don't present a clear winner, I like to use a "regret minimization" framework :p

@to11mtm
Copy link
Member

to11mtm commented Oct 7, 2021

I've never used tags in any capacity with Akka.Persistence. So I'm trying to approach the question from a balance of 'what would make a DBA happy' vs 'what would be most efficient'.

I think Option A is going to provide the best balance of:

  • Long term performance: If the ULID tag field is nullable or we have a specific default value to denote no tags, it should be easy for us to try both join query and two-pass-query approaches and see what works better.
  • Migration : Going back to ULID tags, again nullable or default tags, the two-pass approach would possibly make some backwards compatibility scenarios easier.
  • Write performance: It's just 16 bytes per record and 16 bytes per tag.
  • Keeps the overall pipeline simple:
    • On single row writes we can check if there are any tags to see if we need a transaction (cheaper than a compare)
    • On multi-row writes we can do a bulk copy of the main records first, then any tags we project and bulk-copy in the same transaction
  • Utility value: The ULID could be used as a debugging type tag; if each event is getting it's own ulid (tags or not) there may be value in troubleshooting (might be reaching here.)

One thing to keep in mind with row-per-tag; if there is an identity primary key, there is a chance that workloads with a LOT of events with a LOT of tags could conceivably overflow a bigint. We can make a primary key on the Tags table optional but -that- is something a lot more contentious in DBA circles.

@to11mtm
Copy link
Member

to11mtm commented Oct 12, 2021

@Aaronontheweb I can write this up as an issue in Akka.Persistence.Linq2Db, but a couple questions/doublechecks came to mind:

  • Do we want this to be a fully 'opt-in' change? i.e. I think we could hide all the schema changes (including to the existing table) behind a config switch. Some refactoring of the Queries might be needed to minimize code duplication but thankfully linq2db is very composable. I think that to do this in a way that would allow a 'zero downtime' migration via rolling deploys, we would require 3 modes on this switch:
    • Off (Default, we bypass the join), Migrate (We do the join on reads, but keep writes on the old path), On (We do the join on reads -and- we just write to the join table)
    • For zero downtime, you would make your schema changes, deploy in Migrate mode, then deploy again in On mode.
  • We will still need to 'perfectly' match tags we get back (DB Collation may or may not be case sensitive.)
  • Are you OK with Akka.Persistence.Linq2Db taking on Cysharp.Ulid as a dependency? I am 100% ok with it (We already pull language-ext), it's simple; just a very efficient implementation of the ULID spec.
  • If we are going to start guiding users to Persistence.Linq2Db, can we get a second set of eyes on the Persistence.Query bits? That's the one part of the Library that I can't say is battle-tested.
  • If we are going to be helping users with Schema changes (I feel it would be good to do so), we should think about how to make it easy for users to migrate with their existing configs. I have a plan for this. My thought is a utility that pulls up the config, figures out what to generate, and either runs it against the specified DBs or spits out the SQL for a DBA to run in an environment. We can use FluentMigrator for this (to ease pain of multi DB support) and keep it separate from the main lib.

@Aaronontheweb
Copy link
Member Author

Do we want this to be a fully 'opt-in' change? i.e. I think we could hide all the schema changes (including to the existing table) behind a config switch.

Yep, that's always preferable - we don't want to treat a production users' database roughly without their consent. Feature flags that are opt-in are a low-friction way to implement this.

We will still need to 'perfectly' match tags we get back (DB Collation may or may not be case sensitive.)

Yes, we should do an exact text match.

Are you OK with Akka.Persistence.Linq2Db taking on Cysharp.Ulid as a dependency? I am 100% ok with it (We already pull language-ext), it's simple; just a very efficient implementation of the ULID spec.

Yep, fine with it.

If we are going to start guiding users to Persistence.Linq2Db, can we get a second set of eyes on the Persistence.Query bits? That's the one part of the Library that I can't say is battle-tested.

Can you open an issue with a checklist of specific things to look at or do you want an Akka.Persistence.Query veteran (i.e. @ismaelhamed or @Arkatufus ) to just give it a look over?

If we are going to be helping users with Schema changes (I feel it would be good to do so), we should think about how to make it easy for users to migrate with their existing configs. I have a plan for this. My thought is a utility that pulls up the config, figures out what to generate, and either runs it against the specified DBs or spits out the SQL for a DBA to run in an environment. We can use FluentMigrator for this (to ease pain of multi DB support) and keep it separate from the main lib.

I like the idea of spitting out a SQL script that the DBA can squint at. Happy to create a repo for a utility tool for this if you think it's wise to separate it. Will the migration path include current Akka.Persistence.SqlServer users too? I think that'd be a good idea to help jumpstart the migration over.

@to11mtm
Copy link
Member

to11mtm commented Oct 12, 2021

Will the migration path include current Akka.Persistence.SqlServer users too? I think that'd be a good idea to help jumpstart the migration over.

We already handle migration from Akka.Persistence.SqlServer in a way similar to what I describe above (In 'pure' mode we don't use Journal Metatdata for tracking sequence numbers on deletes.) This change shouldn't be an issue for the compatibility mode switches we already have in place. IOW Yes. There's even a migration guide

@to11mtm
Copy link
Member

to11mtm commented Oct 24, 2021

@Aaronontheweb the more I think about this, I think the right thing to do is NOT do the Ulid bit, and Take the performance hit on Tagged Writes until linq2db/linq2db#2960 is resolved. EventUlid may be of value on the row but I think we should keep our users 'safe' in conversations with DBAs. I am confident we would still be 'faster' than SQL.Common were it put into a similar scenario of writing to two tables.

We can give guidelines to users that use tags that a different configuration (with more parallel writers) may be wanted to help mitigate impact.

@Aaronontheweb
Copy link
Member Author

Done as part of Akka.Persistence.Sql in August of this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants