Akka.Persistence.Sql SqlJournal caching all Persistence Ids in memory does not scale #4524

ondrejpialek · 2020-07-21T09:13:01Z

Hello,

I was diagnosing OutOfMemory exceptions we've been seeing recently and discovered that the SqlJournal stores all PersistenceIds in memory. We have over 10GB of data in our event stream, with millions of unique PersistenceIds. At the time of writing these ids take about 700MB of memory, and I am not sure how long it takes for them to be read from the DB - it must have noticeable impact on startup time...

I would argue that storing all this data in not scalable. Additionally, it seems that this is only there to support the (All/Live)PersistenceIds queries, which not everyone uses (we don't for example). I wonder if this could somehow be improved - I have some ideas at the end.

Some background about our setup:

We use latest Akka 1.3 and Azure SQL DB
We use Service Fabric and have a number of small VMs running up to 3 Akka Cluster Nodes on each (all forming a single cluster)
These VMs have 3.5GB of memory
Given those numbers PersistenceIds on each VM take over 2GB of memory, leaving only 1.5GB for OS and our user code

Soon we will be adding one more service type, leading to up to 4 nodes per VM. We cannot do this right now, as we are out of memory already. This new release will also create a new space of persistence IDs with many thousands added in, so will likely increase the memory needed by at least another 100 MB.

Ideas for addressing this issue:

Toggle feature on and off in HOCON - when off queries to get live PersistenceIds will not work, queries to get all PersistenceId will read hot off the server
Enable feature only on demand - first time persistence IDs are requested load them all and keep the list up to date, before then do nothing
Remove PersistenceId set from SQLJournal (already feels like not the right place for it) and have it ping a PersistenceIdProvider with every PersistenceId encountered AND allow the user to replace a default PersistenceIdProvider (that would cache in memory) with one that is a Cluster Singleton for example.

I think that overall this functionality can be useful, but since it may not be used by everyone I feel it's cost right now too high and should therefore be made opt in (or at least opt out).

Is there anything I missed? Depending on your preferred approach to deal with this I might submit a PR for this.

Many thanks,
Ondrej.

The text was updated successfully, but these errors were encountered:

ismaelhamed · 2020-07-21T13:43:25Z

SqlJournal is based on the levelbd one in the JVM, so it definitely has its limitations. I wonder if at this point what you really need is a different journal.

@Aaronontheweb this one would be interesting to port in the long run.

Aaronontheweb · 2020-07-21T14:08:28Z

@Arkatufus and I have been discussing this recently since we think the current in-memory query architecture is a little.... weird.

https://github.com/akka/akka-persistence-jdbc/blob/afdcea24e946247f8ed8e3306ddd49e395418d25/core/src/main/scala/akka/persistence/jdbc/query/dao/ByteArrayReadJournalDao.scala#L37-L38

Looks like the way they do it in the JDBC implementation is to just run a live query and not store anything in memory at all. That makes sense, now that I think about it - the way the PersistenceId queries were implemented for LevelDb was essentially the same model as SQLite - all of the data stored by that journal is local to that node since it all gets persisted on the file system.

That approach will not work for long-uptime database implementations. We should probably rewrite this to do queries from the database.

We should probably re-model these queries as such:

CurrentPersistenceIds - finite query that queries all persistent Id columns from the database.
PersistenceIds - infinite query that gets all persistent ids and new ones that are discovered after the fact. That will have to be implemented in SQL, somehow - probably by using a RowNumber table of some kind.

@Arkatufus what do you make of this?

to11mtm · 2020-07-21T16:44:46Z

@ismaelhamed @Aaronontheweb I think the JDBC implementation solves at least one other issue: BatchingSQLJournal currently batches both deletes and writes together in the same batch. In SQL Server this leads to heavy risk of Deadlock contention that can do very unkind things to your persisted data. It looks like in this implementation writes are still batched at some level but deletes are still kept separate, which is a vast improvement.

Aaronontheweb · 2020-07-21T16:45:39Z

yeah, we should implement that change too @to11mtm

Arkatufus · 2020-07-21T19:18:43Z

I'll try to revamp how PersistenceIds() and CurrentPersistenceIds() work under the hood.

Arkatufus · 2020-07-29T18:12:13Z

Would appreciate all of your inputs on this implementation.

Aaronontheweb · 2020-07-30T15:10:44Z

Definitely take a look at #4531 as a fix for this. I'll be reviewing it today or tomorrow.

Aaronontheweb · 2020-08-20T18:27:18Z

closed via #4531 - should see a version of this in Akka.Persistence.SqlServer shortly after v1.4.10 is released.

ondrejpialek · 2020-08-20T18:39:20Z

Amazing, thank you @Aaronontheweb and especially @Arkatufus for such a quick turnaround on this not so trivial problem!

Our path to the release is now clear. I am a bit nervous upgrading from v1.3 (especially with the persistence and cluster changes), but hopefully 1.4 is stable by now :)

Thanks again!

Aaronontheweb · 2020-08-21T15:53:13Z

@ondrejpialek happy to help! Akka.Persistence.SqlServer 1.4.10 will be released with these changes in a few moments: akkadotnet/Akka.Persistence.SqlServer#170

Aaronontheweb added this to the 1.4.10 milestone Jul 21, 2020

Aaronontheweb added perf akka-persistence-query akka-persistence-sql-common Akka.Persistence.Sql.Common labels Jul 21, 2020

Arkatufus mentioned this issue Jul 29, 2020

Re-implement PersistenceIds persitence query to match scala implementation behavior #4531

Merged

Aaronontheweb closed this as completed Aug 20, 2020

This was referenced Aug 20, 2020

added v1.4.10 release notes #4546

Merged

v1.4.10 Stable Release #4547

Merged

Bump AkkaVersion from 1.4.9 to 1.4.10 akkadotnet/Akka.Persistence.SqlServer#167

Merged

add v1.4.10 release notes akkadotnet/Akka.Persistence.SqlServer#169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka.Persistence.Sql SqlJournal caching all Persistence Ids in memory does not scale #4524

Akka.Persistence.Sql SqlJournal caching all Persistence Ids in memory does not scale #4524

ondrejpialek commented Jul 21, 2020 •

edited

Loading

ismaelhamed commented Jul 21, 2020

Aaronontheweb commented Jul 21, 2020

to11mtm commented Jul 21, 2020

Aaronontheweb commented Jul 21, 2020

Arkatufus commented Jul 21, 2020

Arkatufus commented Jul 29, 2020

Aaronontheweb commented Jul 30, 2020

Aaronontheweb commented Aug 20, 2020

ondrejpialek commented Aug 20, 2020

Aaronontheweb commented Aug 21, 2020

Akka.Persistence.Sql SqlJournal caching all Persistence Ids in memory does not scale #4524

Akka.Persistence.Sql SqlJournal caching all Persistence Ids in memory does not scale #4524

Comments

ondrejpialek commented Jul 21, 2020 • edited Loading

ismaelhamed commented Jul 21, 2020

Aaronontheweb commented Jul 21, 2020

to11mtm commented Jul 21, 2020

Aaronontheweb commented Jul 21, 2020

Arkatufus commented Jul 21, 2020

Arkatufus commented Jul 29, 2020

Aaronontheweb commented Jul 30, 2020

Aaronontheweb commented Aug 20, 2020

ondrejpialek commented Aug 20, 2020

Aaronontheweb commented Aug 21, 2020

ondrejpialek commented Jul 21, 2020 •

edited

Loading