Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Persistence.Sql SqlJournal caching all Persistence Ids in memory does not scale #4524

Closed
ondrejpialek opened this issue Jul 21, 2020 · 10 comments

Comments

@ondrejpialek
Copy link
Contributor

ondrejpialek commented Jul 21, 2020

Hello,

I was diagnosing OutOfMemory exceptions we've been seeing recently and discovered that the SqlJournal stores all PersistenceIds in memory. We have over 10GB of data in our event stream, with millions of unique PersistenceIds. At the time of writing these ids take about 700MB of memory, and I am not sure how long it takes for them to be read from the DB - it must have noticeable impact on startup time...

I would argue that storing all this data in not scalable. Additionally, it seems that this is only there to support the (All/Live)PersistenceIds queries, which not everyone uses (we don't for example). I wonder if this could somehow be improved - I have some ideas at the end.

Some background about our setup:

  • We use latest Akka 1.3 and Azure SQL DB
  • We use Service Fabric and have a number of small VMs running up to 3 Akka Cluster Nodes on each (all forming a single cluster)
  • These VMs have 3.5GB of memory
  • Given those numbers PersistenceIds on each VM take over 2GB of memory, leaving only 1.5GB for OS and our user code

Soon we will be adding one more service type, leading to up to 4 nodes per VM. We cannot do this right now, as we are out of memory already. This new release will also create a new space of persistence IDs with many thousands added in, so will likely increase the memory needed by at least another 100 MB.

Ideas for addressing this issue:

  • Toggle feature on and off in HOCON - when off queries to get live PersistenceIds will not work, queries to get all PersistenceId will read hot off the server
  • Enable feature only on demand - first time persistence IDs are requested load them all and keep the list up to date, before then do nothing
  • Remove PersistenceId set from SQLJournal (already feels like not the right place for it) and have it ping a PersistenceIdProvider with every PersistenceId encountered AND allow the user to replace a default PersistenceIdProvider (that would cache in memory) with one that is a Cluster Singleton for example.

I think that overall this functionality can be useful, but since it may not be used by everyone I feel it's cost right now too high and should therefore be made opt in (or at least opt out).

Is there anything I missed? Depending on your preferred approach to deal with this I might submit a PR for this.

Many thanks,
Ondrej.

@ismaelhamed
Copy link
Member

SqlJournal is based on the levelbd one in the JVM, so it definitely has its limitations. I wonder if at this point what you really need is a different journal.

@Aaronontheweb this one would be interesting to port in the long run.

@Aaronontheweb
Copy link
Member

@Arkatufus and I have been discussing this recently since we think the current in-memory query architecture is a little.... weird.

https://github.com/akka/akka-persistence-jdbc/blob/afdcea24e946247f8ed8e3306ddd49e395418d25/core/src/main/scala/akka/persistence/jdbc/query/dao/ByteArrayReadJournalDao.scala#L37-L38

Looks like the way they do it in the JDBC implementation is to just run a live query and not store anything in memory at all. That makes sense, now that I think about it - the way the PersistenceId queries were implemented for LevelDb was essentially the same model as SQLite - all of the data stored by that journal is local to that node since it all gets persisted on the file system.

That approach will not work for long-uptime database implementations. We should probably rewrite this to do queries from the database.

We should probably re-model these queries as such:

  1. CurrentPersistenceIds - finite query that queries all persistent Id columns from the database.
  2. PersistenceIds - infinite query that gets all persistent ids and new ones that are discovered after the fact. That will have to be implemented in SQL, somehow - probably by using a RowNumber table of some kind.

@Arkatufus what do you make of this?

@to11mtm
Copy link
Member

to11mtm commented Jul 21, 2020

@ismaelhamed @Aaronontheweb I think the JDBC implementation solves at least one other issue: BatchingSQLJournal currently batches both deletes and writes together in the same batch. In SQL Server this leads to heavy risk of Deadlock contention that can do very unkind things to your persisted data. It looks like in this implementation writes are still batched at some level but deletes are still kept separate, which is a vast improvement.

@Aaronontheweb
Copy link
Member

yeah, we should implement that change too @to11mtm

@Arkatufus
Copy link
Contributor

I'll try to revamp how PersistenceIds() and CurrentPersistenceIds() work under the hood.

@Arkatufus
Copy link
Contributor

Would appreciate all of your inputs on this implementation.

@Aaronontheweb
Copy link
Member

Definitely take a look at #4531 as a fix for this. I'll be reviewing it today or tomorrow.

@Aaronontheweb
Copy link
Member

closed via #4531 - should see a version of this in Akka.Persistence.SqlServer shortly after v1.4.10 is released.

@ondrejpialek
Copy link
Contributor Author

Amazing, thank you @Aaronontheweb and especially @Arkatufus for such a quick turnaround on this not so trivial problem!

Our path to the release is now clear. I am a bit nervous upgrading from v1.3 (especially with the persistence and cluster changes), but hopefully 1.4 is stable by now :)

Thanks again!

@Aaronontheweb
Copy link
Member

@ondrejpialek happy to help! Akka.Persistence.SqlServer 1.4.10 will be released with these changes in a few moments: akkadotnet/Akka.Persistence.SqlServer#170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants