-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize DB access #8118
Synchronize DB access #8118
Conversation
Prevent concurrent access by synchronizing DB operations
This pull request introduces 1 alert when merging 4f0b2ac into bc0b2a8 - view on LGTM.com new alerts:
|
I'm not surprised we don't see much difference in performance, I expect with biased locks and our assumption that mostly only one thread at a time would access the same state you wouldn't see much difference. However keep in mind that biased locking is disabled with Java 15 and onward, so I wonder if on Java 17 we wouldn't see a difference? Though maybe I'm wrong, and the bias won't apply here since our actors are actually scheduled on different threads over their lifetime. Maybe in this case it's rather lock elision + lock coarsening that kicks in? I also wonder if sometimes small changes don't see a statistically significant performance difference, but eventually they accumulate and result in death-by-a-thousand-cuts. Out of curiosity, is there a reason we can't reuse the same mechanism for the DB as we did for the log storage appender (i.e. tying them to the term/logical clock)? Is it because we don't have trust in anything but synchronizing every access? |
I am not aware of that mechanism. My only desire is that we have additional guarantees at runtime (and maybe even checks at compile time). |
Thanks for mentioning me, and asking for my thoughts. You're right that I have some problems with this approach and I try to add some reasoning to this.
Happy to hear other thoughts on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 I don't think that adding synchronized
to all state methods is the right approach.
The state is designed to be used by a single thread. And this assumption is still true. It should be used only by a single thread.
Handling concurrent access by adding synchronize
looks like that the state can/should be used by multiple threads.
Instead, I would prefer to avoid concurrent access on the caller side (i.e. in the actor framework or the bootstrapping/transition logic).
However, I'm open to discussions.
I think @Zelldon's point is the strongest - I don't think the problem is multiple threads, just that multiple stream processors are accessing the same DB. Whether it's thread safe or not probably doesn't make a difference if you can still mutate the underlying data. If you wanted to continue with the lock approach, the idea would be that a stream processor, on creation, would grab and lock the state once, and only unlock it when it's closed/stopped. That would prevent any other access to it while it's processing. |
Are you implying that we don't have transaction isolation? How can this have happened without concurrent access to the methods, from your point of view? |
Not sure whether you know how the RocksDB transaction work, but it is not like a relational database where you lock the complete table or something. In particular, to be clear we using OptimisticTransactions, which means you can have multiple concurrent transactions it will be detected on commit that multiple transaction do not write on the same key. This means that in my opinion it can happen that we update a column family in one processor and in the other processor another column family without causing any optimistic locking exceptions, but this would break still our complete model since we have relations to the column families. But we have no way to model it like it is done in relation databases with foreign key or something. You can read about this rocksdb transactions also here https://github-wiki-see.page/m/facebook/rocksdb/wiki/Transactions |
Thanks @Zelldon for the link. I read most of the article. What is not clear to me is the scope of the transaction. Is a transaction per column family or can it cross multiple column families? If it's per column family, I agree with you. If it is across all column families, I have to think harder about it ... whether it can happen without concurrent access. It is not that there were inconsistencies between two column families that were written to independently by different processors. What we found were inconsistencies where one processor wrote to two families. We know that in some cases precisely, because there is just one place in the whole code base where entries are removed from these two column families. And then these two column families became inconsistent. |
One thing Ole suggested and wouldn't have such a big impact on performance (but may lead to deadlocks - still, not as bad), how about a global lock on the DB? i.e. a stream processor on start would try to grab exclusive ownership of the DB. If another DB has it, then it won't be able to, and should fail (let's ignore recovery for now). If it can, then great, and it doesn't need to synchronize all further accesses. In that sense it's quite similar to the approach of using the term as a form of lease, and could be a step in that direction. |
Did we end up doing anything here, or are we waiting on the task force? If the latter, then let's close this. |
We are waiting for the task force, I guess. |
Then I would propose to close this until we know what/where we want to go with this. Do reopen if you think otherwise, though. |
Description
This PR adds the
synchronized
keyword to all state methods. The change was motivated by the recent bug which led to concurrent calls to these classes. The root cause was identified by @Zelldon and is being addressed in #8101.The PR proposes an additional safety net by protecting the called methods from synchronized access. The benefit of this approach is that concurrent access is prevented at runtime.
Review Hints
Related issues
related to #7988 and other related bugs
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/0.25
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: