Synchronize DB access #8118

pihme · 2021-11-01T16:16:59Z

Description

This PR adds the synchronized keyword to all state methods. The change was motivated by the recent bug which led to concurrent calls to these classes. The root cause was identified by @Zelldon and is being addressed in #8101.

The PR proposes an additional safety net by protecting the called methods from synchronized access. The benefit of this approach is that concurrent access is prevented at runtime.

Review Hints

One common objection to using synchronized is the risk of dead and live locks. However, since we program with the assumption of single threaded access anyway, those should not occur. Also, if they occur one could argue that this is the better outcome than the danger of dirty reads and writes
Second objection often is performance impact. I ran a benchmark http://34.77.165.228/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=pihme-hardening-synchronized&var-pod=All&var-partition=All It ran for more than a week now and it is without the fix to the original bug. I couldn't see any performance impact. However, this might be due to the benchmark not reaching the performance limits.
Also note that I made the changes before Nico backported the standardization of Idea code formatting templates. So there is some reordering. Maybe I should repeat the changes with these standardized settings, but I wanted to test the waters first
@Zelldon I added you as reviewer, because I saw you squirm when I mentioned this option to you a week ago. So feel free to weigh in on this one.

Related issues

related to #7988 and other related bugs

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement

Prevent concurrent access by synchronizing DB operations

lgtm-com · 2021-11-01T16:52:53Z

This pull request introduces 1 alert when merging 4f0b2ac into bc0b2a8 - view on LGTM.com

new alerts:

1 for Inconsistent synchronization of getter and setter

npepinpe · 2021-11-01T17:51:54Z

I'm not surprised we don't see much difference in performance, I expect with biased locks and our assumption that mostly only one thread at a time would access the same state you wouldn't see much difference. However keep in mind that biased locking is disabled with Java 15 and onward, so I wonder if on Java 17 we wouldn't see a difference? Though maybe I'm wrong, and the bias won't apply here since our actors are actually scheduled on different threads over their lifetime. Maybe in this case it's rather lock elision + lock coarsening that kicks in? I also wonder if sometimes small changes don't see a statistically significant performance difference, but eventually they accumulate and result in death-by-a-thousand-cuts.

Out of curiosity, is there a reason we can't reuse the same mechanism for the DB as we did for the log storage appender (i.e. tying them to the term/logical clock)? Is it because we don't have trust in anything but synchronizing every access?

pihme · 2021-11-01T18:27:30Z

Out of curiosity, is there a reason we can't reuse the same mechanism for the DB as we did for the log storage appender (i.e. tying them to the term/logical clock)? Is it because we don't have trust in anything but synchronizing every access?

I am not aware of that mechanism. My only desire is that we have additional guarantees at runtime (and maybe even checks at compile time).

Zelldon · 2021-11-02T06:53:40Z

Thanks for mentioning me, and asking for my thoughts. You're right that I have some problems with this approach and I try to add some reasoning to this.

For me personally this speaks totally against how we developed in the past, lock free structures, one thread access a resource etc. This might be the weakest point here.
I don't think this would solve any problems nor prevent other issues. If we had really concurrent threads accessing the state or processor synchronizing the state only, wouldn't be enough in my eyes, it is just a shoot in one direction.
I don't think that it is the right solution for our problem. It is a solution, but for a different problem. As you wrote we run with one thread. This is something we need to make sure and guarantee in our code base, otherwise the complete structure and how we built things is broken and a mess. Synchronizing the access to the state is for me totally waste of resources, especially if we know that we have only one thread. We introduced the DBContext for the purpose of multi access, I don't think that more is necessary.
In my opinion we should concentrate on things like guarantee one and only one thread or actor access a resource at a time.
I'm not sure whether it was really the case that concurrent calls to the classes/objects where made. For me it was more like concurrent existing Actors, which access an underlying structure (RocksDB) concurrently. What I mean is that I'm not sure whether JVM objects were shared or just the RocksDB which lies outside.
As you pointed out we do not test anymore our limits in the benchmarks. It also seems that your branch has not the latest changes to the latency metric updates, which makes it even harder to compare.
I like the idea of @npepinpe to add a term to the State, which might prevent for reusing the state.

Happy to hear other thoughts on this.

saig0

💭 I don't think that adding synchronized to all state methods is the right approach.

The state is designed to be used by a single thread. And this assumption is still true. It should be used only by a single thread.

Handling concurrent access by adding synchronize looks like that the state can/should be used by multiple threads.

Instead, I would prefer to avoid concurrent access on the caller side (i.e. in the actor framework or the bootstrapping/transition logic).

However, I'm open to discussions.

npepinpe · 2021-11-02T08:48:21Z

I think @Zelldon's point is the strongest - I don't think the problem is multiple threads, just that multiple stream processors are accessing the same DB. Whether it's thread safe or not probably doesn't make a difference if you can still mutate the underlying data. If you wanted to continue with the lock approach, the idea would be that a stream processor, on creation, would grab and lock the state once, and only unlock it when it's closed/stopped. That would prevent any other access to it while it's processing.

pihme · 2021-11-02T09:04:03Z

5\. I'm not sure whether it was really the case that concurrent calls to the classes/objects where made. For me it was more like concurrent existing Actors, which access an underlying structure (RocksDB) concurrently. What I mean is that I'm not sure whether JVM objects were shared or just the RocksDB which lies outside.

Are you implying that we don't have transaction isolation? How can this have happened without concurrent access to the methods, from your point of view?

Zelldon · 2021-11-02T09:13:22Z

5\. I'm not sure whether it was really the case that concurrent calls to the classes/objects where made. For me it was more like concurrent existing Actors, which access an underlying structure (RocksDB) concurrently. What I mean is that I'm not sure whether JVM objects were shared or just the RocksDB which lies outside.
Are you implying that we don't have transaction isolation? How can this have happened without concurrent access to the methods, from your point of view?

Not sure whether you know how the RocksDB transaction work, but it is not like a relational database where you lock the complete table or something.

In particular, to be clear we using OptimisticTransactions, which means you can have multiple concurrent transactions it will be detected on commit that multiple transaction do not write on the same key. This means that in my opinion it can happen that we update a column family in one processor and in the other processor another column family without causing any optimistic locking exceptions, but this would break still our complete model since we have relations to the column families. But we have no way to model it like it is done in relation databases with foreign key or something.

You can read about this rocksdb transactions also here https://github-wiki-see.page/m/facebook/rocksdb/wiki/Transactions

pihme · 2021-11-02T09:38:30Z

You can read about this rocksdb transactions also here https://github-wiki-see.page/m/facebook/rocksdb/wiki/Transactions

Thanks @Zelldon for the link. I read most of the article. What is not clear to me is the scope of the transaction. Is a transaction per column family or can it cross multiple column families?

If it's per column family, I agree with you.

If it is across all column families, I have to think harder about it ... whether it can happen without concurrent access.

It is not that there were inconsistencies between two column families that were written to independently by different processors. What we found were inconsistencies where one processor wrote to two families. We know that in some cases precisely, because there is just one place in the whole code base where entries are removed from these two column families. And then these two column families became inconsistent.

npepinpe · 2021-11-04T15:34:21Z

One thing Ole suggested and wouldn't have such a big impact on performance (but may lead to deadlocks - still, not as bad), how about a global lock on the DB? i.e. a stream processor on start would try to grab exclusive ownership of the DB. If another DB has it, then it won't be able to, and should fail (let's ignore recovery for now). If it can, then great, and it doesn't need to synchronize all further accesses. In that sense it's quite similar to the approach of using the term as a form of lease, and could be a step in that direction.

npepinpe · 2021-12-16T17:02:41Z

Did we end up doing anything here, or are we waiting on the task force? If the latter, then let's close this.

pihme · 2021-12-16T17:10:57Z

We are waiting for the task force, I guess.

npepinpe · 2021-12-17T09:51:06Z

Then I would propose to close this until we know what/where we want to go with this. Do reopen if you think otherwise, though.

pihme added 2 commits November 1, 2021 16:49

chore(engine): add synchronized keyword

3f6b22b

Prevent concurrent access by synchronizing DB operations

Do not merge: benchmark for changes

4f0b2ac

pihme requested review from Zelldon and saig0 November 1, 2021 16:32

saig0 reviewed Nov 2, 2021

View reviewed changes

Zelldon removed their request for review November 2, 2021 13:00

npepinpe closed this Dec 17, 2021

pihme deleted the pihme-hardening-synchronized branch April 7, 2022 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize DB access #8118

Synchronize DB access #8118

pihme commented Nov 1, 2021

lgtm-com bot commented Nov 1, 2021

npepinpe commented Nov 1, 2021 •

edited

pihme commented Nov 1, 2021

Zelldon commented Nov 2, 2021

saig0 left a comment

npepinpe commented Nov 2, 2021

pihme commented Nov 2, 2021

Zelldon commented Nov 2, 2021

pihme commented Nov 2, 2021

npepinpe commented Nov 4, 2021

npepinpe commented Dec 16, 2021

pihme commented Dec 16, 2021

npepinpe commented Dec 17, 2021

Synchronize DB access #8118

Synchronize DB access #8118

Conversation

pihme commented Nov 1, 2021

Description

Review Hints

Related issues

Definition of Done

lgtm-com bot commented Nov 1, 2021

npepinpe commented Nov 1, 2021 • edited

pihme commented Nov 1, 2021

Zelldon commented Nov 2, 2021

saig0 left a comment

Choose a reason for hiding this comment

npepinpe commented Nov 2, 2021

pihme commented Nov 2, 2021

Zelldon commented Nov 2, 2021

pihme commented Nov 2, 2021

npepinpe commented Nov 4, 2021

npepinpe commented Dec 16, 2021

pihme commented Dec 16, 2021

npepinpe commented Dec 17, 2021

npepinpe commented Nov 1, 2021 •

edited