Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize DB access #8118

Closed
wants to merge 2 commits into from
Closed

Synchronize DB access #8118

wants to merge 2 commits into from

Conversation

pihme
Copy link
Contributor

@pihme pihme commented Nov 1, 2021

Description

This PR adds the synchronized keyword to all state methods. The change was motivated by the recent bug which led to concurrent calls to these classes. The root cause was identified by @Zelldon and is being addressed in #8101.

The PR proposes an additional safety net by protecting the called methods from synchronized access. The benefit of this approach is that concurrent access is prevented at runtime.

Review Hints

  • One common objection to using synchronized is the risk of dead and live locks. However, since we program with the assumption of single threaded access anyway, those should not occur. Also, if they occur one could argue that this is the better outcome than the danger of dirty reads and writes
  • Second objection often is performance impact. I ran a benchmark http://34.77.165.228/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=pihme-hardening-synchronized&var-pod=All&var-partition=All It ran for more than a week now and it is without the fix to the original bug. I couldn't see any performance impact. However, this might be due to the benchmark not reaching the performance limits.
  • Also note that I made the changes before Nico backported the standardization of Idea code formatting templates. So there is some reordering. Maybe I should repeat the changes with these standardized settings, but I wanted to test the waters first
  • @Zelldon I added you as reviewer, because I saw you squirm when I mentioned this option to you a week ago. So feel free to weigh in on this one.

Related issues

related to #7988 and other related bugs

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement

Prevent concurrent access by synchronizing DB operations
@pihme pihme requested review from Zelldon and saig0 November 1, 2021 16:32
@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2021

This pull request introduces 1 alert when merging 4f0b2ac into bc0b2a8 - view on LGTM.com

new alerts:

  • 1 for Inconsistent synchronization of getter and setter

@npepinpe
Copy link
Member

npepinpe commented Nov 1, 2021

I'm not surprised we don't see much difference in performance, I expect with biased locks and our assumption that mostly only one thread at a time would access the same state you wouldn't see much difference. However keep in mind that biased locking is disabled with Java 15 and onward, so I wonder if on Java 17 we wouldn't see a difference? Though maybe I'm wrong, and the bias won't apply here since our actors are actually scheduled on different threads over their lifetime. Maybe in this case it's rather lock elision + lock coarsening that kicks in? I also wonder if sometimes small changes don't see a statistically significant performance difference, but eventually they accumulate and result in death-by-a-thousand-cuts.

Out of curiosity, is there a reason we can't reuse the same mechanism for the DB as we did for the log storage appender (i.e. tying them to the term/logical clock)? Is it because we don't have trust in anything but synchronizing every access?

@pihme
Copy link
Contributor Author

pihme commented Nov 1, 2021

Out of curiosity, is there a reason we can't reuse the same mechanism for the DB as we did for the log storage appender (i.e. tying them to the term/logical clock)? Is it because we don't have trust in anything but synchronizing every access?

I am not aware of that mechanism. My only desire is that we have additional guarantees at runtime (and maybe even checks at compile time).

@Zelldon
Copy link
Member

Zelldon commented Nov 2, 2021

Thanks for mentioning me, and asking for my thoughts. You're right that I have some problems with this approach and I try to add some reasoning to this.

  1. For me personally this speaks totally against how we developed in the past, lock free structures, one thread access a resource etc. This might be the weakest point here.
  2. I don't think this would solve any problems nor prevent other issues. If we had really concurrent threads accessing the state or processor synchronizing the state only, wouldn't be enough in my eyes, it is just a shoot in one direction.
  3. I don't think that it is the right solution for our problem. It is a solution, but for a different problem. As you wrote we run with one thread. This is something we need to make sure and guarantee in our code base, otherwise the complete structure and how we built things is broken and a mess. Synchronizing the access to the state is for me totally waste of resources, especially if we know that we have only one thread. We introduced the DBContext for the purpose of multi access, I don't think that more is necessary.
  4. In my opinion we should concentrate on things like guarantee one and only one thread or actor access a resource at a time.
  5. I'm not sure whether it was really the case that concurrent calls to the classes/objects where made. For me it was more like concurrent existing Actors, which access an underlying structure (RocksDB) concurrently. What I mean is that I'm not sure whether JVM objects were shared or just the RocksDB which lies outside.
  6. As you pointed out we do not test anymore our limits in the benchmarks. It also seems that your branch has not the latest changes to the latency metric updates, which makes it even harder to compare.
  7. I like the idea of @npepinpe to add a term to the State, which might prevent for reusing the state.

Happy to hear other thoughts on this.

Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I don't think that adding synchronized to all state methods is the right approach.

The state is designed to be used by a single thread. And this assumption is still true. It should be used only by a single thread.

Handling concurrent access by adding synchronize looks like that the state can/should be used by multiple threads.

Instead, I would prefer to avoid concurrent access on the caller side (i.e. in the actor framework or the bootstrapping/transition logic).

However, I'm open to discussions.

@npepinpe
Copy link
Member

npepinpe commented Nov 2, 2021

I think @Zelldon's point is the strongest - I don't think the problem is multiple threads, just that multiple stream processors are accessing the same DB. Whether it's thread safe or not probably doesn't make a difference if you can still mutate the underlying data. If you wanted to continue with the lock approach, the idea would be that a stream processor, on creation, would grab and lock the state once, and only unlock it when it's closed/stopped. That would prevent any other access to it while it's processing.

@pihme
Copy link
Contributor Author

pihme commented Nov 2, 2021

5\. I'm not sure whether it was really the case that concurrent calls to the classes/objects where made. For me it was more like concurrent existing Actors, which access an underlying structure (RocksDB) concurrently. What I mean is that I'm not sure whether JVM objects were shared or just the RocksDB which lies outside.

Are you implying that we don't have transaction isolation? How can this have happened without concurrent access to the methods, from your point of view?

@Zelldon
Copy link
Member

Zelldon commented Nov 2, 2021

5\. I'm not sure whether it was really the case that concurrent calls to the classes/objects where made. For me it was more like concurrent existing Actors, which access an underlying structure (RocksDB) concurrently. What I mean is that I'm not sure whether JVM objects were shared or just the RocksDB which lies outside.

Are you implying that we don't have transaction isolation? How can this have happened without concurrent access to the methods, from your point of view?

Not sure whether you know how the RocksDB transaction work, but it is not like a relational database where you lock the complete table or something.

In particular, to be clear we using OptimisticTransactions, which means you can have multiple concurrent transactions it will be detected on commit that multiple transaction do not write on the same key. This means that in my opinion it can happen that we update a column family in one processor and in the other processor another column family without causing any optimistic locking exceptions, but this would break still our complete model since we have relations to the column families. But we have no way to model it like it is done in relation databases with foreign key or something.

You can read about this rocksdb transactions also here https://github-wiki-see.page/m/facebook/rocksdb/wiki/Transactions

@pihme
Copy link
Contributor Author

pihme commented Nov 2, 2021

You can read about this rocksdb transactions also here https://github-wiki-see.page/m/facebook/rocksdb/wiki/Transactions

Thanks @Zelldon for the link. I read most of the article. What is not clear to me is the scope of the transaction. Is a transaction per column family or can it cross multiple column families?

If it's per column family, I agree with you.

If it is across all column families, I have to think harder about it ... whether it can happen without concurrent access.

It is not that there were inconsistencies between two column families that were written to independently by different processors. What we found were inconsistencies where one processor wrote to two families. We know that in some cases precisely, because there is just one place in the whole code base where entries are removed from these two column families. And then these two column families became inconsistent.

@Zelldon Zelldon removed their request for review November 2, 2021 13:00
@npepinpe
Copy link
Member

npepinpe commented Nov 4, 2021

One thing Ole suggested and wouldn't have such a big impact on performance (but may lead to deadlocks - still, not as bad), how about a global lock on the DB? i.e. a stream processor on start would try to grab exclusive ownership of the DB. If another DB has it, then it won't be able to, and should fail (let's ignore recovery for now). If it can, then great, and it doesn't need to synchronize all further accesses. In that sense it's quite similar to the approach of using the term as a form of lease, and could be a step in that direction.

@npepinpe
Copy link
Member

Did we end up doing anything here, or are we waiting on the task force? If the latter, then let's close this.

@pihme
Copy link
Contributor Author

pihme commented Dec 16, 2021

We are waiting for the task force, I guess.

@npepinpe
Copy link
Member

Then I would propose to close this until we know what/where we want to go with this. Do reopen if you think otherwise, though.

@npepinpe npepinpe closed this Dec 17, 2021
@pihme pihme deleted the pihme-hardening-synchronized branch April 7, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants