Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CallbackHandler async re-subscribe watcher can potentially miss events #331

Closed
Jackie-Jiang opened this issue Jul 15, 2019 · 4 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@Jackie-Jiang
Copy link

In CallbackHandler, the re-subscription of the watcher for CALLBACK happens asynchronously. If the re-subscription happens after the change of the path, then we will miss the change until the next change happens.

@jiajunwang
Copy link
Contributor

The re-subscribe does not rely on the async logic. It is by default done in the ZkClient.
The real problem is the race condition when updating the Helix cache. This prevents the controller from re-calculating the new mapping. Will update with more detail next week.

@jiajunwang jiajunwang self-assigned this Jul 22, 2019
@jiajunwang
Copy link
Contributor

The root cause of the issue is a race condition in Helix cache refresh logic. Basically, Helix relies on the ZK notification path to set up 2 things:

  1. If the current assignment is still valid.
  2. If the data cache needs to be refreshed.

The race condition causes these 2 checks to have a different result. The assignment was not invalidated. But the cache was updated. As a result, the rebalancer refused to calculate a new mapping even knows the latest IdealState.

The impact is limited in the CUSTOM rebalance mode only. Because the fist check is done for custom rebalancer only.

Jackie-Jiang added a commit to apache/pinot that referenced this issue Jul 25, 2019
@jiajunwang jiajunwang added the bug Something isn't working label Jul 26, 2019
jiajunwang added a commit to jiajunwang/helix that referenced this issue Jul 26, 2019
This change fix issue apache#331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
@jiajunwang
Copy link
Contributor

@Jackie-Jiang Could you please help to take a look at the proposed fix? Thanks.

jiajunwang added a commit to jiajunwang/helix that referenced this issue Jul 30, 2019
This change fix issue apache#331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
jiajunwang added a commit that referenced this issue Jul 30, 2019
* Fix the race condition while Helix refresh cluster status cache.

This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
asfgit pushed a commit that referenced this issue Aug 1, 2019
* Fix the race condition while Helix refresh cluster status cache.

This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
@jiajunwang
Copy link
Contributor

Close this ticket since the fix has been merged to master. We will have a release soon next week with one more critical fix.

chenboat pushed a commit to chenboat/incubator-pinot that referenced this issue Nov 15, 2019
mgao0 pushed a commit to mgao0/helix that referenced this issue Mar 6, 2020
…che#363)

* Fix the race condition while Helix refresh cluster status cache.

This change fix issue apache#331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants