CallbackHandler async re-subscribe watcher can potentially miss events #331

Jackie-Jiang · 2019-07-15T22:14:01Z

In CallbackHandler, the re-subscription of the watcher for CALLBACK happens asynchronously. If the re-subscription happens after the change of the path, then we will miss the change until the next change happens.

jiajunwang · 2019-07-20T05:17:20Z

The re-subscribe does not rely on the async logic. It is by default done in the ZkClient.
The real problem is the race condition when updating the Helix cache. This prevents the controller from re-calculating the new mapping. Will update with more detail next week.

jiajunwang · 2019-07-22T22:51:41Z

The root cause of the issue is a race condition in Helix cache refresh logic. Basically, Helix relies on the ZK notification path to set up 2 things:

If the current assignment is still valid.
If the data cache needs to be refreshed.

The race condition causes these 2 checks to have a different result. The assignment was not invalidated. But the cache was updated. As a result, the rebalancer refused to calculate a new mapping even knows the latest IdealState.

The impact is limited in the CUSTOM rebalance mode only. Because the fist check is done for custom rebalancer only.

Helix issue: apache/helix#331

…sue (#4459) Helix issue: apache/helix#331

This change fix issue apache#331. The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

jiajunwang · 2019-07-30T05:48:28Z

@Jackie-Jiang Could you please help to take a look at the proposed fix? Thanks.

This change fix issue apache#331. The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

* Fix the race condition while Helix refresh cluster status cache. This change fix issue #331. The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

jiajunwang · 2019-08-08T07:33:05Z

Close this ticket since the fix has been merged to master. We will have a release soon next week with one more critical fix.

…sue (apache#4459) Helix issue: apache/helix#331

…che#363) * Fix the race condition while Helix refresh cluster status cache. This change fix issue apache#331. The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

jiajunwang self-assigned this Jul 22, 2019

Jackie-Jiang added a commit to apache/pinot that referenced this issue Jul 22, 2019

Enable periodic rebalance as a temporary work-around for the Helix issue

43acd78

Helix issue: apache/helix#331

Jackie-Jiang mentioned this issue Jul 22, 2019

Enable periodic rebalance as a temporary work-around for the Helix issue apache/pinot#4459

Merged

Jackie-Jiang added a commit to apache/pinot that referenced this issue Jul 25, 2019

Enable periodic rebalance as a temporary work-around for the Helix is…

1d45f87

…sue (#4459) Helix issue: apache/helix#331

jiajunwang added the bug Something isn't working label Jul 26, 2019

jiajunwang mentioned this issue Jul 26, 2019

Fix the race condition while Helix refresh cluster status cache. #363

Merged

jiajunwang closed this as completed Aug 8, 2019

chenboat pushed a commit to chenboat/incubator-pinot that referenced this issue Nov 15, 2019

Enable periodic rebalance as a temporary work-around for the Helix is…

0c6b63a

…sue (apache#4459) Helix issue: apache/helix#331

This was referenced Dec 6, 2022

[Flaky-test] ZKOperatorTest.testMetadataUploadType apache/pinot#9897

Closed

Multiple changes to IS can cause later change ignored #2309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CallbackHandler async re-subscribe watcher can potentially miss events #331

CallbackHandler async re-subscribe watcher can potentially miss events #331

Jackie-Jiang commented Jul 15, 2019

jiajunwang commented Jul 20, 2019

jiajunwang commented Jul 22, 2019

jiajunwang commented Jul 30, 2019

jiajunwang commented Aug 8, 2019

CallbackHandler async re-subscribe watcher can potentially miss events #331

CallbackHandler async re-subscribe watcher can potentially miss events #331

Comments

Jackie-Jiang commented Jul 15, 2019

jiajunwang commented Jul 20, 2019

jiajunwang commented Jul 22, 2019

jiajunwang commented Jul 30, 2019

jiajunwang commented Aug 8, 2019