Assignment Metadata Store #423

narendly · 2019-08-20T05:40:04Z

No description provided.

This is the intial check in for the future development of the WAGED rebalancer. All the components are placeholders. They will be implemented gradually.

* Adding the configuration items of the WAGED rebalancer. Including: Instance Capacity Keys, Rebalance Preferences, Instance Capacity Details, Partition Capacity (the weight) Details. Also adding test to cover the new configuration items.

* Introduce the cluster model classes to support the WAGED rebalancer. Implement the cluster model classes with the minimum necessary information to support rebalance. Additional field/logics might be added later once the detailed rebalance logic is implemented. Also add related tests.

…ead of IdealState. (apache#398) ResourceAssignment fit the usage better. And there will be no unnecessary information to be recorded or read during the rebalance calculation.

…nment. (apache#399) This is to avoid unnecessary information being recorded or read.

* Implement Cluster Model Provider. The model provider is called in the WAGED rebalancer to generate CLuster Model based on the current cluster status. The major responsibility of the provider is to parse all the assignable replicas and identify which replicas need to be reassigned. Note that if the current best possible assignment is still valid, the rebalancer won't need to calculate for the partition assignment. Also, add unit tests to verify the main logic.

In order to efficiently react to changes happening to the cluster in the new WAGED rebalancer, a new component called ChangeDetector was added. Changelist: Add ChangeDetector interface Implement ResourceChangeDetector Add ResourceChangeCache, a wrapper for critical cluster metadata

This config will be applied to the instance when there is no (or empty) capacity configuration in the Instance Config. Also add unit tests.

This is a constant that is no longer used.

…apache#365) Issue: CurrentStateCache updating snapshot would miss all the existing partitions that having state change. RoutingTableProvider callback on the main event thread. Time is not accounted in log. Description: fix the bug by updating the snapshot with the correct reloadkeys. enhanced log to accout for user callback code separately. Tests: mvn test passed.

Previously, ClusterConfig would be read from ZK every pipeline run. This PR makes it a selective read and also add to the set of all changed types so that cluster change detector could more easily tell whether ClusterConfig changed without having to store two copies of ClusterConfig objects.

Stablize the REST tests by following changes: 1. Remove temporary cluster which impact the ClusterAccessor test 2. Add all start/end message for test debug purpose. 3. Disable unstable monitoring test for default MBeans. Sometimes we can query it sometimes not. It is not critical test path. Let's make it stable later.

Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.

Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states). It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately. Changelist: 1. Make sure all tasks whose requested state is DROPPED get added to tasksToDrop 2. Add a unit test: TestDropTerminalTasksUponReset

…on (apache#395) * Fix the CallbackHandler registration logic in DistributedLeaderElection that may cause a leader node has no callback registered. Our current initialization logic assumes a strict leader acquire/relinquish events sequence. However, due to the possible carried over ZK events from the previous ZK session, the controller node change event might be triggered in the following sequence: 1. CALLBACK (from the previous session): Create new leader node and add handlers. 2. FINALIZE (Handle the previous session expire): Clean up handlers. 3. INIT (For the new session establishment): Expect to add the handlers back again. As a result, if the INIT event processing does not recover the handlers, the leader controller won't be able to manage anything. This fix ensures all the acquireLeadership call will try to initialize the leader controller's callback handlers. Also, add the additional test logic in TestHandleNewSession to verify the fix. * Improve the leader history update logic so there is no duplicate entry recorded.

This reverts commit f2746c8.

This reverts commit c7e8e63.

…ormance issue apache#366

Also fix the missing helix-agent snapshot update logic in the bump-up.comand.

jiajunwang and others added 30 commits August 2, 2019 21:25

Define the WAGED rebalancer interfaces.

ae2b516

This is the intial check in for the future development of the WAGED rebalancer. All the components are placeholders. They will be implemented gradually.

Change the rebalancer assignment record to be ResourceAssignment inst…

94a2b1e

…ead of IdealState. (apache#398) ResourceAssignment fit the usage better. And there will be no unnecessary information to be recorded or read during the rebalance calculation.

Convert all the internal assignment state objects to be ResourceAssig…

d1af72d

…nment. (apache#399) This is to avoid unnecessary information being recorded or read.

Add cluster level default instance config. (apache#413)

e28bc75

This config will be applied to the instance when there is no (or empty) capacity configuration in the Instance Config. Also add unit tests.

Remove .reviewboardrc from the open source repository

fd6f5a1

Remove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig

e8367ad

This is a constant that is no longer used.

Dynamically change the processor thread name when consuming event

da9987d

Add reviews@helix.apache.org to mailing list

943fe56

Improve ZK read with batch call

e91f60b

Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.

[maven-release-plugin] prepare release helix-0.9.1

263b103

[maven-release-plugin] prepare for next development iteration

147dd70

Revert "[maven-release-plugin] prepare for next development iteration"

084feda

This reverts commit f2746c8.

Revert "[maven-release-plugin] prepare release helix-0.9.1"

4f8831b

This reverts commit c7e8e63.

Reenable helix-front module for official release.

99bfe12

[maven-release-plugin] prepare release helix-0.9.1

eb07383

[maven-release-plugin] prepare for next development iteration

f970227

Add InstanceServieImpl#batchGetInstancesStoppableChecks to solve perf…

c2eb8f1

…ormance issue apache#366

Merge with the lastest optimization on batch get zookeeper properties

2ec5147

Bump up the snapshot version.

cdcee65

Also fix the missing helix-agent snapshot update logic in the bump-up.comand.

Release note for 0.9.1.

3227984

Draft for BucketDataAccessor

45d8428

narendly closed this Aug 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment Metadata Store #423

Assignment Metadata Store #423

narendly commented Aug 20, 2019

Assignment Metadata Store #423

Assignment Metadata Store #423

Conversation

narendly commented Aug 20, 2019