Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment Metadata Store #423

Closed
wants to merge 30 commits into from
Closed

Assignment Metadata Store #423

wants to merge 30 commits into from

Conversation

narendly
Copy link
Contributor

No description provided.

jiajunwang and others added 30 commits August 2, 2019 21:25
This is the intial check in for the future development of the WAGED rebalancer.
All the components are placeholders. They will be implemented gradually.
* Adding the configuration items of the WAGED rebalancer.

Including: Instance Capacity Keys, Rebalance Preferences, Instance Capacity Details, Partition Capacity (the weight) Details.
Also adding test to cover the new configuration items.
* Introduce the cluster model classes to support the WAGED rebalancer.

Implement the cluster model classes with the minimum necessary information to support rebalance.
Additional field/logics might be added later once the detailed rebalance logic is implemented.

Also add related tests.
…ead of IdealState. (apache#398)

ResourceAssignment fit the usage better. And there will be no unnecessary information to be recorded or read during the rebalance calculation.
…nment. (apache#399)

This is to avoid unnecessary information being recorded or read.
* Implement Cluster Model Provider.

The model provider is called in the WAGED rebalancer to generate CLuster Model based on the current cluster status.
The major responsibility of the provider is to parse all the assignable replicas and identify which replicas need to be reassigned. Note that if the current best possible assignment is still valid, the rebalancer won't need to calculate for the partition assignment.

Also, add unit tests to verify the main logic.
In order to efficiently react to changes happening to the cluster in the new WAGED rebalancer, a new component called ChangeDetector was added.

Changelist:

Add ChangeDetector interface
Implement ResourceChangeDetector
Add ResourceChangeCache, a wrapper for critical cluster metadata
This config will be applied to the instance when there is no (or empty) capacity configuration in the Instance Config.
Also add unit tests.
This is a constant that is no longer used.
…apache#365)

Issue:

CurrentStateCache updating snapshot would miss all the existing partitions that having state change.

RoutingTableProvider callback on the main event thread. Time is not accounted in log.

Description:
fix the bug by updating the snapshot with the correct reloadkeys.

enhanced log to accout for user callback code separately.

Tests:
mvn test passed.
Previously, ClusterConfig would be read from ZK every pipeline run. This PR makes it a selective read and also add to the set of all changed types so that cluster change detector could more easily tell whether ClusterConfig changed without having to store two copies of ClusterConfig objects.
Stablize the REST tests by following changes:
1. Remove temporary cluster which impact the ClusterAccessor test
2. Add all start/end message for test debug purpose.
3. Disable unstable monitoring test for default MBeans. Sometimes we can query it sometimes not. It is not critical test path. Let's make it stable later.
Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.
Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).

It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.
Changelist:
1. Make sure all tasks whose requested state is DROPPED get added to tasksToDrop
2. Add a unit test: TestDropTerminalTasksUponReset
…on (apache#395)

* Fix the CallbackHandler registration logic in DistributedLeaderElection that may cause a leader node has no callback registered.

Our current initialization logic assumes a strict leader acquire/relinquish events sequence. However, due to the possible carried over ZK events from the previous ZK session, the controller node change event might be triggered in the following sequence:
1. CALLBACK (from the previous session): Create new leader node and add handlers.
2. FINALIZE (Handle the previous session expire): Clean up handlers.
3. INIT (For the new session establishment): Expect to add the handlers back again.
As a result, if the INIT event processing does not recover the handlers, the leader controller won't be able to manage anything. This fix ensures all the acquireLeadership call will try to initialize the leader controller's callback handlers.

Also, add the additional test logic in TestHandleNewSession to verify the fix.

* Improve the leader history update logic so there is no duplicate entry recorded.
Also fix the missing helix-agent snapshot update logic in the bump-up.comand.
@narendly narendly closed this Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants