Added new metric to report real time missing top state for partition #2344

rahulrane50 · 2023-01-19T08:15:10Z

Issues

My PR addresses the following Helix issues and references them in the PR description:
Fixes Add new metric to report missing top state for partition #2345 : Adds a new metric to report real time missing top state for any partition for each resource in a cluster.

(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)

Description

Issue :
Currently, if waged rebalancer is used then there could be a situation where multiple leader replicas are residing on same node. If that node goes down for maintainence or for any other reason then there is no current metric which reports about missing top state until any rebalancer event is triggered.

Ask :
Ask here is to add a metric which satisfies following conditions :

The metric should be at resource level and should be reported irrespective of top state replica is recovered or not for any partition of that resource.
This metric should be started as soon as "any" partition for a resource has top state missing and it should be "reset" only when that resource has no partition with missing top state.

Solution :
The solution proposed in this PR starts a async thread as soon as first partition of any resource has top state missing and it continuously reports missing top state duration (atleast once in sliding window interval) for partitions at resource level as long as that resource has "at least one" partition with missing top state. If there are no resources with any partitions with missing top state then this async thread will sleep.
The guage reported by this thread is histogram which won't give exact values but will approximate about duration increasing and will be reset when all top state replicas for that resource are recovered.

Here are some details about my PR, including screenshots of any UI changes:

(Write a concise description including what, why, how)

Tests

[X ] The following tests are written for this issue:

CI pipeline and added new tests for verifying metrics.

mvn test -Dtest=TestTopStateHandoffMetrics
[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 18.723 s - in org.apache.helix.monitoring.mbeans.TestTopStateHandoffMetrics
[INFO]
[INFO] Results:
[INFO]
[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 1

(List the names of added unit/integration tests)

The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

CI is failing with helix-rest tests. Verified locally that tests are running
Failing test results :

2023-01-19T10:12:27.2107893Z [info] ./metadata-store-directory-common/target/surefire-reports/TestSuite.txt: Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.633 s - in TestSuite
2023-01-19T10:12:27.2108983Z [info] ./helix-core/target/surefire-reports/TestSuite.txt: Tests run: 1322, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 5,906.49 s - in TestSuite
2023-01-19T10:12:27.2110115Z [info] ./helix-view-aggregator/target/surefire-reports/TestSuite.txt: Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 100.176 s - in TestSuite
2023-01-19T10:12:27.2110941Z [info] ./recipes/distributed-lock-manager/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.492 s - in TestSuite
2023-01-19T10:12:27.2111791Z [info] ./recipes/rabbitmq-consumer-group/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.692 s - in TestSuite
2023-01-19T10:12:27.2112591Z [info] ./recipes/task-execution/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.534 s - in TestSuite
2023-01-19T10:12:27.2113368Z [info] ./recipes/service-discovery/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.63 s - in TestSuite
2023-01-19T10:12:27.2114188Z [info] ./recipes/rsync-replicated-file-system/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.582 s - in TestSuite
2023-01-19T10:12:27.2114980Z [info] ./helix-lock/target/surefire-reports/TestSuite.txt: Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 55.692 s - in TestSuite
2023-01-19T10:12:27.2115716Z [info] ./metrics-common/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.372 s - in TestSuite
2023-01-19T10:12:27.2116463Z [info] ./zookeeper-api/target/surefire-reports/TestSuite.txt: Tests run: 82, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 152.431 s - in TestSuite
2023-01-19T10:12:27.2117226Z [info] ./helix-rest/target/surefire-reports/TestSuite.txt: Tests run: 209, Failures: 1, Errors: 0, Skipped: 18, Time elapsed: 151.752 s <<< FAILURE! - in TestSuite
2023-01-19T10:12:27.2144940Z ##[error] Test failed: testActivateSuperCluster(org.apache.helix.rest.server.TestClusterAccessor)  Time elapsed: 8.741 s  <<< FAILURE!
2023-01-19T10:12:27.2146629Z [info] ./helix-common/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.264 s - in TestSuite

Verified locally :

[INFO] Tests run: 39, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 208.189 s - in org.apache.helix.rest.server.TestClusterAccessor
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 39, Failures: 0, Errors: 0, Skipped: 0
[INFO]

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

desaikomal · 2023-01-19T15:03:01Z

Just a small nit: since we have an ask, you can create an Issue with all the details and then in Issues section, can say, 'Fixes #',

helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java

helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

junkaixue · 2023-01-20T21:32:31Z

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {


You can use Java parallel compute lamda feature.

That's a good suggestion, but even if we parallelize this loop but still all those threads would be updating same guage value so it would be sequential only, right?

Yes. But it will save our sequential computational time as this update is synchronized in Helix regular pipelines.

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

mgao0

Thanks for the PR. I have several general question:

Is there a reason why we need a long-running thread for AsyncMissingTopStateMonitor? Can we couple the update of ResourceMonitor with the update of updateMissingTopStateResourceMap in ClusterStatusMonitor, instead of checking the map periodically?
Can we make this metric as optional, one that can be turned on and turned off by config?

mgao0 · 2023-01-23T18:31:27Z

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();


Can this line and next line can be part of ClusterStatusMonitor constructor? Just define the type of variable here, but do the instantiation in the constructor.

That's a good point! Reason of defining (allocating memory) in main process is that main process (ClusterStatusMonitor) will be the one who is updating this map and async thread will be just reading from it.

xyuanlu · 2023-01-23T21:32:36Z

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();


Do we need to join this thread eventually?

Good question! I think it has to be stopped/interrupted for sure by ClusterStatusMonitor when it's cleaning up all state. But main process (ClusterStatusMonitor) do not have to wait until it's finished because it's async long running thread and metrics reporting is never ending job so it's lifecycle can be tied with main process. Let me know if that makes sense.

Sorry I did not follow. When ClusterStatusMonitor terminates, we need to join this thread dont we?

I checked the flow and async thread should be tied with ClusterStatusMonitor thread. That means currently, all beans are registered in active() method and unregistered or reset in reset() method. So whenever Helix controller will activate cluster status monitor it should start this async thread and whenever helix controller stops/resets cluster status monitor it should stop this async thread. So just to re-iterate whenever cluster status monitor is reset() it has to be activated first by caller which will make sure that this async thread will be started.

Does this mean after reset() we should kill this metrics reporting thread?

Yup. that's what is done here. In reset() we unregister all things, clear map and then stop/interrupt thread. Async thread will be started again in active() call.

rahulrane50 · 2023-01-23T21:57:42Z

TFTR @mgao0 please find answers inline!

Thanks for the PR. I have several general question:

Is there a reason why we need a long-running thread for AsyncMissingTopStateMonitor? Can we couple the update of ResourceMonitor with the update of updateMissingTopStateResourceMap in ClusterStatusMonitor, instead of checking the map periodically?
--> The reason of decoupling metric reporting with existing main process of clusterstatusmonitor is to report metric irrespective of any event is being handled or not. Hence we couple of this with clusterstatusmonitor then thread will report metric only when resourcemap is updated ie., any event happens. About long-running thread, it sleeps whenever all resources have all partitions with top state recovered. To save some resources i had added a sleep in thread.

Can we make this metric as optional, one that can be turned on and turned off by config?
--> That's a good point! To be honest i'm not sure. In my mind this is very small features hence guarding it behind flag may not be that useful. In terms resource usage or performance it should not that much since it's single thread which should be running only when any resources have any partitions with missing top state. I'm open for suggestions though :)

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

xyuanlu · 2023-01-23T22:49:13Z

A general question here. To me it seems the new metrics is using some non 0 random number indicating at least one partition had no top state. (and 0 means all partition is good). I feel like we could have a more logistic meaningful number.

xyuanlu · 2023-01-23T23:14:36Z

--> The reason of decoupling metric reporting with existing main process of clusterstatusmonitor is to report metric irrespective of any event is being handled or not.

Please correct me if I am wrong. In your implementation, I think update happens at "TopStateHandoffReportStage" witch is trigged when event (update on current stage etc.) We didn't quite decouple these two...?

rahulrane50 · 2023-01-24T19:05:14Z

@mgao0 synced up offline. Thanks a lot for your inputs. I think you are correct it won't make sense if we just have one counter with increasing value.
@xyuanlu and @desaikomal I have updated PR with duration now instead up just randomly incrementing counter. I hope it makes sense now.

mgao0 · 2023-01-24T20:01:25Z

@mgao0 synced up offline. Thanks a lot for your inputs. I think you are correct it won't make sense if we just have one counter with increasing value. @xyuanlu and @desaikomal I have updated PR with duration now instead up just randomly incrementing counter. I hope it makes sense now.

Thanks @rahulrane50 for the update. To add more details, the conclusion is that if only for count of missing top state partition, we don't need an async thread, we can just couple it with ClusterStatusMonitor, but if we want to get a real time measurement for how long the missing top state has been lasting, then it makes sense to use an async thread. Thanks for making the change from counting the count to measuring the duration, and from gauge to histogram which shows the distribution of missing top state duration for different partitions, I think it makes sense. I'll take another look at your updated PR.

desaikomal · 2023-01-26T22:18:36Z

helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java

@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
    if (cache.getClusterConfig() != null) {
      durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
    }
+    if (clusterStatusMonitor != null) {
+      clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);


if durationThreshold is Long.MAX-VALUE, our monitoring will never finish.. thread will remain spin with no-action..

That's correct if threshold is not set then it's no-op thread. But in this case most of the metrics related to top state are not reported so i'm assuming that this config is used commonly.

I am not sure if we can assume that. Normally we have default value and user can set -1 if they want to disable this.

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java

junkaixue · 2023-01-29T22:20:14Z

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

+
+              }
+            }
+            sleep(50); // Instead of providing stream of durtion values to histogram thread can sleep in between to save some CPU cycles.


I am still concerning this number as hard coding here... It would not be a good thing. Could you explain why adding this number helps?

Had a offline discussion with @junkaixue . After giving it a thought i came with below solution please let me know if this looks okay. Summary :

Original sleep here was to save few computational cycles in this tight for loop but it's difficult to justify correct sleep duration.

In new solution, thread will report a metric per partition only if it's not reported within last sliding window reset interval. Now this has few benefits as : one if sleeping stops thread to report metrics for all resources and all partitions. But that's may be wrong because what if during that sleep time resource has new partitions with missing top state. Hence ideally if thread has reported duration at least once for that partition then it can skip that partition until it's sliding window has finished.
@desaikomal hence i didn't add sleep here for sliding window reset time but used that value to determine of duration should be reported for that partition or not.

…thin sliding window time.

junkaixue · 2023-02-03T04:09:10Z

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java

@@ -55,6 +55,81 @@
 import org.slf4j.LoggerFactory;

 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {


This will be meaningful if our pipeline speed is very fast. Otherwise, it will not help as the cache updated state is refreshed per pipeline. I dont believe we need build this thread but just rely on pipeline call as we did before.

But we need to remove the constraint of doing only final reporting.

We can discuss about it tomorrow f2f.

rahulrane50 · 2023-02-22T19:45:05Z

After internal syncup, it makes sense to not have this real-time metric. Helix will emit a metric as a part of current default pipeline (which is triggered on any helix event), to report number of partitions with missing top state beyond set threshold value. I will close this PR and create a new one to address this.

Added new metric to report real time missing top state for partition

670287b

Adding some cleanup code for async monitor thread.

c7ed8d1

rahulrane50 marked this pull request as ready for review January 19, 2023 23:31

xyuanlu reviewed Jan 20, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java Outdated Show resolved Hide resolved

helix-bot reviewed Jan 20, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java Show resolved Hide resolved

junkaixue reviewed Jan 20, 2023

View reviewed changes

qqu0127 reviewed Jan 23, 2023

View reviewed changes

mgao0 reviewed Jan 23, 2023

View reviewed changes

xyuanlu reviewed Jan 23, 2023

View reviewed changes

Addressed review comments

dafc47d

xyuanlu reviewed Jan 23, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java Outdated Show resolved Hide resolved

xyuanlu reviewed Jan 23, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java Show resolved Hide resolved

Adding missing top state duration instead of just incrementing counter.

230be3e

fixed thread stop order

23d8272

desaikomal reviewed Jan 26, 2023

View reviewed changes

xyuanlu reviewed Jan 27, 2023

View reviewed changes

helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java Show resolved Hide resolved

added sleep

4f582d3

junkaixue reviewed Jan 29, 2023

View reviewed changes

Skipping reporting of metric for parition if it's already reported wi…

c3d823e

…thin sliding window time.

junkaixue reviewed Feb 3, 2023

View reviewed changes

rahulrane50 closed this Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new metric to report real time missing top state for partition #2344

Added new metric to report real time missing top state for partition #2344

rahulrane50 commented Jan 19, 2023 •

edited

desaikomal commented Jan 19, 2023

junkaixue Jan 20, 2023

rahulrane50 Jan 23, 2023

junkaixue Jan 29, 2023

mgao0 left a comment

mgao0 Jan 23, 2023

rahulrane50 Jan 23, 2023

xyuanlu Jan 23, 2023

rahulrane50 Jan 23, 2023 •

edited

xyuanlu Jan 23, 2023

rahulrane50 Jan 25, 2023 •

edited

xyuanlu Jan 25, 2023

rahulrane50 Jan 26, 2023

rahulrane50 commented Jan 23, 2023

xyuanlu commented Jan 23, 2023

xyuanlu commented Jan 23, 2023

rahulrane50 commented Jan 24, 2023 •

edited

mgao0 commented Jan 24, 2023

desaikomal Jan 26, 2023

rahulrane50 Jan 27, 2023 •

edited

xyuanlu Feb 2, 2023

junkaixue Jan 29, 2023

rahulrane50 Jan 30, 2023 •

edited

junkaixue Feb 3, 2023

rahulrane50 commented Feb 22, 2023

		for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
		if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {

Added new metric to report real time missing top state for partition #2344

Added new metric to report real time missing top state for partition #2344

Conversation

rahulrane50 commented Jan 19, 2023 • edited

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

desaikomal commented Jan 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgao0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulrane50 Jan 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulrane50 Jan 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulrane50 commented Jan 23, 2023

xyuanlu commented Jan 23, 2023

xyuanlu commented Jan 23, 2023

rahulrane50 commented Jan 24, 2023 • edited

mgao0 commented Jan 24, 2023

Choose a reason for hiding this comment

rahulrane50 Jan 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulrane50 Jan 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulrane50 commented Feb 22, 2023

rahulrane50 commented Jan 19, 2023 •

edited

rahulrane50 Jan 23, 2023 •

edited

rahulrane50 Jan 25, 2023 •

edited

rahulrane50 commented Jan 24, 2023 •

edited

rahulrane50 Jan 27, 2023 •

edited

rahulrane50 Jan 30, 2023 •

edited