Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new metric to report real time missing top state for partition #2344

Closed
wants to merge 7 commits into from

Conversation

rahulrane50
Copy link
Contributor

@rahulrane50 rahulrane50 commented Jan 19, 2023

Issues

(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)

Description

Issue :
Currently, if waged rebalancer is used then there could be a situation where multiple leader replicas are residing on same node. If that node goes down for maintainence or for any other reason then there is no current metric which reports about missing top state until any rebalancer event is triggered.

Ask :
Ask here is to add a metric which satisfies following conditions :

  1. The metric should be at resource level and should be reported irrespective of top state replica is recovered or not for any partition of that resource.
  2. This metric should be started as soon as "any" partition for a resource has top state missing and it should be "reset" only when that resource has no partition with missing top state.

Solution :
The solution proposed in this PR starts a async thread as soon as first partition of any resource has top state missing and it continuously reports missing top state duration (atleast once in sliding window interval) for partitions at resource level as long as that resource has "at least one" partition with missing top state. If there are no resources with any partitions with missing top state then this async thread will sleep.
The guage reported by this thread is histogram which won't give exact values but will approximate about duration increasing and will be reset when all top state replicas for that resource are recovered.

  • Here are some details about my PR, including screenshots of any UI changes:

(Write a concise description including what, why, how)

Tests

  • [X ] The following tests are written for this issue:

CI pipeline and added new tests for verifying metrics.

mvn test -Dtest=TestTopStateHandoffMetrics
[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 18.723 s - in org.apache.helix.monitoring.mbeans.TestTopStateHandoffMetrics
[INFO]
[INFO] Results:
[INFO]
[WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 1

(List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

CI is failing with helix-rest tests. Verified locally that tests are running
Failing test results :

2023-01-19T10:12:27.2107893Z [info] ./metadata-store-directory-common/target/surefire-reports/TestSuite.txt: Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.633 s - in TestSuite
2023-01-19T10:12:27.2108983Z [info] ./helix-core/target/surefire-reports/TestSuite.txt: Tests run: 1322, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 5,906.49 s - in TestSuite
2023-01-19T10:12:27.2110115Z [info] ./helix-view-aggregator/target/surefire-reports/TestSuite.txt: Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 100.176 s - in TestSuite
2023-01-19T10:12:27.2110941Z [info] ./recipes/distributed-lock-manager/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.492 s - in TestSuite
2023-01-19T10:12:27.2111791Z [info] ./recipes/rabbitmq-consumer-group/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.692 s - in TestSuite
2023-01-19T10:12:27.2112591Z [info] ./recipes/task-execution/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.534 s - in TestSuite
2023-01-19T10:12:27.2113368Z [info] ./recipes/service-discovery/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.63 s - in TestSuite
2023-01-19T10:12:27.2114188Z [info] ./recipes/rsync-replicated-file-system/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.582 s - in TestSuite
2023-01-19T10:12:27.2114980Z [info] ./helix-lock/target/surefire-reports/TestSuite.txt: Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 55.692 s - in TestSuite
2023-01-19T10:12:27.2115716Z [info] ./metrics-common/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.372 s - in TestSuite
2023-01-19T10:12:27.2116463Z [info] ./zookeeper-api/target/surefire-reports/TestSuite.txt: Tests run: 82, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 152.431 s - in TestSuite
2023-01-19T10:12:27.2117226Z [info] ./helix-rest/target/surefire-reports/TestSuite.txt: Tests run: 209, Failures: 1, Errors: 0, Skipped: 18, Time elapsed: 151.752 s <<< FAILURE! - in TestSuite
2023-01-19T10:12:27.2144940Z ##[error] Test failed: testActivateSuperCluster(org.apache.helix.rest.server.TestClusterAccessor)  Time elapsed: 8.741 s  <<< FAILURE!
2023-01-19T10:12:27.2146629Z [info] ./helix-common/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.264 s - in TestSuite

Verified locally :

[INFO] Tests run: 39, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 208.189 s - in org.apache.helix.rest.server.TestClusterAccessor
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 39, Failures: 0, Errors: 0, Skipped: 0
[INFO]

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@desaikomal
Copy link
Contributor

Just a small nit: since we have an ask, you can create an Issue with all the details and then in Issues section, can say, 'Fixes #',

@rahulrane50 rahulrane50 marked this pull request as ready for review January 19, 2023 23:31
Comment on lines 89 to 90
for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Java parallel compute lamda feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion, but even if we parallelize this loop but still all those threads would be updating same guage value so it would be sequential only, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it will save our sequential computational time as this update is synchronized in Helix regular pipelines.

Copy link
Contributor

@mgao0 mgao0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have several general question:

  1. Is there a reason why we need a long-running thread for AsyncMissingTopStateMonitor? Can we couple the update of ResourceMonitor with the update of updateMissingTopStateResourceMap in ClusterStatusMonitor, instead of checking the map periodically?
  2. Can we make this metric as optional, one that can be turned on and turned off by config?

/**
* Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
*/
private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this line and next line can be part of ClusterStatusMonitor constructor? Just define the type of variable here, but do the instantiation in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Reason of defining (allocating memory) in main process is that main process (ClusterStatusMonitor) will be the one who is updating this map and async thread will be just reading from it.

* This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
* least one resource with missing top state for a cluster.
*/
_asyncMissingTopStateMonitor.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to join this thread eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think it has to be stopped/interrupted for sure by ClusterStatusMonitor when it's cleaning up all state. But main process (ClusterStatusMonitor) do not have to wait until it's finished because it's async long running thread and metrics reporting is never ending job so it's lifecycle can be tied with main process. Let me know if that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not follow. When ClusterStatusMonitor terminates, we need to join this thread dont we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the flow and async thread should be tied with ClusterStatusMonitor thread. That means currently, all beans are registered in active() method and unregistered or reset in reset() method. So whenever Helix controller will activate cluster status monitor it should start this async thread and whenever helix controller stops/resets cluster status monitor it should stop this async thread. So just to re-iterate whenever cluster status monitor is reset() it has to be activated first by caller which will make sure that this async thread will be started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean after reset() we should kill this metrics reporting thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. that's what is done here. In reset() we unregister all things, clear map and then stop/interrupt thread. Async thread will be started again in active() call.

@rahulrane50
Copy link
Contributor Author

TFTR @mgao0 please find answers inline!

Thanks for the PR. I have several general question:

  1. Is there a reason why we need a long-running thread for AsyncMissingTopStateMonitor? Can we couple the update of ResourceMonitor with the update of updateMissingTopStateResourceMap in ClusterStatusMonitor, instead of checking the map periodically?
    --> The reason of decoupling metric reporting with existing main process of clusterstatusmonitor is to report metric irrespective of any event is being handled or not. Hence we couple of this with clusterstatusmonitor then thread will report metric only when resourcemap is updated ie., any event happens. About long-running thread, it sleeps whenever all resources have all partitions with top state recovered. To save some resources i had added a sleep in thread.
  2. Can we make this metric as optional, one that can be turned on and turned off by config?
    --> That's a good point! To be honest i'm not sure. In my mind this is very small features hence guarding it behind flag may not be that useful. In terms resource usage or performance it should not that much since it's single thread which should be running only when any resources have any partitions with missing top state. I'm open for suggestions though :)

@xyuanlu
Copy link
Contributor

xyuanlu commented Jan 23, 2023

A general question here. To me it seems the new metrics is using some non 0 random number indicating at least one partition had no top state. (and 0 means all partition is good). I feel like we could have a more logistic meaningful number.

@xyuanlu
Copy link
Contributor

xyuanlu commented Jan 23, 2023

--> The reason of decoupling metric reporting with existing main process of clusterstatusmonitor is to report metric irrespective of any event is being handled or not.

Please correct me if I am wrong. In your implementation, I think update happens at "TopStateHandoffReportStage" witch is trigged when event (update on current stage etc.) We didn't quite decouple these two...?

@rahulrane50
Copy link
Contributor Author

rahulrane50 commented Jan 24, 2023

@mgao0 synced up offline. Thanks a lot for your inputs. I think you are correct it won't make sense if we just have one counter with increasing value.
@xyuanlu and @desaikomal I have updated PR with duration now instead up just randomly incrementing counter. I hope it makes sense now.

@mgao0
Copy link
Contributor

mgao0 commented Jan 24, 2023

@mgao0 synced up offline. Thanks a lot for your inputs. I think you are correct it won't make sense if we just have one counter with increasing value. @xyuanlu and @desaikomal I have updated PR with duration now instead up just randomly incrementing counter. I hope it makes sense now.

Thanks @rahulrane50 for the update. To add more details, the conclusion is that if only for count of missing top state partition, we don't need an async thread, we can just couple it with ClusterStatusMonitor, but if we want to get a real time measurement for how long the missing top state has been lasting, then it makes sense to use an async thread. Thanks for making the change from counting the count to measuring the duration, and from gauge to histogram which shows the distribution of missing top state duration for different partitions, I think it makes sense. I'll take another look at your updated PR.

@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
if (cache.getClusterConfig() != null) {
durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
}
if (clusterStatusMonitor != null) {
clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if durationThreshold is Long.MAX-VALUE, our monitoring will never finish.. thread will remain spin with no-action..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct if threshold is not set then it's no-op thread. But in this case most of the metrics related to top state are not reported so i'm assuming that this config is used commonly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we can assume that. Normally we have default value and user can set -1 if they want to disable this.


}
}
sleep(50); // Instead of providing stream of durtion values to histogram thread can sleep in between to save some CPU cycles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still concerning this number as hard coding here... It would not be a good thing. Could you explain why adding this number helps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a offline discussion with @junkaixue . After giving it a thought i came with below solution please let me know if this looks okay. Summary :

  1. Original sleep here was to save few computational cycles in this tight for loop but it's difficult to justify correct sleep duration.
  2. In new solution, thread will report a metric per partition only if it's not reported within last sliding window reset interval. Now this has few benefits as : one if sleeping stops thread to report metrics for all resources and all partitions. But that's may be wrong because what if during that sleep time resource has new partitions with missing top state. Hence ideally if thread has reported duration at least once for that partition then it can skip that partition until it's sliding window has finished.
    @desaikomal hence i didn't add sleep here for sliding window reset time but used that value to determine of duration should be reported for that partition or not.

@@ -55,6 +55,81 @@
import org.slf4j.LoggerFactory;

public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
private class AsyncMissingTopStateMonitor extends Thread {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be meaningful if our pipeline speed is very fast. Otherwise, it will not help as the cache updated state is refreshed per pipeline. I dont believe we need build this thread but just rely on pipeline call as we did before.

But we need to remove the constraint of doing only final reporting.

We can discuss about it tomorrow f2f.

@rahulrane50
Copy link
Contributor Author

After internal syncup, it makes sense to not have this real-time metric. Helix will emit a metric as a part of current default pipeline (which is triggered on any helix event), to report number of partitions with missing top state beyond set threshold value. I will close this PR and create a new one to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new metric to report missing top state for partition
7 participants