Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new metric to report real time missing top state for partition #2344
Added new metric to report real time missing top state for partition #2344
Changes from 2 commits
670287b
c7ed8d1
dafc47d
230be3e
23d8272
4f582d3
c3d823e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if durationThreshold is Long.MAX-VALUE, our monitoring will never finish.. thread will remain spin with no-action..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct if threshold is not set then it's no-op thread. But in this case most of the metrics related to top state are not reported so i'm assuming that this config is used commonly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we can assume that. Normally we have default value and user can set -1 if they want to disable this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be meaningful if our pipeline speed is very fast. Otherwise, it will not help as the cache updated state is refreshed per pipeline. I dont believe we need build this thread but just rely on pipeline call as we did before.
But we need to remove the constraint of doing only final reporting.
We can discuss about it tomorrow f2f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use Java parallel compute lamda feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good suggestion, but even if we parallelize this loop but still all those threads would be updating same guage value so it would be sequential only, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But it will save our sequential computational time as this update is synchronized in Helix regular pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this line and next line can be part of ClusterStatusMonitor constructor? Just define the type of variable here, but do the instantiation in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! Reason of defining (allocating memory) in main process is that main process (ClusterStatusMonitor) will be the one who is updating this map and async thread will be just reading from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to join this thread eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I think it has to be stopped/interrupted for sure by ClusterStatusMonitor when it's cleaning up all state. But main process (ClusterStatusMonitor) do not have to wait until it's finished because it's async long running thread and metrics reporting is never ending job so it's lifecycle can be tied with main process. Let me know if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did not follow. When ClusterStatusMonitor terminates, we need to join this thread dont we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the flow and async thread should be tied with ClusterStatusMonitor thread. That means currently, all beans are registered in active() method and unregistered or reset in reset() method. So whenever Helix controller will activate cluster status monitor it should start this async thread and whenever helix controller stops/resets cluster status monitor it should stop this async thread. So just to re-iterate whenever cluster status monitor is reset() it has to be activated first by caller which will make sure that this async thread will be started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean after
reset()
we should kill this metrics reporting thread?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. that's what is done here. In reset() we unregister all things, clear map and then stop/interrupt thread. Async thread will be started again in active() call.