-
Couldn't load subscription status.
- Fork 9.1k
HDFS-16678. RBF should supports disable getNodeUsage() in RBFMetrics #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expensive part of this whole block? this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess is rpcServer.getDatanodeReport() actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, rpcServer.getDatanodeReport() is expensive. As the number of DNs or downstream nameservices in the cluster increases, it will become more and more expensive. such as 1w+ DNs, 5w+ DNs, 20+ NSs, 50+ NSs.
...doop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache commons math has a nice StandardDeviation utility.
It might be good to use this directly.
https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/descriptive/moment/StandardDeviation.html
...op-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RBFConfigKeys.java
Outdated
Show resolved
Hide resolved
|
@goiri Sir, I have updated the patch and added some UTs, please help me review it again. Thanks |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@goiri Hi, master, can you help me merge it into the trunk? |
| dev = deviation.evaluate(usages); | ||
| } | ||
| } catch (IOException e) { | ||
| LOG.error("Cannot get the live nodes: {}", e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it would be better this way.
LOG.error("Cannot get the live nodes.", e).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it would be better this way.
LOG.error("Cannot get the live nodes.", e).
Do we want to have the full stack trace? I think it is pretty clear what the error is here without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @slfan1989 @goiri for your review. I think e.getMessage() is enough. @slfan1989 Do you have some cases that need the full stack?
|
@goiri @slfan1989 Master, can help me merge this RP into trunk? Thanks. |
|
@slfan1989 please take another look |
LGTM. @ZanderXu thanks for your contribution! @goiri @goiri Thanks for helping review the code! |
|
@goiri @slfan1989 @ferhui @tomscut Maters, thank you very much for helping me review this patch. |
Description of PR
In our prod environment, we try to collect RBF metrics every 15s through jmx_exporter. And we found that collection task often failed.
After tracing and found that the collection task is blocked at getNodeUsage() in RBFMetrics, because it will collect all datanode's usage from downstream nameservices.
This is a very expensive and almost useless operation. Because in most scenarios, each downstream nameserivce contains almost the same DNs. We can get the data usage's from any one nameservices if need, not from RBF.
So I feel that RBF should supports disable getNodeUsage() in RBFMetrics.