-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-2333: CGroup memory and CPU metrics #1934
Conversation
previousSystem = systemHz; | ||
long hz = getUserHZ(); | ||
HashMap<String, Long> ret = new HashMap<>(); | ||
ret.put("user-ms", user * 1000/hz); //Convert to millis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a integer division during ms conversion. A float division would produce better resolution. It is surely not a problem with getUserHz() returning 100, but might cause issues if 100 is changed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this in part to match the https://github.com/apache/storm/blob/master/external/storm-metrics/src/main/java/org/apache/storm/metrics/sigar/CPUMetric.java format, which offers ms level metrics, not floating point, but just for the JVM, not for the entire container.
I'll try to add documentation to explain the level of granularity that this supports and some of it's limitations. I am missing documentation all together after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the explanation.
|
||
@Override | ||
public Long parseFileContents(String contents) { | ||
return Long.parseLong(contents.trim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of these cgroup getting functions have been implemented in the cgroup package:
for example there is already a method of get memory.limit_in_bytes:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into possibly using it, but right now most of the code that would be replaced is the code in CGroupMetricsBase that slurps the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been looking into it and it would be a lot of change just to save a few lines of code. If you really think it is important to make the code common I will, but I would prefer to leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it might be worth it, just to allow it to support different CGroup hierarchies better. I'll keep trying
} | ||
|
||
@Override | ||
public Long parseFileContents(String contents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@revans2 this is a really cool feature and thanks for adding it. My only question is that I see you have implemented some parsers/getters for the cgroup metrics. There is already a set of them implemented in the cgroup package. Is there a reason why we shouldn't reuse them? |
@jerrypeng Please take another look. I refactored the code to reuse the CgroupManager and such. |
docs/cgroups_in_storm.md
Outdated
|
||
The perm section needs to be configured so that the user the supervisor is running as can modify the group. | ||
|
||
If run as user is enabled so the supervisor spawns other processes as the user that launched the topology make sure that the permissions are such that indavidual users have read access but not write access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps more clear
if "run as user"is enabled so that the supervisor spawns other processes as the user that launched the topology**,** make sure that the permissions are such that individual users have read access but not write access.
@Override | ||
public Map<String, Long> getDataFrom(CgroupCore core) throws IOException { | ||
CpuacctCore cpu = (CpuacctCore) core; | ||
Map<StatType, Long> stat = cpu.getCpuStat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we not doing a try catch of an IOException here similar for cpu.getCpuShares() below but just propagating the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the CGroupCore subclasses throw an IOException. I thought it would be better to have the parent class handle it rather then have the children each have the same code copied and pasted. On some I guess I forgot to clean it up. Will do.
} | ||
|
||
@Override | ||
public Long getDataFrom(CgroupCore core) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here but also this function throws an Exception except for IOException. Should we at least change it to IOException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^C^V
} | ||
|
||
@Override | ||
public Long getDataFrom(CgroupCore core) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^C^V
thanks @revans2 for taking the time to refactor some of the code! I reviewed it once more. If there is an example of how to user the metrics in the docs I think that would be extremely helpful to users. |
@jerrypeng I added in the extra documentation you wanted, and I spell checked the whole thing. I also updated the CGroupCpuGuarantee metric to let the parent handle the exception. |
Thanks @revans2 +1 |
@jerrypeng Actually just found a bug working on a fix now.... |
@jerrypeng Fixed the issue. We were only checking if the SybSystem was enabled, but not if it was mounted. |
Thanks @revans2 +1 |
No description provided.