Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2333: CGroup memory and CPU metrics #1934

Merged
merged 6 commits into from
Feb 27, 2017
Merged

Conversation

revans2
Copy link
Contributor

@revans2 revans2 commented Feb 9, 2017

No description provided.

previousSystem = systemHz;
long hz = getUserHZ();
HashMap<String, Long> ret = new HashMap<>();
ret.put("user-ms", user * 1000/hz); //Convert to millis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a integer division during ms conversion. A float division would produce better resolution. It is surely not a problem with getUserHz() returning 100, but might cause issues if 100 is changed in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in part to match the https://github.com/apache/storm/blob/master/external/storm-metrics/src/main/java/org/apache/storm/metrics/sigar/CPUMetric.java format, which offers ms level metrics, not floating point, but just for the JVM, not for the entire container.

I'll try to add documentation to explain the level of granularity that this supports and some of it's limitations. I am missing documentation all together after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the explanation.


@Override
public Long parseFileContents(String contents) {
return Long.parseLong(contents.trim());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these cgroup getting functions have been implemented in the cgroup package:

https://github.com/apache/storm/tree/master/storm-core/src/jvm/org/apache/storm/container/cgroup/core

for example there is already a method of get memory.limit_in_bytes:

https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/container/cgroup/core/MemoryCore.java#L133

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into possibly using it, but right now most of the code that would be replaced is the code in CGroupMetricsBase that slurps the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been looking into it and it would be a lot of change just to save a few lines of code. If you really think it is important to make the code common I will, but I would prefer to leave it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it might be worth it, just to allow it to support different CGroup hierarchies better. I'll keep trying

}

@Override
public Long parseFileContents(String contents) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerrypeng
Copy link
Contributor

@revans2 this is a really cool feature and thanks for adding it. My only question is that I see you have implemented some parsers/getters for the cgroup metrics. There is already a set of them implemented in the cgroup package. Is there a reason why we shouldn't reuse them?

@revans2
Copy link
Contributor Author

revans2 commented Feb 21, 2017

@jerrypeng Please take another look. I refactored the code to reuse the CgroupManager and such.


The perm section needs to be configured so that the user the supervisor is running as can modify the group.

If run as user is enabled so the supervisor spawns other processes as the user that launched the topology make sure that the permissions are such that indavidual users have read access but not write access.
Copy link
Contributor

@jerrypeng jerrypeng Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps more clear

if "run as user"is enabled so that the supervisor spawns other processes as the user that launched the topology**,** make sure that the permissions are such that individual users have read access but not write access.

@Override
public Map<String, Long> getDataFrom(CgroupCore core) throws IOException {
CpuacctCore cpu = (CpuacctCore) core;
Map<StatType, Long> stat = cpu.getCpuStat();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we not doing a try catch of an IOException here similar for cpu.getCpuShares() below but just propagating the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the CGroupCore subclasses throw an IOException. I thought it would be better to have the parent class handle it rather then have the children each have the same code copied and pasted. On some I guess I forgot to clean it up. Will do.

}

@Override
public Long getDataFrom(CgroupCore core) throws Exception {
Copy link
Contributor

@jerrypeng jerrypeng Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here but also this function throws an Exception except for IOException. Should we at least change it to IOException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^C^V

}

@Override
public Long getDataFrom(CgroupCore core) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^C^V

@jerrypeng
Copy link
Contributor

thanks @revans2 for taking the time to refactor some of the code! I reviewed it once more. If there is an example of how to user the metrics in the docs I think that would be extremely helpful to users.

@revans2
Copy link
Contributor Author

revans2 commented Feb 22, 2017

@jerrypeng I added in the extra documentation you wanted, and I spell checked the whole thing.

I also updated the CGroupCpuGuarantee metric to let the parent handle the exception.

@jerrypeng
Copy link
Contributor

Thanks @revans2 +1

@revans2
Copy link
Contributor Author

revans2 commented Feb 23, 2017

@jerrypeng Actually just found a bug working on a fix now....

@revans2
Copy link
Contributor Author

revans2 commented Feb 23, 2017

@jerrypeng Fixed the issue. We were only checking if the SybSystem was enabled, but not if it was mounted.

@jerrypeng
Copy link
Contributor

Thanks @revans2 +1

@asfgit asfgit merged commit c01b04d into apache:master Feb 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants