New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Fix massive values for cpu metricset for docker module #3682
Conversation
update integration tests for cpu metricset
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
@douaejeouit How can it happen in the first place that these values become negative? |
@ruflin, Actually the changes made are only to prevent the module from any potential bug that may be caused by negative values. As you know, the CPU's usage values are accumulated by the processes of the container. So in order to get the current cpu usage value, we need to calculate it by using both cpu and Pre-cpu usage (current = new- old). The negative value might be occurring due to the fact that the new usage value is lower than the older value. |
What if we just change it from uint to int and report the actual negative values that were reported? |
@ruflin, sorry for the delay of replying. Yes, that sounds good to me! But (for the calcularion) I think it would be better to cast it to float rather than int, don't you think so? |
I 've just realised that I misspoke in my first comment ><" ! The idea was to keep the resulting value and not the 'resulting value of the current cup usage.. Sorry for that ! |
My thought process here:
Do not want to hold back this PR, just want to make sure I fully understand what it does and if we are not trying to fix something, that potentially should not exist :-) |
@douaejeouit Any thoughts on the above? |
@ruflin sorry for the delay.
|
func calculateLoad(value uint64) float64 { | ||
return float64(value) / float64(1000000000) | ||
func calculateLoad(newValue uint64, oldValue uint64) float64 { | ||
value := float64(newValue) - float64(oldValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a comment here on why we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is meant to calculate the % CPU time change between two successive readings. The "oldVlue" represents the preCpu which refers to the CPU statistics of the last read.
Time here is expressed by second and not by nanoseconde. ( The main goal is to expose the %, in the same way, it's displayed by docker Client)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Can you add it as a comment to the source code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix it properly we should really figure out what is happening in point 1. SGTM to have a temporary fix for it first, but we should add a note about it in the code (see my comments).
The part I worry is that in case the problem is not on the docker side but in our code, that through this change we just circumvent the problem instead of finding and fixing hit.
Ok for me to move forward with this PR.
func calculateLoad(newValue uint64, oldValue uint64) float64 { | ||
value := float64(newValue) - float64(oldValue) | ||
if value < 0 { | ||
value = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log an error here? Perhaps we should even set it to -1 to indicate something is up?
Also we can directly return here the value as 0 / x is 0 again.
Ok. I fully agree with you! We need to figure out what's happening. I'll be on it this weekend. I'll ping you ASAP. |
Here are my thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- -1 SGTM. And log an error to the metricbeat log?
- For the PreCPU: I would not mix it with the discussion here as we are already using it. In general I have the opinion that it's a ES job and not a Beats job even though I can see that for some people it can be quite usefule.
func calculateLoad(value uint64) float64 { | ||
return float64(value) / float64(1000000000) | ||
func calculateLoad(newValue uint64, oldValue uint64) float64 { | ||
value := float64(newValue) - float64(oldValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Can you add it as a comment to the source code?
- Set the time change value in between two reading to -1 if the value is negative
func calculateLoad(newValue uint64, oldValue uint64) float64 { | ||
value := float64(newValue) - float64(oldValue) | ||
if value < 0 { | ||
logp.Err("time change calculation failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as these errors appear on a global level, some more details would be good like that it is a the docker module and perhaps the old and new value or something that is helpful for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. what about: "error calculating CPU time change for docker module: new stats value is lower than the old one". Is it fine for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM: I would add the two values at the end or as part of the log message something like new stats value (%v) is lower then the old one (%v)
update integration tests for cpu metricset
- Set the time change value in between two reading to -1 if the value is negative
- Update cpu_test
@douaejeouit Thanks for the changes. Can you rebase on master. There seem to be some conflicts :-( |
@douaejeouit Merged. Thanks for going through all the back and forth with me :-) |
Thank you, with pleasure! |
Due to the use of unsigned integers, negative values were transformed into huge positive numbers.