Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Fix massive values for cpu metricset for docker module #3682

Merged
merged 8 commits into from May 12, 2017

Conversation

douaejeouit
Copy link
Contributor

Due to the use of unsigned integers, negative values were transformed into huge positive numbers.

update integration tests for cpu metricset
@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

@ruflin
Copy link
Member

ruflin commented Feb 27, 2017

@douaejeouit How can it happen in the first place that these values become negative?

@douaejeouit
Copy link
Contributor Author

douaejeouit commented Feb 27, 2017

@ruflin, Actually the changes made are only to prevent the module from any potential bug that may be caused by negative values. As you know, the CPU's usage values are accumulated by the processes of the container. So in order to get the current cpu usage value, we need to calculate it by using both cpu and Pre-cpu usage (current = new- old). The negative value might be occurring due to the fact that the new usage value is lower than the older value.
My suggestion to this is to multiply the value by a '-1' , if it's negative, instead of setting it to 0. we can keep the resulting value of the current cpu usage. what is you opinion about it ?

@ruflin
Copy link
Member

ruflin commented Feb 27, 2017

What if we just change it from uint to int and report the actual negative values that were reported?

@douaejeouit
Copy link
Contributor Author

@ruflin, sorry for the delay of replying. Yes, that sounds good to me! But (for the calcularion) I think it would be better to cast it to float rather than int, don't you think so?

@douaejeouit
Copy link
Contributor Author

douaejeouit commented Mar 5, 2017

I 've just realised that I misspoke in my first comment ><" ! The idea was to keep the resulting value and not the 'resulting value of the current cup usage.. Sorry for that !
I think it's judicious to report the value as it was reported (negative )

@ruflin
Copy link
Member

ruflin commented Mar 6, 2017

My thought process here:

  • We need to fix the massive values
  • Either docker values reported are wrong or our calculation is wrong
  • Do negative values make any sense? Probably not.
  • What is the reason we do any calculations on the usage? Could we just report the total values and let ES do the calculations on query time?

Do not want to hold back this PR, just want to make sure I fully understand what it does and if we are not trying to fix something, that potentially should not exist :-)

@ruflin
Copy link
Member

ruflin commented Apr 25, 2017

@douaejeouit Any thoughts on the above?

@douaejeouit
Copy link
Contributor Author

@ruflin sorry for the delay.
Yes, we need to fix it. But we also need to understand the reason of the occurrence of the massive values.

  • Why do we have negative values?
    Having a negative value doesn't make any sense. I'm not sure if it's due to the calculation or to the returned values from docker API itself.

  • Why do we have massive values?
    The calculation returns a delta of old CPU values and new ones. Those values are stored in a uint64 variable type ( because we're not expecting negative values). So using this type transform negative values into huge values instead of the real calculated value. That's why we get massive values

  • What should we do?
    I see two options, we can either change the variable type used to store the delta to keep and report the actual negative value. It's true that it represents the calculation result, but I'm afraid it doesn't make sense? Or we can set this last into 0 et report 0 instead of the negative value!
    I did open this PR to prevent the module to report huge numbers ( it's a rare occurrence) which don't make sense.

func calculateLoad(value uint64) float64 {
return float64(value) / float64(1000000000)
func calculateLoad(newValue uint64, oldValue uint64) float64 {
value := float64(newValue) - float64(oldValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a comment here on why we do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is meant to calculate the % CPU time change between two successive readings. The "oldVlue" represents the preCpu which refers to the CPU statistics of the last read.
Time here is expressed by second and not by nanoseconde. ( The main goal is to expose the %, in the same way, it's displayed by docker Client)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you add it as a comment to the source code?

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix it properly we should really figure out what is happening in point 1. SGTM to have a temporary fix for it first, but we should add a note about it in the code (see my comments).

The part I worry is that in case the problem is not on the docker side but in our code, that through this change we just circumvent the problem instead of finding and fixing hit.

Ok for me to move forward with this PR.

func calculateLoad(newValue uint64, oldValue uint64) float64 {
value := float64(newValue) - float64(oldValue)
if value < 0 {
value = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log an error here? Perhaps we should even set it to -1 to indicate something is up?

Also we can directly return here the value as 0 / x is 0 again.

@douaejeouit
Copy link
Contributor Author

Ok. I fully agree with you! We need to figure out what's happening. I'll be on it this weekend. I'll ping you ASAP.

@douaejeouit
Copy link
Contributor Author

Here are my thoughts:

  • The calculation isn't wrong because it concerns the time the CPU has been in use since boot ( user, system or total use). Therefore, those values can't decrease ( newValue can't be < oldValue). Thus, if we get the expected values from the API, everything should be ok!
    I couldn't reproduce this scenario.

  • I think it's better to report the calculated percentages rather than letting ES do the calculations in query time. The reason is simple: take advantage of the preCPU field provided by the API, ( precpu field is not the exact copy of cpu_stats according to the documentation ) & Since we don't expose any PreCPU field, the user can't have the right values.

  • Finally, I think that we still have to handle this just in case of! Setting the value to -1 LGTM.
    What do you think about it?

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • -1 SGTM. And log an error to the metricbeat log?
  • For the PreCPU: I would not mix it with the discussion here as we are already using it. In general I have the opinion that it's a ES job and not a Beats job even though I can see that for some people it can be quite usefule.

func calculateLoad(value uint64) float64 {
return float64(value) / float64(1000000000)
func calculateLoad(newValue uint64, oldValue uint64) float64 {
value := float64(newValue) - float64(oldValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you add it as a comment to the source code?

- Set the time change value in between two reading to -1 if the value is negative
func calculateLoad(newValue uint64, oldValue uint64) float64 {
value := float64(newValue) - float64(oldValue)
if value < 0 {
logp.Err("time change calculation failed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as these errors appear on a global level, some more details would be good like that it is a the docker module and perhaps the old and new value or something that is helpful for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. what about: "error calculating CPU time change for docker module: new stats value is lower than the old one". Is it fine for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM: I would add the two values at the end or as part of the log message something like new stats value (%v) is lower then the old one (%v)

douaejeouit and others added 5 commits May 9, 2017 18:21
update integration tests for cpu metricset
- Set the time change value in between two reading to -1 if the value is negative
- Update cpu_test
@ruflin
Copy link
Member

ruflin commented May 11, 2017

@douaejeouit Thanks for the changes. Can you rebase on master. There seem to be some conflicts :-(

@ruflin ruflin merged commit e48b5d4 into elastic:master May 12, 2017
@ruflin
Copy link
Member

ruflin commented May 12, 2017

@douaejeouit Merged. Thanks for going through all the back and forth with me :-)

@douaejeouit
Copy link
Contributor Author

Thank you, with pleasure!

@douaejeouit douaejeouit deleted the review branch May 12, 2017 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants