Bug: Fix massive values for cpu metricset for docker module #3682

douaejeouit · 2017-02-27T16:02:12Z

Due to the use of unsigned integers, negative values were transformed into huge positive numbers.

update integration tests for cpu metricset

elasticmachine · 2017-02-27T16:02:20Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

elasticmachine · 2017-02-27T16:07:05Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

ruflin · 2017-02-27T18:05:16Z

@douaejeouit How can it happen in the first place that these values become negative?

douaejeouit · 2017-02-27T19:06:42Z

@ruflin, Actually the changes made are only to prevent the module from any potential bug that may be caused by negative values. As you know, the CPU's usage values are accumulated by the processes of the container. So in order to get the current cpu usage value, we need to calculate it by using both cpu and Pre-cpu usage (current = new- old). The negative value might be occurring due to the fact that the new usage value is lower than the older value.
My suggestion to this is to multiply the value by a '-1' , if it's negative, instead of setting it to 0. we can keep the resulting value of the current cpu usage. what is you opinion about it ?

ruflin · 2017-02-27T20:23:31Z

What if we just change it from uint to int and report the actual negative values that were reported?

douaejeouit · 2017-03-05T00:24:47Z

@ruflin, sorry for the delay of replying. Yes, that sounds good to me! But (for the calcularion) I think it would be better to cast it to float rather than int, don't you think so?

douaejeouit · 2017-03-05T00:35:21Z

I 've just realised that I misspoke in my first comment ><" ! The idea was to keep the resulting value and not the 'resulting value of the current cup usage.. Sorry for that !
I think it's judicious to report the value as it was reported (negative )

ruflin · 2017-03-06T14:51:37Z

My thought process here:

We need to fix the massive values
Either docker values reported are wrong or our calculation is wrong
Do negative values make any sense? Probably not.
What is the reason we do any calculations on the usage? Could we just report the total values and let ES do the calculations on query time?

Do not want to hold back this PR, just want to make sure I fully understand what it does and if we are not trying to fix something, that potentially should not exist :-)

ruflin · 2017-04-25T09:11:42Z

@douaejeouit Any thoughts on the above?

douaejeouit · 2017-04-27T09:26:52Z

@ruflin sorry for the delay.
Yes, we need to fix it. But we also need to understand the reason of the occurrence of the massive values.

Why do we have negative values?
Having a negative value doesn't make any sense. I'm not sure if it's due to the calculation or to the returned values from docker API itself.
Why do we have massive values?
The calculation returns a delta of old CPU values and new ones. Those values are stored in a uint64 variable type ( because we're not expecting negative values). So using this type transform negative values into huge values instead of the real calculated value. That's why we get massive values
What should we do?
I see two options, we can either change the variable type used to store the delta to keep and report the actual negative value. It's true that it represents the calculation result, but I'm afraid it doesn't make sense? Or we can set this last into 0 et report 0 instead of the negative value!
I did open this PR to prevent the module to report huge numbers ( it's a rare occurrence) which don't make sense.

ruflin · 2017-04-28T13:52:25Z

metricbeat/module/docker/cpu/helper.go

-func calculateLoad(value uint64) float64 {
-	return float64(value) / float64(1000000000)
+func calculateLoad(newValue uint64, oldValue uint64) float64 {
+	value := float64(newValue) - float64(oldValue)


Lets add a comment here on why we do this.

This function is meant to calculate the % CPU time change between two successive readings. The "oldVlue" represents the preCpu which refers to the CPU statistics of the last read.
Time here is expressed by second and not by nanoseconde. ( The main goal is to expose the %, in the same way, it's displayed by docker Client)

Thanks. Can you add it as a comment to the source code?

ruflin

To fix it properly we should really figure out what is happening in point 1. SGTM to have a temporary fix for it first, but we should add a note about it in the code (see my comments).

The part I worry is that in case the problem is not on the docker side but in our code, that through this change we just circumvent the problem instead of finding and fixing hit.

Ok for me to move forward with this PR.

ruflin · 2017-04-28T13:53:10Z

metricbeat/module/docker/cpu/helper.go

+func calculateLoad(newValue uint64, oldValue uint64) float64 {
+	value := float64(newValue) - float64(oldValue)
+	if value < 0 {
+		value = 0


Should we log an error here? Perhaps we should even set it to -1 to indicate something is up?

Also we can directly return here the value as 0 / x is 0 again.

douaejeouit · 2017-04-28T14:52:15Z

Ok. I fully agree with you! We need to figure out what's happening. I'll be on it this weekend. I'll ping you ASAP.

douaejeouit · 2017-04-30T13:07:09Z

Here are my thoughts:

The calculation isn't wrong because it concerns the time the CPU has been in use since boot ( user, system or total use). Therefore, those values can't decrease ( newValue can't be < oldValue). Thus, if we get the expected values from the API, everything should be ok!
I couldn't reproduce this scenario.
I think it's better to report the calculated percentages rather than letting ES do the calculations in query time. The reason is simple: take advantage of the preCPU field provided by the API, ( precpu field is not the exact copy of cpu_stats according to the documentation ) & Since we don't expose any PreCPU field, the user can't have the right values.
Finally, I think that we still have to handle this just in case of! Setting the value to -1 LGTM.
What do you think about it?

ruflin

-1 SGTM. And log an error to the metricbeat log?
For the PreCPU: I would not mix it with the discussion here as we are already using it. In general I have the opinion that it's a ES job and not a Beats job even though I can see that for some people it can be quite usefule.

ruflin · 2017-05-03T10:33:19Z

metricbeat/module/docker/cpu/helper.go

-func calculateLoad(value uint64) float64 {
-	return float64(value) / float64(1000000000)
+func calculateLoad(newValue uint64, oldValue uint64) float64 {
+	value := float64(newValue) - float64(oldValue)


Thanks. Can you add it as a comment to the source code?

- Set the time change value in between two reading to -1 if the value is negative

ruflin · 2017-05-09T13:55:56Z

metricbeat/module/docker/cpu/helper.go

+func calculateLoad(newValue uint64, oldValue uint64) float64 {
+	value := float64(newValue) - float64(oldValue)
+	if value < 0 {
+		logp.Err("time change calculation failed")


as these errors appear on a global level, some more details would be good like that it is a the docker module and perhaps the old and new value or something that is helpful for debugging.

Ok. what about: "error calculating CPU time change for docker module: new stats value is lower than the old one". Is it fine for you?

SGTM: I would add the two values at the end or as part of the log message something like new stats value (%v) is lower then the old one (%v)

update integration tests for cpu metricset

- Set the time change value in between two reading to -1 if the value is negative

- Update cpu_test

ruflin · 2017-05-11T08:03:51Z

@douaejeouit Thanks for the changes. Can you rebase on master. There seem to be some conflicts :-(

ruflin · 2017-05-12T06:07:56Z

@douaejeouit Merged. Thanks for going through all the back and forth with me :-)

douaejeouit · 2017-05-12T07:53:54Z

Thank you, with pleasure!

Fix massive values for cpu metricset

fc6d30f

update integration tests for cpu metricset

update tests

6d1be2b

ruflin added Metricbeat Metricbeat review labels Feb 27, 2017

monicasarbu added the feedback needed label Apr 10, 2017

ruflin reviewed Apr 28, 2017

View reviewed changes

ruflin requested changes Apr 28, 2017

View reviewed changes

ruflin reviewed May 3, 2017

View reviewed changes

- Comment the calculateLoad Function

24bd8ba

- Set the time change value in between two reading to -1 if the value is negative

ruflin reviewed May 9, 2017

View reviewed changes

douaejeouit and others added 5 commits May 9, 2017 18:21

Fix massive values for cpu metricset

12a2985

update integration tests for cpu metricset

update tests

e8bdbec

- Comment the calculateLoad Function

2e29ee1

- Set the time change value in between two reading to -1 if the value is negative

Merge branch 'review' of github.com:douaejeouit/beats into review

1f77251

- Update log error message

4c2779b

- Update cpu_test

douaejeouit force-pushed the review branch from 5393bfb to 4c2779b Compare May 9, 2017 16:45

ruflin approved these changes May 11, 2017

View reviewed changes

ruflin removed the feedback needed label May 12, 2017

ruflin merged commit e48b5d4 into elastic:master May 12, 2017

douaejeouit deleted the review branch May 12, 2017 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Fix massive values for cpu metricset for docker module #3682

Bug: Fix massive values for cpu metricset for docker module #3682

douaejeouit commented Feb 27, 2017

elasticmachine commented Feb 27, 2017

elasticmachine commented Feb 27, 2017

ruflin commented Feb 27, 2017

douaejeouit commented Feb 27, 2017 •

edited

ruflin commented Feb 27, 2017

douaejeouit commented Mar 5, 2017

douaejeouit commented Mar 5, 2017 •

edited

ruflin commented Mar 6, 2017

ruflin commented Apr 25, 2017

douaejeouit commented Apr 27, 2017

ruflin Apr 28, 2017

douaejeouit Apr 30, 2017 •

edited

ruflin May 3, 2017

ruflin left a comment

ruflin Apr 28, 2017

douaejeouit commented Apr 28, 2017

douaejeouit commented Apr 30, 2017

ruflin left a comment

ruflin May 3, 2017

ruflin May 9, 2017

douaejeouit May 9, 2017

ruflin May 9, 2017

ruflin commented May 11, 2017

ruflin commented May 12, 2017

douaejeouit commented May 12, 2017

Bug: Fix massive values for cpu metricset for docker module #3682

Bug: Fix massive values for cpu metricset for docker module #3682

Conversation

douaejeouit commented Feb 27, 2017

elasticmachine commented Feb 27, 2017

elasticmachine commented Feb 27, 2017

ruflin commented Feb 27, 2017

douaejeouit commented Feb 27, 2017 • edited

ruflin commented Feb 27, 2017

douaejeouit commented Mar 5, 2017

douaejeouit commented Mar 5, 2017 • edited

ruflin commented Mar 6, 2017

ruflin commented Apr 25, 2017

douaejeouit commented Apr 27, 2017

ruflin Apr 28, 2017

Choose a reason for hiding this comment

douaejeouit Apr 30, 2017 • edited

Choose a reason for hiding this comment

ruflin May 3, 2017

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

ruflin Apr 28, 2017

Choose a reason for hiding this comment

douaejeouit commented Apr 28, 2017

douaejeouit commented Apr 30, 2017

ruflin left a comment

Choose a reason for hiding this comment

ruflin May 3, 2017

Choose a reason for hiding this comment

ruflin May 9, 2017

Choose a reason for hiding this comment

douaejeouit May 9, 2017

Choose a reason for hiding this comment

ruflin May 9, 2017

Choose a reason for hiding this comment

ruflin commented May 11, 2017

ruflin commented May 12, 2017

douaejeouit commented May 12, 2017

douaejeouit commented Feb 27, 2017 •

edited

douaejeouit commented Mar 5, 2017 •

edited

douaejeouit Apr 30, 2017 •

edited