Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybind/mgr: fix divided by zero error #26270

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
23 changes: 3 additions & 20 deletions src/pybind/mgr/dashboard/services/ceph_service.py
Expand Up @@ -6,6 +6,7 @@
import rados

from mgr_module import CommandResult
from mgr_util import differentiate, latest_rate

try:
from more_itertools import pairwise
Expand Down Expand Up @@ -107,15 +108,10 @@ def get_pool_list_with_stats(cls, application=None):
stats = pool_stats[pool['pool']]
s = {}

def get_rate(series):
if len(series) >= 2:
return differentiate(*list(series)[-2:])
sebastian-philipp marked this conversation as resolved.
Show resolved Hide resolved
return 0

for stat_name, stat_series in stats.items():
s[stat_name] = {
'latest': stat_series[0][1],
'rate': get_rate(stat_series),
'rate': latest_rate(stat_series),
'series': [i for i in stat_series]
}
pool['stats'] = s
Expand Down Expand Up @@ -184,10 +180,7 @@ def get_rates(cls, svc_type, svc_name, path):
def get_rate(cls, svc_type, svc_name, path):
"""returns most recent rate"""
data = mgr.get_counter(svc_type, svc_name, path)[path]

if data and len(data) > 1:
return differentiate(*data[-2:])
return 0.0
return latest_rate(data)

@classmethod
def get_client_perf(cls):
Expand Down Expand Up @@ -249,13 +242,3 @@ def get_pg_info(cls):
'statuses': pg_summary['all'],
'pgs_per_osd': pgs_per_osd,
}


def differentiate(data1, data2):
"""
>>> times = [0, 2]
>>> values = [100, 101]
>>> differentiate(*zip(times, values))
0.5
"""
return (data2[1] - data1[1]) / float(data2[0] - data1[0])
16 changes: 2 additions & 14 deletions src/pybind/mgr/diskprediction_cloud/common/clusterdata.py
Expand Up @@ -7,6 +7,7 @@
import json
import rbd
from mgr_module import CommandResult
from mgr_util import latest_rate

GB = 1024 * 1024 * 1024

Expand All @@ -24,16 +25,6 @@
}


def differentiate(data1, data2):
"""
# >>> times = [0, 2]
# >>> values = [100, 101]
# >>> differentiate(*zip(times, values))
0.5
"""
return (data2[1] - data1[1]) / float(data2[0] - data1[0])


class ClusterAPI(object):

def __init__(self, module_obj):
Expand Down Expand Up @@ -444,10 +435,7 @@ def get_configuration(self, key):
def get_rate(self, svc_type, svc_name, path):
"""returns most recent rate"""
data = self.module.get_counter(svc_type, svc_name, path)[path]

if data and len(data) > 1:
return differentiate(*data[-2:])
return 0.0
return latest_rate(data)

def get_latest(self, daemon_type, daemon_name, counter):
return self.module.get_latest(daemon_type, daemon_name, counter)
Expand Down
25 changes: 25 additions & 0 deletions src/pybind/mgr/mgr_util.py
Expand Up @@ -66,3 +66,28 @@ def format_dimless(n, width, colored=True):

def format_bytes(n, width, colored=True):
return format_units(n, width, colored, decimal=False)


def differentiate(data1, data2):
"""
# >>> times = [0, 2]
# >>> values = [100, 101]
# >>> differentiate(*zip(times, values))
0.5
"""
try:
return (data2[1] - data1[1]) / float(data2[0] - data1[0])
except ZeroDivisionError:
return 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, when is this case happening? A possible alternative could be to return float('nan') (Not a Number) as assuming this is 0.0 can be misleading. nan is a perfectly valid Python float object, so you can still perform arithmetic operations with it:

>>> float('nan') + 1
nan
>>> float('nan')*2
nan
>>> float('nan')/float('nan')
nan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JOSN left out NaN, but json.dumps generates them:

In [2]: json.dumps(float('nan'))
Out[2]: 'NaN'

No idea what clients will do with NaN values then. I'd like to keep it simple and not run into this complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO 0.0 is not good because it gives a wrong output.

It seems that, while a lot of JSON parsers are capable of dealing with NaN (it's a valid JS number value), it's not strict JSON:

Numeric values that cannot be represented as sequences of digits (such as Infinity and NaN) are not permitted

(That also implies that we should use json.dumps(..., allow_nan=False) always if we want to comply with the ECMA JSON standard).

In that case you can leverage on the null value, as it's perfectly strict JSON. I would do the mapping 'NaN' to null in the JSON dumping, while keeping nan for other output formats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When returning something like [0.4, 0.23, 123.0, 123.5, null, 42, 42,42, 0.5, 0.0], I wonder how many null related exceptions we're going to generate in client code. I don't think lots of users will expect null values here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share that concern too, but the issue here is whether it's preferable something that might not work:

[0.4, 0.23, 123.0, 123.5, **null**, 42, 42,42, 0.5, 0.0]

Than something that works but contains false data:

[0.4, 0.23, 123.0, 123.5, **0.0**, 42, 42,42, 0.5, 0.0]

If someone draws a chart from the latter data, they will see the graph suddenly dropping to zero. Most visualization engines (e.g.: Grafana) properly treat null values, but 0.0 is a perfectly valid value.

If you use Python to parse that, you'll get a None (Java, PHP, etc., do the same), which will quickly trigger an exception if you try to do math on it:

>>> import json
>>> json.loads('[0.4, 0.23, 123.0, 123.5, null, 42, 42,42, 0.5, 0.0]')
[0.4, 0.23, 123.0, 123.5, None, 42, 42, 42, 0.5, 0.0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastian-philipp: Well, adding stray and surprising nulls lead to bugs like #26766

And I think that #26766 does the proper thing: handling all possible inputs and not mapping nulls or undefined to 0, which are semantically different values.

I think so, too. I plan to amend this PR this week.

@sebastian-philipp: this is a function that affects all mgr modules, not just the dashboard.

I think that, besides dashboard, there are just 2 other modules using that: diskprediction_cloud and status, so let's just ping their top contributors for their feedback on the impact of this change: @hsiang41, @gmayyyha and @liewegas.

In case the dashboard becomes the official API, we should make sure, we're not returning nasty surprises, like stray null values. But yep, let's huddle if it makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for coming late to this discussion.
I'm going to step back a bit on the discussion of whether we should return 0.0 or None when we get a ZeroDivisionError exception, and analyze why are we getting a ZeroDivisionError when calculating the derivative of a sequence of data points.

If we are getting a ZeroDivisionError is because we have two consecutive data points that have two (possibly different) values for the same timestamp. In mathematical terms, this is the same as saying that f(t) = a and f(t) = b at the same time, which is impossible. When calculating a derivative of a sequence of data points we are assuming that the data points were generated by a continuous function, and thus the derivative function expects that data2[0] - data1[0] will be always different from zero.
Since this is not the case because of timestamp resolution, we might get two perf counter values for the same timestamp. So what we should do is to filter those consecutive data points that have the same timestamp and keep just one of them.

Now we need to decide on the filtering algorithm for these special datapoints. Given a sequence of data points with the same timestamp, we can:
a) keep just the first data point of the sequence
b) keep just the last data point of the sequence
c) calculate the average of the data points

By doing this filtering we will never have a ZeroDivisionError again while calculating the derivative function, and solves this problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just calculate the difference (option c), but using the absolute like I did in my draft PR #28603 . Just calculate the time difference first and use 1 if it's 0.

    td = float(data2[0] - data1[0])
    return abs((data2[1] - data1[1]) / (td if td != 0 else 1))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here. Why do we get those broken perf values from C++ in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here. Why do we get those broken perf values from C++ in the first place?

It's not broken. The problem is that two perf values could have been separated by only a few nanoseconds but the clock only had milisecond resolution.



def latest_rate(data):
if data and len(data) > 1:
try:
last_two_items = data[-2:]
except TypeError:
# assert isinstance(data, collections.deque)
# but I'm happy if list() just works.
last_two_items = list(data)[-2:]
return differentiate(*last_two_items)
return 0.0
7 changes: 1 addition & 6 deletions src/pybind/mgr/status/module.py
Expand Up @@ -41,12 +41,7 @@ def get_latest(self, daemon_type, daemon_name, stat):

def get_rate(self, daemon_type, daemon_name, stat):
data = self.get_counter(daemon_type, daemon_name, stat)[stat]

#self.log.error("get_latest {0} data={1}".format(stat, data))
if data and len(data) > 1:
return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0])
else:
return 0
return mgr_util.latest_rate(data)

def handle_fs_status(self, cmd):
output = ""
Expand Down