Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add series aggregation DSL function `aggregate` #2294

Merged
merged 15 commits into from Aug 20, 2018

Conversation

Projects
None yet
2 participants
@hermanschaaf
Copy link
Contributor

hermanschaaf commented Aug 17, 2018

This PR adds an aggregate DSL function, which allows one to combine different series in a seriesSet using a specified aggregator (currently min, max, p50, avg).

This is particularly useful when comparing data across different weeks (using the over) function. In our case, for anomaly detection, we want to compare the current day's data with an aggregated view of the same day in previous weeks. In particular, we want to compare each point in the last day to the median of each point in the corresponding day for the last 3 weeks, so that any anomalies that occurred in a previous week are ignored. This way we compare with a hypothetical "perfect" day.

For example:

$weeks = over("avg:10m-avg-zero:os.cpu", "24h", "1w", 3)
$a = aggregate($weeks, "", "p50")
merge($a, $q)

Which looks like this:

screen shot 2018-08-17 at 4 51 27 pm

Or, if we wanted to combine series but maintain the region and color groups`, that query would look like this:

$weeks = over("avg:10m-avg-zero:os.cpu{region=*,color=*}", "24h", "1w", 3)
aggregate($weeks, "region,color", "p50")

which would result in one merged series for each unique region/color combination.

I am very happy to take suggestions for changes / improvements. With regards to naming the function, I would have probably chosen "merge", but since that is already taken, I went with the OpenTSDB terminology and used "aggregate".

@hermanschaaf hermanschaaf changed the title Add series aggregation DSL Add series aggregation DSL function `aggregate` Aug 17, 2018

@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

Oh I really like this idea. These are operations I often feel that the TSDB should do, but often they don't or not with the order of operations you may want.

Some things to think about:

  • Line up timestamps?
  • What to do on operations that include NaN values etc

Looking at how r or pandas does things can show some of the possibilities. If we just want to start with a simple function with limited behavior (for example, timestamps must line up, nan values or ignored) etc that is fine. But we might not want to take the aggregate func name for that.

@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

Another factor is aggregation and tags / dimensions, it will most like be very useful to be able to do a group by within a tagset.

So for example, says you have a series that is host=*,disk=*, you would pass an argument that would just have "host", you would get series in the set tag with each host, and the disks were aggregated across those host.

Edit: ... Oh, maybe you already do this...

@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 17, 2018

Thanks for the feedback @kylebrandt!

  • Another factor is aggregation and tags / dimensions, it will most like be very useful to be able to do a group by within a tagset.

    Indeed, the existing code behaves exactly like you described. Only if you leave the groups argument empty, it aggregates across all series regardless of tags.

  • NaN values: Good point, I'll add some test cases to see how that behaves now. How do you think that should work? Say we have three points that are aligned by timestamp, 1, 3, NaN, should we make the resulting point NaN regardless of the aggregation function applied?

  • For lining up timestamps. At the moment, I look at all the points that do line up and aggregate them, and any that don't align are considered on their own. I think this would be the expected behavior from this aggregate function (or whatever we end up calling it). In OpenTSDB at least, it's possible to make sure all points line up by using a downsample when retrieving the points.

    Perhaps we can add an additional function to snap points in each series, similar to how OpenTSDB does downsampling, in a different PR?

  • Any suggestions for names? I chose aggregate because it matches OpenTSDB naming conventions. combine or combineSeries? groupby, maybe? That makes less sense if no group is specified.

@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

NaN: Honestly I don't think we really ever thought that out well enough. If we are sticking with the simplest possible version, I would just drop them

Lining up timestamps: Agreed, building interpolation is no small request. Not only do you have to downsample but also upsample. There are all sorts of different ways to do this

Name: I wanted to reserve aggregation because I kind of image that to be a function with options about NaN, downsampling, etc in the future. Then simpler functions that do common behaviors ( basically aggregation with common options) could be added. But you are correct that it is a good name. So I would recommend we call it something temporary like ~ zagg() for now. With the assumption being that we can alias the function to aggregation and deprecate the old name if we end up not really developing this feature much more in say the next 6 months, but in a way that won't break people's configs. Or if this will become a name suitable for one of the various time series aggregation category of functions it could change in the same way.

"aggregate": {
Args: []models.FuncType{models.TypeSeriesSet, models.TypeString, models.TypeString},
Return: models.TypeSeriesSet,
Tags: tagFirst,

This comment has been minimized.

@kylebrandt

kylebrandt Aug 17, 2018

Member

This should be a function that extracts what the resulting tag keys will be, so it will come from groups arg to Aggregate arg[1]. The reason is that if you look in expr/parse, the nodes have a Tag method which makes sure at parse time that the resulting tags of function make sense in any sort of union operations.

This comment has been minimized.

@hermanschaaf

hermanschaaf Aug 17, 2018

Author Contributor

Ah, thanks, that makes sense now. I updated it, please have another look when you get a chance.

}
res.Group = opentsdb.TagSet{}
results.Results = append(results.Results, res)
} else {

This comment has been minimized.

@kylebrandt

kylebrandt Aug 17, 2018

Member

drop the else, return return &results, nil above in the if condition, and then the rest of the code does not need to be indented. Same idea under https://github.com/golang/go/wiki/CodeReviewComments#indent-error-flow in terms of keeping minimal indentation.

This comment has been minimized.

@hermanschaaf

hermanschaaf Aug 17, 2018

Author Contributor

Done 👍

vals := []string{}
for _, grp := range grps {
if val, ok := result.Group[grp]; ok {
vals = append(vals, val)

This comment has been minimized.

@kylebrandt

kylebrandt Aug 17, 2018

Member

No need for else, can `continue in the if statement, and then have the error after the if statement. Same idea of minimal indentaiton

This comment has been minimized.

@hermanschaaf

hermanschaaf Aug 17, 2018

Author Contributor

Done 👍

res := Result{}
newSeries := make(Series)

switch aggregator {

This comment has been minimized.

@kylebrandt

kylebrandt Aug 17, 2018

Member

Probably useful to mimic what we do in http://bosun.org/expressions#windowquery-string-duration-string-period-string-num-scalar-funcname-string-seriesset :

In addition to supporting Bosun’s reduction functions that take on argument, percentile operations may be be done by setting funcName to p followed by number that is between 0 and 1 (inclusively). For example, "p.25" will be the 25th percentile, "p.999" will be the 99.9th percentile. "p0" and "p1" are min and max respectively (However, in these cases it is recommended to use "min" and "max" for the sake of clarity.

This comment has been minimized.

@hermanschaaf

hermanschaaf Aug 17, 2018

Author Contributor

Nice, I wasn't aware that it's done that way in the window query. Will make it so 👍

Herman Schaaf added some commits Aug 17, 2018

@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 17, 2018

For NaN, I added a test that just asserts that NaN-aligned values remain NaN - I'm not sure we should drop them here, since that's what the dropna function is there for? (It was awkard to test, since reflect.DeepEqual doesn't consider two math.NaN values to be equal)

Left to do:

  • update the name to zaggr (if you're happy with that?)
  • use flexible p values like in the window function
  • the build is failing with Failed to stop bosun cleanly. Exit code 0 (124=60s test timeout reached) - any idea why?
@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

Just "aggr"? reasonable unlike a z prefix, and leaves agg and aggregate for future.

Herman Schaaf
@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

I've never really been happy with dropna though, since it can result in an empty series which I think throws an error because we never really defined what to do about that.

For now I would maybe:

  • Just drop any NaN values from series
  • if one of the series in the set ends up empty because of this, drop the series
  • I guess error if you end up with an empty set, or return the empty set and see what bosun does with it

I say the above because I'm trying to think of what will make the function least error prone (annoying) in the context of some of the stuff that needs to be better defined in Bosun in general.

So in the bigger picture outside of your PR (I should read through existing tests, maybe some of this is covered):

  • What do we do with items in a set that have no value? Whatever we do, is there a reason for it - and is what we really want in the context of alerting (what most of the expression language is for, but at least mentally distinguishing between funcs that are meant just for graphing vs to be used in an alert expression might help lead to some sane path)
  • What do we do with a set that has no values in it (again is that what we want etc). How would an empty set behave as part of a union operation?
@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

Regarding the timeout I doubt that is you, probably just travis being slow - we use a free tier.

@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 17, 2018

Recent push to master will require an update, but should just be as simple as removing the T miniprofiler.Timer argument.

@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 17, 2018

Ok, thanks! I'll make those changes (probably tomorrow or early next week, getting a bit late over here). I'll let you know when I've made them.

Just about the NaN values again: as it stands, nothing that the aggregation function does can result in a NaN, so the only way you can end up with it is if there were NaN values in the original series given as input. Maybe I'm missing the wider context, but wouldn't that be the expected behavior? If we drop NaN entries, I feel like we're doing things users might not expect from this function.

Herman Schaaf added some commits Aug 18, 2018

@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 18, 2018

@kylebrandt Ok, I have updated the code to:

  • use flexible percentile values like in the window function,
  • use the aggr function name
  • resolve conflicts with master
  • test the percentile values
  • run goimports

NaN behavior is still to pass them along, to avoid the complexities and possibly unexpected behavior of removing them.

What do you think?

@@ -290,7 +290,7 @@ func (a *Results) Equal(b *Results) (bool, error) {
if a.IgnoreOtherUnjoined != b.IgnoreOtherUnjoined {
return false, fmt.Errorf("ignoreUnjoined flag does not match a: %v, b: %v", a.IgnoreOtherUnjoined, b.IgnoreOtherUnjoined)
}
if a.NaNValue != a.NaNValue {
if a.NaNValue != b.NaNValue {

This comment has been minimized.

@kylebrandt

kylebrandt Aug 18, 2018

Member

nice find!

@@ -385,6 +411,198 @@ var builtins = map[string]parse.Func{
},
}

// Aggr combines multiple series matching the specified groups using an aggregator function. If group
// is empty, all given series are combined, regardless of existing groups.
// Available aggregator functions include: avg (average), p50 (median), min (minimum) and max (maximum).

This comment has been minimized.

@kylebrandt

kylebrandt Aug 18, 2018

Member

update comment to reflect changes

This comment has been minimized.

@kylebrandt

kylebrandt Aug 18, 2018

Member

would be good to have sum as well

This comment has been minimized.

@hermanschaaf

hermanschaaf Aug 18, 2018

Author Contributor

Updated the comment and added sum. Also updated the documentation accordingly.

for _, result := range series {
for t, v := range result.Value.Value().(Series) {
newSeries[t] += v
counts[t] += 1

This comment has been minimized.

@kylebrandt

kylebrandt Aug 18, 2018

Member

counts[t]++

This comment has been minimized.

@hermanschaaf

hermanschaaf Aug 18, 2018

Author Contributor

That's it, no more Python for me 🐍

@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 18, 2018

Few more comments but in general looks good to me - I'm content to go with your take on NaN.

I think maybe the number of aggrThing functions (or at least the lines of code in them) could probably be reduced by a function that takes another reduction function as an argument or something like this ... but I'm okay with the aggrThing funcs as is -- so don't worry about it.

This will be quite useful :-) +1

Herman Schaaf added some commits Aug 18, 2018

@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 18, 2018

Thanks, I removed two of the aggrThing functions (aggrMin and aggrMax) by using aggrPercentile. But yeah I agree, it would be nice to generalize these in the future so it can take any function. Potential future improvement :)

Herman Schaaf
@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 18, 2018

Ok, made some fixes to the documentation too. All done making changes now 👍

@kylebrandt

This comment has been minimized.

Copy link
Member

kylebrandt commented Aug 20, 2018

@hermanschaaf Looks good! In a new PR can you add a little more to the documentation in terms of behavior: for example, timestamps needing to line up, what happens with NaN etc. ?

@kylebrandt kylebrandt merged commit d7dfefd into bosun-monitor:master Aug 20, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@hermanschaaf

This comment has been minimized.

Copy link
Contributor Author

hermanschaaf commented Aug 21, 2018

Thanks @kylebrandt - sure thing, I will open a new PR for that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.