Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: Add moving average aggregation #10024

Closed

Conversation

polyfractal
Copy link
Contributor

Adds an aggregation to calculate the moving average of sibling metrics in histogram-style data (histogram, date_histogram). Moving averages are useful when time series data is locally stationary and has a mean that changes slowly over time.

Seasonal data may need a different analysis, as well as data that is bimodal, "bursty" or contains frequent extreme values (which are not necessarily outliers).

Request

GET /test/_search?search_type=count
{
   "aggs": {
      "my_date_histo": {
         "date_histogram": {
            "field": "@timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "price"
               }
            },
            "the_movavg": {
               "movavg": {
                  "bucketsPath": "the_sum",
                  "window": 3,
                  "weighting" : "single_exp",
                  "gap_policy" : "ignore",
                  "settings" : {
                    "alpha" : 0.5
                  }
               }
            }
         }
      }
   }
}
bucketsPath (Required)

Path to buckets/metric values to calculate moving average over

window (Optional - default 5)

The user specifies the window size they wish to calculate a moving average for. E.g. a user may want a 30-day sliding window over a histogram of 90 days total.

Currently, if there is not enough data to "fill" the window, the moving average will be calculated with whatever is available. For example, if a user selects 30-day window, days 1-29 will calculate the moving average with between 1-29 days of data.

We could investigate adding more "edge policies", which determine how to handle gaps at the edge of the moving average

weighting (Optional - default simple)

Currently, the agg supports four types of weighting:

  • simple: A simple (arithmetic) average. Default.
  • linear: A linearly weighted average, such that data becomes linearly less important as it gets "older" in the window
  • single_exp: Single exponentially weighted average (aka EWMA or Brown's Simple Exp Smoothing), such that data becomes exponentially less important as it get's "older".
  • double_exp: Double exponentially weighted average (aka Holt-Winters). Uses two exponential terms: first smooth data exponentially like single_exp, but then apply second corrective smoothing to account for a trend.
gap_policy (Optional - default ignore)

Determines the policy for handling gaps in the data. Default is to ignore gaps.

settings (Optional)

Extra settings which apply to individual weighting types.

  • alpha can be set for single_exp. Defaults to 0.5
  • alpha and beta can be set for double_exp. Defaults to 0.5 and 0.5 respectively.

Response

{
   "took": 3,
   "timed_out": false,
   "aggregations": {
      "my_date_histo": {
         "buckets": [
            {
               "key_as_string": "2014-12-01T00:00:00.000Z",
               "key": 1417392000000,
               "doc_count": 1,
               "the_sum": {
                  "value": 1,
                  "value_as_string": "1.0"
               },
               "the_movavg": {
                  "value": 1
               }
            },
            {
               "key_as_string": "2014-12-02T00:00:00.000Z",
               "key": 1417478400000,
               "doc_count": 1,
               "the_sum": {
                  "value": 2,
                  "value_as_string": "2.0"
               },
               "the_movavg": {
                  "value": 1.5
               }
            },
            {
               "key_as_string": "2014-12-04T00:00:00.000Z",
               "key": 1417651200000,
               "doc_count": 1,
               "the_sum": {
                  "value": 4,
                  "value_as_string": "4.0"
               },
               "the_movavg": {
                  "value": 2.3333333333333335
               }
            },
            {
               "key_as_string": "2014-12-05T00:00:00.000Z",
               "key": 1417737600000,
               "doc_count": 1,
               "the_sum": {
                  "value": 5,
                  "value_as_string": "5.0"
               },
               "the_movavg": {
                  "value": 3.6666666666666665
               }
            },
            {
               "key_as_string": "2014-12-08T00:00:00.000Z",
               "key": 1417996800000,
               "doc_count": 1,
               "the_sum": {
                  "value": 8,
                  "value_as_string": "8.0"
               },
               "the_movavg": {
                  "value": 5.666666666666667
               }
            },
            {
               "key_as_string": "2014-12-09T00:00:00.000Z",
               "key": 1418083200000,
               "doc_count": 1,
               "the_sum": {
                  "value": 9,
                  "value_as_string": "9.0"
               },
               "the_movavg": {
                  "value": 7.333333333333333
               }
            }
         ]
      }
   }
}

Closes #10002

Allows the user to calculate a Moving Average over a histogram
of buckets.  Provides four different moving averages:

 - Simple
 - Linear weighted
 - Single Exponentially weighted (aka EWMA)
 - Double Exponentially weighted (aka Holt-winters)
return this;
}

public MovAvgBuilder weightingType(Weighting weightingType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have the Enum and method name the same? I have no preference as to whether we call it Weighting or WeightingType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, cool...wasn't sure if that was kosher or not. Will change everything to Weighting!

@colings86
Copy link
Contributor

@polyfractal looks good so far. Left some comments, mostly around making the Weighting use a common interface with each weighting as a sub-class which might help make it easier to plug in new weighting types

@polyfractal
Copy link
Contributor Author

Awesome, thanks @colings86, all seemed reasonable but had question about the Weighting builder stuff. I'll fixup the other stuff right now.

Lots of extra boilerplate, but mostly around registering streams and parsers.  Will make
future implementations very simple to build, since they just need to subclass
MovAvgModel and provide some code for serialization and parsing.

Also allows each model to manage their own settings configuration, which is a lot cleaner.
@polyfractal
Copy link
Contributor Author

@colings86 @clintongormley Ok, so I refactored everything as a compromise between generic API and specific internal structure. I think it looks good now.

API looks the same as before (settings hash, optional):

"movavg": {
  "bucketsPath": "the_sum",
  "weighting" : "double_exp",
  "window": 5,
  "settings": {
    "alpha": 0.8   
  }
}

Internally, it uses a similar architecture as the SignificantHeuristics, e.g. a parsers are stored in a central map, the relevant parser is retrieved after we are done parsing the request, a MovAvgModel object is constructed and given to the aggregation. All models are responsible for construction, custom settings, serialization, etc.

The only difference is that we extract a Map<> for settings (if it exists), instead of using a more specialized XContent parser. I think this is a fair tradeoff.

bucketsPath needs fixing, but that's actually a problem with Reducer.Parser. I'll fix that separately.

@colings86
Copy link
Contributor

@polyfractal I like the new changes but I wonder if we should have a MovAvgModelBuilder where you can set the settings for a particular model and then we call settings() to get the settings object to add to the movavg reducer? the builder should also get a type() method so we can use the builder to populate the weighting field too.

@polyfractal
Copy link
Contributor Author

@colings86 Newbie java dev question: what does a builder buy us? It doesn't seem necessary from a REST point of view (still need to parse a Map), and in java-land users can just make a new SingleExpMovAvgModel(alpha), etc? Is it just because we have the builder convention elsewhere?

Just seems like even more boilerplate for such a minor thing :)

@colings86
Copy link
Contributor

@polyfractal It allows someone using the Java API to have concrete methods describing the options available for each model rather than having to know what keys are valid for the settings map. At the moment the MovAvgBuilder takes the weight and settings separately and as far as I can see there is no way to pass a MovAvgModel into it? When #10217 is complete we should have a MovAvgModel objects which has builders and parser methods as well being POJOs which we can pass around in the Java API. To avoid creating the MovAvgModelBuilder classes you could pre-empt this by adding the builder methods to the MovAvgModel classes now.

@polyfractal
Copy link
Contributor Author

@colings86 Ah, that makes sense...thanks for the explanation. I'll fix it up on my flight home, will ping you when done :)

Better explains what is being done, and allows extensibility in the future (e.g. wavelets can be used
to smooth, not just moving average)
@polyfractal
Copy link
Contributor Author

  • Renamed to smooth. I think this is better representative of what the agg does (it "smooths" your data). It also allows us to add other models in the future, like ARMA, which aren't moving averages
  • Merged upstream changes
  • Added builders. I'm not sure if the way I did it is kosher...will have to chat when you get back.

@rjernst
Copy link
Member

rjernst commented Mar 26, 2015

Renamed to smooth. I think this is better representative of what the agg does (it "smooths" your data). It also allows us to add other models in the future, like ARMA, which aren't moving averages

Won't the different models have different parameters potentially? The name smooth seems quite arbitrary to me, while moving_avg just "makes sense" and is easily found with a google search. I'm not saying we shouldn't use smooth, I just want to make sure we aren't choosing a name too generic, just so we can potentially add other models in the future, but perhaps at the cost of confusion early on.

@polyfractal
Copy link
Contributor Author

Won't the different models have different parameters potentially?

@rjernst Yeah, different models may will have different parameters. The current setup is that you specify which model you want to use, then pass in a generic settings hash which is specific for that model. I know I would like to (near-to-medium term) add ARMA, ARIMA, Wavelet and Savitzky-Golay filters for smoothing. There are dozens others we could add too.

The name smooth seems quite arbitrary to me, while moving_avg just "makes sense" and is easily found with a google search. I'm not saying we shouldn't use smooth, I just want to make sure we aren't choosing a name too generic, just so we can potentially add other models in the future, but perhaps at the cost of confusion early on.

This is my current thinking...please feel free to poke holes in it. :)

I'm thinking that we should to group functionality by "usage" whenever possible, since we shouldn't expect the average user to know how all the specific parts work. Obviously some will be experts, but I think most will be new to time series analysis.

If we clump functionality by usage, a user can sit down and say "I would like to smooth my data", pull out the smoothing agg, and start fiddling with different models and parameters. Similarly, they should be able to pull out an outlier agg and try different models/params to find anomalies, or a predict agg and start forecasting.

Importantly, an "outlier" agg and "prediction" agg will also support some of the same models as smoothing. There is a lot of overlap between the three...but not full overlap, which is really the problem.

The alternative is that users will need to navigate a large set of aggs that inconsistently support functionality. E.g. this table illustrates the problem:

Smooth Predict Outlier
movavg X sorta sorta
ARMA / ARIMA X X X
Wavelet X X
SG filter X X
High / low / bandpass filter X
Regression X X
Thresholding X
SVM, ANN, etc X X

The other concern is that if each agg supports several "modes", the output from that agg will depend on which options you have toggled. That might be irritating if toggling a new param changes the output (e.g. predict will start adding new buckets, outlier will add a new "was_outlier" field or something)

Basically, the goal was to prevent an explosion of small aggs that each do one highly technical thing...leaving a lot of users in the dust because they've never done time series analysis before.

There is also a bunch of extra stuff that can't be grouped easily, like autocorrelation, changepoint, statistical testing, etc.

_(_Oh goodness, that turned into an essay. Sorry :( )

@rjernst
Copy link
Member

rjernst commented Mar 27, 2015

@polyfractal That totally makes sense. +1 for smooth :)

@polyfractal
Copy link
Contributor Author

@rjernst Urgh, second guessing myself now...thoughts?

I agree there is concern about making things toooo generic. There is also something nice about having an agg represent a single-purpose, with one well defined set of params. Maybe it's a pain to have one agg with like 10 different modes?

Maybe we could do a hybrid approach? I was thinking about it last night and realized that most functionality has an innate purpose, but can be used for prediction or outlier based on that functionality. E.g. a moving average smooths the data, but can find outliers by comparing a value against the smoothed average.

So we could do:

  • Aggs are individual for their innate purpose: movavg, wavelet, regression, etc
  • Aggs are bundled for predict, outlier and changepoint, since those tend to use the features of the individual components plus some extra logic

I think I'm still leaning towards the prior option, but wanted to write this down. Not sure, just braindumping now :)

@polyfractal
Copy link
Contributor Author

So...after a lot of discussion we decided to revert the naming change and go back to moving_avg. We did this for a few reasons:

  • Simpler is better to start, since we aren't quite sure how users will be using the new functionality. Single-function aggs are simpler, and people are already accustomed to looking for a "moving average"...whereas a "smoothing" agg might confuse them
  • Since each model has very different settings, using a single generic settings hash is not ideal (even though it's validated).
  • We can re-expose these individual aggs through a "sugar" agg later if we see people needing some guidance when getting started. Similar to how the match query is essentially a smart wrapper for term, phrase, phrase_prefix, etc.
  • At least for prediction, the code complexity is not large on a per-agg basis. Outlier will likely need a bundled agg to start, since it will be considerably more complex.
  • Individual aggs also let us throw a lot of functionality at the wall, and see what sticks with users. We can prune or reorganize into sugar as required later

/cc @rjernst

import java.io.IOException;


public interface MovAvgModelBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have some JavaDocs since it forms part of the Java API. Also, should this not implement toXContent?

@colings86
Copy link
Contributor

@polyfractal left a comment but I think it's pretty close now

polyfractal added a commit that referenced this pull request Apr 8, 2015
Allows the user to calculate a Moving Average over a histogram  of buckets.  Provides four different
moving averages:
 - Simple
 - Linear weighted
 - Single Exponentially weighted (aka EWMA)
 - Double Exponentially weighted (aka Holt-winters)

Closes #10024
@polyfractal
Copy link
Contributor Author

Closed via a squash-merge in a824184, because I'm terrible at Git and messed up the merge process :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants