Aggregations: Add moving average aggregation #10024

polyfractal · 2015-03-06T22:23:37Z

Adds an aggregation to calculate the moving average of sibling metrics in histogram-style data (histogram, date_histogram). Moving averages are useful when time series data is locally stationary and has a mean that changes slowly over time.

Seasonal data may need a different analysis, as well as data that is bimodal, "bursty" or contains frequent extreme values (which are not necessarily outliers).

Request

GET /test/_search?search_type=count
{
   "aggs": {
      "my_date_histo": {
         "date_histogram": {
            "field": "@timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "price"
               }
            },
            "the_movavg": {
               "movavg": {
                  "bucketsPath": "the_sum",
                  "window": 3,
                  "weighting" : "single_exp",
                  "gap_policy" : "ignore",
                  "settings" : {
                    "alpha" : 0.5
                  }
               }
            }
         }
      }
   }
}

`bucketsPath` (Required)

Path to buckets/metric values to calculate moving average over

`window` (Optional - default 5)

The user specifies the window size they wish to calculate a moving average for. E.g. a user may want a 30-day sliding window over a histogram of 90 days total.

Currently, if there is not enough data to "fill" the window, the moving average will be calculated with whatever is available. For example, if a user selects 30-day window, days 1-29 will calculate the moving average with between 1-29 days of data.

We could investigate adding more "edge policies", which determine how to handle gaps at the edge of the moving average

`weighting` (Optional - default `simple`)

Currently, the agg supports four types of weighting:

simple: A simple (arithmetic) average. Default.
linear: A linearly weighted average, such that data becomes linearly less important as it gets "older" in the window
single_exp: Single exponentially weighted average (aka EWMA or Brown's Simple Exp Smoothing), such that data becomes exponentially less important as it get's "older".
double_exp: Double exponentially weighted average (aka Holt-Winters). Uses two exponential terms: first smooth data exponentially like single_exp, but then apply second corrective smoothing to account for a trend.

`gap_policy` (Optional - default `ignore`)

Determines the policy for handling gaps in the data. Default is to ignore gaps.

`settings` (Optional)

Extra settings which apply to individual weighting types.

alpha can be set for single_exp. Defaults to 0.5
alpha and beta can be set for double_exp. Defaults to 0.5 and 0.5 respectively.

Response

{
   "took": 3,
   "timed_out": false,
   "aggregations": {
      "my_date_histo": {
         "buckets": [
            {
               "key_as_string": "2014-12-01T00:00:00.000Z",
               "key": 1417392000000,
               "doc_count": 1,
               "the_sum": {
                  "value": 1,
                  "value_as_string": "1.0"
               },
               "the_movavg": {
                  "value": 1
               }
            },
            {
               "key_as_string": "2014-12-02T00:00:00.000Z",
               "key": 1417478400000,
               "doc_count": 1,
               "the_sum": {
                  "value": 2,
                  "value_as_string": "2.0"
               },
               "the_movavg": {
                  "value": 1.5
               }
            },
            {
               "key_as_string": "2014-12-04T00:00:00.000Z",
               "key": 1417651200000,
               "doc_count": 1,
               "the_sum": {
                  "value": 4,
                  "value_as_string": "4.0"
               },
               "the_movavg": {
                  "value": 2.3333333333333335
               }
            },
            {
               "key_as_string": "2014-12-05T00:00:00.000Z",
               "key": 1417737600000,
               "doc_count": 1,
               "the_sum": {
                  "value": 5,
                  "value_as_string": "5.0"
               },
               "the_movavg": {
                  "value": 3.6666666666666665
               }
            },
            {
               "key_as_string": "2014-12-08T00:00:00.000Z",
               "key": 1417996800000,
               "doc_count": 1,
               "the_sum": {
                  "value": 8,
                  "value_as_string": "8.0"
               },
               "the_movavg": {
                  "value": 5.666666666666667
               }
            },
            {
               "key_as_string": "2014-12-09T00:00:00.000Z",
               "key": 1418083200000,
               "doc_count": 1,
               "the_sum": {
                  "value": 9,
                  "value_as_string": "9.0"
               },
               "the_movavg": {
                  "value": 7.333333333333333
               }
            }
         ]
      }
   }
}

Closes #10002

Allows the user to calculate a Moving Average over a histogram of buckets. Provides four different moving averages: - Simple - Linear weighted - Single Exponentially weighted (aka EWMA) - Double Exponentially weighted (aka Holt-winters)

colings86 · 2015-03-17T19:58:47Z

src/main/java/org/elasticsearch/search/aggregations/reducers/movavg/MovAvgBuilder.java

+        return this;
+    }
+
+    public MovAvgBuilder weightingType(Weighting weightingType) {


Does it make sense to have the Enum and method name the same? I have no preference as to whether we call it Weighting or WeightingType

Ah, cool...wasn't sure if that was kosher or not. Will change everything to Weighting!

colings86 · 2015-03-17T20:11:01Z

@polyfractal looks good so far. Left some comments, mostly around making the Weighting use a common interface with each weighting as a sub-class which might help make it easier to plug in new weighting types

polyfractal · 2015-03-18T15:10:19Z

Awesome, thanks @colings86, all seemed reasonable but had question about the Weighting builder stuff. I'll fixup the other stuff right now.

Lots of extra boilerplate, but mostly around registering streams and parsers. Will make future implementations very simple to build, since they just need to subclass MovAvgModel and provide some code for serialization and parsing. Also allows each model to manage their own settings configuration, which is a lot cleaner.

polyfractal · 2015-03-20T16:58:04Z

@colings86 @clintongormley Ok, so I refactored everything as a compromise between generic API and specific internal structure. I think it looks good now.

API looks the same as before (settings hash, optional):

"movavg": {
  "bucketsPath": "the_sum",
  "weighting" : "double_exp",
  "window": 5,
  "settings": {
    "alpha": 0.8   
  }
}

Internally, it uses a similar architecture as the SignificantHeuristics, e.g. a parsers are stored in a central map, the relevant parser is retrieved after we are done parsing the request, a MovAvgModel object is constructed and given to the aggregation. All models are responsible for construction, custom settings, serialization, etc.

The only difference is that we extract a Map<> for settings (if it exists), instead of using a more specialized XContent parser. I think this is a fair tradeoff.

bucketsPath needs fixing, but that's actually a problem with Reducer.Parser. I'll fix that separately.

colings86 · 2015-03-23T06:51:24Z

@polyfractal I like the new changes but I wonder if we should have a MovAvgModelBuilder where you can set the settings for a particular model and then we call settings() to get the settings object to add to the movavg reducer? the builder should also get a type() method so we can use the builder to populate the weighting field too.

polyfractal · 2015-03-23T18:53:13Z

@colings86 Newbie java dev question: what does a builder buy us? It doesn't seem necessary from a REST point of view (still need to parse a Map), and in java-land users can just make a new SingleExpMovAvgModel(alpha), etc? Is it just because we have the builder convention elsewhere?

Just seems like even more boilerplate for such a minor thing :)

colings86 · 2015-03-24T09:57:17Z

@polyfractal It allows someone using the Java API to have concrete methods describing the options available for each model rather than having to know what keys are valid for the settings map. At the moment the MovAvgBuilder takes the weight and settings separately and as far as I can see there is no way to pass a MovAvgModel into it? When #10217 is complete we should have a MovAvgModel objects which has builders and parser methods as well being POJOs which we can pass around in the Java API. To avoid creating the MovAvgModelBuilder classes you could pre-empt this by adding the builder methods to the MovAvgModel classes now.

polyfractal · 2015-03-24T13:02:57Z

@colings86 Ah, that makes sense...thanks for the explanation. I'll fix it up on my flight home, will ping you when done :)

Better explains what is being done, and allows extensibility in the future (e.g. wavelets can be used to smooth, not just moving average)

polyfractal · 2015-03-26T21:45:20Z

Renamed to smooth. I think this is better representative of what the agg does (it "smooths" your data). It also allows us to add other models in the future, like ARMA, which aren't moving averages
Merged upstream changes
Added builders. I'm not sure if the way I did it is kosher...will have to chat when you get back.

rjernst · 2015-03-26T23:02:49Z

Renamed to smooth. I think this is better representative of what the agg does (it "smooths" your data). It also allows us to add other models in the future, like ARMA, which aren't moving averages

Won't the different models have different parameters potentially? The name smooth seems quite arbitrary to me, while moving_avg just "makes sense" and is easily found with a google search. I'm not saying we shouldn't use smooth, I just want to make sure we aren't choosing a name too generic, just so we can potentially add other models in the future, but perhaps at the cost of confusion early on.

polyfractal · 2015-03-27T01:52:11Z

Won't the different models have different parameters potentially?

@rjernst Yeah, different models may will have different parameters. The current setup is that you specify which model you want to use, then pass in a generic settings hash which is specific for that model. I know I would like to (near-to-medium term) add ARMA, ARIMA, Wavelet and Savitzky-Golay filters for smoothing. There are dozens others we could add too.

The name smooth seems quite arbitrary to me, while moving_avg just "makes sense" and is easily found with a google search. I'm not saying we shouldn't use smooth, I just want to make sure we aren't choosing a name too generic, just so we can potentially add other models in the future, but perhaps at the cost of confusion early on.

This is my current thinking...please feel free to poke holes in it. :)

I'm thinking that we should to group functionality by "usage" whenever possible, since we shouldn't expect the average user to know how all the specific parts work. Obviously some will be experts, but I think most will be new to time series analysis.

If we clump functionality by usage, a user can sit down and say "I would like to smooth my data", pull out the smoothing agg, and start fiddling with different models and parameters. Similarly, they should be able to pull out an outlier agg and try different models/params to find anomalies, or a predict agg and start forecasting.

Importantly, an "outlier" agg and "prediction" agg will also support some of the same models as smoothing. There is a lot of overlap between the three...but not full overlap, which is really the problem.

The alternative is that users will need to navigate a large set of aggs that inconsistently support functionality. E.g. this table illustrates the problem:

	Smooth	Predict	Outlier
movavg	X	sorta	sorta
ARMA / ARIMA	X	X	X
Wavelet	X		X
SG filter	X		X
High / low / bandpass filter	X
Regression		X	X
Thresholding			X
SVM, ANN, etc		X	X

The other concern is that if each agg supports several "modes", the output from that agg will depend on which options you have toggled. That might be irritating if toggling a new param changes the output (e.g. predict will start adding new buckets, outlier will add a new "was_outlier" field or something)

Basically, the goal was to prevent an explosion of small aggs that each do one highly technical thing...leaving a lot of users in the dust because they've never done time series analysis before.

There is also a bunch of extra stuff that can't be grouped easily, like autocorrelation, changepoint, statistical testing, etc.

_(_Oh goodness, that turned into an essay. Sorry :( )

rjernst · 2015-03-27T02:45:12Z

@polyfractal That totally makes sense. +1 for smooth :)

polyfractal · 2015-03-27T16:04:36Z

@rjernst Urgh, second guessing myself now...thoughts?

I agree there is concern about making things toooo generic. There is also something nice about having an agg represent a single-purpose, with one well defined set of params. Maybe it's a pain to have one agg with like 10 different modes?

Maybe we could do a hybrid approach? I was thinking about it last night and realized that most functionality has an innate purpose, but can be used for prediction or outlier based on that functionality. E.g. a moving average smooths the data, but can find outliers by comparing a value against the smoothed average.

So we could do:

Aggs are individual for their innate purpose: movavg, wavelet, regression, etc
Aggs are bundled for predict, outlier and changepoint, since those tend to use the features of the individual components plus some extra logic

I think I'm still leaning towards the prior option, but wanted to write this down. Not sure, just braindumping now :)

polyfractal · 2015-03-30T15:22:02Z

So...after a lot of discussion we decided to revert the naming change and go back to moving_avg. We did this for a few reasons:

Simpler is better to start, since we aren't quite sure how users will be using the new functionality. Single-function aggs are simpler, and people are already accustomed to looking for a "moving average"...whereas a "smoothing" agg might confuse them
Since each model has very different settings, using a single generic settings hash is not ideal (even though it's validated).
We can re-expose these individual aggs through a "sugar" agg later if we see people needing some guidance when getting started. Similar to how the match query is essentially a smart wrapper for term, phrase, phrase_prefix, etc.
At least for prediction, the code complexity is not large on a per-agg basis. Outlier will likely need a bundled agg to start, since it will be considerably more complex.
Individual aggs also let us throw a lot of functionality at the wall, and see what sticks with users. We can prune or reorganize into sugar as required later

/cc @rjernst

colings86 · 2015-04-07T10:00:03Z

...in/java/org/elasticsearch/search/aggregations/reducers/movavg/models/MovAvgModelBuilder.java

+import java.io.IOException;
+
+
+public interface MovAvgModelBuilder {


This should probably have some JavaDocs since it forms part of the Java API. Also, should this not implement toXContent?

colings86 · 2015-04-07T10:13:14Z

@polyfractal left a comment but I think it's pretty close now

Allows the user to calculate a Moving Average over a histogram of buckets. Provides four different moving averages: - Simple - Linear weighted - Single Exponentially weighted (aka EWMA) - Double Exponentially weighted (aka Holt-winters) Closes #10024

polyfractal · 2015-04-08T14:46:31Z

Closed via a squash-merge in a824184, because I'm terrible at Git and messed up the merge process :(

$polyfractal$

polyfractal added 7 commits March 5, 2015 15:09

$@polyfractal$

Add MovAvg Reducer

29a96ce

Allows the user to calculate a Moving Average over a histogram of buckets. Provides four different moving averages: - Simple - Linear weighted - Single Exponentially weighted (aka EWMA) - Double Exponentially weighted (aka Holt-winters)

$@polyfractal$

Handle null values when calculating moving avg

20e6353

$@polyfractal$

Add some simple Tests

ccfa3fc

$@polyfractal$

[TESTS] add empty buckets, randomize gap policy

a7132e8

$@polyfractal$

Expose per-weight settings

2950780

$@polyfractal$

Annotations and javadocs

8b190ca

$@polyfractal$

Throw exception if unexpected JSON object found

57bba5a

$@polyfractal$ polyfractal added >feature :Analytics/Aggregations Aggregations labels Mar 6, 2015

$@polyfractal$

Add header to MovAvgModel

c4c3d05

$@polyfractal$ polyfractal added the review label Mar 6, 2015

colings86 reviewed Mar 17, 2015
View reviewed changes

polyfractal added 4 commits March 18, 2015 11:22

$@polyfractal$

Rename weightingType to weighting

679f9e0

$@polyfractal$

Window should be an int

a86ae43

$@polyfractal$

[TESTS] test metric values in addition to _counts

f941fcc

s1monw assigned colings86 Mar 24, 2015

polyfractal added 3 commits March 26, 2015 15:19

$@polyfractal$

Merge branch 'feature/aggs_2_0' into feature/aggs_2_0_movavg

cd5a63d

$@polyfractal$

Fix Reducer to comply with new factory changes, also add validation

7944f79

$@polyfractal$

Rename MovAvg to Smooth

05226ea

Better explains what is being done, and allows extensibility in the future (e.g. wavelets can be used to smooth, not just moving average)

polyfractal added 2 commits March 26, 2015 17:18

$@polyfractal$

Add static methods to help generate models in java-land

ab20e63

$@polyfractal$

Just kidding...use real builders :)

c63cfd9

polyfractal added 2 commits March 27, 2015 13:37

$@polyfractal$

Change weighting param to model

17f7979

$@polyfractal$

Revert change to "smoothing". Stick with moving_avg for now.

4fd7157

colings86 reviewed Apr 7, 2015
View reviewed changes

$@polyfractal$

Javadocs, headers, MovAvgModelBuilder should extend ToXContent

0fe0a23

$@polyfractal$ polyfractal closed this Apr 8, 2015

$@polyfractal$ polyfractal mentioned this pull request Apr 8, 2015

Aggregation to calculate the moving average on a histogram aggregation #10002

Closed

clintongormley removed the review label Aug 7, 2015

colings86 mentioned this pull request Aug 4, 2016

Should we remove/modify some of the experiment tags in the documentation #19798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: Add moving average aggregation #10024

Aggregations: Add moving average aggregation #10024

$@polyfractal$ polyfractal commented Mar 6, 2015

colings86 Mar 17, 2015

$@polyfractal$ polyfractal Mar 18, 2015

colings86 commented Mar 17, 2015

polyfractal commented Mar 18, 2015

polyfractal commented Mar 20, 2015

colings86 commented Mar 23, 2015

polyfractal commented Mar 23, 2015

colings86 commented Mar 24, 2015

polyfractal commented Mar 24, 2015

polyfractal commented Mar 26, 2015

rjernst commented Mar 26, 2015

polyfractal commented Mar 27, 2015

rjernst commented Mar 27, 2015

polyfractal commented Mar 27, 2015

polyfractal commented Mar 30, 2015

colings86 Apr 7, 2015

colings86 commented Apr 7, 2015

polyfractal commented Apr 8, 2015

		import java.io.IOException;


		public interface MovAvgModelBuilder {

Aggregations: Add moving average aggregation #10024

Aggregations: Add moving average aggregation #10024

Conversation

polyfractal commented Mar 6, 2015

Request

bucketsPath (Required)

window (Optional - default 5)

weighting (Optional - default simple)

gap_policy (Optional - default ignore)

settings (Optional)

Response

colings86 Mar 17, 2015

Choose a reason for hiding this comment

polyfractal Mar 18, 2015

Choose a reason for hiding this comment

colings86 commented Mar 17, 2015

polyfractal commented Mar 18, 2015

polyfractal commented Mar 20, 2015

colings86 commented Mar 23, 2015

polyfractal commented Mar 23, 2015

colings86 commented Mar 24, 2015

polyfractal commented Mar 24, 2015

polyfractal commented Mar 26, 2015

rjernst commented Mar 26, 2015

polyfractal commented Mar 27, 2015

rjernst commented Mar 27, 2015

polyfractal commented Mar 27, 2015

polyfractal commented Mar 30, 2015

colings86 Apr 7, 2015

Choose a reason for hiding this comment

colings86 commented Apr 7, 2015

polyfractal commented Apr 8, 2015

$@polyfractal$ polyfractal commented Mar 6, 2015

`bucketsPath` (Required)

`window` (Optional - default 5)

`weighting` (Optional - default `simple`)

`gap_policy` (Optional - default `ignore`)

`settings` (Optional)

$@polyfractal$ polyfractal Mar 18, 2015