Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAggregations: Add autocorrelation agg #10377
Conversation
colings86
and others
added some commits
Mar 17, 2015
polyfractal
added
>feature
v2.0.0-beta1
WIP
:Search/Aggregations
labels
Apr 1, 2015
colings86
referenced this pull request
Apr 1, 2015
Closed
Add ability to perform computations on aggregations #9876
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpountz
Apr 1, 2015
Contributor
This is exciting. :) I am not familiar with the theory so please excuse me if my questions are silly.
window: size of time series to perform ACF on. If series is length n, the ACF will be performed on n - window .. n values. E.g. the most recent values. Optional, defaults to 5
Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?
Unsure how much we can randomize these tests for that reason. Needs more thinking.
Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures.
|
This is exciting. :) I am not familiar with the theory so please excuse me if my questions are silly.
Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?
Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
polyfractal
Apr 1, 2015
Member
Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?
No particular reason, I was mostly just thinking about performance (e.g. if you want accidentally ask for an autocorrelation of 100k of points). Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history?
Practically speaking, ACF becomes less useful (I think) the farther back in time you go. And the higher order lags have more approximation error that accumulates.
Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures.
Ahh, this makes sense. I'll see what I can do to split the tests into those two categories.
No particular reason, I was mostly just thinking about performance (e.g. if you want accidentally ask for an autocorrelation of 100k of points). Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history? Practically speaking, ACF becomes less useful (I think) the farther back in time you go. And the higher order lags have more approximation error that accumulates.
Ahh, this makes sense. I'll see what I can do to split the tests into those two categories. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpountz
Apr 1, 2015
Contributor
Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history?
This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points?
This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
polyfractal
Apr 1, 2015
Member
This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points?
I have no idea :D Real benchmarks are on the top of my to-do list...I'm curious to see where this breaks.
Classical radix-2 FFTs have complexity of O(n log n). I'm not sure what optimizations JTransforms is using, it may be better than that. JTransform's has some benchmark results which claim an FFT on 1m datapoints takes 10ms. FFT on 23m values takes 700ms. Timings are a bit slower if you include "construction" of the FFT plan (e.g. when you instantiate the object).
For non-padded ACF: two FFTs, one O(n) loop over the data compute magnitudes, potentially an extra O(n) loop to normalize. Note the FFTs will be non radix-2, so may be slower.
For padded ACF: four FFTs, two O(n) loops for magnitudes, and potentially an extra O(n) loop to normalize.
The brute-force, non-FFT ACF functions are O(n2)
I have no idea :D Real benchmarks are on the top of my to-do list...I'm curious to see where this breaks. Classical radix-2 FFTs have complexity of O(n log n). I'm not sure what optimizations JTransforms is using, it may be better than that. JTransform's has some benchmark results which claim an FFT on 1m datapoints takes 10ms. FFT on 23m values takes 700ms. Timings are a bit slower if you include "construction" of the FFT plan (e.g. when you instantiate the object). For non-padded ACF: two FFTs, one O(n) loop over the data compute magnitudes, potentially an extra O(n) loop to normalize. Note the FFTs will be non radix-2, so may be slower. For padded ACF: four FFTs, two O(n) loops for magnitudes, and potentially an extra O(n) loop to normalize. The brute-force, non-FFT ACF functions are O(n2) |
s1monw
assigned
colings86
Apr 2, 2015
colings86
closed this
May 19, 2015
clintongormley
removed
the
v2.0.0-beta1
label
May 25, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@colings86 why did you close this, was it merged? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
colings86
May 25, 2015
Member
@clintongormley it auto-closed because the feature/aggs_2_0 got deleted (since it's no longer needed). @polyfractal said it's an old PR anyway and needs to be updated onto the current pipeline aggs so it Ms probably ok to stay closed
|
@clintongormley it auto-closed because the feature/aggs_2_0 got deleted (since it's no longer needed). @polyfractal said it's an old PR anyway and needs to be updated onto the current pipeline aggs so it Ms probably ok to stay closed |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
polyfractal
May 26, 2015
Member
Yep, this needs to be rebased against current master. I'll resubmit it soonish.
I'll pull tags from this PR so it doesn't confuse anyone.
|
Yep, this needs to be rebased against current master. I'll resubmit it soonish. I'll pull tags from this PR so it doesn't confuse anyone. |
polyfractal commentedApr 1, 2015
WIP, putting up for discussion.
Depends on the
SiblingReducerfunctionality introduced in @colings86's "Max Aggregator" PR, so any changes in that PR will need to be reflected here.No need for a review yet, this is largely just to test the sibling functionality.
Autocorrelation
Autocorrelation shows the similarity between a time series and a "lagged" version of itself at different intervals of time. This can be used to determine if a signal has periodic elements hidden by noise. If there is a periodic element (repeating every
nelements), there will be a peak in the Autocorrelation everynlags. This is because the original time series will "line up" with the lagged version and display a high degree of similarity, even in the presence of noise.As an example, this "Lemmings Population" is a very noisy sine wave with a 30-day period. If you squint hard enough, you can see the sine wave. The ACF of the series, however, clearly shows periodic elements. The peaks are spaced ~27 days, which is very close to the actual 30-day period
Request
ACF is a sibling reducer, which accepts histogram or datehistogram input.
GET /test/test/_search?search_type=count { "aggs": { "my_date_histo": { "date_histogram": { "field": "timestamp", "interval": "day", "min_doc_count": 0 }, "aggs": { "the_sum": { "sum": { "field": "price" } } } }, "the_acf": { "acf": { "bucketsPath": "my_date_histo.the_sum", "window" : 50 } } } }Parameters
bucketsPath: requiredwindow: size of time series to perform ACF on. If series is lengthn, the ACF will be performed onn - window .. nvalues. E.g. the most recent values. Optional, defaults to5zero_mean: "centers" the ACF by removing the mean from the time series. Optional, defaults totruezero_pad: pads the input data with zeros, up to the nearest power of two. FFTs are faster on powers of 2, and padding converts the ACF from a circular convolution to a linear convolution. Linear are more useful for "real world" use-cases. Optional, defaults totruenormalize: Divides all ACF values by variance, which normalizes the ACF to roughly -1..1. Optional, defaults totrueResponse
{ "took": 14, "timed_out": false, "_shards": { ... }, "hits": { ... }, "aggregations": { "my_date_histo": { "buckets": [ ... ] }, "the_acf": { "values": [ 1, 0.37343470483005364, -0.360763267740012, 0.17441860465116257, 0.5277280858676209 ] } } }Todo
AbstractAggregationBuilder, and the need forInternalAcfBuilderwhich is registered as an aggregation (due to siblings potentially being "top level" aggs). Unsure if this was the correct approach?