New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to perform computations on aggregations #9876

Closed
colings86 opened this Issue Feb 25, 2015 · 37 comments

Comments

Projects
None yet
@colings86
Member

colings86 commented Feb 25, 2015

There are many instances where it is useful to perform computations on the output of aggregations to calculate new aggregations. This meta issue aims to summarize the functionality we would like to add to the aggregations framework to allow different types of computation to be performed during the reduce phase of aggregations.

This set of new aggregations are the highest priority, given their utility in a wide range of scenarios:

  • #9293 Aggregation to calculate the derivative on a histogram aggregation
  • #10898 Derivative Aggregation x-axis units normalisation
  • #10002 Aggregation to calculate multiple types of moving averages on a histogram aggregation
  • #10000 Aggregation to calculate the bucket which has the maximum value in a given aggregation
  • #9999 Aggregation to calculate the bucket which has the minimum value in a given aggregation

At the moment, the remainder of the list is largely explorative, to see which ideas/functionality makes sense and have community interest. Feel free to suggest your own ideas/aggregations/algos!

  • Aggregation that uses scripts to perform arbitrary computations on aggregations
  • #11196 Aggregation to compute differences on a single series (e.g. first difference = Yt - Yt-1)
  • #10377 Aggs for autocorrelation, acf graphs, correlograms
  • #11006 Aggregation to calculate the (mean) average value of the buckets in a given aggregation
  • #11007 Aggregation to calculate the sum of the values of the buckets in a given aggregation
  • #13128 Aggregation to calculate stats and extended_stats values of the buckets in a given aggregation
  • #11008 Aggregation to calculate the number of buckets in a given aggregation
  • #11009 Aggregation to calculate the cardinality of a metric in a given aggregation
  • #11029 Aggregation to allow users to perform simple arithmetic operations on histogram aggregations
  • #11825 Agg to calculate cumulative sum of a metric
  • #11941 Agg to filter buckets based on a script
  • #13186 Agg to calculate percentiles
  • Agg detect changes in mean (cumulative-sum control chart, Kolmogorov-Smirnov)
  • Agg detect periodicity, seasonality
  • #11196 Agg to subtract known seasonality (serial differencing)
  • Agg for regression
  • Agg for Savitzky-Golay Filters
  • Aggs for high-pass, low-pass, band-pass filters
  • Agg for generic FFT and inverse FFT
  • #14928 Agg for selecting the nth bucket, and/or selecting a range + truncating
  • Agg for building a sliding_histogram
@jhenley45

This comment has been minimized.

Show comment
Hide comment
@jhenley45

jhenley45 commented Feb 26, 2015

+1

@javadevmtl

This comment has been minimized.

Show comment
Hide comment
@javadevmtl

javadevmtl commented Feb 27, 2015

+1

@TheDeveloper

This comment has been minimized.

Show comment
Hide comment
@TheDeveloper

TheDeveloper Feb 28, 2015

Contributor

+1

I want to have a histogram aggregation that can use the doc_count result field from a parent Terms aggregation as its field.

Contributor

TheDeveloper commented Feb 28, 2015

+1

I want to have a histogram aggregation that can use the doc_count result field from a parent Terms aggregation as its field.

@clintongormley clintongormley referenced this issue Mar 3, 2015

Closed

Roadmap for 2.0 #9970

14 of 14 tasks complete
@tcucchietti

This comment has been minimized.

Show comment
Hide comment
@tcucchietti

tcucchietti Mar 5, 2015

Contributor

Big +1 on this!

Contributor

tcucchietti commented Mar 5, 2015

Big +1 on this!

@rtremaine

This comment has been minimized.

Show comment
Hide comment
@rtremaine

rtremaine commented Mar 5, 2015

+1

@aschokking

This comment has been minimized.

Show comment
Hide comment
@aschokking

aschokking commented Mar 18, 2015

👍

@aschokking

This comment has been minimized.

Show comment
Hide comment
@aschokking

aschokking Mar 18, 2015

Is it possible to do these awkwardly now using scripted aggregations? Is that something Kibana4 can take advantage of if they are there?

aschokking commented Mar 18, 2015

Is it possible to do these awkwardly now using scripted aggregations? Is that something Kibana4 can take advantage of if they are there?

@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal Mar 18, 2015

Member

@aschokking Nope, there's no way to hack this right now...if you want this functionality, you currently have to build it client-side yourself.

This new functionality essentially adds one or more extra reduce phases to the aggregation framework. For example, currently you can get the average price per day for the last 30 days (date_histogram bucket with a avg metric). But you can't get the sum of those averages, since the summation is operating on the agg results and not doc values. You would have to do that client-side right now by summing up the buckets yourself.

From a high level, it looks like this:

  1. map executes on each doc to collect value
  2. combine all the prices together by averaging all the collected values. This happens on each shard.
  3. Send all the shard results to the coordinating node, reduce the shard values together by merging averages

The new functionality introduces a fourth step:

  1. Execute another reduce phase, this time iterating over the aggregation buckets and summing the averages.

We are keeping close communication with the Kibana team, since they want to use a lot of this functionality. And none of this will "break" existing aggregations; in fact, all the new aggs look just like the old aggs. So Kibana will be able to implement them as they arrive in Elasticsearch, no need for a new major version or anything.

Member

polyfractal commented Mar 18, 2015

@aschokking Nope, there's no way to hack this right now...if you want this functionality, you currently have to build it client-side yourself.

This new functionality essentially adds one or more extra reduce phases to the aggregation framework. For example, currently you can get the average price per day for the last 30 days (date_histogram bucket with a avg metric). But you can't get the sum of those averages, since the summation is operating on the agg results and not doc values. You would have to do that client-side right now by summing up the buckets yourself.

From a high level, it looks like this:

  1. map executes on each doc to collect value
  2. combine all the prices together by averaging all the collected values. This happens on each shard.
  3. Send all the shard results to the coordinating node, reduce the shard values together by merging averages

The new functionality introduces a fourth step:

  1. Execute another reduce phase, this time iterating over the aggregation buckets and summing the averages.

We are keeping close communication with the Kibana team, since they want to use a lot of this functionality. And none of this will "break" existing aggregations; in fact, all the new aggs look just like the old aggs. So Kibana will be able to implement them as they arrive in Elasticsearch, no need for a new major version or anything.

@aschokking

This comment has been minimized.

Show comment
Hide comment
@aschokking

aschokking Mar 19, 2015

Thanks for clarifying @polyfractal, that makes sense.

aschokking commented Mar 19, 2015

Thanks for clarifying @polyfractal, that makes sense.

@lukas-vlcek

This comment has been minimized.

Show comment
Hide comment
@lukas-vlcek

lukas-vlcek Mar 25, 2015

Contributor

very nice!

Contributor

lukas-vlcek commented Mar 25, 2015

very nice!

@lewchuk

This comment has been minimized.

Show comment
Hide comment
@lewchuk

lewchuk Apr 8, 2015

+1 on adding the secondard reduces, will it be limited to a two level aggregation or can more levels be possible?

I'd suggest a modification to "Aggregation to calculate the (mean) average value of the buckets in a given aggregation" to be "Aggregation to calculate the any/all of the extended_stats values of the buckets in a given aggregation, e.g. after a terms aggregation". This allows each bucket to be given an equal weight regardless of the number of documents in the underlying buckets.

lewchuk commented Apr 8, 2015

+1 on adding the secondard reduces, will it be limited to a two level aggregation or can more levels be possible?

I'd suggest a modification to "Aggregation to calculate the (mean) average value of the buckets in a given aggregation" to be "Aggregation to calculate the any/all of the extended_stats values of the buckets in a given aggregation, e.g. after a terms aggregation". This allows each bucket to be given an equal weight regardless of the number of documents in the underlying buckets.

@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal Apr 8, 2015

Member

@lewchuk The new functionality should be able to work in multi-level aggregations. E.g. you can embed these new aggs at multiple levels in the aggregation tree.

Depending on the agg, they may have certain requirements which must be satisfied (e.g. a derivative must be embedded inside a histogram or date_histo, since it expects numerical series of numerical data); you'll receive a validation error if you put it in the wrong place.

Most of these new aggs also support "chaining". For example, you could calculate acceleration by taking the derivative of a derivative of position. Or do something like take the moving average of the derivative of the position. Etc etc :)

I'd suggest a modification to "Aggregation to calculate the (mean) average value of the buckets in a given aggregation" to be "Aggregation to calculate the any/all of the extended_stats values of the buckets in a given aggregation, e.g. after a terms aggregation".

I believe the plan is to support all the basic "arithmetic" functions, not just mean. So mean/min/max/sum/etc. Basically mirroring the existing set of metrics...but for agg values instead of document values.

Member

polyfractal commented Apr 8, 2015

@lewchuk The new functionality should be able to work in multi-level aggregations. E.g. you can embed these new aggs at multiple levels in the aggregation tree.

Depending on the agg, they may have certain requirements which must be satisfied (e.g. a derivative must be embedded inside a histogram or date_histo, since it expects numerical series of numerical data); you'll receive a validation error if you put it in the wrong place.

Most of these new aggs also support "chaining". For example, you could calculate acceleration by taking the derivative of a derivative of position. Or do something like take the moving average of the derivative of the position. Etc etc :)

I'd suggest a modification to "Aggregation to calculate the (mean) average value of the buckets in a given aggregation" to be "Aggregation to calculate the any/all of the extended_stats values of the buckets in a given aggregation, e.g. after a terms aggregation".

I believe the plan is to support all the basic "arithmetic" functions, not just mean. So mean/min/max/sum/etc. Basically mirroring the existing set of metrics...but for agg values instead of document values.

@lewchuk

This comment has been minimized.

Show comment
Hide comment
@lewchuk

lewchuk Apr 9, 2015

@polyfractal Thanks for the clarification! Will be very excited to unleash the power of these new aggregations.

lewchuk commented Apr 9, 2015

@polyfractal Thanks for the clarification! Will be very excited to unleash the power of these new aggregations.

@Kallin

This comment has been minimized.

Show comment
Hide comment
@Kallin

Kallin Apr 22, 2015

periodicity/seasonality stuff sounds interesting. we would like to do detection of customer attrition, many of whom have seasonal behaviour based on the vertical of their industry. this feature sounds like it could help eliminate false positives.

Kallin commented Apr 22, 2015

periodicity/seasonality stuff sounds interesting. we would like to do detection of customer attrition, many of whom have seasonal behaviour based on the vertical of their industry. this feature sounds like it could help eliminate false positives.

colings86 added a commit that referenced this issue Apr 29, 2015

Aggregations: Ability to perform computations on aggregations
Adds a new type of aggregation called 'reducers' which act on the output of aggregations and compute extra information that they add to the aggregation tree. Reducers look much like any other aggregation in the request but have a buckets_path parameter which references the aggregation(s) to use.

Internally there are two types of reducer; the first is given the output of its parent aggregation and computes new aggregations to add to the buckets of its parent, and the second (a specialisation of the first) is given a sibling aggregation and outputs an aggregation to be a sibling at the same level as that aggregation.

This PR includes the framework for the reducers, the derivative reducer (#9293), the moving average reducer(#10002) and the maximum bucket reducer(#10000). These reducer implementations are not all yet fully complete.

Known work left to do (these points will be done once this PR is merged into the master branch):

Add x-axis normalisation to the derivative reducer
Add lots more JUnit tests for all reducers
Contributes to #9876
Closes #10002
Closes #9293
Closes #10000
@goupeng212

This comment has been minimized.

Show comment
Hide comment
@goupeng212

goupeng212 commented May 11, 2015

+1

@acarstoiu

This comment has been minimized.

Show comment
Hide comment
@acarstoiu

acarstoiu Jun 18, 2015

I've had a close look at the documentation of the upcoming pipeline aggregations. Quite an exciting stuff 😃

Yet there's a very important, I'd say capital functionality missing. The primary reason for using the server-side post-aggregations is not laziness (at least not in my case), but performance: it might be killing for your application to receive tons of data on the wire and then crunch them a while to finally spit just a few numbers.

All pipeline aggregations should have a supress_source parameter (or something alike) that would instruct the coordinating node to prune the buckets used as source data from the returned result. Certainly, an aggregation might be supressed by several pipeline aggregations, but one would suffice to have its buckets removed from the reply.

acarstoiu commented Jun 18, 2015

I've had a close look at the documentation of the upcoming pipeline aggregations. Quite an exciting stuff 😃

Yet there's a very important, I'd say capital functionality missing. The primary reason for using the server-side post-aggregations is not laziness (at least not in my case), but performance: it might be killing for your application to receive tons of data on the wire and then crunch them a while to finally spit just a few numbers.

All pipeline aggregations should have a supress_source parameter (or something alike) that would instruct the coordinating node to prune the buckets used as source data from the returned result. Certainly, an aggregation might be supressed by several pipeline aggregations, but one would suffice to have its buckets removed from the reply.

@acarstoiu

This comment has been minimized.

Show comment
Hide comment
@acarstoiu

acarstoiu Jun 18, 2015

In the meantime I found the filter_path parameter which seems a good work-around (although I expect it to be less performant).

acarstoiu commented Jun 18, 2015

In the meantime I found the filter_path parameter which seems a good work-around (although I expect it to be less performant).

@roytmana

This comment has been minimized.

Show comment
Hide comment
@roytmana

roytmana Jul 8, 2015

Read your blog on pipeline aggregations (https://www.elastic.co/blog/out-of-this-world-aggregations) Really nice thank you

I would be interested in few more pipeline aggregations (or rather transformations)

  1. To provide compatibility with huge number of Pivot Table/Chart visualization packages who expect flat dataset and can "pivot" on its columns. Flatten aggregation tree - it would flatten tree of nested aggregations to a list of records capturing keys recursively from top down in the records as well as all metrics.
  2. Overlay transformation that can overlay (with various strategies) aggregation buckets from sibling aggregations. For example if we want to bring _missing aggregation into terms aggregation buckets (I know that 2.0 will support it directly but it is for illustration purposes) or overlaying results produced by aggregations on different fields - for example in case management system they want to count number of cases being opened and closed per month so elastic could collate aggs by two separate fields together based on the bucket key (fiscal year). Even more useful case overlaying with calculation (or script)
  3. Arithmetic/expressions to produce derivative metrics based on agg metrics within each bucket. (#2 is probably more generic and more complex case of overlay with calculation)

roytmana commented Jul 8, 2015

Read your blog on pipeline aggregations (https://www.elastic.co/blog/out-of-this-world-aggregations) Really nice thank you

I would be interested in few more pipeline aggregations (or rather transformations)

  1. To provide compatibility with huge number of Pivot Table/Chart visualization packages who expect flat dataset and can "pivot" on its columns. Flatten aggregation tree - it would flatten tree of nested aggregations to a list of records capturing keys recursively from top down in the records as well as all metrics.
  2. Overlay transformation that can overlay (with various strategies) aggregation buckets from sibling aggregations. For example if we want to bring _missing aggregation into terms aggregation buckets (I know that 2.0 will support it directly but it is for illustration purposes) or overlaying results produced by aggregations on different fields - for example in case management system they want to count number of cases being opened and closed per month so elastic could collate aggs by two separate fields together based on the bucket key (fiscal year). Even more useful case overlaying with calculation (or script)
  3. Arithmetic/expressions to produce derivative metrics based on agg metrics within each bucket. (#2 is probably more generic and more complex case of overlay with calculation)
@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jul 9, 2015

Member

@roytmana I like the idea of (1) flattening aggs into columns

(2) and (3) sound like they could be achieved very easily with https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-bucket-script-aggregation.html

Member

clintongormley commented Jul 9, 2015

@roytmana I like the idea of (1) flattening aggs into columns

(2) and (3) sound like they could be achieved very easily with https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-bucket-script-aggregation.html

@roytmana

This comment has been minimized.

Show comment
Hide comment
@roytmana

roytmana Jul 9, 2015

ok @clintongormley I will play with script pipeline when the beta is out. If you decide to go ahead with (1) I would be happy to provide some use cases.

roytmana commented Jul 9, 2015

ok @clintongormley I will play with script pipeline when the beta is out. If you decide to go ahead with (1) I would be happy to provide some use cases.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jul 10, 2015

Member

@roytmana just chatted to @colings86 and apparently (2) isn't supported by the bucket_script agg yet. But we should definitely add support

Member

clintongormley commented Jul 10, 2015

@roytmana just chatted to @colings86 and apparently (2) isn't supported by the bucket_script agg yet. But we should definitely add support

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jul 10, 2015

Member

Hmmmm actually rereading (2) I'm not entirely sure if I understood it correctly. The examples you provide are quite different, eg:

  • moving _missing into the list of buckets.. this seems like a fairly arbitrary transform which would require access to the whole agg tree and could potentially result in weird output
  • however the "open" vs "closed" example sounds like those should be two metrics in each bucket of a date histo, then you could use bucket_script to (eg) add a third metric calculated from open - closed

Am I missing something?

The bit that I said was unsupported by bucket_script was the ability to access two separate histograms

Member

clintongormley commented Jul 10, 2015

Hmmmm actually rereading (2) I'm not entirely sure if I understood it correctly. The examples you provide are quite different, eg:

  • moving _missing into the list of buckets.. this seems like a fairly arbitrary transform which would require access to the whole agg tree and could potentially result in weird output
  • however the "open" vs "closed" example sounds like those should be two metrics in each bucket of a date histo, then you could use bucket_script to (eg) add a third metric calculated from open - closed

Am I missing something?

The bit that I said was unsupported by bucket_script was the ability to access two separate histograms

@roytmana

This comment has been minimized.

Show comment
Hide comment
@roytmana

roytmana Jul 10, 2015

let me try to elaborate a bit
moving _missing: currently (and I know it'll be in 2.0) terms agg does not support missing bucket so one way to solve it was to declare a sibling "missing" aggregation (with the same sub aggs as in the terms) next to my say terms aggregation and then move its result into terms agg bucket array. This is just one example of the overlaying.
Here is another one: imagine that you want to selectively drill down sub aggs (that is not to bring subaggs data for every bucket of a parent agg). Example we aggregate by country and city and we want to show breakdown by country and within countries we want to show breakdown by city but only for Germany and France. So
One way to do it is to have two sibling aggs one on Country only and the other on Country and City with "include" restricting countries to Germany and France. then overlay second (two level) agg over the first and you have selective subaggregation. It is very useful in UI where user can freely drill down different path of nested aggregations (and they do not wish to drill down into every bucket) and then could change some global filer or search criteria and I need to reload entire visible tree from new query

As for open/closed. I do not think it could be two metrics in one bucket they are two different fields to bucket on. Here is requirement: I want to calculate number of cases and cost of cases opened and closed in each fiscal year and show them side by side. I have two fields OpenFY and ClosedFY which are pre-calculated. I want to show a chart with two data series one for opened and one for closed (counts and cost). Open an closed are two independent fields (It is even possible that there could be a year when there was no closed at all so there will not be a bucket for this FY in closed)

I want to agg on the first and on the second and then merge results by FY so each bucket will get open and closed metrics together. I do it currently in post processing but I think result tree manipulation support directly in ES would be really useful!

One more question I have is about nested and reverse_nested (same for parent) aggregations. They introduce extra level in result tree which I am not sure is necessary. It only changes calculation scope but should not alter result tree depth. It makes it rather a headache to deal with it in dynamic metadata driven systems where users do not care how data is laid out they just pick how to aggregate and what to calculate and I may have to cross nested back and forth to accommodate it. Right now in post processing I have to transform my results by removing these extra nodes created due to nested/reverse_nested (a royal headache in entirely dynamic system) before passing it to UI level. I was wondering if it would introduce any problem (name clash?) if nested/reverse_nested did not introduce a separate node and all its subaggs emitted their results into agg owning the nested one.

roytmana commented Jul 10, 2015

let me try to elaborate a bit
moving _missing: currently (and I know it'll be in 2.0) terms agg does not support missing bucket so one way to solve it was to declare a sibling "missing" aggregation (with the same sub aggs as in the terms) next to my say terms aggregation and then move its result into terms agg bucket array. This is just one example of the overlaying.
Here is another one: imagine that you want to selectively drill down sub aggs (that is not to bring subaggs data for every bucket of a parent agg). Example we aggregate by country and city and we want to show breakdown by country and within countries we want to show breakdown by city but only for Germany and France. So
One way to do it is to have two sibling aggs one on Country only and the other on Country and City with "include" restricting countries to Germany and France. then overlay second (two level) agg over the first and you have selective subaggregation. It is very useful in UI where user can freely drill down different path of nested aggregations (and they do not wish to drill down into every bucket) and then could change some global filer or search criteria and I need to reload entire visible tree from new query

As for open/closed. I do not think it could be two metrics in one bucket they are two different fields to bucket on. Here is requirement: I want to calculate number of cases and cost of cases opened and closed in each fiscal year and show them side by side. I have two fields OpenFY and ClosedFY which are pre-calculated. I want to show a chart with two data series one for opened and one for closed (counts and cost). Open an closed are two independent fields (It is even possible that there could be a year when there was no closed at all so there will not be a bucket for this FY in closed)

I want to agg on the first and on the second and then merge results by FY so each bucket will get open and closed metrics together. I do it currently in post processing but I think result tree manipulation support directly in ES would be really useful!

One more question I have is about nested and reverse_nested (same for parent) aggregations. They introduce extra level in result tree which I am not sure is necessary. It only changes calculation scope but should not alter result tree depth. It makes it rather a headache to deal with it in dynamic metadata driven systems where users do not care how data is laid out they just pick how to aggregate and what to calculate and I may have to cross nested back and forth to accommodate it. Right now in post processing I have to transform my results by removing these extra nodes created due to nested/reverse_nested (a royal headache in entirely dynamic system) before passing it to UI level. I was wondering if it would introduce any problem (name clash?) if nested/reverse_nested did not introduce a separate node and all its subaggs emitted their results into agg owning the nested one.

@roytmana

This comment has been minimized.

Show comment
Hide comment
@roytmana

roytmana Jul 10, 2015

I want to add that nested/reverse_nested introducing extra levels in result tree is not a trivial matter.
Consider this example. A Case has Customers and Teams (of employees) who work on the case. I want to see 3 level breakdown of cases by customer by team and by employee. While logically my result tree should have 3 levels actual result tree with all the nesting/un-nesting is good deal more complex. I would greatly appreciate if you give it some thought and see if it an option could be added to skip extra nodes in result tree for calculation scope changing aggs

roytmana commented Jul 10, 2015

I want to add that nested/reverse_nested introducing extra levels in result tree is not a trivial matter.
Consider this example. A Case has Customers and Teams (of employees) who work on the case. I want to see 3 level breakdown of cases by customer by team and by employee. While logically my result tree should have 3 levels actual result tree with all the nesting/un-nesting is good deal more complex. I would greatly appreciate if you give it some thought and see if it an option could be added to skip extra nodes in result tree for calculation scope changing aggs

@tmandry

This comment has been minimized.

Show comment
Hide comment
@tmandry

tmandry Aug 7, 2015

Lag or Timeshift Aggregation: Sort of a generalization of the serial differencing agg which only provides the lag functionality, allowing you to perform operations on values in different buckets (from the same or bucket aggregations.)

Use case: Cohort retention analysis, where I want to see what percentage of users come back the day after their first day. I could do this by bucketing by day and by filtering on both first_seen_days_ago:0 and first_seen_days_ago:1, using the lag aggregation to line up the second filter with the first, and finally dividing values from the same cohort.

tmandry commented Aug 7, 2015

Lag or Timeshift Aggregation: Sort of a generalization of the serial differencing agg which only provides the lag functionality, allowing you to perform operations on values in different buckets (from the same or bucket aggregations.)

Use case: Cohort retention analysis, where I want to see what percentage of users come back the day after their first day. I could do this by bucketing by day and by filtering on both first_seen_days_ago:0 and first_seen_days_ago:1, using the lag aggregation to line up the second filter with the first, and finally dividing values from the same cohort.

@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal Aug 7, 2015

Member

@tmandry hmm, I can see this being useful. Would you need/want a newly created field to be appended to each bucket, like:

"buckets": [
    {
        "key_as_string": "2014-07-29T17:00:00.000Z",
        "key": 1406653200000,
        "doc_count": 7,
        "login_today": {  // <-- original, derived from something like an `avg` metric
            "avg": 1
        },
        "login_yesterday": {   // <-- derived and shifted via a `timeshift` agg
            "avg": 1
        }
    },

Or would it be sufficient if the serial_diff agg allowed arbitrary scripting, so that you could perform any mathematical operation other than just subtraction?

Thinking about it, the advantage of actually appending a new bucket is that you can use something like bucket_selector or bucket_script to filter / munge the agg, whereas the arbitrary scripting might be a bit more limiting.

Member

polyfractal commented Aug 7, 2015

@tmandry hmm, I can see this being useful. Would you need/want a newly created field to be appended to each bucket, like:

"buckets": [
    {
        "key_as_string": "2014-07-29T17:00:00.000Z",
        "key": 1406653200000,
        "doc_count": 7,
        "login_today": {  // <-- original, derived from something like an `avg` metric
            "avg": 1
        },
        "login_yesterday": {   // <-- derived and shifted via a `timeshift` agg
            "avg": 1
        }
    },

Or would it be sufficient if the serial_diff agg allowed arbitrary scripting, so that you could perform any mathematical operation other than just subtraction?

Thinking about it, the advantage of actually appending a new bucket is that you can use something like bucket_selector or bucket_script to filter / munge the agg, whereas the arbitrary scripting might be a bit more limiting.

@tmandry

This comment has been minimized.

Show comment
Hide comment
@tmandry

tmandry Aug 7, 2015

@polyfractal For my use case, the serial_diff approach would work, but appending a new bucket would allow us to enrich the interface with raw user counts in addition to percentages. (At least, I think appending would be necessary.)

tmandry commented Aug 7, 2015

@polyfractal For my use case, the serial_diff approach would work, but appending a new bucket would allow us to enrich the interface with raw user counts in addition to percentages. (At least, I think appending would be necessary.)

@clintongormley clintongormley added v2.0.0 and removed v2.0.0-beta1 labels Aug 13, 2015

@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal Sep 25, 2015

Member

Been working with pipelines more extensively on a demo project. A few observations about what is difficult:

  • A "sliding histogram" would be very useful. There are situations where you need to accumulate the results from a range ti..ti+n into a single value...then repeat the process for the next ti+1..ti+n+1 time range. After doing that, you want to treat the output of each time range as a point in a new series and perform metrics on that.

    Currently, the only way to do this is execute a search-per-range and index the results, then run a followup agg. The main downside to this functionality is that it could produce a very large number of buckets. But I think the usefulness outweighs the downside

  • An ability to pick out individual buckets from a series. E.g. a first, last, nth metric. For example, you could have a date_histo embedded in a terms, giving one time series per term. Then you want to calculate a moving avg and some other stuff for each series, and just want the "final" value from each series, so you could determine the largest "final" value. Currently there is no way to do that.

    Alternatively, pathing could be modified to allow last etc as special keywords, so you could do termsAgg>dateAgg[last].value. Would tie in nicely with the ability to ask for specific terms too (termsAgg['foo']>dateAgg.....)

  • Terms aggs tend to be an impenetrable wall. It is difficult to access values on either "side" of the terms agg since it is a dynamic multi-value bucket. And double term aggs basically prevent all access entirely

Member

polyfractal commented Sep 25, 2015

Been working with pipelines more extensively on a demo project. A few observations about what is difficult:

  • A "sliding histogram" would be very useful. There are situations where you need to accumulate the results from a range ti..ti+n into a single value...then repeat the process for the next ti+1..ti+n+1 time range. After doing that, you want to treat the output of each time range as a point in a new series and perform metrics on that.

    Currently, the only way to do this is execute a search-per-range and index the results, then run a followup agg. The main downside to this functionality is that it could produce a very large number of buckets. But I think the usefulness outweighs the downside

  • An ability to pick out individual buckets from a series. E.g. a first, last, nth metric. For example, you could have a date_histo embedded in a terms, giving one time series per term. Then you want to calculate a moving avg and some other stuff for each series, and just want the "final" value from each series, so you could determine the largest "final" value. Currently there is no way to do that.

    Alternatively, pathing could be modified to allow last etc as special keywords, so you could do termsAgg>dateAgg[last].value. Would tie in nicely with the ability to ask for specific terms too (termsAgg['foo']>dateAgg.....)

  • Terms aggs tend to be an impenetrable wall. It is difficult to access values on either "side" of the terms agg since it is a dynamic multi-value bucket. And double term aggs basically prevent all access entirely

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Sep 27, 2015

Member

An ability to pick out individual buckets from a series. E.g. a first, last, nth metric.

could be: dateAgg[-1].value for last

Member

clintongormley commented Sep 27, 2015

An ability to pick out individual buckets from a series. E.g. a first, last, nth metric.

could be: dateAgg[-1].value for last

@clintongormley clintongormley removed the v2.0.0 label Oct 6, 2015

@colings86 colings86 removed their assignment Dec 18, 2015

@arivazhagan-jeganathan

This comment has been minimized.

Show comment
Hide comment
@arivazhagan-jeganathan

arivazhagan-jeganathan Dec 24, 2015

Query String with Aggregation parameters works fine with JEST client. but with TCP, is it always mandatory to build AggregationBuilder to execute aggregation? Why JSON aggregation query is not supported in TCP? any specific reason for this?

Query String with Aggregation parameters works fine with JEST client. but with TCP, is it always mandatory to build AggregationBuilder to execute aggregation? Why JSON aggregation query is not supported in TCP? any specific reason for this?

@NathanZamecnik

This comment has been minimized.

Show comment
Hide comment
@NathanZamecnik

NathanZamecnik Jan 11, 2016

A "Moving Standard Deviation" pipeline aggregation would be useful. If we can calculate that on the server we could also create a "Relative Standard Deviation" aggregation which would use a "Moving Average" aggregation and the "Moving Standard Deviation" aggregation. This would be useful to calculate the +/- for various metrics.

For instance, with a Web server I may want to calculate volatility and I could use "Relative Standard Deviation" to see +/- how many client requests I have over time or +/- the sum of bytes served per window, etc. Possibly this could be used with the predictive aggregations to let me get an idea of how much capacity I'll need during various seasons, times of day, etc.

NathanZamecnik commented Jan 11, 2016

A "Moving Standard Deviation" pipeline aggregation would be useful. If we can calculate that on the server we could also create a "Relative Standard Deviation" aggregation which would use a "Moving Average" aggregation and the "Moving Standard Deviation" aggregation. This would be useful to calculate the +/- for various metrics.

For instance, with a Web server I may want to calculate volatility and I could use "Relative Standard Deviation" to see +/- how many client requests I have over time or +/- the sum of bytes served per window, etc. Possibly this could be used with the predictive aggregations to let me get an idea of how much capacity I'll need during various seasons, times of day, etc.

@l8liu

This comment has been minimized.

Show comment
Hide comment
@l8liu

l8liu Sep 22, 2016

I agree a "Moving Standard Deviation" pipeline aggregation would be useful. I want to do the statistical control for a time series count data. I can get the moving average of the daily count, but in order to compute the control limit I need a moving standard deviation of the count.

l8liu commented Sep 22, 2016

I agree a "Moving Standard Deviation" pipeline aggregation would be useful. I want to do the statistical control for a time series count data. I can get the moving average of the daily count, but in order to compute the control limit I need a moving standard deviation of the count.

@AlexKovalevich

This comment has been minimized.

Show comment
Hide comment
@AlexKovalevich

AlexKovalevich Oct 16, 2016

I don't see how practically calculate lets say average site visit duration.
Lets say I have something like this: parent.subAggregation(AggregationBuilders.terms("visit_metrics_wrapper_agg").field("trackingSessionId").subAggregation(AggregationBuilders.avg("avg_time_per_visit_agg").field("trackingSessionLastUpdateDifference"));
parent.subAggregation(PipelineAggregatorBuilders.avgBucket("avg_page_view_time_avg_per_visit").setBucketsPaths("visit_metrics_wrapper_agg>avg_time_per_visit_agg"));

"avg_page_view_time_avg_per_visit" calculates correct result, great!
But!!!! Assume you do this for a month period on site with few millions visits per month,
this will produce enormous amount of buckets (few millions) split by trackingSessionId.
I can't tell how slow it's goint to be inside of the server, but JSON response will contain few million buckets which looks impossible to filter.

It would be great if this kind of structure could be configure to return just response without intermediate steps.

For example in relational DB it would be done in two selects. Internal would would count average time per visit and external average time per visits returning only one row with final result. You don't want your DB to return all the possible temporary results. Something similar would be nice to have in ES!

AlexKovalevich commented Oct 16, 2016

I don't see how practically calculate lets say average site visit duration.
Lets say I have something like this: parent.subAggregation(AggregationBuilders.terms("visit_metrics_wrapper_agg").field("trackingSessionId").subAggregation(AggregationBuilders.avg("avg_time_per_visit_agg").field("trackingSessionLastUpdateDifference"));
parent.subAggregation(PipelineAggregatorBuilders.avgBucket("avg_page_view_time_avg_per_visit").setBucketsPaths("visit_metrics_wrapper_agg>avg_time_per_visit_agg"));

"avg_page_view_time_avg_per_visit" calculates correct result, great!
But!!!! Assume you do this for a month period on site with few millions visits per month,
this will produce enormous amount of buckets (few millions) split by trackingSessionId.
I can't tell how slow it's goint to be inside of the server, but JSON response will contain few million buckets which looks impossible to filter.

It would be great if this kind of structure could be configure to return just response without intermediate steps.

For example in relational DB it would be done in two selects. Internal would would count average time per visit and external average time per visits returning only one row with final result. You don't want your DB to return all the possible temporary results. Something similar would be nice to have in ES!

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Nov 26, 2016

Member

@colings86 @polyfractal can this issue be closed now, or do you want to keep the unimplemented list around?

Member

clintongormley commented Nov 26, 2016

@colings86 @polyfractal can this issue be closed now, or do you want to keep the unimplemented list around?

@colings86

This comment has been minimized.

Show comment
Hide comment
@colings86

colings86 Nov 29, 2016

Member

@clintongormley yes, i think we can close this issue as we have the core functionality this issue was created to address. New aggregations can be requested and added in separate issues/PRS, this way it will be easier to discuss them

Member

colings86 commented Nov 29, 2016

@clintongormley yes, i think we can close this issue as we have the core functionality this issue was created to address. New aggregations can be requested and added in separate issues/PRS, this way it will be easier to discuss them

@colings86 colings86 closed this Nov 29, 2016

@hienchu

This comment has been minimized.

Show comment
Hide comment
@hienchu

hienchu Sep 12, 2017

Is there any plan to do "Agg for building a sliding_histogram" ?
I am keen to get this feature to calc document appearance frequency, which is

  1. count per bucket
  2. histogram on count of each bucket.

I am happy to contribute to this work, any consolidated doc / example will help.

hienchu commented Sep 12, 2017

Is there any plan to do "Agg for building a sliding_histogram" ?
I am keen to get this feature to calc document appearance frequency, which is

  1. count per bucket
  2. histogram on count of each bucket.

I am happy to contribute to this work, any consolidated doc / example will help.

@colings86

This comment has been minimized.

Show comment
Hide comment
@colings86

colings86 Sep 12, 2017

Member

@hienchu my original intention for a sliding_histogram is a bit different I think. I had intended it to be a histogram with an interval and a window such that the output would be buckets whose bounds range is the window period and the change in the buckets bounds from one bucket to the next is the interval. For example you might have an interval of 1 hour and a window of one day. In this case the output would be buckets for 2017-01-01-00:00:00.000 TO 2017-01-01-23:59:59.999, 2017-01-01-01:00:00.000 TO 2017-01-02-00:59:59.999, 2017-01-01-02:00:00.000 TO 2017-01-02-01:59:59.999, etc.

Does that fit into what you are thinking here? It might be a good idea if you raised a new ticket for this and then we can iterate on the idea there?

Member

colings86 commented Sep 12, 2017

@hienchu my original intention for a sliding_histogram is a bit different I think. I had intended it to be a histogram with an interval and a window such that the output would be buckets whose bounds range is the window period and the change in the buckets bounds from one bucket to the next is the interval. For example you might have an interval of 1 hour and a window of one day. In this case the output would be buckets for 2017-01-01-00:00:00.000 TO 2017-01-01-23:59:59.999, 2017-01-01-01:00:00.000 TO 2017-01-02-00:59:59.999, 2017-01-01-02:00:00.000 TO 2017-01-02-01:59:59.999, etc.

Does that fit into what you are thinking here? It might be a good idea if you raised a new ticket for this and then we can iterate on the idea there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment