Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cumulative distribution function #3905

Closed
Tracked by #179199
dagguh opened this issue May 21, 2015 · 10 comments
Closed
Tracked by #179199

Cumulative distribution function #3905

dagguh opened this issue May 21, 2015 · 10 comments
Labels
Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) Feature:elasticsearch release_note:enhancement Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@dagguh
Copy link

dagguh commented May 21, 2015

Cumulative distribution function, e.g.:
cumulative distribution function example
This function is invertible, e.g. you can swap the axes:
JMeter percentile distribution

I am unable to visualise neither function in Kibana build 6998, commit d029b34. I understand that the axes depend on each other, ie. must be about the same field and must be a pair of percentiles and percentile ranks. I'm aware of the fact that e.g. the Line Chart visualisation isolates the axes from each other.

This is why I propose a new visualisation type: Cumulative Distribution. This is very similar to #2704 which would also need a separate visualisation type. Maybe it can be generalised into a Distribution visualisation type.
Both of them only need a single field as an input. Both would benefit greatly from Split Lines and Split Charts.

If ElasticSearch doesn't give such capabilities, please let me know, I'll raise an issue there.

@rashidkpc
Copy link
Contributor

This is not something we want to introduce a new visualization for, rather this is a transformation on existing data as applied to a line chart. Can you explain some use cases? Give some examples on where you'd use this? Concrete questions it would solve?

@dagguh
Copy link
Author

dagguh commented May 22, 2015

E.g. A/B testing. These are actual comparisons we did using JMeter:
cumulative-distribution-function-1
cumulative-distribution-function-2
We need to compare response times for different configs/versions. Quantiles are the most meaningful.
We care about the entire spectrum, so picking a single percentile is not enough. We might give up some of the completeness and only pick a subset, e.g. P1, P25, P50, P75, P90, P99, but for 4 splits (per config/version) it would result in 6×4 = 24 lines, which would be absolutely unreadable.

@tbragin tbragin added Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) Feature:Visualizations Generic visualization features (in case no more specific feature label is available) Feature:elasticsearch labels Nov 15, 2016
@sachinzgupta
Copy link

Do we have any update or method or plugin to plot Cumulative distribution function (CDF) or probability density function(PDF) plot for the KPI?

@timroes timroes added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure Feature:Visualizations Generic visualization features (in case no more specific feature label is available) labels Sep 16, 2018
@agirbal
Copy link

agirbal commented Sep 12, 2019

+1. Much of the analysis we do is based on percentile distribution, exactly like @dagguh shows. Basically a histogram where the X-Axis are the bucketed percentiles (e.g. p25, p50, p75) of a field, and the Y-Axis uses some number function like average of that same field or median of some other field (counts would be equal between percentiles). This lets you answer questions like "what is the gain on the 25% of users who have the worst latency to our service." It'd be super powerful.

This is older ticket, any chance this is now doable with pipeline agg, and maybe Vega / Canvas visualizations?

@polyfractal
Copy link
Contributor

I believe a CDF chart should be doable with the Percentiles aggregation in Elasticsearch. A CDF is just the "continuous" function describing percentiles at any arbitrary position.

So Kibana could ask the percentiles agg for 0-100 percentile in small increments (0, 5, 10, 15, ... 100) that will approximate the CDF. Smaller increments == better approximation. Asking for more percentiles is essentially free other than some minor computation and a larger response size. The percentile sketches collect all the information from the shards, and when we construct the response Elasticsearch interrogates the CDF of the sketches to generate specific percentiles. So asking for more percentiles just interrogates the sketch a bit more, which is mostly neglibible (within reason) compared to building the sketch itself.

It could also be done with PercentileRanks agg (which is basically the inverted chart), but that requires you to know the extents of data ahead of time. Would be easier to use Percentiles since you know the data is always 0-100, then invert the graph client-side if desired.

I agree that a complete plot of the full CDF is very useful in many analysis.

@agirbal
Copy link

agirbal commented Feb 20, 2020

@polyfractal cc @AlonaNadler the issues with the bare Percentiles agg is that Kibana would still need to do a 2 step request to ES, since it applies to your X-axis value X. The Y-axis is just getting the average of the Y value for that percentile interval of X (however granular it is).

Would there be a way for Kibana to choose to do a "quantile histogram", give a few parameters like the granularity or specific percentile value to bucket at, and then ES would do the full aggregation in 1 go?

@polyfractal
Copy link
Contributor

I'm not sure I follow? A request like this basically gives you the CDF:

GET /test/_search
{
  "size": 0,
  "aggs": {
    "cdf": {
      "percentiles": {
        "field": "value",
        "percents": [ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ]
      }
    },
    "stats": {
      "stats": {
        "field": "value"
      }
    }
  }
}
{
  "aggregations" : {
    "cdf" : {
      "values" : {
        "10.0" : 2.6,
        "20.0" : 8.0,
        "30.0" : 15.0,
        "40.0" : 15.0,
        "50.0" : 15.0,
        "60.0" : 19.499999999999996,
        "70.0" : 20.8,
        "80.0" : 41.300000000000004,
        "90.0" : 67.99999999999999,
        "100.0" : 80.0
      }
    },
    "stats" : {
      "count" : 9,
      "min" : 1.0,
      "max" : 80.0,
      "avg" : 24.666666666666668,
      "sum" : 222.0
    }
  }
}

All Kibana needs to do is convert that into a line chart. E.g. a point at (10.0, 2.6), (20.0, 8.0), etc It doesn't work with the current kibana visualization setup because Vizualization assumes you have to build the X axis out of bucketing aggs (which is accurate in most cases, just not here). I don't know the internal details about how hard that would be to adjust, but all the data is available in the percentiles response to build a CDF.

We can't make a "bucket" version of percentiles because it's one of those operations that you don't know the real percentile values until all the shards have been merged together. And at that point it's too late to collect documents into buckets because we're merging on the coordinating node. If we had multi-pass aggs it is theoretically possible, but would still require two passes (it'd just happen in ES)

If a "bucketed" percentiles are needed today, it could be done by Kibana with two passes: one to get the percentiles, second to setup a range agg on those returned percentiles. But that's no longer really a CDF imo :)

@agirbal
Copy link

agirbal commented Feb 24, 2020

@polyfractal right your last description is what I mean. Drawing a pure CDF is one thing and you are right that it would answer the original premise of this ticket. But I think it'd be very limiting in what you can do with it - I attempted to describe a more generic approach that would let you do more interesting things here elastic/elasticsearch#50386

You could draw the CDF 2 ways:
A) as you describe: get a whole bunch of percentile points for the value and extrapolate into a line. Your Y-axis would probably select "avg of field A" and then X-axis a new "percentile histogram" that does not select a field since it doesn't need one (just 0-100).
B) allow to select what you want on Y-axis, say "avg of field A" and then on X-axis "Histogram" with a new "percentile of field B" option (instead of typical range). With this solution you can achieve CDF too (by picking same field for both) but it's much more interesting because it lets you do any histogram as you would normally do, but with values that are not X-axis friendly due to their distribution (typical long tail prod system metrics).

@wylieconlon
Copy link
Contributor

wylieconlon commented Jun 4, 2021

For anyone who is trying to get this type of chart in Kibana, I have a workaround using Vega. As mentioned earlier in this thread, Elasticsearch already supports the most basic level of fetching data that we can use to render a chart. Vega can do the calculation in your browser and render the chart. Here's my example.

Screen Shot 2021-06-09 at 4 37 01 PM

Full Vega-Lite spec
{
  $schema: https://vega.github.io/schema/vega-lite/v5.json
  title: "Cumulative distribution of bytes"
  data: {
    url: {
      %context%: true
      %timefield%: timestamp
      index: kibana_sample_data_logs
      body: {
        aggs: {
          "terms": {
            "terms": {
              "field": "geo.dest"
              size: 3
            },
            "aggs": {
              "cdf": {
                "percentiles": {
                  "field": "bytes",
                  "percents": [ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 99, 100 ],
                  "keyed": true
                }
              }
            }
          }
        }
        size: 0
      }
    }
    format: {property: "aggregations.terms.buckets"}
  }
  
  transform: [
    {
      fold: [
        "cdf.values['0.0']"
        "cdf.values['10.0']"
        "cdf.values['20.0']"
        "cdf.values['30.0']"
        "cdf.values['40.0']"
        "cdf.values['50.0']"
        "cdf.values['60.0']"
        "cdf.values['70.0']"
        "cdf.values['80.0']"
        "cdf.values['90.0']"
        "cdf.values['95.0']"
        "cdf.values['99.0']"
        "cdf.values['100.0']"
      ]
      as: ["bytes", "value"]
    }
    {
      calculate: 'toNumber(substring(datum.bytes, 12, lastindexof(datum.bytes, "\'"))) / 100'
      as: percentile
    }
  ]

  mark: {
    type: line
    point: true
    tooltip: true
  }

  encoding: {
    x: {
      field: value
      type: quantitative
      axis: {
        title: false
      }
    }
    y: {
      field: percentile
      type: quantitative
      axis: {
        title: null
        format: "0%"
      }
    }
    color: {
      field: key
      type: nominal
      axis: {
        title: null
      }
    }
  }
}

@timductive
Copy link
Member

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

@timductive timductive closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) Feature:elasticsearch release_note:enhancement Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

9 participants