Scatterplot aggregation #55943

polyfractal · 2020-04-29T14:17:03Z

This issue proposes a new scatterplot aggregation. Scatterplots are surprisingly tricky to build in ES right now. An approximation of a scatterplot can be generated with two histograms, but this is closer to a heatmap/density plot than a real scatterplot. It is also difficult to get a good density plot because of search.max_buckets limits, and not knowing the dynamic bounds of the data.

One of the useful aspects of a scatterplot is seeing the actual, raw data plotted as points which today is not easily achieved in a scalable manner (search with large size is not recommended, scroll can be slower and more difficult interface, top-hits / top-metrics can help to a degree but you may want a random sample not the "top" values).

The proposed aggregation would return a sample of raw points as well as an approximation of the total density, allowing charts like this to be created:

Algorithm overview

Up to a user-defined threshold, raw data will be buffered as x/y pairs
After the threshold is breached:
- The buffer of points will be maintained with reservoir sampling. This means all points that are returned to the user had an equal probability of being selected, and should provide a reasonable distribution of raw points to overlay the density chart
- Two TDigest sketches will be allocated to track the x and y axis.
The user will receive the buffer of raw points, and if it was over-threshold, an approximate density map constructed from the two TDigest sketches

This should give the user a relatively flexible tool. Under the threshold they can receive a true scatterplot, and over the threshold they get a scalable density estimate and sampling of raw points to overlay.

Request syntax

{
  "scatterplot": {
    "x": {
      "field": "foo"
    },
    "y": {
      "field": "bar"
    },
    "compression": 100,
    "raw_samples": 1000,
    "density_intervals": 10,
    "density_shape": "square | rect"
  }
}

x/y defines the fields for each axis
compression defines the TDigest compression value
raw_samples controls the threshold where we switch over to reservoir sampling and density estimation
density_intervals defines how many "buckets" or "pixels" of density we should return in the response. Internally this translates to how many points of the CDF we sample
density_shape defines if we should normalize the axis before sampling the CDF.
- Square: to get "square" pixels, we need to make sure density estimates from both axis are equal in range. E.g. if all X values land in [1,10] but Y values land in [1,100], we need to make sure the CDF estimates we take from the X-axis are also in the range of [1,100]
- Rectangle: if we don't want square pixels, we will sample from each axis independently, meaning X will take density estimates from [1,10] and Y will take estimates from [1,100], leading to rectangular pixels being returned

All names are up for debate, open to suggestions :)

Response syntax

{
  "scatterplot": {
    "axis": {
      "x": {
        "min": -20,
        "max": 100,
        "interval": 12
      },
      "y": {
        "min": 55,
        "max": 900,
        "interval": 84.5
      }
    }
    "samples": [
      [1.0, 1.0],
      [10.3, 2.6]
      ...
      ...
    ],
    "density_plot": [
      [22.3, 60.4, 33.0, ...],
      [40.0, 81.2, 98.7, ...],
    ]
  }
}

axis returns the boundaries of each axis and the interval size, to help the client to more easily setup the scatterplot chart
samples contains the raw samples (complete, or sub-sampled)
density_plot contains the density estimate if the scatterplot crosses the requested raw_samples threshold. Will be omitted if the scatterplot is 100% complete with no sub-sampling.

Future improvements

Initial implementation aims to aggregate over two fields for x/y, but a followup improvement will add filtering so that a single field can be used (with x/y being differentiated by some criteria in the filter:

{
  "scatterplot": {
    "x": {
      "field": "foo",
      "filter": {...}
    },
    "y": {
      "field": "foo",
      "filter": {...}
    },
    ...
  }
}

Could be possible to introduce a third metric (such as an avg of a field or something) instead of just raw counts
There are some techniques we can use to record outliers so that they are more likely to show up in the raw samples

Edit: Tom noted that two one-dimensional quantile sketches/histograms may not be sufficient or correct to generate a scatterplot. To quote:

This tells you about the marginals, but not the correlation structure. The reason this doesn’t work is there can be variation in the density of y as a function of x. In particular, you have a function like f(y | x) for density conditioned on value x.

So we may need to investigate an alternate method which is a little more dense, like storing a quadtree, etc.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-29T14:17:05Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2020-04-29T14:18:26Z

/cc @wylieconlon

tveasey · 2020-10-12T14:25:16Z

Just to give a concrete example of why using just f(x) and f(y) fails. Imagine clusters of points around (x, y) = (1, 1) and (x, y) = (5, 5). This has exactly the same marginals as clusters of points around (x, y) = (1, 5) and (x, y) = (5, 1). In both cases you'd get spikes in the density around 1 and around 5 for both x and y. This isn't to say that these aren't useful, it's often very useful to look at projections of data onto individual axes, but they can't reconstruct the full density.

If you want to eke out compression performance passing back a representation of the full density, I think a quadtree would be a great choice. The regular regions can be keyed efficiently, you could use just 2 bits per subdivision of the plane, so provided you know the full data bounding box (blc, trc) you could pass back a collection of (32 bit ids, count) pairs to get all the resolution it would be possible to handle in a chart. You'd reconstruct the bottom left corner of a grid cell in this list using something like:

(dx, dy) = (trc - blc) / 2
(x, y) = blc
for (i = id; i > 0; i = i / 4, d = d / 2) {
  cell = i % 4
  x = x + (cell % 2) * dx
  y = y + (cell / 2) * dy
}

This is also much better than a uniform grid for variable density because you get the resolution where the data is, and, since the regions are rectangular, if you want to select a grid point, to see all the raw data it contains, you'd just have to run the appropriate range query derivable from the overall bounding box and its id.

(This also give you a nice way of creating a spatially stratified sample, i.e. use reservoir sampling for each cell and select a number points based on the relative count of points it contains. This all works shard local, but probably would need an upfront pass to get the count of points in each cell at the shard so may not be appropriate to fit into the aggs framework.)

$@polyfractal$ polyfractal added >feature :Analytics/Aggregations Aggregations labels Apr 29, 2020

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 29, 2020

$@polyfractal$ polyfractal self-assigned this Apr 29, 2020

wylieconlon mentioned this issue Sep 18, 2020

[Lens] Allow a single operation to create multiple user-facing operations elastic/kibana#77949

Closed

$@polyfractal$ polyfractal removed their assignment Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scatterplot aggregation #55943

Scatterplot aggregation #55943

polyfractal commented Apr 29, 2020 •

edited

Loading

elasticmachine commented Apr 29, 2020

polyfractal commented Apr 29, 2020

tveasey commented Oct 12, 2020 •

edited

Loading

Scatterplot aggregation #55943

Scatterplot aggregation #55943

Comments

polyfractal commented Apr 29, 2020 • edited Loading

Algorithm overview

Request syntax

Response syntax

Future improvements

elasticmachine commented Apr 29, 2020

polyfractal commented Apr 29, 2020

tveasey commented Oct 12, 2020 • edited Loading

polyfractal commented Apr 29, 2020 •

edited

Loading

tveasey commented Oct 12, 2020 •

edited

Loading