Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatterplot aggregation #55943

Open
polyfractal opened this issue Apr 29, 2020 · 3 comments
Open

Scatterplot aggregation #55943

polyfractal opened this issue Apr 29, 2020 · 3 comments
Labels
:Analytics/Aggregations Aggregations >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@polyfractal
Copy link
Contributor

polyfractal commented Apr 29, 2020

This issue proposes a new scatterplot aggregation. Scatterplots are surprisingly tricky to build in ES right now. An approximation of a scatterplot can be generated with two histograms, but this is closer to a heatmap/density plot than a real scatterplot. It is also difficult to get a good density plot because of search.max_buckets limits, and not knowing the dynamic bounds of the data.

One of the useful aspects of a scatterplot is seeing the actual, raw data plotted as points which today is not easily achieved in a scalable manner (search with large size is not recommended, scroll can be slower and more difficult interface, top-hits / top-metrics can help to a degree but you may want a random sample not the "top" values).

The proposed aggregation would return a sample of raw points as well as an approximation of the total density, allowing charts like this to be created:

image

Algorithm overview

  • Up to a user-defined threshold, raw data will be buffered as x/y pairs
  • After the threshold is breached:
    • The buffer of points will be maintained with reservoir sampling. This means all points that are returned to the user had an equal probability of being selected, and should provide a reasonable distribution of raw points to overlay the density chart
    • Two TDigest sketches will be allocated to track the x and y axis.
  • The user will receive the buffer of raw points, and if it was over-threshold, an approximate density map constructed from the two TDigest sketches

This should give the user a relatively flexible tool. Under the threshold they can receive a true scatterplot, and over the threshold they get a scalable density estimate and sampling of raw points to overlay.

Request syntax

{
  "scatterplot": {
    "x": {
      "field": "foo"
    },
    "y": {
      "field": "bar"
    },
    "compression": 100,
    "raw_samples": 1000,
    "density_intervals": 10,
    "density_shape": "square | rect"
  }
}
  • x/y defines the fields for each axis
  • compression defines the TDigest compression value
  • raw_samples controls the threshold where we switch over to reservoir sampling and density estimation
  • density_intervals defines how many "buckets" or "pixels" of density we should return in the response. Internally this translates to how many points of the CDF we sample
  • density_shape defines if we should normalize the axis before sampling the CDF.
    • Square: to get "square" pixels, we need to make sure density estimates from both axis are equal in range. E.g. if all X values land in [1,10] but Y values land in [1,100], we need to make sure the CDF estimates we take from the X-axis are also in the range of [1,100]
    • Rectangle: if we don't want square pixels, we will sample from each axis independently, meaning X will take density estimates from [1,10] and Y will take estimates from [1,100], leading to rectangular pixels being returned

All names are up for debate, open to suggestions :)

Response syntax

{
  "scatterplot": {
    "axis": {
      "x": {
        "min": -20,
        "max": 100,
        "interval": 12
      },
      "y": {
        "min": 55,
        "max": 900,
        "interval": 84.5
      }
    }
    "samples": [
      [1.0, 1.0],
      [10.3, 2.6]
      ...
      ...
    ],
    "density_plot": [
      [22.3, 60.4, 33.0, ...],
      [40.0, 81.2, 98.7, ...],
    ]
  }
}
  • axis returns the boundaries of each axis and the interval size, to help the client to more easily setup the scatterplot chart
  • samples contains the raw samples (complete, or sub-sampled)
  • density_plot contains the density estimate if the scatterplot crosses the requested raw_samples threshold. Will be omitted if the scatterplot is 100% complete with no sub-sampling.

Future improvements

  • Initial implementation aims to aggregate over two fields for x/y, but a followup improvement will add filtering so that a single field can be used (with x/y being differentiated by some criteria in the filter:
{
  "scatterplot": {
    "x": {
      "field": "foo",
      "filter": {...}
    },
    "y": {
      "field": "foo",
      "filter": {...}
    },
    ...
  }
}
  • Could be possible to introduce a third metric (such as an avg of a field or something) instead of just raw counts
  • There are some techniques we can use to record outliers so that they are more likely to show up in the raw samples

Edit: Tom noted that two one-dimensional quantile sketches/histograms may not be sufficient or correct to generate a scatterplot. To quote:

This tells you about the marginals, but not the correlation structure. The reason this doesn’t work is there can be variation in the density of y as a function of x. In particular, you have a function like f(y | x) for density conditioned on value x.

So we may need to investigate an alternate method which is a little more dense, like storing a quadtree, etc.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 29, 2020
@polyfractal
Copy link
Contributor Author

/cc @wylieconlon

@tveasey
Copy link
Contributor

tveasey commented Oct 12, 2020

Just to give a concrete example of why using just f(x) and f(y) fails. Imagine clusters of points around (x, y) = (1, 1) and (x, y) = (5, 5). This has exactly the same marginals as clusters of points around (x, y) = (1, 5) and (x, y) = (5, 1). In both cases you'd get spikes in the density around 1 and around 5 for both x and y. This isn't to say that these aren't useful, it's often very useful to look at projections of data onto individual axes, but they can't reconstruct the full density.

If you want to eke out compression performance passing back a representation of the full density, I think a quadtree would be a great choice. The regular regions can be keyed efficiently, you could use just 2 bits per subdivision of the plane, so provided you know the full data bounding box (blc, trc) you could pass back a collection of (32 bit ids, count) pairs to get all the resolution it would be possible to handle in a chart. You'd reconstruct the bottom left corner of a grid cell in this list using something like:

(dx, dy) = (trc - blc) / 2
(x, y) = blc
for (i = id; i > 0; i = i / 4, d = d / 2) {
  cell = i % 4
  x = x + (cell % 2) * dx
  y = y + (cell / 2) * dy
}

This is also much better than a uniform grid for variable density because you get the resolution where the data is, and, since the regions are rectangular, if you want to select a grid point, to see all the raw data it contains, you'd just have to run the appropriate range query derivable from the overall bounding box and its id.

(This also give you a nice way of creating a spatially stratified sample, i.e. use reservoir sampling for each cell and select a number points based on the relative count of points it contains. This all works shard local, but probably would need an upfront pass to get the count of points in each cell at the shard so may not be appropriate to fit into the aggs framework.)

@polyfractal polyfractal removed their assignment Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

3 participants