-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scatterplot aggregation #55943
Comments
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
/cc @wylieconlon |
Just to give a concrete example of why using just f(x) and f(y) fails. Imagine clusters of points around (x, y) = (1, 1) and (x, y) = (5, 5). This has exactly the same marginals as clusters of points around (x, y) = (1, 5) and (x, y) = (5, 1). In both cases you'd get spikes in the density around 1 and around 5 for both x and y. This isn't to say that these aren't useful, it's often very useful to look at projections of data onto individual axes, but they can't reconstruct the full density. If you want to eke out compression performance passing back a representation of the full density, I think a quadtree would be a great choice. The regular regions can be keyed efficiently, you could use just 2 bits per subdivision of the plane, so provided you know the full data bounding box (blc, trc) you could pass back a collection of (32 bit ids, count) pairs to get all the resolution it would be possible to handle in a chart. You'd reconstruct the bottom left corner of a grid cell in this list using something like:
This is also much better than a uniform grid for variable density because you get the resolution where the data is, and, since the regions are rectangular, if you want to select a grid point, to see all the raw data it contains, you'd just have to run the appropriate range query derivable from the overall bounding box and its id. (This also give you a nice way of creating a spatially stratified sample, i.e. use reservoir sampling for each cell and select a number points based on the relative count of points it contains. This all works shard local, but probably would need an upfront pass to get the count of points in each cell at the shard so may not be appropriate to fit into the aggs framework.) |
This issue proposes a new
scatterplot
aggregation. Scatterplots are surprisingly tricky to build in ES right now. An approximation of a scatterplot can be generated with two histograms, but this is closer to a heatmap/density plot than a real scatterplot. It is also difficult to get a good density plot because ofsearch.max_buckets
limits, and not knowing the dynamic bounds of the data.One of the useful aspects of a scatterplot is seeing the actual, raw data plotted as points which today is not easily achieved in a scalable manner (search with large
size
is not recommended, scroll can be slower and more difficult interface, top-hits / top-metrics can help to a degree but you may want a random sample not the "top" values).The proposed aggregation would return a sample of raw points as well as an approximation of the total density, allowing charts like this to be created:
Algorithm overview
This should give the user a relatively flexible tool. Under the threshold they can receive a true scatterplot, and over the threshold they get a scalable density estimate and sampling of raw points to overlay.
Request syntax
x
/y
defines the fields for each axiscompression
defines the TDigest compression valueraw_samples
controls the threshold where we switch over to reservoir sampling and density estimationdensity_intervals
defines how many "buckets" or "pixels" of density we should return in the response. Internally this translates to how many points of the CDF we sampledensity_shape
defines if we should normalize the axis before sampling the CDF.[1,10]
but Y values land in[1,100]
, we need to make sure the CDF estimates we take from the X-axis are also in the range of[1,100]
[1,10]
and Y will take estimates from[1,100]
, leading to rectangular pixels being returnedAll names are up for debate, open to suggestions :)
Response syntax
axis
returns the boundaries of each axis and the interval size, to help the client to more easily setup the scatterplot chartsamples
contains the raw samples (complete, or sub-sampled)density_plot
contains the density estimate if the scatterplot crosses the requestedraw_samples
threshold. Will be omitted if the scatterplot is 100% complete with no sub-sampling.Future improvements
avg
of a field or something) instead of just raw countsEdit: Tom noted that two one-dimensional quantile sketches/histograms may not be sufficient or correct to generate a scatterplot. To quote:
So we may need to investigate an alternate method which is a little more dense, like storing a quadtree, etc.
The text was updated successfully, but these errors were encountered: