New Histogram field mapper that supports percentiles aggregations. #48580

iverase · 2019-10-28T12:21:18Z

This PR explores the addition of a new histogram field mapper that consist in a pre-aggregated format of numerical data to be used in percentiles aggregations.

Mapper

The new field is defined in a mapping using the following structure:

PUT /example
{
    "mappings": {
        "properties": {
            "aggregated": {
                "type": "histogram"
            }
        }
    }
}

And it can be populated using the following structure:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2, 0.3, 0.4, 0.5],
        "counts" : [5, 3, 14, 6, 4]
    }
}

where values is an array of doubles and counts is an array of integers and must have the same length. This format is up for discussion but the reasons to choose this format is that they can be easily generated from the existing histograms.

For TDigest, they can be generated using the following code (see example):

                List<Double> values = new ArrayList<>();
                List<Integer> counts = new ArrayList<>();
                Collection<Centroid> centroids = histogram.centroids();
                for (Centroid centroid : centroids) {
                    values.add(centroid.mean());
                    counts.add(centroid.count());
                }

For HDR histograms, hey can be generated using the following code (see example):

                List<Double> values = new ArrayList<>();
                List<Integer> counts = new ArrayList<>();
                Iterator<DoubleHistogramIterationValue> iterator = histogram.recordedValues().iterator();
                while (iterator.hasNext()) {
                    DoubleHistogramIterationValue histValue = iterator.next();
                    values.add(histValue.getValueIteratedTo());
                    counts.add(Math.toIntExact(histValue.getCountAtValueIteratedTo()));
                }

This structure is stored as a binary doc value

Aggregations

This field can be used in standard percentile and percentile_ranks aggregations. They can be used together with standard numeric fields. In order for this aggregations to support this new format the following interfaces has been created:

IndexHistogramFieldData: A new specialisation of field data for histograms.
AtomicHistogramFieldData: A new specialisation of atomic field data for histograms.
HistogramValues: The docVales returned by the atomic field data.
HistogramValue: A doc value representing one of those histograms.

The aggregations has been updated accordingly to be able to understand this new field data.

Open questions

Accuracy and suitability of the format.
The field mapper does not support as input an array of histograms, does it need it to support it?
Overflow of aggregations? As we are adding pre-aggregated data, it can theoretically happen that aggregation histogram can more easily overflow (I think the maximum number of total counts is LONG.MAX_VALUE).

Relates #48578

elasticmachine · 2019-10-28T12:21:20Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

jpountz

I like it a lot. My main concern is that the parsing of aggregations was made lenient for this to work by accepting any field in the parser. I think it's fine that we use generic objects internally like ValuesSource and do instanceof calls to see how these objects should be handled. However I'd like the parser to be strict and fail if a field is provided that is neither a numeric field or a histogram field?

Let's add docs?

Accuracy and suitability of the format.

It looks good to me.

The field mapper does not support as input an array of histograms, does it need it to support it?

I don't think it needs to. We could document that this field only supports single values, like we do for vector fields (you might want to borrow the logic that vectors used to make sure fields are actually single-valued).

Overflow of aggregations? As we are adding pre-aggregated data, it can theoretically happen that aggregation histogram can more easily overflow (I think the maximum number of total counts is LONG.MAX_VALUE).

I think we should ignore this problem on our end and push this problem to the upstream libraries, since this is a general problem that they have given that they allow collecting multiple instances of the same value at once. The fact that you require counts to be integers and reject longs sounds to me like the right thing to do on our end.

...ain/java/org/elasticsearch/search/aggregations/metrics/AbstractHDRPercentilesAggregator.java

jpountz · 2019-10-28T17:29:42Z

...main/java/org/elasticsearch/search/aggregations/metrics/HDRPercentilesAggregatorFactory.java

        return new HDRPercentilesAggregator(name, valuesSource, searchContext, parent, percents, numberOfSignificantValueDigits, keyed,
-                config.format(), pipelineAggregators, metaData);
+            config.format(), pipelineAggregators, metaData);


can you try to undo formatting changes in this file and a couple other ones to keep the diff readable?

jpountz · 2019-10-28T17:30:30Z

...in/java/org/elasticsearch/search/aggregations/metrics/PercentileRanksAggregationBuilder.java

@@ -80,7 +79,7 @@
    static {
        PARSER = new ConstructingObjectParser<>(PercentileRanksAggregationBuilder.NAME, false,
            (a, context) -> new PercentileRanksAggregationBuilder(context, (List) a[0]));
-        ValuesSourceParserHelper.declareNumericFields(PARSER, true, false, false);
+        ValuesSourceParserHelper.declareAnyFields(PARSER, true, true);


Can we try to be less lenient and require a field that is either numeric or a histogram?

Yes, that makes sense and that is the tricky part. @not-napoleon I might want to borrow a bit of your thoughts on how to do this in the best way possible.

The parser doesn't currently support that, and right now the convention is to use ANY for aggregations that can accept more than one type of input. Cleaning that up is a goal of the ValuesSource refactor effort I'm currently working on (see #42949)

Thanks @not-napoleon, let's keep this TODO for later then @iverase ?

Agreed. I think to be able to support that we require the effort that @not-napoleon is working on (thanks for taking a look!).

jpountz · 2019-10-28T17:34:29Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+                                                    return getHistogramValue(values.binaryValue());
+                                                } catch (IOException e) {
+                                                    throw new IllegalStateException("Cannot load doc value", e);
+                                                }


Let's make histogram() throw an IOException for consistency with BinaryDocValues#binaryValue?

jpountz · 2019-10-28T17:35:06Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+                                    } catch (IOException e) {
+                                        throw new IllegalStateException("Cannot load doc values", e);
+                                    }
+                                }


We should throw an UnsupportedOperationException in the two above methods.

jpountz · 2019-10-28T17:35:25Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+                        @Override
+                        public SortField sortField(Object missingValue, MultiValueMode sortMode,
+                                                   XFieldComparatorSource.Nested nested, boolean reverse) {
+                            return null;


this should throw an UnsupportedOperationException too

jpountz · 2019-10-28T17:38:53Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+                private HistogramValue getHistogramValue(final BytesRef bytesRef) throws IOException {
+                    final ByteBufferStreamInput streamInput = new ByteBufferStreamInput(
+                        ByteBuffer.wrap(bytesRef.bytes, bytesRef.offset, bytesRef.length));
+                    final int numValues = streamInput.readVInt();


We could also avoid storing the length and consider the iterator exhausted when all bytes have been read.

jpountz · 2019-10-28T17:40:22Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+        context.path().add(simpleName());
+        try {
+            List<Double> values = null;
+            List<Integer> counts = null;


let's use native variants, e.g. IntArrayList and DoubleArrayList?

jpountz · 2019-10-28T17:42:40Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+            if (values.size() == 0) {
+                throw new MapperParsingException("error parsing field ["
+                    + name() + "], arrays for values and counts cannot be empty");
+            }


I wonder whether we should actually fail it or not.

On the other hand, maybe we should require that values come in order and fail if there are duplicate values?

Respect empty arrays, yes we can just ignore it instead of failing.

I like he idea of requiring values to be ordered and disallow duplicates. Before implementing I will wait for more input.

I updated the code so in case of empty arrays we just ignore it as we do with null values.

In addition we require now that values are provided in increasing order.

felixbarny · 2019-10-29T07:45:35Z

Really looking forward to this 😍
The API also looks great to me!

Question: does the values array have to contain the same values for each doc? In other words, does the first doc determine the structure of the histogram?

Is it possible to extend the range in subsequent documents? Example:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2],
        "counts" : [1, 2]
    }
}

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2, 0.3],
        "counts" : [1, 2, 3]
    }
}

Is it possible to use different buckets in subsequent documents? Example:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2],
        "counts" : [1, 2]
    }
}

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.15, 0.2, 0.25],
        "counts" : [1, 2, 3]
    }
}

Is it possible to omit buckets with a value of zero? Example:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2, 0.3],
        "counts" : [1, 2, 3]
    }
}

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.3],
        "counts" : [1, 0, 3]
    }
}

These are not necessarily requirements we have in APM, just trying to get a feel for what's possible.

...java/org/elasticsearch/search/aggregations/metrics/AbstractTDigestPercentilesAggregator.java

colings86 · 2019-10-29T09:08:04Z

...lytics/src/test/java/org/elasticsearch/xpack/analytics/mapper/HistogramAggregationTests.java

+import java.util.List;
+
+
+public class HistogramAggregationTests extends ESSingleNodeTestCase {


This naming is a bit confusing since there is also a histogram aggregation which is completely different to this new field. Maybe we should call this HistogramPercentileAggregationTests or something else?

Originally created this test to check different approaches. I renamed it following your suggestions.

and use that to decide the collecting method

iverase · 2019-10-29T10:15:19Z

@felixbarny

Is it possible to extend the range in subsequent documents?
Is it possible to use different buckets in subsequent documents?

Yes that is possible as every histogram is independent to the others so it can have different buckets and different number of buckets.

Is it possible to omit buckets with a value of zero?

The count for a bucket can be zero but the bucket must be present. In general the length of the values and counts arrays must be the same. We are still considering some other changes, in particular we might require that buckets are ordered and you cannot have twice the same bucket.

colings86 · 2019-10-29T10:39:32Z

you cannot have twice the same bucket.

One thing to note is that it's possible in TDigest particularly to have multiple buckets with the same value. This can happen if the same value is inserted enough times to exceed the threshold for the centroid which will force a new centroid with the same value to be created. IF we move to reject histograms with multiple of the same bucket, it will make the code for extracting the TDigest histogram (from your PR description) less simple since the user will need to keep track of whether the value has changed when moving to the next centroid.

I don't see a problem with requiring ordered buckets though

jpountz · 2019-10-29T13:42:20Z

Good point Colin, I agree that rejecting duplicates would make it tricky to work with t-digests.

iverase · 2019-10-30T11:36:47Z

If we allow duplicates, I don't think we require buckets to be ordered. wdyt?

jpountz · 2019-10-30T12:38:36Z

I think we should at least make sure values are sorted in the doc-value representation. Then we could either sort buckets ourselves or require users to provide us with sorted buckets.

null value

jpountz

I'd like the ignoreMalformed logic to be a bit more robust, but other than that it looks good to me.

...ain/java/org/elasticsearch/search/aggregations/metrics/AbstractHDRPercentilesAggregator.java

jpountz · 2019-11-27T13:33:32Z

docs/reference/aggregations/metrics/percentile-aggregation.asciidoc

-can be extracted either from specific numeric fields in the documents, or
-be generated by a provided script.
+over numeric values extracted from the aggregated documents. These values can be
+generated by a provided script or extracted from specific numeric or histogram


add a link to the histogram field?

jpountz · 2019-11-27T13:33:39Z

docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc

-can be extracted either from specific numeric fields in the documents, or
-be generated by a provided script.
+over numeric values extracted from the aggregated documents. These values can be
+generated by a provided script or extracted from specific numeric or histogram


add a link to the histogram field?

jpountz · 2019-11-27T18:23:32Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+                                                                   Map<String, Object> node, ParserContext parserContext)
+                throws MapperParsingException {
+            Builder builder = new HistogramFieldMapper.Builder(name);
+            parseField(builder, name, node, parserContext);


Let's not call parseField, none of the properties that it parses are supported by this field?

I am supporting them:

user can disable docValues

If user set index, index options or stored values, an error will be thrown. Note setting index or stored values to false is allowed.

I am not handling boost, similarity and copy_to. Shall we throw an error if user defines those fields?

jpountz · 2019-11-27T18:26:16Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+                    } else if (count > 0) {
+                        // we do not add elements with count == 0
+                        streamOutput.writeDouble(values.get(i));
+                        streamOutput.writeVInt(count);


I'd suggest putting the count before the values, it might make it easier to better compress in the future by stealing bits of the count.

jpountz · 2019-11-27T18:38:50Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

+            if (ignoreMalformed.value() == false) {
+                throw new MapperParsingException("failed to parse field [{}] of type [{}]",
+                    ex, fieldType().name(), fieldType().typeName());
+            }


This is what XContentSubParser has been designed for. See #35603. Maybe it would be more robust? By the way looking at the latest version of GeoShapeFieldMapper, it looks like it no longer handles ignoreMalformed correctly, or am I misreading it cc @imotov ?

polyfractal

I think the docs look good. I just left a few minor comments/tweaks to make a certain aspect more explicit, otherwise I think they are fine 👍

Been following the code changes from afar, think Adrien has the code review aspect covered so I'll defer to him there :)

polyfractal · 2019-11-27T20:09:55Z

docs/reference/mapping/types/histogram.asciidoc

+histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
+to build it. Some possibilities include:
+
+- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, histograms


Hmm, trying to tweak this a little to make it more explicit, so the user knows what the value/count fields do.

For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.

WDYT?

polyfractal · 2019-11-27T20:12:14Z

docs/reference/mapping/types/histogram.asciidoc

+can be built by using the mean value of the centroids and the centroid's count. If the algorithm has already
+started to approximate the percentiles, this inaccuracy is carried over in the histogram.
+
+- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, histograms


Similarly,

For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

??

polyfractal · 2019-11-27T20:18:26Z

docs/reference/mapping/types/histogram.asciidoc

+can be created by using the recorded values and the count at that value. This implementation maintains a fixed worse-case
+percentage error (specified as a number of significant digits), therefore the value used when generating the histogram
+would be the maximum accuracy you can achieve at aggregation time.
+


Perhaps another sentence/paragraph at the end?

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

Or something similar... trying to convey to the user that how they index the data is important and they should chose upfront.

iverase · 2019-11-28T08:15:36Z

@jpountz, I changed the logic so now I am using XContentSubParser to handle ignore malformed (as it should be).

My only doubt is about parse fields. I am supporting doc values so a user can disable them. It might be useful if a user wants to store the histogram fields (already parsed) without doc values so at a later stage it can reindex them again with them, wdyt?

should be handle boost, similarity and copy_to and throw an error if the user defines them?

jpountz · 2019-11-28T09:59:19Z

Given that it's always easier to add features than remove them, I'd be in favor of only supporting ignore_malformed for now?

iverase · 2019-11-28T10:34:04Z

Great, that makes things simpler. I removed support for parse fields

jpountz · 2019-11-28T11:08:04Z

...n/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HistogramFieldMapper.java

-                    token = context.parser().nextToken();
+            if (subParser != null) {
+                while (token != null) {
+                    token = subParser.nextToken();
                }


Do subParser.close() instead?

jpountz · 2019-11-28T11:09:16Z

docs/reference/mapping/types/histogram.asciidoc

+`histogram` fields are primarily intended for use with aggregations. To make it
+more readily accessible for aggregations, `histogram` field data is stored as a
+binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most
+`12 * numValues`, where `numValues` is the length of the provided arrays.


I think it's actually 13 since vints can take up to 5 bytes.

iverase · 2019-11-28T11:22:22Z

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

iverase · 2019-11-28T11:37:59Z

@elasticmachine update branch

…o histogramField

…48580) (#49683) This commit adds a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.

Add HistogramField.

c4bfdb7

iverase added >feature WIP :Analytics/Aggregations Aggregations v8.0.0 labels Oct 28, 2019

iverase requested review from polyfractal and jpountz October 28, 2019 12:27

iverase added 2 commits October 28, 2019 13:49

checkStyle

550394c

more checkStyle

9d4f9c4

jpountz reviewed Oct 28, 2019

View reviewed changes

iverase added 4 commits October 29, 2019 08:08

Addressed part of the review

4e3eed7

Extract the logic of creating a new histogram to a separate method

a168d32

Addressed more comments.

038d429

formatting

edc2faf

extract logic for getting histogram in TDigest

c527aec

colings86 reviewed Oct 29, 2019

View reviewed changes

iverase added 4 commits October 29, 2019 10:17

remove unused imports

bd59238

rename test class

71886a8

Detect in the constructor if we expect histogram value source

793a257

and use that to decide the collecting method

revert last change

579c05c

ruflin mentioned this pull request Oct 29, 2019

Schema for metrics elastic/ecs#474

Open

iverase added 2 commits October 31, 2019 09:30

Values must be provided in increasing order

af1249f

Handling null value and do not fail if arrays are empty, trate it as a

1cb8f53

null value

iverase added the v7.6.0 label Nov 27, 2019

jpountz reviewed Nov 27, 2019

View reviewed changes

$polyfractal$

polyfractal reviewed Nov 27, 2019

View reviewed changes

address review comments

f1a1ead

remove support for parsed fields

c8a1f12

jpountz approved these changes Nov 28, 2019

View reviewed changes

elasticmachine and others added 3 commits November 28, 2019 03:38

Merge branch 'master' into histogramField

0045a8b

addressed last comments

f8cf1a7

Merge branch 'histogramField' of github.com:iverase/elasticsearch int…

2e8649a

…o histogramField

iverase merged commit eade4f0 into elastic:master Nov 28, 2019

iverase mentioned this pull request Nov 28, 2019

New Histogram field mapper that supports percentiles aggregations. #49683

Merged

exekias mentioned this pull request Nov 28, 2019

[Metricbeat] Use histograms type in Prometheus module elastic/beats#14843

Closed

wylieconlon mentioned this pull request Dec 6, 2019

Support pre-aggregated histogram type elastic/kibana#52426

Closed

3 tasks

$@polyfractal$ polyfractal added the release highlight label Dec 12, 2019

$@polyfractal$ polyfractal mentioned this pull request Jan 10, 2020

[Rollup] Support for data-structure based metrics (Cardinality, Percentiles, etc) #33214

Closed

axw mentioned this pull request Jan 16, 2020

Histogram metric support elastic/apm-server#3195

Closed

1 task

axw mentioned this pull request Jan 23, 2020

Feature request: structural object type matching in dynamic templates #51341

Closed

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Add HistogramProperty elastic/elasticsearch-net#4358

Closed

pebrc mentioned this pull request May 25, 2020

Instrument the operator with metrics elastic/cloud-on-k8s#212

Closed

iverase deleted the histogramField branch July 9, 2020 10:02

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

		import java.util.List;


		public class HistogramAggregationTests extends ESSingleNodeTestCase {

New Histogram field mapper that supports percentiles aggregations. #48580

New Histogram field mapper that supports percentiles aggregations. #48580

Conversation

iverase commented Oct 28, 2019 • edited Loading

Mapper

Aggregations

Open questions

elasticmachine commented Oct 28, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixbarny commented Oct 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase commented Oct 29, 2019

colings86 commented Oct 29, 2019

jpountz commented Oct 29, 2019

iverase commented Oct 30, 2019

jpountz commented Oct 30, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase Nov 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase commented Nov 28, 2019

jpountz commented Nov 28, 2019

iverase commented Nov 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase commented Nov 28, 2019

iverase commented Nov 28, 2019

iverase commented Oct 28, 2019 •

edited

Loading

iverase Nov 28, 2019 •

edited

Loading

$@polyfractal$ polyfractal left a comment