Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Histogram field mapper that supports percentiles aggregations. #48580

Merged
merged 30 commits into from
Nov 28, 2019
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c4bfdb7
Add HistogramField.
iverase Oct 28, 2019
550394c
checkStyle
iverase Oct 28, 2019
9d4f9c4
more checkStyle
iverase Oct 28, 2019
4e3eed7
Addressed part of the review
iverase Oct 29, 2019
a168d32
Extract the logic of creating a new histogram to a separate method
iverase Oct 29, 2019
038d429
Addressed more comments.
iverase Oct 29, 2019
edc2faf
formatting
iverase Oct 29, 2019
c527aec
extract logic for getting histogram in TDigest
iverase Oct 29, 2019
bd59238
remove unused imports
iverase Oct 29, 2019
71886a8
rename test class
iverase Oct 29, 2019
793a257
Detect in the constructor if we expect histogram value source
iverase Oct 29, 2019
579c05c
revert last change
iverase Oct 29, 2019
af1249f
Values must be provided in increasing order
iverase Oct 31, 2019
1cb8f53
Handling null value and do not fail if arrays are empty, trate it as a
iverase Oct 31, 2019
93229e5
Handle ignore malformed properly
iverase Oct 31, 2019
996f8fc
Merge branch 'master' into histogramField
iverase Oct 31, 2019
edec448
initial documentation for the new field
iverase Oct 31, 2019
adf12a4
initial documentation for the new field
iverase Oct 31, 2019
3c5892e
Addressed docs review
iverase Nov 1, 2019
19f15a2
Add HistogramFieldTypeTests
iverase Nov 1, 2019
1f6383d
address last review comments
iverase Nov 3, 2019
fe039ee
Merge branch 'master' into histogramField
iverase Nov 3, 2019
40f679d
Merge branch 'master' into histogramField
iverase Nov 15, 2019
79f7fd9
Merge branch 'master' into histogramField
iverase Nov 27, 2019
fbabf1c
Make sure that in ignore malformed we move to the end of the
iverase Nov 27, 2019
f1a1ead
address review comments
iverase Nov 28, 2019
c8a1f12
remove support for parsed fields
iverase Nov 28, 2019
0045a8b
Merge branch 'master' into histogramField
elasticmachine Nov 28, 2019
f8cf1a7
addressed last comments
iverase Nov 28, 2019
2e8649a
Merge branch 'histogramField' of github.com:iverase/elasticsearch int…
iverase Nov 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
=== Percentiles Aggregation

A `multi-value` metrics aggregation that calculates one or more percentiles
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or histogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to the histogram field?

fields in the documents.

Percentiles show the point at which a certain percentage of observed values
occur. For example, the 95th percentile is the value which is greater than 95%
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
=== Percentile Ranks Aggregation

A `multi-value` metrics aggregation that calculates one or more percentile ranks
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or histogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to the histogram field?

fields in the documents.

[NOTE]
==================================================
Expand Down
5 changes: 5 additions & 0 deletions docs/reference/mapping/types.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
<<ip>>:: `ip` for IPv4 and IPv6 addresses
<<completion-suggester,Completion datatype>>::
`completion` to provide auto-complete suggestions

<<token-count>>:: `token_count` to count the number of tokens in a string
{plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
{plugins}/mapper-annotated-text.html[`mapper-annotated-text`]:: `annotated-text` to index text containing special markup (typically used for identifying named entities)
Expand All @@ -56,6 +57,8 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>

<<shape>>:: `shape` for arbitrary cartesian geometries.

<<histogram>>:: `histogram` for pre-aggregated numerical values for percentiles aggregations.

[float]
[[types-array-handling]]
=== Arrays
Expand Down Expand Up @@ -91,6 +94,8 @@ include::types/date_nanos.asciidoc[]

include::types/dense-vector.asciidoc[]

include::types/histogram.asciidoc[]

include::types/flattened.asciidoc[]

include::types/geo-point.asciidoc[]
Expand Down
116 changes: 116 additions & 0 deletions docs/reference/mapping/types/histogram.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
[role="xpack"]
[testenv="basic"]
[[histogram]]
=== Histogram datatype
++++
<titleabbrev>Histogram</titleabbrev>
++++

A field to store pre-aggregated numerical data representing a histogram.
This data is defined using two paired arrays:

* A `values` array of <<number, `double`>> numbers, representing the buckets for
the histogram. These values must be provided in ascending order.
* A corresponding `counts` array of <<number, `integer`>> numbers, representing how
many values fall into each bucket. These numbers must be positive or zero.

Because the elements in the `values` array correspond to the elements in the
same position of the `count` array, these two arrays must have the same length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly needed for MVP, but it might be nice to add some text for context and split up the index creation and document indexing snippets.

For example:

[[histogram-ex]]
==== Examples

The following <<indices-create-index, create index>> API request creates a new index with two field mappings:

* `my_histogram`, a `histogram` field used to store percentile data
* `my_text`, a `keyword` field used to store a title for the histogram

[ INSERT CREATE INDEX SNIPPET ]
...

The following <<docs-index_,index>> API requests store pre-aggregated for two histograms: `histogram_`` and `histogram_2`.

[ INSERT DOC INDEX SNIPPET ]
...

Providing an example use case for the data may also be helpful. For example, the histograms could represent load time, similar to the percentile aggs docs. Not required for MVP though.

[IMPORTANT]
========
* A `histogram` field can only store a single pair of `values` and `count` arrays
per document. Nested arrays are not supported.
* `histogram` fields do not support sorting.
========

[[histogram-uses]]
==== Uses

`histogram` fields are primarily intended for use with aggregations. To make it
more readily accessible for aggregations, `histogram` field data is stored as a
binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most
`12 * numValues`, where `numValues` is the length of the provided arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's actually 13 since vints can take up to 5 bytes.


Because the data is not indexed, you only can use `histogram` fields for the
following aggregations and queries:

* <<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation
* <<search-aggregations-metrics-percentile-rank-aggregation,percentile ranks>> aggregation
* <<query-dsl-exists-query,exists>> query

We recommend you define the buckets in the `values` array based on the type of aggregation you intended to use.

[[mapping-types-histogram-building-histogram]]
==== Building a histogram

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the
histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
to build it. Some possibilities include:

- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, histograms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, trying to tweak this a little to make it more explicit, so the user knows what the value/count fields do.

  • For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.

WDYT?

can be built by using the mean value of the centroids and the centroid's count. If the algorithm has already
started to approximate the percentiles, this inaccuracy is carried over in the histogram.

- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, histograms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly,

  • For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

??

can be created by using the recorded values and the count at that value. This implementation maintains a fixed worse-case
percentage error (specified as a number of significant digits), therefore the value used when generating the histogram
would be the maximum accuracy you can achieve at aggregation time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another sentence/paragraph at the end?

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

Or something similar... trying to convey to the user that how they index the data is important and they should chose upfront.

[[histogram-ex]]
==== Examples

The following <<indices-create-index, create index>> API request creates a new index with two field mappings:

* `my_histogram`, a `histogram` field used to store percentile data
* `my_text`, a `keyword` field used to store a title for the histogram

[ INSERT CREATE INDEX SNIPPET ]
[source,console]
--------------------------------------------------
PUT my_index
{
"mappings": {
"properties": {
"my_histogram": {
"type" : "histogram"
},
"my_text" : {
"type" : "keyword"
}
}
}
}
--------------------------------------------------

The following <<docs-index_,index>> API requests store pre-aggregated for
two histograms: `histogram_1` and `histogram_2`.

[source,console]
--------------------------------------------------
PUT my_index/_doc/1
{
"my_text" : "histogram_1",
"my_histogram" : {
"values" : [0.1, 0.2, 0.3, 0.4, 0.5], <1>
"counts" : [3, 7, 23, 12, 6] <2>
}
}

PUT my_index/_doc/2
{
"my_text" : "histogram_2",
"my_histogram" : {
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], <1>
"counts" : [8, 17, 8, 7, 6, 2] <2>
}
}
--------------------------------------------------
<1> Values for each bucket. Values in the array are treated as doubles and must be given in
increasing order. For <<search-aggregations-metrics-percentile-aggregation-approximation, T-Digest>>
histograms this value represents the mean value. In case of HDR histograms this represents the value iterated to.
<2> Count for each bucket. Values in the arrays are treated as integers and must be positive or zero.
Negative values will be rejected. The relation between a bucket and a count is given by the position in the array.



Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.fielddata;


import java.io.IOException;

/**
* {@link AtomicFieldData} specialization for histogram data.
*/
public interface AtomicHistogramFieldData extends AtomicFieldData {

/**
* Return Histogram values.
*/
HistogramValues getHistogramValues() throws IOException;

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.fielddata;

import java.io.IOException;

/**
* Per-document histogram value. Every value of the histogram consist on
* a value and a count.
*/
public abstract class HistogramValue {

/**
* Advance this instance to the next value of the histogram
* @return true if there is a next value
*/
public abstract boolean next() throws IOException;

/**
* the current value of the histogram
* @return the current value of the histogram
*/
public abstract double value();

/**
* The current count of the histogram
* @return the current count of the histogram
*/
public abstract int count();

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.fielddata;

import java.io.IOException;

/**
* Per-segment histogram values.
*/
public abstract class HistogramValues {

/**
* Advance this instance to the given document id
* @return true if there is a value for this document
*/
public abstract boolean advanceExact(int doc) throws IOException;

/**
* Get the {@link HistogramValue} associated with the current document.
* The returned {@link HistogramValue} might be reused across calls.
*/
public abstract HistogramValue histogram() throws IOException;

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.fielddata;


import org.elasticsearch.index.Index;
import org.elasticsearch.index.fielddata.plain.DocValuesIndexFieldData;

/**
* Specialization of {@link IndexFieldData} for histograms.
*/
public abstract class IndexHistogramFieldData extends DocValuesIndexFieldData implements IndexFieldData<AtomicHistogramFieldData> {

public IndexHistogramFieldData(Index index, String fieldName) {
super(index, fieldName);
}
}
Loading