[WIP] HLL Rollup field mapper plugin implementation and Cardinality Agg hacking #1

amontalenti · 2019-04-01T00:01:43Z

Write up basic spec of this project in Notion
Explore HyperLogLogPlusPlus class; read associated paper
Write new unit tests for serialization and deserialization of HLL
Record empirical results of those tests in Notion
Add a skeletal mapper-hll plugin based on the mapper-murmur3 plugin
Hack around with making the serialized HLL go into binary doc values storage for the custom field type
Get remote debugging of ES and gradle integration tests with intellij working
Write a very basic end-to-end integration test for using the hll field and cardinality agg together
Hack around in CardinalityAggregator to make it pick up serialized HLL bytes
Make the initial integration test pass based on HLLFieldMapper and CardinalityAggregator changes!
Get answer on why initialBucket and otherBucket are always 0 right now
Work on making the HLLFieldMapper support arrays of values and a specified precision field; use the GeoShape field mapper for inspiration.
Explore how XContentParser works so that we can do streaming parsing of the items array.
Look into this comment in SortedDocValuesField: "This value can be at most 32766 bytes long."
Figure out if BinaryDocValuesField is what HLLFieldMapper should really be using, due to the above limit, but can't, perhaps due to the current ES ValuesSource implementation. (Is this what Zach was talking about?)
Investigate whether the ByteArrayOutputStream, ByteArrayInputStream are the right options for HLL serialization/deserialization.
See whether the stored field in the HLLFieldMapper is actually storing the byte[], and also what option we could possibly have for displaying it (in searches) and using it for re-indexing operations.
Explore whether there's a way to change what gets stored in _source for a given field, so that, e.g. we could store the serialized blob of the HLL in _source (note: we'd also need to support serialized HLLs in the HLLFieldMapper class) rather than being forced to store the raw items
⭐ BytesRef stores whole bytes array, and it stores its own offset and length. Which means passing the underlying bytes array is not the right move. Might just be lucky in current impl.
⭐ Figure out how big an HLL Rollup can be to fit into 32766 bytes, just in case. (Somewhere near 10K.)
⭐ Make HLLFieldMapper support arrays of values via the streaming JSON parsing technique (tokens) rather than all-at-once parsing.
⭐ Rename hll field type to hll-rollup
⭐ Write up spec for abstracting the CardinalityAggregator changes into a new plugin with a new agg name, such as hll-uniq
⭐ Split hacked CardinalityAggregator code changes into that hll-uniq plugin
⭐ Build a real-world data set of URLs, timestamps, and UUIDs using BigQuery
⭐ Do a real world performance benchmark

amontalenti · 2019-04-01T02:40:56Z

Re: this just-completed task:

Hack around in CardinalityAggregator to make it pick up serialized HLL bytes

Happy to say I just got an integration test passing with this test input:

# Integration tests for Mapper HLL components
#

---
"Mapper HLL":

  - do:
      indices.create:
        index: test
        body:
          mappings:
            type1: { "properties": { "foo": { "type": "hll" } } }

  - do:
      index:
        index: test
        type: type1
        id: 0
        body: { "foo": "bar" }

  - do:
      indices.refresh: {}

  - do:
      search:
        body: { "aggs": { "foo_count": { "cardinality": { "field": "foo" } } } }

  - match: { aggregations.foo_count.value: 1 }

  - do:
      index:
        index: test
        type: type1
        id: 1
        body: { "foo": "bar" }

  - do:
      index:
        index: test
        type: type1
        id: 2
        body: { "foo": "baz" }

  - do:
      index:
        index: test
        type: type1
        id: 3
        body: { "foo": "bam" }

  - do:
      index:
        index: test
        type: type1
        id: 4
        body: { "foo": "bar" }

  - do:
      indices.refresh: {}

  - do:
      search:
        body: { "aggs": { "foo_count": { "cardinality": { "field": "foo" } } } }

  - match: { aggregations.foo_count.value: 3 }

... which means, the mapper-hll plugin prototype is serializing HLLs as binary blobs into the binary doc values area of the index, and my hacked CardinalityAggregator is successfully de-serializing those blobs and calculating HLL merges from them, rather than building HLLs "on the fly" from the raw values.

My debug screen in intellij captured here showcases the code in action, being caught during a live ES cardinality agg query.

There's still more to do (as described in the checklist above), but my, my, how very promising!

amontalenti · 2019-04-01T02:57:34Z

buildSrc/src/main/groovy/org/elasticsearch/gradle/test/ClusterFormationTasks.groovy

@@ -697,7 +697,7 @@ class ClusterFormationTasks {
            // gradle task options are not processed until the end of the configuration phase
            if (node.config.debug) {
                println 'Running elasticsearch in debug mode, suspending until connected on port 8000'
-                esJavaOpts.add('-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000')
+                esJavaOpts.add('-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:8000')


This is for my own benefit for local debugging; won't be in the final PR. That said, might be nice to make this customizable.

amontalenti · 2019-04-01T02:57:47Z

plugins/mapper-hll/Makefile

+	# run integration tests
+	gradle integTestRunner
+
+debug:


Likewise, this is for my own debugging.

amontalenti · 2019-04-01T02:58:41Z

...in/java/org/elasticsearch/search/aggregations/metrics/cardinality/CardinalityAggregator.java

+                counts.merge(0, rollup, 0);
+                long cardCounts2 = counts.cardinality(0);
+                long cardRollups2 = rollup.cardinality(0);
+                long maxBucket = counts.maxBucket();


The above long values are just for debugging and will come out once we think on this section a bit more.

amontalenti · 2019-04-01T02:59:20Z

...java/org/elasticsearch/search/aggregations/metrics/cardinality/HyperLogLogPlusPlusTests.java

+            100_000,
+            1_000_000, // 1 million
+            10_000_000 // 10 million
+        };


Due to the above, this test takes about 1.5s to run, whereas other tests in this suite only take 500-700ms. May want to think about a way to make this test faster by using fewer unique values.

This code also exercises the deserialization & serialization points.

amontalenti · 2019-04-02T21:48:52Z

...in/java/org/elasticsearch/search/aggregations/metrics/cardinality/CardinalityAggregator.java

+                assert hllBytes.length > 0 : "Decoded HLL had no bytes";
+                ByteArrayInputStream bais = new ByteArrayInputStream(hllBytes);
+                InputStreamStreamInput issi = new InputStreamStreamInput(bais);
+                HyperLogLogPlusPlus rollup = HyperLogLogPlusPlus.readFrom(issi, BigArrays.NON_RECYCLING_INSTANCE);


query-side deserialization point

amontalenti · 2019-04-02T21:49:30Z

plugins/mapper-hll/src/main/java/org/elasticsearch/index/mapper/hll/HLLFieldMapper.java

+
+            // stored as binary DocValues field
+            //fields.add(new BinaryDocValuesField(fieldType().name(), hllBytesRef));
+            fields.add(new SortedDocValuesField(fieldType().name(), hllBytesRef));


index-side serialization point

They will show me the way toward supporting a more complex HLL field mapping type. We're halfway there already.

amontalenti · 2019-04-08T01:43:37Z

plugins/mapper-hll/src/test/resources/rest-api-spec/test/mapper_hll/10_basic.yml

+        index: test
+        type: type1
+        id: 5
+        body: { "foo": {"items": ["1", "2", "3", "4"], "precision": 18} }


This introduces a fielding test case until I can get HLLFieldMapper to fully support objects/arrays, but I'm already partway there as of d557530

amontalenti · 2019-04-08T01:45:27Z

plugins/mapper-hll/src/main/java/org/elasticsearch/index/mapper/hll/HLLFieldMapper.java

+        // try to parse it as a map
+        XContentParser.Token token = context.parser().currentToken();
+        if (token == XContentParser.Token.START_OBJECT) {
+            Map<String, Object> map = context.parser().map();


Should be aware that this probably allocates the whole Map into memory, and, in practice, we're expecting this to be a pretty large map. It seems like the ParseContext object is flexible enough to support streaming parsing of JSON, so, the right way to handle that would be to advance, token at a time, via calls to .parser().currentToken() and parser().nextToken(), and actually offer the items to the in-memory HLL one-string-at-a-time as you come across the VALUE_STRING objects inside the array.

amontalenti · 2019-04-08T01:47:13Z

...in/java/org/elasticsearch/search/aggregations/metrics/cardinality/CardinalityAggregator.java

+                // The right way to do it: use the alternative constructor of ByteArrayInputStream:
+
+                //byte[] hllBytes = BytesRef.deepCopyOf(bytes).bytes;
+                byte[] hllBytesCopy = BytesRef.deepCopyOf(bytes).bytes;


In a couple of commits, I thought the right way to do this was to pass bytes.bytes, bytes.offset, bytes.length to the ByteArrayInputStream constructor, since via Lucene talks I discovered that the buffer underneath the BytesRef is not just something you can pass around freely. But, when I tried to make use of a copy or do that trick, I hit other issues, so I tabled it for now.

amontalenti added 9 commits March 31, 2019 13:39

Comment HLL++ class; add unit test for Ser/DeSer

ba7f3a6

Add a very basic (and broken) mapper-hll plugin

5eb1a35

Attempt to stuff serialized HLL into byte[]

e5f9338

Add TODO's and FIXME's sketching out my plan

9964a52

Hack Cardinality agg to support binary doc values

215de9c

Add a better integration test for mapper-hll

f0038b2

add Makefile for integration tests

b7d757d

Add couple more comments to Cardinality agg

e330418

Makefile comment

418bcd9

amontalenti self-assigned this Apr 1, 2019

amontalenti added 10 commits March 31, 2019 20:51

Make debugging work across nodes

bc07e91

Add debug option for integration test

34d708c

Make CardinalityAgg easier to debug

90641e2

Switch to SortedDocValuesField for HLLFieldMapper

848c3e2

Better debugging

584bf3d

Add mix64 to mix

07c191f

More HLL debugging

4c7017e

Peg the precision to 18 for now (hack)

3c98899

More debug helpers

484058f

Improve comments: CardinalityAggregator prototype

2979c34

Remove TODO

ba45342

amontalenti commented Apr 1, 2019

View reviewed changes

plugins/mapper-hll/Makefile

# run integration tests

gradle integTestRunner

debug:

Copy link

Owner Author

amontalenti Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this is for my own debugging.

amontalenti commented Apr 1, 2019

View reviewed changes

amontalenti commented Apr 2, 2019

View reviewed changes

amontalenti added 2 commits April 7, 2019 20:09

Fix the way BytesRef is used

e81320f

Remove assertion

3db538d

amontalenti added 13 commits April 7, 2019 20:23

Try .deepCopyOf(...)

9730de8

Switch back to using raw bytes reference

da14eb8

Check offsets

09169da

Add debugging bytes references

ccf815f

add single-value HLL assertion

3dd79a7

Add initial end-to-end test for "items" sub-array

9bd2770

Try to do map parsing in HLLFieldMapper

17181e7

Checked conversion

9c67740

Forced cast

743a59a

Is this raw type OK?

6747e70

Add else branch

6ce1db3

Use currentToken()

e5157d6

Add FIXME's and a failing integration test

d557530

They will show me the way toward supporting a more complex HLL field mapping type. We're halfway there already.

amontalenti commented Apr 8, 2019

View reviewed changes

Repository owner deleted a comment from floatingdev Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] HLL Rollup field mapper plugin implementation and Cardinality Agg hacking #1

[WIP] HLL Rollup field mapper plugin implementation and Cardinality Agg hacking #1

amontalenti commented Apr 1, 2019 •

edited

Loading

amontalenti commented Apr 1, 2019 •

edited

Loading

amontalenti Apr 1, 2019

amontalenti Apr 1, 2019

amontalenti Apr 1, 2019

amontalenti Apr 1, 2019 •

edited

Loading

amontalenti Apr 2, 2019 •

edited

Loading

amontalenti Apr 2, 2019

amontalenti Apr 8, 2019

amontalenti Apr 8, 2019

amontalenti Apr 8, 2019

[WIP] HLL Rollup field mapper plugin implementation and Cardinality Agg hacking #1

Are you sure you want to change the base?

[WIP] HLL Rollup field mapper plugin implementation and Cardinality Agg hacking #1

Conversation

amontalenti commented Apr 1, 2019 • edited Loading

amontalenti commented Apr 1, 2019 • edited Loading

amontalenti Apr 1, 2019

Choose a reason for hiding this comment

amontalenti Apr 1, 2019

Choose a reason for hiding this comment

amontalenti Apr 1, 2019

Choose a reason for hiding this comment

amontalenti Apr 1, 2019 • edited Loading

Choose a reason for hiding this comment

amontalenti Apr 2, 2019 • edited Loading

Choose a reason for hiding this comment

amontalenti Apr 2, 2019

Choose a reason for hiding this comment

amontalenti Apr 8, 2019

Choose a reason for hiding this comment

amontalenti Apr 8, 2019

Choose a reason for hiding this comment

amontalenti Apr 8, 2019

Choose a reason for hiding this comment

amontalenti commented Apr 1, 2019 •

edited

Loading

amontalenti commented Apr 1, 2019 •

edited

Loading

amontalenti Apr 1, 2019 •

edited

Loading

amontalenti Apr 2, 2019 •

edited

Loading