Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] HLL Rollup field mapper plugin implementation and Cardinality Agg hacking #1

Open
wants to merge 35 commits into
base: 6.7
Choose a base branch
from

Conversation

amontalenti
Copy link
Owner

@amontalenti amontalenti commented Apr 1, 2019

  • Write up basic spec of this project in Notion
  • Explore HyperLogLogPlusPlus class; read associated paper
  • Write new unit tests for serialization and deserialization of HLL
  • Record empirical results of those tests in Notion
  • Add a skeletal mapper-hll plugin based on the mapper-murmur3 plugin
  • Hack around with making the serialized HLL go into binary doc values storage for the custom field type
  • Get remote debugging of ES and gradle integration tests with intellij working
  • Write a very basic end-to-end integration test for using the hll field and cardinality agg together
  • Hack around in CardinalityAggregator to make it pick up serialized HLL bytes
  • Make the initial integration test pass based on HLLFieldMapper and CardinalityAggregator changes!
  • Get answer on why initialBucket and otherBucket are always 0 right now
  • Work on making the HLLFieldMapper support arrays of values and a specified precision field; use the GeoShape field mapper for inspiration.
  • Explore how XContentParser works so that we can do streaming parsing of the items array.
  • Look into this comment in SortedDocValuesField: "This value can be at most 32766 bytes long."
  • Figure out if BinaryDocValuesField is what HLLFieldMapper should really be using, due to the above limit, but can't, perhaps due to the current ES ValuesSource implementation. (Is this what Zach was talking about?)
  • Investigate whether the ByteArrayOutputStream, ByteArrayInputStream are the right options for HLL serialization/deserialization.
  • See whether the stored field in the HLLFieldMapper is actually storing the byte[], and also what option we could possibly have for displaying it (in searches) and using it for re-indexing operations.
  • Explore whether there's a way to change what gets stored in _source for a given field, so that, e.g. we could store the serialized blob of the HLL in _source (note: we'd also need to support serialized HLLs in the HLLFieldMapper class) rather than being forced to store the raw items
  • BytesRef stores whole bytes array, and it stores its own offset and length. Which means passing the underlying bytes array is not the right move. Might just be lucky in current impl.
  • ⭐ Figure out how big an HLL Rollup can be to fit into 32766 bytes, just in case. (Somewhere near 10K.)
  • ⭐ Make HLLFieldMapper support arrays of values via the streaming JSON parsing technique (tokens) rather than all-at-once parsing.
  • ⭐ Rename hll field type to hll-rollup
  • ⭐ Write up spec for abstracting the CardinalityAggregator changes into a new plugin with a new agg name, such as hll-uniq
  • ⭐ Split hacked CardinalityAggregator code changes into that hll-uniq plugin
  • ⭐ Build a real-world data set of URLs, timestamps, and UUIDs using BigQuery
  • ⭐ Do a real world performance benchmark

@amontalenti amontalenti self-assigned this Apr 1, 2019
@amontalenti
Copy link
Owner Author

amontalenti commented Apr 1, 2019

Re: this just-completed task:

  • Hack around in CardinalityAggregator to make it pick up serialized HLL bytes

Happy to say I just got an integration test passing with this test input:

# Integration tests for Mapper HLL components
#

---
"Mapper HLL":

  - do:
      indices.create:
        index: test
        body:
          mappings:
            type1: { "properties": { "foo": { "type": "hll" } } }

  - do:
      index:
        index: test
        type: type1
        id: 0
        body: { "foo": "bar" }

  - do:
      indices.refresh: {}

  - do:
      search:
        body: { "aggs": { "foo_count": { "cardinality": { "field": "foo" } } } }

  - match: { aggregations.foo_count.value: 1 }

  - do:
      index:
        index: test
        type: type1
        id: 1
        body: { "foo": "bar" }

  - do:
      index:
        index: test
        type: type1
        id: 2
        body: { "foo": "baz" }

  - do:
      index:
        index: test
        type: type1
        id: 3
        body: { "foo": "bam" }

  - do:
      index:
        index: test
        type: type1
        id: 4
        body: { "foo": "bar" }

  - do:
      indices.refresh: {}

  - do:
      search:
        body: { "aggs": { "foo_count": { "cardinality": { "field": "foo" } } } }

  - match: { aggregations.foo_count.value: 3 }

... which means, the mapper-hll plugin prototype is serializing HLLs as binary blobs into the binary doc values area of the index, and my hacked CardinalityAggregator is successfully de-serializing those blobs and calculating HLL merges from them, rather than building HLLs "on the fly" from the raw values.

My debug screen in intellij captured here showcases the code in action, being caught during a live ES cardinality agg query.

There's still more to do (as described in the checklist above), but my, my, how very promising!

@@ -697,7 +697,7 @@ class ClusterFormationTasks {
// gradle task options are not processed until the end of the configuration phase
if (node.config.debug) {
println 'Running elasticsearch in debug mode, suspending until connected on port 8000'
esJavaOpts.add('-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000')
esJavaOpts.add('-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:8000')
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for my own benefit for local debugging; won't be in the final PR. That said, might be nice to make this customizable.

# run integration tests
gradle integTestRunner

debug:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this is for my own debugging.

counts.merge(0, rollup, 0);
long cardCounts2 = counts.cardinality(0);
long cardRollups2 = rollup.cardinality(0);
long maxBucket = counts.maxBucket();
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above long values are just for debugging and will come out once we think on this section a bit more.

100_000,
1_000_000, // 1 million
10_000_000 // 10 million
};
Copy link
Owner Author

@amontalenti amontalenti Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the above, this test takes about 1.5s to run, whereas other tests in this suite only take 500-700ms. May want to think about a way to make this test faster by using fewer unique values.

This code also exercises the deserialization & serialization points.

assert hllBytes.length > 0 : "Decoded HLL had no bytes";
ByteArrayInputStream bais = new ByteArrayInputStream(hllBytes);
InputStreamStreamInput issi = new InputStreamStreamInput(bais);
HyperLogLogPlusPlus rollup = HyperLogLogPlusPlus.readFrom(issi, BigArrays.NON_RECYCLING_INSTANCE);
Copy link
Owner Author

@amontalenti amontalenti Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query-side deserialization point


// stored as binary DocValues field
//fields.add(new BinaryDocValuesField(fieldType().name(), hllBytesRef));
fields.add(new SortedDocValuesField(fieldType().name(), hllBytesRef));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index-side serialization point

index: test
type: type1
id: 5
body: { "foo": {"items": ["1", "2", "3", "4"], "precision": 18} }
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a fielding test case until I can get HLLFieldMapper to fully support objects/arrays, but I'm already partway there as of d557530

// try to parse it as a map
XContentParser.Token token = context.parser().currentToken();
if (token == XContentParser.Token.START_OBJECT) {
Map<String, Object> map = context.parser().map();
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be aware that this probably allocates the whole Map into memory, and, in practice, we're expecting this to be a pretty large map. It seems like the ParseContext object is flexible enough to support streaming parsing of JSON, so, the right way to handle that would be to advance, token at a time, via calls to .parser().currentToken() and parser().nextToken(), and actually offer the items to the in-memory HLL one-string-at-a-time as you come across the VALUE_STRING objects inside the array.

// The right way to do it: use the alternative constructor of ByteArrayInputStream:

//byte[] hllBytes = BytesRef.deepCopyOf(bytes).bytes;
byte[] hllBytesCopy = BytesRef.deepCopyOf(bytes).bytes;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a couple of commits, I thought the right way to do this was to pass bytes.bytes, bytes.offset, bytes.length to the ByteArrayInputStream constructor, since via Lucene talks I discovered that the buffer underneath the BytesRef is not just something you can pass around freely. But, when I tried to make use of a copy or do that trick, I hit other issues, so I tabled it for now.

Repository owner deleted a comment from floatingdev Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant