Elasticsearch HyperloglogPlus Aggregator

Simple aggregation which allows the aggregation of HyperloglogPlus serialized objects which have been saved as a binary field in Elasticsearch.

Background

The HyperloglogPlus data structure allows us to compute the cardinality of a multiset with a small trade-off in accuracy. One of the interesting properties of Hyperloglog is that it allows to merge more than one Hyperloglog to compute total multiset cardinality.

This plugin uses a HyperloglogPlus implementation from stream-lib. We first considered Elasticsearch HLL implemenation, but decided to go with stream-lib as ES implementation may change and usage of HyperLogLogPlus directly is not supported.

Sample Use Case

Unique audience count

Given a data set of social media posts with a few billion visits and 10 million posts ( with hashtags) , we must count the distinct count of users who had seen a combination of hashtags associated with pages . One approach would be to index each user as a document with posts and their hashtags as properties in the document. This demands a huge index and the view is very 'user centric'.

Instead, the approach used here is to index each post as a document with a hyperloglog binary field representing unique visitors for that post. Then, using the hyperlogsum aggregation provided by this plugin, unique visitors can be computed by any post dimension ( such as hashtags here).

Computing Hyperloglog binary field to index can be done by one of these ways:

1 - As demonstrated in the integration tests, directly using stream-lib with same precision used by plugin internally (HyperUniqeSumAggregationBuilder.SERIALIZED_DENSE_PRECISION, HyperUniqeSumAggregationBuilder.SERIALIZED_SPARSE_PRECISION)

2 - If using spark to index data into elasticsearch, an example UDAF has been provided which internally uses the same precision.

3 - One could also build a custom UDAF for your stack, modeled on the spark version (feel free to contribute back!)

Usage

Elasticsearch Versions

Since this plugin might change with minor versions of ES , Source code is organized into ES version specific directory. Currently we support ES 5.4.3.

Please create an issue if a different ES-version-specific release needs to be made.

Setup

Switch to ES version specific directory .

cd es-5.4.3

In order to install this plugin, you need to create a zip distribution first by running

gradle clean assemble

This will produce a zip file in build/distributions.

After building the zip file, you can install it like this

bin/elasticsearch-plugin install file:///path/to/elasticsearch-hyperloglogplus/build/distribution/elasticsearch-hyperloglogplus.zip

Elasticsearch Mappings

Since we are using binary field in ES for storing HLL, doc_values should be set to true in the mapping declaration.

Indexing Example

hll field should be mapped to type = binary and doc_values = true, like using an index template in below example

curl -XPUT http://localhost:9200/_template/template_1 -d '
{
  "order": 0,
  "template": "*",
  "mappings": {
    "product": {
      "properties": {
        "hll": {
          "type": "binary",
          "doc_values": true
        },
        "desc": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  },
  "aliases": {}
}'

python sample code to populate index

import json
import requests

data = {
          "desc" : "bar",
          "hll" : "rO0ABXNyAFBjb20uY2xlYXJzcHJpbmcuYW5hbHl0aWNzLnN0cmVhbS5jYXJkaW5hbGl0eS5IeXBlckxvZ0xvZ1BsdXMkU2VyaWFsaXphdGlvbkhvbGRlct2sxediKNllDAAAeHB6AAACVf////4OGQHIAearF8aBOYiDAfAZlpgGrusBwO8M7ij6uSHKkgOA0RDW9Wuyig30qCvu9CTQ3yT62grKsRjumQeE9QG+vQ3c7ATc6TC0OsjSH/CaE6RY+PYD0I8WtvM1tMcBlswJ7vsQ0E/6/wLEvSWW/ibm4Q6YL9KADMyeCLq1JsDWKszxE7CzBIibCsrVCfS3CPCbG9TsA8TpVeaUCejZE9KlH4iPCNSNEczxB/pu+NUPutEC/KgVxpwBzKpI0poTwk6i6CO83APe/waKqRv+tg328CewmxLugwLqwA7GxwimiQWIswuC/i/WsxCo0Rrs2QbAywi2yATqkyz46SLO6wic/gXyGaTcAaiHMezAAvzfCrrwEsyUGbhHnKUDhLgj9IMSvogC7MMIrusBjPMD2L0FkvQPrKcfoJpM2r1K7rEGtP0S2rAr+tJM9oga3q8OivkJrPAF4uQojscWmMw62towntIEoN0DzowQjvAq7NYI4IQJ7sojsq0L5I8B+OoUptgohswMrNEa2Pw5iIMB5uEdurUmioM40M4CgrEj2OIKqIcPmosL7NkGvsMK5LIisOkH0PELuucEirAitKEH7NkG9JcBzPICnrAO8qEh0LkJkKoN0MIQ1pgLwu8XruUm3IQQ5rgVwt0RgJIwzK4DqMEw4IoZkO8KrqUP1uIIns4nyokO2kuorxPUr1OEvweWogL8qyDkpAyyig64shDqoS++zQvu4xbOuhL48grqohHg8QfSlQKO+wPmH6brQqSiCJSiAqSZVujUA9jDArqyC/T1Xng="
        }


url = 'http://localhost:9200/newindex/product/a'

for  i in range(1,100):
	requests.post(url + str(i),json = data)

Querying with custom aggregation

curl -XPOST 'localhost:9200/_search?pretty=true&size=0' -d '{ "query" : {"match_all" : {}}, "aggs" : { "desc.keyword" : { "terms" : { "field" : "desc.keyword"}  ,  "aggs" : {  "uniq" : {"hyperlogsum" : { "field" : "hll" } } } } } }'

{
  "took" : 36,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 99,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "desc.keyword" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "bar",
          "doc_count" : 99,
          "uniq" : {
            "value" : 200.0
          }
        }
      ]
    }
  }
}

Bugs & TODO

Use an encoding format for HLL ( Like REDIS )
Consider LogLog Beta
HLL Mapping Type

Acknowledgement

Thanks to David Pilato's gradle cookie cutter generator to make plugin projects easier.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
es-5.4.3		es-5.4.3
es-5.5.0		es-5.5.0
spark		spark
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
OWNERS		OWNERS
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elasticsearch HyperloglogPlus Aggregator

Background

Sample Use Case

Unique audience count

Usage

Elasticsearch Versions

Setup

Elasticsearch Mappings

Indexing Example

Querying with custom aggregation

Bugs & TODO

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

bazaarvoice/elasticsearch-hyperloglog

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch HyperloglogPlus Aggregator

Background

Sample Use Case

Unique audience count

Usage

Elasticsearch Versions

Setup

Elasticsearch Mappings

Indexing Example

Querying with custom aggregation

Bugs & TODO

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages