Probabilistic data structures and algorithms for hive
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Probabilistic Hive (Alpha)

Probabilistic data structures and algorithms in hive. Currently only implemented hyperloglog.


-- estimate the cardinality of SELECT * FROM src GROUP BY col1, col2;
SELECT hll(col1, col2).cardinality from src;

-- create hyperloglog cache per hour
FROM input_table src
  INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00')
  SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00'
  INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01')
  SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01'

-- read the cache and calculate the cardinality of full day
SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;'

-- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc.
SELECT hll_col.signature from hll_cache WHERE d='2015-03-01';


mvn package
scp target/hive-probabilistic-utils-0.0.2-jar-with-dependencies.jar your-hive-server

In your hive shell:

ADD JAR /path/to/your/uploaded/jar;
CREATE TEMPORARY FUNCTION hll AS 'org.idryman.hive.HyperLogLog';

-- Now the function is loaded, use the example in SYNOPSIS
DESC EXTENDED FUNCTION hll; -- show function doc

Road Map

  • Lossy count -> estimate top-k element
  • Count-sketch -> frequency estimate
  • Moment/Entropy estimate -> AMS or others


  • No test! BeeTest should be the way to go.
  • make function configurable (only available after hive 0.11.0)
  • Currently using hadoop seriailzied binary as the input (for performance purpose), but this may be incompatible to the downstream (postres-hll, js-hll). Value to hyperloglog mapping should be deterministic across all platform.



CopyRight 2015 Felix Chern ALv2