Probabilistic data structures and algorithms in hive. Currently only implemented hyperloglog.
-- estimate the cardinality of SELECT * FROM src GROUP BY col1, col2;
SELECT hll(col1, col2).cardinality from src;
-- create hyperloglog cache per hour
FROM input_table src
INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00')
SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00'
INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01')
SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01'
-- read the cache and calculate the cardinality of full day
SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;'
-- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc.
SELECT hll_col.signature from hll_cache WHERE d='2015-03-01';
mvn package
scp target/hive-probabilistic-utils-0.0.2-jar-with-dependencies.jar your-hive-server
In your hive shell:
ADD JAR /path/to/your/uploaded/jar;
CREATE TEMPORARY FUNCTION hll AS 'org.idryman.hive.HyperLogLog';
-- Now the function is loaded, use the example in SYNOPSIS
DESC EXTENDED FUNCTION hll; -- show function doc
- Lossy count -> estimate top-k element
- Count-sketch -> frequency estimate
- Moment/Entropy estimate -> AMS or others
- No test! BeeTest should be the way to go.
- make function configurable (only available after hive 0.11.0)
- Currently using hadoop seriailzied binary as the input (for performance purpose), but this may be incompatible to the downstream (postres-hll, js-hll). Value to hyperloglog mapping should be deterministic across all platform.
CopyRight 2015 Felix Chern ALv2