Skip to content

dryman/hive-probabilistic-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probabilistic Hive (Alpha)

Probabilistic data structures and algorithms in hive. Currently only implemented hyperloglog.

SYNOPSIS

-- estimate the cardinality of SELECT * FROM src GROUP BY col1, col2;
SELECT hll(col1, col2).cardinality from src;

-- create hyperloglog cache per hour
FROM input_table src
  INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00')
  SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00'
  INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01')
  SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01'

-- read the cache and calculate the cardinality of full day
SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;'

-- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc.
SELECT hll_col.signature from hll_cache WHERE d='2015-03-01';

INSTALL

mvn package
scp target/hive-probabilistic-utils-0.0.2-jar-with-dependencies.jar your-hive-server

In your hive shell:

ADD JAR /path/to/your/uploaded/jar;
CREATE TEMPORARY FUNCTION hll AS 'org.idryman.hive.HyperLogLog';

-- Now the function is loaded, use the example in SYNOPSIS
DESC EXTENDED FUNCTION hll; -- show function doc

Road Map

  • Lossy count -> estimate top-k element
  • Count-sketch -> frequency estimate
  • Moment/Entropy estimate -> AMS or others

TODO

  • No test! BeeTest should be the way to go.
  • make function configurable (only available after hive 0.11.0)
  • Currently using hadoop seriailzied binary as the input (for performance purpose), but this may be incompatible to the downstream (postres-hll, js-hll). Value to hyperloglog mapping should be deterministic across all platform.

Reference

LICENSE

CopyRight 2015 Felix Chern ALv2

About

Probabilistic data structures and algorithms for hive

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages