Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

README.mkdn

Probabilistic Hive (Alpha)

Probabilistic data structures and algorithms in hive. Currently only implemented hyperloglog.

SYNOPSIS

-- estimate the cardinality of SELECT * FROM src GROUP BY col1, col2;
SELECT hll(col1, col2).cardinality from src;

-- create hyperloglog cache per hour
FROM input_table src
  INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00')
  SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00'
  INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01')
  SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01'

-- read the cache and calculate the cardinality of full day
SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;'

-- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc.
SELECT hll_col.signature from hll_cache WHERE d='2015-03-01';

INSTALL

mvn package
scp target/hive-probabilistic-utils-0.0.2-jar-with-dependencies.jar your-hive-server

In your hive shell:

ADD JAR /path/to/your/uploaded/jar;
CREATE TEMPORARY FUNCTION hll AS 'org.idryman.hive.HyperLogLog';

-- Now the function is loaded, use the example in SYNOPSIS
DESC EXTENDED FUNCTION hll; -- show function doc

Road Map

  • Lossy count -> estimate top-k element
  • Count-sketch -> frequency estimate
  • Moment/Entropy estimate -> AMS or others

TODO

  • No test! BeeTest should be the way to go.
  • make function configurable (only available after hive 0.11.0)
  • Currently using hadoop seriailzied binary as the input (for performance purpose), but this may be incompatible to the downstream (postres-hll, js-hll). Value to hyperloglog mapping should be deterministic across all platform.

Reference

LICENSE

CopyRight 2015 Felix Chern ALv2

About

Probabilistic data structures and algorithms for hive

Resources

License

Releases

No releases published

Packages

No packages published

Languages