Skip to content
master
Switch branches/tags
Code

Latest commit

The Random sampling algorithm, often referred to as MRL, was published by Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce Lindsay in 1999 [1] and addressed the problem of the correct sampling and quantile estimation. It consists of the non-uniform sampling technique and deterministic quantile finding algorithm.

This implements a simpler version of the MRL algorithm that is proposed by Ge Luo, Lu Wang, Ke Yi, and Graham Cormode [2], [3], and has denoted in the original articles as Random.

References
---------------

[1] Manku, G., et al: Random sampling techniques for space efficient
online computation of order statistics of large datasets. Proceedings
of the 1999 ACM SIGMOD International conference on Management
of data, Philadelphia, Pennsylvania, USA - May 31–June 03, 1999,
pp. 251–262, ACM New York, NY (1999)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.5750&rep=rep1&type=pdf
[2] Wang, L., et al: Quantiles over data streams: an experimental
study. Proceedings of the 2013 ACM SIGMOD International
conference on Management of data, New York, NY, USA - June
22–27, 2013, 2013, pp. 737–748, ACM New York, NY (2013)
http://dimacs.rutgers.edu/~graham/pubs/papers/nquantiles.pdf
[3] Luo, G., Wang, L., Yi, K. et al.: Quantiles over data streams:
experimental comparisons, new analyses, and further improvements.
The VLDB Journal. Vol. 25 (4), 449–472 (2016)
http://dimacs.rutgers.edu/~graham/pubs/papers/nquantvldbj.pdf
0e1c58e

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PDSA: Probabilistic Data Structures and Algorithms in Python

Travis Build Status Current Release Version pypi Version Documentation Version Python versions

The Book

Everybody interested in learning more about probabilistic data structures and algorithms could be referred to our recently published book:

Probabilistic Data Structures and Algorithms for Big Data Applications by Andrii Gakhov

2019, ISBN: 978-3748190486 (paperback) ASIN: B07MYKTY8W (e-book)

Introduction

Probabilistic data structures is a common name of data structures based on different hashing techniques.

Unlike regular (or deterministic) data structures, they always provide approximated answers, but usually with reliable ways to estimate the error probability.

The potential losses or errors are fully compensated by extremely low memory requirements, constant query time and scaling.

GitHub repository: https://github.com/gakhov/pdsa

Dependencies

Documentation

The latest documentation can be found at http://pdsa.readthedocs.io/en/latest/

Membership problem

Cardinality problem

Frequency problem

Rank problem

License

MIT License

Source code

Authors

  • Maintainer: Andrii Gakhov <andrii.gakhov@gmail.com>

Install with pip

Installation requires a working build environment.

Using pip, PDSA releases are currently only available as source packages.

$ pip3 install -U pdsa

When using pip it is generally recommended to install packages in a virtualenv to avoid modifying system state:

$ virtualenv .env -p python3 --no-site-packages
$ source .env/bin/activate
$ pip3 install -U cython
$ pip3 install -U pdsa

Compile from source

The other way to install PDSA is to clone its GitHub repository and build it from source.

$ git clone https://github.com/gakhov/pdsa.git
$ cd pdsa

$ make install

$ bin/pip3 install -r requirements-dev.txt
$ make test