[CPAN] Memory-efficient approximate univariate distribution
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
example
lib/Statistics/Descriptive
t
xt
.gitignore
Changes
MANIFEST
Makefile.PL
README
README.md
TODO

README.md

DESCRIPTION

Statistics::Descriptive::LogScale - Memory-efficient approximate univariate statistical analysis module.

FEATURES

  • Efficiency - can handle very large samples without loading them into memory, at the cost of predictable small relative error (see DATA MODEL below);

  • Power - can report all kind of values - mean, variance, percentiles, moments of any power, probability to hit specific interval, even integrate arbitrary function over the sample;

  • Persistence - can save, load, and combine data sets, as well as cut part of the sample within given range;

  • Visualization - can output data in form of a histogram array to draw pictures;

  • Compatibility - can be used in place of Statistics::Descriptive::Full whenever retaining exact initial data points isn't required.

SINOPSIS

use strict;
use warnings;
use Statistics::Descriptive::LogScale;

my $stat = Statistics::Descriptive::LogScale->new (
    base => 1.01,
    linear_width => 0.001,
);

while (<>) {
	$stat->add_data($_) for /(-?\d+(?:\.\d*)?)/g;
};
# add more data....

printf "   minimal value: %f\n", $stat->min;
printf "%3uth percentile: %f\n", $_*10, $stat->percentile($_*10)
	for 1..10;

See more in the perldoc. See also example directory:

  • example/summary.pl - short summary

  • example/compare-full.pl - side-by-side comparison with Statistics::Descriptive::Full

  • example/histogram.pl - text-based histogram

  • example/png-histogram.pl - png histogram, can load JSON-encoded sample

  • example/save-load.pl - save/load sample to JSON files

  • example/gen-sample.pl - not really an example, but a clumsy random distributions generator. Use it to fake random test data, in case you don't have some at hand.

DATA MODEL

The data is divided into logarithmic intervals, or bins, i.e. such that upper boundary/lower boundary ratio is constant across all bins. This allows to store data spanning orders of magnitude while maintaining a guaranteed relative precision.

For instance, the default bin ratio is 10^1/232, which is approximately 1% and allows sorting numbers from 1 to 1000000 into like 1400 bins.

Additionally, linear approximation can be used around zero to save memory. The incoming data is rarely absolutely precise anyway. The threshold under which linear interpolation is used is roughly (precision of data)/(bin ratio). By default, data is assumed to be precise, so linear approximation is not used.

INSTALLATION

Most likely, you need to install the latest stable version from CPAN:

cpanm Statistics::Descriptive::LogScale

However, for installing this very package, the following can be used:

perl Makefile.PL
make
make test
make install

WHY THIS MODULE

Initially it was started out as a quick and dirty performance analysis tool. It turned out that in some cases average values do not tell that much, as in "your service responds in 0.1 +- 10 seconds".

Another usage can be long-running and/or memory-limited applications. It's possible to save data samples and/or send them over the network, as well as gradually "forget" old data.

Ideally, it should become the tool for preliminary analysis and drawing funny pictures until one realises they need serious stuff like R.

BUGS AND LIMITATIONS

This software is still under development and has not experienced enough usage, so there may be bugs.

Mode calculation algorithm only works well for discrete distributions, and needs improvement.

Data inspection methods are wanted.

C and Javascript translations are wanted. XS would be a boon.

The error introduced by approximation have not been studied well enough yet. It may turn out that tweaking the model could win some precision.

SUPPORT AND DOCUMENTATION

The module itself is moderately well documented, so you can use

perldoc Statistics::Descriptive::LogScale

As of May 2015, you can find the latest and greatest version of this package at https://github.com/dallaylaen/perl-Statistics-Descriptive-LogScale

Please, report bugs there, if you can. Alternatively, CPAN RT is at your service.

ACKNOLEDGEMENTS

This module was inspired by a talk that Andrew Aksyonoff, author of Sphinx search software, has given at HighLoad++ conference in Moscow, 2012.

Statistics::Descriptive::Full was and is used as reference when in doubt. Several code snippets were shamelessly stolen from there.

linear_width and linear_threshold parameter names were suggested by CountZero from http://perlmonks.org

COPYRIGHT AND LICENSE

Copyright (C) 2013-2015 Konstantin S. Uvarin

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.