Skip to content

Latest commit

 

History

History
678 lines (485 loc) · 20.8 KB

README.md

File metadata and controls

678 lines (485 loc) · 20.8 KB

Scistats

High-Performance Descriptive Statistics and Hypothesis Tests in C++20

Scistats


Statistics help us analyze and interpret data. High-performance statistical algorithms help us analyze and interpret a lot of data. Most environments provide convenient helper functions to calculate basic statistics. Scistats aims to provide high-performance statistical algorithms with an easy and familiar interface. All algorithms can run sequentially or in parallel, depending on how much data you have.


Table of Contents

Quick Start

Scistats extends the numeric facilities of the standard library to include statistics that work with iterators and ranges. This means you can do things like:

std::vector<int> v{/*...*/};
float m = scistats::mean(v);

or

std::vector<int> v{/*...*/};
float s = scistats::stddev(v); // 

or

std::vector<int> v1{/*...*/};
std::vector<int> v2{/*...*/};
float c = scistats::cov(v); // covariance

or

std::vector<int> v{/*...*/};
float p = scistats::t_test(v); // student's t hypothesis test

All algorithms allow execution policies and iterators. So you can do

std::vector<int> v{/*...*/};
float m = scistats::mean(scistats::execution::par, v);

to calculate your average in parallel. Or

std::vector<int> v{/*...*/};
float m = scistats::mean(scistats::execution::seq, v);

to explicitly tell scistats you don't want that to be calculated in parallel. If no execution policy is provided, scistats will choose a policy according to the input size.

As usual, you can also work directly with iterators, so

std::vector<int> v{/*...*/};
float m = scistats::mean(v.begin(), v.end();

also works.

Note that, when needed, the result type gets promoted to float. If the result for a given statistic needs to be floating point, scistats will always promote an integer input type to a corresponding floating type large enough to keep the results without losing precision.

Descriptive statistics

Central Tendency

With ranges:

using namespace scistats;
// ...
mean(x); 

With iterators:

mean(x.begin(), x.end()); 

You can run any algorithm in parallel by changing the execution policy:

mean(execution::seq, x);
mean(execution::par, x);

If no execution policy is provided, scistats will infer the best execution policy according to the input data.

Other functions to measure central tendency are:

Function Description
mean(x) Arithmetic mean
median(x) Median
mode(x) Mode

Dispersion

To calculate the standard deviation of a data set:

stddev(x);

If you already know the mean m, you can make calculations faster with:

stddev(x,m);

Other functions to measure dispersion are:

Function Description
var(x) Variance
stddev(x) Standard Deviation
min(x) Minimum Value
max(x) Maximum Value
bounds(x) Minimum and Maximum Values
percentile(x,p) Calculate p-th percentile

Multivariate Analysis

To calculate the covariance of two data sets:

cov(x,y);

Probability Distributions

To get the probability of x in a normal distribution:

norm_pdf(x);

To get the cumulative probability of x in a normal distribution:

norm_cdf(x);

To get the value x that has a cumulative probability p in a normal distribution:

norm_inv(p);
Probability Cumulative Inverse Description
norm_pdf(x) norm_cdf(x) norm_inv(p) Normal distribution
t_pdf(x,df) t_cdf(x,df) t_inv(p,df) Student's T distribution

where df is the degrees of freedom in the probability distribution.

Hypothesis Testing

To test the hypothesis that the values in x come from a distribution with mean(x) is zero:

t_test(x);

To test the hypothesis that the values in x and y have the same mean:

t_test(x,y);

For a paired test:

t_test_paired(x,y);

To get a confidence interval for these tests:

t_test_interval(x);
t_test_interval(x,y);

Bayesian statistics

Given (i) the probability P(E|H)=likelihood of the evidence E given the hypothesis H, (ii) the prior probability p_hypothesis of hypothesis H, and (iii) the prior probability p_evidence of evidence E, we can calculate the probability P(H|E) of a hypothesis H given the evidence E with:

bayes_theorem(likelihood, p_hypothesis, p_evidence)

Given P(E|H) and P(E|not H), we can calculate the bayes factor:

bayes_factor(p_evidence_given_h, p_evidence_given_not_h)

Mathematics

Parallel Arithmetic

To sum the elements of a range in parallel:

sum(execution::parallel_policy, x)

Or let scistats infer if it is worth doing it in parallel:

sum(x)
Function Description
sum summation
prod product

Constants

The header scistats/math/constants.h defines a number of useful constants as constexpr functions:

Function Description Approximate Value
pi The constant pi 3.14159
epsilon(scale) A tiny tiny number for a given scale and type epsilon(1.) = 2.22045e-16
inf The number representing infinity inf
min Smallest number 2.22507e-308
max Largest number 1.79769e+308
NaN The number representing "not a number" nan
e Euler's number - The base of exponentials 2.71828
euler Euler–Mascheroni constant / or Euler's gamma : The base of the natural logarithm 0.577216
log2_e The base-2 logarithm of e 1.4427
log10_e The base-10 logarithm of e 0.434294
sqrt2 The square root of two 1.41421
sqrt1_2 The square root of one-half 0.707107
sqrt3 The square root of three 1.73205
pi_2 Pi divided by two 1.5708
pi_4 Pi divided by four 0.785398
sqrt_pi The square root of pi 1.77245
two_sqrt_pi Two divided by the square root of pi 1.12838
one_by_pi The reciprocal of pi (1./pi) 0.31831
two_by_pi Twice the reciprocal of pi 0.63662
ln10 The natural logarithm of ten 2.30259
ln2 The natural logarithm of two 0.693147
lnpi The natural logarithm of pi 1.14473

Functions

Some helper functions:

Function Description
Numeric
abs absolute value (for floating point and integers)
almost_equal check if two numbers are almost the same
is_odd check if integer is odd
is_even check if integer is even
Trigonometric
acot acot
cot cot
Special
beta beta
beta_inc beta_inc
beta_inc_inv beta_inc_inv
beta_inc_inv_upper beta_inc_inv_upper
beta_inc_upper beta_inc_upper
betaln betaln
erfinv erfinv
gammaln gammaln
tgamma tgamma
xinbta xinbta

Measuring Time

To measure the time between two operations:

double t1 = tic();
// your operations
double t2 = toc();

To measure the time it takes to run a function:

double t = timeit([](){
    // Your function...
});

To create a mini-benchmark measuring the time it takes to run a function:

std::vector<double> t = minibench([](){
    // Your function...
});
std::cout << "Mean: " << mean(t) << std::endl;
std::cout << "Standard Deviation: " << stddev(t) << std::endl;

Random Number Generators

To generate a random integer between a and b with a reasonable random number generator:

randi(a,b)

To generate a random number from a normal distribution:

randn()

To generate a random number from an uniform distribution between a and b:

rand(a,b)

Roadmap

Some functions we plan to implement are:

  • Math
    • Parallel Arithmetic
    • Constants 1
    • Mini-benchmarks
    • Random Number Generators
  • Descriptive statistics 1 2 3
    • Central tendency
    • Dispersion
    • Correlation
  • Hypothesis Tests 1
    • Probability distributions 1
    • Basic tests 1 2
    • Non-Parametric tests
    • Anova 1
    • Bayeasian Statistics 1
  • Regression Models
    • Classification
    • Clustering
  • Data processing 1

Integration

Build from Source

Dependencies

  • C++20
  • CMake 3.14+
Instructions: Linux/Ubuntu/GCC

Check your GCC version:

g++ --version

The output should be something like:

g++-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0

If you see a version before GCC-10, update it with

sudo apt update
sudo apt install gcc-10
sudo apt install g++-10

Once you installed a newer version of GCC, you can link it to update-alternatives. For instance, if you have GCC-7 and GCC-10, you can link them with:

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 7
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 10

You can now use update-alternatives to set your default gcc and g++ to a more recent version:

update-alternatives --config g++
update-alternatives --config gcc

Also check your CMake version:

cmake --version

If it's older than CMake 3.14, update it with

sudo apt upgrade cmake

or download the most recent version from cmake.org.

Later when running CMake, make sure you are using GCC-8 or higher by appending the following options:

-DCMAKE_C_COMPILER=/usr/bin/gcc-10 -DCMAKE_CXX_COMPILER=/usr/bin/g++-10
Instructions: Mac Os/Clang

Check your Clang version:

clang --version

The output should have something like

Apple clang version 11.0.0

If you see a version before Clang 11, update LLVM+Clang:

curl --output clang.tar.xz -L https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/clang+llvm-11.0.0-x86_64-apple-darwin.tar.xz
mkdir clang
tar -xvJf clang.tar.xz -C clang
cd clang/clang+llvm-11.0.0-x86_64-apple-darwin
sudo cp -R * /usr/local/

Update CMake with

sudo brew upgrade cmake

or download the most recent version from cmake.org.

If the last command fails because you don't have Homebrew on your computer, you can install it with

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

or you can follow the instructions in https://brew.sh.

Instructions: Windows/MSVC

You can see the dependencies in source/CMakeLists.txt.

Build the Examples

This will build the examples in the build/examples directory:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2"
cmake --build . --parallel 2 --config Release
  • Replace --parallel 2 with --parallel <number of cores in your machine>
  • On Windows, replace -O2 with /O2
  • On Linux, you might need sudo for this last command

Installing Scistats from Source

This will install Scistats on your system:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2" -DBUILD_EXAMPLES=OFF -DBUILD_TESTS=OFF 
cmake --build . --parallel 2 --config Release
cmake --install .
  • Replace --parallel 2 with --parallel <number of cores in your machine>
  • On Windows, replace -O2 with /O2
  • On Linux, you might need sudo for this last command

CMake targets

Find it as a CMake Package

If you have the library installed, you can call

find_package(Scistats)

from your CMake build script.

When creating your executable, link the library to the targets you want:

add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)

Add this header to your source files:

#include <scistats/scistats.h>

Use it as a CMake subdirectory

You can use Scistats directly in CMake projects without installing it. Check if you have Cmake 3.14+ installed:

cmake -version

Clone the whole project

git clone https://github.com/alandefreitas/scistats/

and add the subdirectory to your CMake project:

add_subdirectory(scistats)

When creating your executable, link the library to the targets you want:

add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)

You can now add the scistats headers to your source files.

However, it's always recommended to look for Scistats with find_package before including it as a subdirectory. Otherwise, we can get ODR errors in larger projects.

CMake with Automatic Download

Check if you have Cmake 3.14+ installed:

cmake -version

Install CPM.cmake and then:

CPMAddPackage(
    NAME scistats
    GITHUB_REPOSITORY alandefreitas/scistats
    GIT_TAG origin/master # or whatever tag you want
)
# ...
target_link_libraries(my_target PUBLIC scistats)

You can now add the scistats headers to your source files.

However, it's always recommended to look for Scistats with find_package before including it as a subdirectory. You can use:

option(CPM_USE_LOCAL_PACKAGES "Try `find_package` before downloading dependencies" ON)

to let CPM.cmake do that for you. Otherwise, we can get ODR errors in larger projects.

Other build systems

If you want to use it in another build system you can either install the library (Section Binary Packages or Section Installing Scistats from Source or you have to somehow rewrite the build script.

If you want to rewrite the build script, your project needs to 1) include the headers, and 2) link with the dependencies described in source/CMakeLists.txt.

Contributing

There are many ways in which you can contribute to this library:

  • Testing the library in new environments
  • Contributing with interesting examples
  • Contributing with new statistics
  • Finding problems in this documentation
  • Finding bugs in general
  • Whatever idea seems interesting to you

If contributing with code, please leave the pedantic mode ON (-DBUILD_WITH_PEDANTIC_WARNINGS=ON), and don't forget cppcheck and clang-format.

Example: CLion

CLion Settings with Pedantic Mode

If contributing to the documentation, please edit README.md directly, as the files in ./docs are automatically generated with mdsplit.

Contributors

alandefreitas
Alan De Freitas
rcpsilva
Rcpsilva