Skip to content

Commit

Permalink
Multiple doc, pack and format improvements
Browse files Browse the repository at this point in the history
 * Documentation
 * Packaging and pypi
 * Corrected flake errors
 Improved formatting (thanks Petr Gladkikh, @PetrGlad)
  • Loading branch information
akochepasov committed Dec 11, 2023
1 parent ec73621 commit d3d7e2d
Show file tree
Hide file tree
Showing 12 changed files with 358 additions and 471 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ jobs:
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. Line is 160 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=160 --statistics
# exit-zero treats all errors as warnings. Line is 120 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=120 --statistics
44 changes: 35 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

Fast streaming univariate and bivariate moments and t-statistics.

statmoments is a library for fast streaming one pass computation of univariate and bivariate moments and statistics for batch of multiple of waveforms or traces with thousands of sample points. Given the data sorting with classifier, it can compute Welch's t-test statistics of various orders for arbitrary data partitioning to allow finding relationships and statistical differences among many data splits, which are unknown beforehand. statmoments uses best of class BLAS implementation and preprocesses input data to take the most of computational power and perform computations as fast as possible.
statmoments is a library for the fast streaming one-pass computation of univariate and bivariate moments for batches of multiple of waveforms or traces with thousands of sample points. Given the data sorting with classifiers, it can compute Welch's t-test statistics of various orders for arbitrary data partitioning to allow finding relationships and statistical differences among many data splits, which are unknown beforehand. statmoments uses best-of-class BLAS implementation and preprocesses input data to make the most of computational power and perform computations as fast as possible on Windows and Linux platforms.

## How is that different?

When the difference in input data is really subtle, millions of waveforms should be processed to find the statistical significant difference. Therefore, computationally efficient algorithms for analysing such volumes of data are crucial for practical use. Due to their nature, computing high-order moments and statistics requires 2- or 3-pass computations. However, not all the data can fit into the available memory. In another case, after initial input has been processed, new data may appear, which requires starting the process over. When the input waveforms contains many thousands of sample points, it adds another dimension to an already complex problem.
When the difference in input data is really subtle, millions of waveforms should be processed to find the statistically significant difference. Therefore, computationally efficient algorithms for analyzing such volumes of data are crucial for practical use. Due to their nature, computing high-order moments and statistics requires 2- or 3-pass computations. However, not all the data can fit into the available memory. In another case, after initial input has been processed, new data may appear, which requires starting the process over. When the input waveforms contain many thousands of sample points, it adds another dimension to an already complex problem.

This can be overcome with a streaming approach. A streaming algorithm examines a sequence of inputs in a single pass, it processes data as it is collected, without waiting for it to be pre-collected and stored in a persistent storage. Streaming computation is often used to process data from real-time sources, such as oscilloscopes, sensors as well as financial markets. It can also be used to process large datasets that are too large to fit in memory or to process data that is constantly changing. Another name for such an approach is an online algorithm.

This library implements streaming algorithms to avoid recomputation, stores intermediate data in the accumulator. The algorithm also takes in the account that covariance and higher matrices are represented by a symmetric matrix, decreasing the memory requirement two-fold. The data update is blazingly fast and at any moment the internal accumulator can be converted to to (co-)moments on demand and the moments in turn can be converted to t-statistics with Welch's t-test. Once more data collected, it can be iteratively processed, increasing the precision of the moments, and discarded. The computation is optimized to consume significant input streams, hundreds of megabytes per second of waveforms, which may contain thousands of points.
This library implements streaming algorithms to avoid recomputation, and stores intermediate data in the accumulator. The algorithm also takes into account that covariance and higher matrices are represented by a symmetric matrix, decreasing the memory requirement two-fold. The data update is blazingly fast and at any moment the internal accumulator can be converted to (co-)moments on demand and the moments in turn can be converted to t-statistics with Welch's t-test. Once more data is collected, it can be iteratively processed, increasing the precision of the moments, and discarded. The computation is optimized to consume significant input streams, hundreds of megabytes per second of waveforms, which may contain thousands of points.
Yet another dimension can be added when the data split is unknown. In other words, which bucket the input waveform belongs to. This library solves this with pre-classification of the input data and computing moments for all the requested data splits.

Some of the benefits of streaming computation include:
Expand All @@ -25,11 +25,11 @@ Univariate statistics are used in various fields and contexts to analyze and des

- Descriptive Statistics: summarizing and describing the central tendency, dispersion, and shape of a dataset.
- Hypothesis Testing: testing hypothesis to determine if there are significant differences or relationships between groups or conditions.
- Finance and Economics: Examining the performance of financial assets, track market trends, and assess risk in real time.
- Finance and Economics: Examining the performance of financial assets, tracking market trends, and assessing risk in real-time.

In summary, univariate statistics are a fundamental tool in data analysis and are widely used across a range of fields to explore, summarize, and draw conclusions from single-variable data. They provide essential insights into the characteristics and behavior of individual variables, which can inform decision-making and further research.

Bivariate statistics is a tool for understanding the relationships between two variables. Researchers and practitioners use it in a wide range of fields to make informed decisions and improve outcomes. The can be used to answer a variety of questions, such as:
Bivariate statistics is a tool for understanding the relationships between two variables. Researchers and practitioners use it in a wide range of fields to make informed decisions and improve outcomes. They can be used to answer a variety of questions, such as:

- Is there a statistically significant relationship between variables?
- Which data points are related?
Expand All @@ -38,9 +38,13 @@ Bivariate statistics is a tool for understanding the relationships between two v

These statistical methods are used in medical and bioinformatics research, astrophysics, seismology, market predictions, and many more where the input data may be measured in hundreds of gigabytes.

## Numeric accuracy

The numeric accuracy of results is dependent on the coefficient of variation (COV) of the sample point in the input waveforms. With COV of about 5%, the computed (co-)kurtosis has about 10 correct significant digits for 10'000 waveforms, which is more than enough for the resulting t-test. Increasing data by about 100x additionally loses one more significant digit.

## Examples

Performing univariate data analysis
### Performing univariate data analysis

```python
# Input data parameters
Expand All @@ -61,7 +65,7 @@ Performing univariate data analysis
mean = [cm.copy() for cm in uveng.moments(moments=1)]
skeweness = [cm.copy() for cm in uveng.moments(moments=3)]

# Detect statistical differences in the first order t-test
# Detect statistical differences in the first-order t-test
for i, tt in enumerate(statmoments.stattests.ttests(uveng, moment=1)):
if np.any(np.abs(tt) > 5):
print(f"Data split {i} has different means")
Expand All @@ -72,7 +76,7 @@ Performing univariate data analysis
# Get updated statistical moments and t-tests
```

Performing bivariate data analysis
### Performing bivariate data analysis

```python
# Input data parameters
Expand Down Expand Up @@ -108,8 +112,30 @@ Performing bivariate data analysis
# Get updated statistical moments and t-tests
```

### Performing data analysis from the command line

```shell
# Find univariate t-test statistics of skeweness for
# the first 5000 waveform sample points
# taken from the HDF5 dataset
python -m statmoments.univar -i data.h5 -m 3 -r 0:5000

# Find bivariate t-test statistics of covariance for
# the first 1000 waveform sample points
# Taken from the HDF5 dataset
python -m statmoments.bivar -i data.h5 -r 0:1000
```

More examples can be found in the examples and tests directories.

## Implementation notes

Since the output data can exhaust the existing RAM, the results are the matrices of statistical moments for the requested region, which is produced in a one-at-a-time fashion for each input classifier. The output moment for each classifier has dimension 2 x M x L, where, M is an index of the requested classifier, L is the region length. t-test is represented by a 1D array for each classifier.
The **bivariate moments** are represented by the **upper triangle** of the symmetric matrix.


## Installation

pip install -e .
```shell
pip install statmoments
```
12 changes: 6 additions & 6 deletions examples/bivar.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,29 @@ def bivar_ttest():
# Input data
traces0 = np.random.normal(0, 10, (tr_count, tr_len)).astype(np.int8)
# Insert correlation for sample points 0-1 for the full batch
traces0[:, 0] = (2 * traces0[:, 1] + 10).astype(np.int8)
traces0[:, 0] = (2 * traces0[:, 1] + 10).astype(np.int8)
# and for interlaced waveforms, sample points 2-3
traces0[::2, 2] = (3 * traces0[::2, 3]/2 + 20).astype(np.int8)
traces0[::2, 2] = (3 * traces0[::2, 3] / 2 + 20).astype(np.int8)

# Generate sorting classification (data partitioning hypotheses)
# 0: the input batch belongs to dataset 0
# 1: the input batch belongs to dataset 1
# 2: data interlaced from 0
# 3: data interlaced from 1
cl0 = [[0, 1, i % 2, (i+1) % 2] for i in range(len(traces0))]
cl0 = [[0, 1, i % 2, (i + 1) % 2] for i in range(len(traces0))]

# Process input
bveng.update(traces0, cl0)

traces1 = np.random.normal(0, 10, (tr_count, tr_len)).astype(np.int8)
# Insert correlation for interlaces waveforms, sample points 2-3
traces1[::2, 2] = (3 * traces1[::2, 3]/2 + 20).astype(np.int8)
traces1[::2, 2] = (3 * traces1[::2, 3] / 2 + 20).astype(np.int8)
# Generate sorting classification (data partitioning hypotheses)
# 0: the input batch belongs to dataset 1
# 1: the input batch belongs to dataset 0
# 2: data interlaced from 0
# 3: data interlaced from 1
cl1 = [[1, 0, i % 2, (i+1) % 2] for i in range(len(traces1))]
cl1 = [[1, 0, i % 2, (i + 1) % 2] for i in range(len(traces1))]
bveng.update(traces1, cl1)

# All generator returned data must be copied out
Expand Down Expand Up @@ -69,7 +69,7 @@ def bivar_ttest():
print(f"Found stat diff in the split {i}")

# Second (covariances)
for i, tt in enumerate(statmoments.stattests.ttests(bveng, moment=(1,1))):
for i, tt in enumerate(statmoments.stattests.ttests(bveng, moment=(1, 1))):
if np.any(np.abs(tt) > 5):
print(f"Found stat diff in the split {i}")

Expand Down
22 changes: 15 additions & 7 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
[metadata]
description = Fast streaming bivariate statistics and t-test
name = statmoments
author = Anton Kochepasov
author_email = akss@me.com
license = MIT
platforms = any
description = Fast streaming single-pass univariate/bivariate statistics and t-test
long_description = file: README.md
long_description_content_type = text/markdown
project_urls =
Source Code = https://github.com/akochepasov/statmoments/
classifiers =
Development Status :: 5 - Production/Stable
Environment :: Console
Intended Audience :: Developers
Intended Audience :: Science/Research
Intended Audience :: Financial and Insurance Industry
Expand All @@ -20,8 +30,8 @@ classifiers =
Topic :: Scientific/Engineering :: Information Analysis
Topic :: Scientific/Engineering :: Mathematics
Topic :: Scientific/Engineering :: Physics
keywords = data-science,univariate,bivariate,statistics,streaming,numpy,vectorization
platforms = any
keywords =
data-science,univariate,bivariate,statistics,streaming,numpy,vectorization

[options]
zip_safe = False
Expand All @@ -31,17 +41,15 @@ install_requires =
numpy
scipy
psutil

setup_requires =
cython

[flake8]
ignore = E111,E114,E226,E231,E241,E272,E221
ignore = E111,E114,E221,E241,E272
per-file-ignores =
__init__.py: F401
_statmoments_impl.py: F401
_native_shim.py: E402,F401,F403
bench_bivar.py: E121,E122,E201,E262,E265
examples/bivar.py: F841
examples/univar.py: F841
exclude =
Expand All @@ -64,4 +72,4 @@ statistics = yes
# E262 Inline comment should start with #
# E111 Indentation is not a multiple of four
# E114 Indentation is not a multiple of four (comment)
# E226 Missing whitespace around arithmetic operations
# E226 Missing whitespace around arithmetic operations
6 changes: 1 addition & 5 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
# It should work even with 2.7, just never really tested
raise RuntimeError("statmoments requires Python 3.6 or later")

kwargs = {}
basedir = os.path.abspath(os.path.dirname(__file__))
USE_CYTHON = os.path.isfile(os.path.join(basedir, "statmoments/_native.pyx"))

Expand Down Expand Up @@ -62,17 +63,12 @@ def get_version():
# setuptools.setup(
# cmdclass = {'build_ext' : build_ext_cupy},

kwargs = {}


def main():
ext = '.pyx' if USE_CYTHON else '.c'
extensions = [make_ext("statmoments._native", 'statmoments/_native' + ext)]
extensions = cythonize('statmoments/_native' + ext) if USE_CYTHON else extensions
setuptools.setup(
name='statmoments',
author='Anton Kochepasov',
author_email='akss@me.com',
version=get_version(),
ext_modules=extensions,
**kwargs)
Expand Down
Loading

0 comments on commit d3d7e2d

Please sign in to comment.