Multiple doc, pack and format improvements

* Documentation * Packaging and pypi * Corrected flake errors Improved formatting (thanks Petr Gladkikh, @PetrGlad)
akochepasov · Dec 11, 2023 · d3d7e2d · d3d7e2d
1 parent ec73621
commit d3d7e2d
Show file tree

Hide file tree

Showing 12 changed files with 358 additions and 471 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -43,5 +43,5 @@ jobs:
       run: |
         # stop the build if there are Python syntax errors or undefined names
         flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
-        # exit-zero treats all errors as warnings. Line is 160 chars wide
-        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=160 --statistics
+        # exit-zero treats all errors as warnings. Line is 120 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=120 --statistics
diff --git a/README.md b/README.md
@@ -2,15 +2,15 @@
 
 Fast streaming univariate and bivariate moments and t-statistics.
 
-statmoments is a library for fast streaming one pass computation of univariate and bivariate moments and statistics for batch of multiple of waveforms or traces with thousands of sample points. Given the data sorting with classifier, it can compute Welch's t-test statistics of various orders for arbitrary data partitioning to allow finding relationships and statistical differences among many data splits, which are unknown beforehand. statmoments uses best of class BLAS implementation and preprocesses input data to take the most of computational power and perform computations as fast as possible.
+statmoments is a library for the fast streaming one-pass computation of univariate and bivariate moments for batches of multiple of waveforms or traces with thousands of sample points. Given the data sorting with classifiers, it can compute Welch's t-test statistics of various orders for arbitrary data partitioning to allow finding relationships and statistical differences among many data splits, which are unknown beforehand. statmoments uses best-of-class BLAS implementation and preprocesses input data to make the most of computational power and perform computations as fast as possible on Windows and Linux platforms.
 
 ## How is that different?
 
-When the difference in input data is really subtle, millions of waveforms should be processed to find the statistical significant difference. Therefore, computationally efficient algorithms for analysing such volumes of data are crucial for practical use. Due to their nature, computing high-order moments and statistics requires 2- or 3-pass computations. However, not all the data can fit into the available memory. In another case, after initial input has been processed, new data may appear, which requires starting the process over. When the input waveforms contains many thousands of sample points, it adds another dimension to an already complex problem.
+When the difference in input data is really subtle, millions of waveforms should be processed to find the statistically significant difference. Therefore, computationally efficient algorithms for analyzing such volumes of data are crucial for practical use. Due to their nature, computing high-order moments and statistics requires 2- or 3-pass computations. However, not all the data can fit into the available memory. In another case, after initial input has been processed, new data may appear, which requires starting the process over. When the input waveforms contain many thousands of sample points, it adds another dimension to an already complex problem.
 
 This can be overcome with a streaming approach. A streaming algorithm examines a sequence of inputs in a single pass, it processes data as it is collected, without waiting for it to be pre-collected and stored in a persistent storage. Streaming computation is often used to process data from real-time sources, such as oscilloscopes, sensors as well as financial markets. It can also be used to process large datasets that are too large to fit in memory or to process data that is constantly changing. Another name for such an approach is an online algorithm.
 
-This library implements streaming algorithms to avoid recomputation, stores intermediate data in the accumulator. The algorithm also takes in the account that covariance and higher matrices are represented by a symmetric matrix, decreasing the memory requirement two-fold. The data update is blazingly fast and at any moment the internal accumulator can be converted to to (co-)moments on demand and the moments in turn can be converted to t-statistics with Welch's t-test. Once more data collected, it can be iteratively processed, increasing the precision of the moments, and discarded. The computation is optimized to consume significant input streams, hundreds of megabytes per second of waveforms, which may contain thousands of points.
+This library implements streaming algorithms to avoid recomputation, and stores intermediate data in the accumulator. The algorithm also takes into account that covariance and higher matrices are represented by a symmetric matrix, decreasing the memory requirement two-fold. The data update is blazingly fast and at any moment the internal accumulator can be converted to (co-)moments on demand and the moments in turn can be converted to t-statistics with Welch's t-test. Once more data is collected, it can be iteratively processed, increasing the precision of the moments, and discarded. The computation is optimized to consume significant input streams, hundreds of megabytes per second of waveforms, which may contain thousands of points.
 Yet another dimension can be added when the data split is unknown. In other words, which bucket the input waveform belongs to. This library solves this with pre-classification of the input data and computing moments for all the requested data splits.
 
 Some of the benefits of streaming computation include:
@@ -25,11 +25,11 @@ Univariate statistics are used in various fields and contexts to analyze and des
 
 - Descriptive Statistics: summarizing and describing the central tendency, dispersion, and shape of a dataset.
 - Hypothesis Testing: testing hypothesis to determine if there are significant differences or relationships between groups or conditions.
-- Finance and Economics: Examining the performance of financial assets, track market trends, and assess risk in real time.
+- Finance and Economics: Examining the performance of financial assets, tracking market trends, and assessing risk in real-time.
 
 In summary, univariate statistics are a fundamental tool in data analysis and are widely used across a range of fields to explore, summarize, and draw conclusions from single-variable data. They provide essential insights into the characteristics and behavior of individual variables, which can inform decision-making and further research.
 
-Bivariate statistics is a tool for understanding the relationships between two variables. Researchers and practitioners use it in a wide range of fields to make informed decisions and improve outcomes. The can be used to answer a variety of questions, such as:
+Bivariate statistics is a tool for understanding the relationships between two variables. Researchers and practitioners use it in a wide range of fields to make informed decisions and improve outcomes. They can be used to answer a variety of questions, such as:
 
 - Is there a statistically significant relationship between variables?
 - Which data points are related?
@@ -38,9 +38,13 @@ Bivariate statistics is a tool for understanding the relationships between two v
 
 These statistical methods are used in medical and bioinformatics research, astrophysics, seismology, market predictions, and many more where the input data may be measured in hundreds of gigabytes.
 
+## Numeric accuracy
+
+The numeric accuracy of results is dependent on the coefficient of variation (COV) of the sample point in the input waveforms. With COV of about 5%, the computed (co-)kurtosis has about 10 correct significant digits for 10'000 waveforms, which is more than enough for the resulting t-test. Increasing data by about 100x additionally loses one more significant digit.
+
 ## Examples
 
-Performing univariate data analysis
+### Performing univariate data analysis
 
 ```python
   # Input data parameters
@@ -61,7 +65,7 @@ Performing univariate data analysis
   mean       = [cm.copy() for cm in uveng.moments(moments=1)]
   skeweness  = [cm.copy() for cm in uveng.moments(moments=3)]
 
-  # Detect statistical differences in the first order t-test
+  # Detect statistical differences in the first-order t-test
   for i, tt in enumerate(statmoments.stattests.ttests(uveng, moment=1)):
     if np.any(np.abs(tt) > 5):
       print(f"Data split {i} has different means")
@@ -72,7 +76,7 @@ Performing univariate data analysis
   # Get updated statistical moments and t-tests
 ```
 
-Performing bivariate data analysis
+### Performing bivariate data analysis
 
 ```python
   # Input data parameters
@@ -108,8 +112,30 @@ Performing bivariate data analysis
   # Get updated statistical moments and t-tests
 ```
 
+### Performing data analysis from the command line
+
+```shell
+# Find univariate t-test statistics of skeweness for
+# the first 5000 waveform sample points
+# taken from the HDF5 dataset
+python -m statmoments.univar -i data.h5 -m 3 -r 0:5000
+
+# Find bivariate t-test statistics of covariance for
+# the first 1000 waveform sample points
+# Taken from the HDF5 dataset
+python -m statmoments.bivar -i data.h5 -r 0:1000
+```
+
 More examples can be found in the examples and tests directories.
 
+## Implementation notes
+
+Since the output data can exhaust the existing RAM, the results are the matrices of statistical moments for the requested region, which is produced in a one-at-a-time fashion for each input classifier. The output moment for each classifier has dimension 2 x M x L, where, M is an index of the requested classifier, L is the region length. t-test is represented by a 1D array for each classifier.
+The **bivariate moments** are represented by the **upper triangle** of the symmetric matrix.
+
+
 ## Installation
 
-pip install -e .
+```shell
+pip install statmoments
+```
diff --git a/examples/bivar.py b/examples/bivar.py
@@ -14,29 +14,29 @@ def bivar_ttest():
   # Input data
   traces0 = np.random.normal(0, 10, (tr_count, tr_len)).astype(np.int8)
   # Insert correlation for sample points 0-1 for the full batch
-  traces0[:,   0] = (2 * traces0[:,   1]   + 10).astype(np.int8)
+  traces0[:,   0] = (2 * traces0[:,   1]     + 10).astype(np.int8)
   # and for interlaced waveforms, sample points 2-3
-  traces0[::2, 2] = (3 * traces0[::2, 3]/2 + 20).astype(np.int8)
+  traces0[::2, 2] = (3 * traces0[::2, 3] / 2 + 20).astype(np.int8)
 
   # Generate sorting classification (data partitioning hypotheses)
   # 0: the input batch belongs to dataset 0
   # 1: the input batch belongs to dataset 1
   # 2: data interlaced from 0
   # 3: data interlaced from 1
-  cl0 = [[0, 1, i % 2, (i+1) % 2] for i in range(len(traces0))]
+  cl0 = [[0, 1, i % 2, (i + 1) % 2] for i in range(len(traces0))]
 
   # Process input
   bveng.update(traces0, cl0)
 
   traces1 = np.random.normal(0, 10, (tr_count, tr_len)).astype(np.int8)
   # Insert correlation for interlaces waveforms, sample points 2-3
-  traces1[::2, 2] = (3 * traces1[::2, 3]/2 + 20).astype(np.int8)
+  traces1[::2, 2] = (3 * traces1[::2, 3] / 2 + 20).astype(np.int8)
   # Generate sorting classification (data partitioning hypotheses)
   # 0: the input batch belongs to dataset 1
   # 1: the input batch belongs to dataset 0
   # 2: data interlaced from 0
   # 3: data interlaced from 1
-  cl1 = [[1, 0, i % 2, (i+1) % 2] for i in range(len(traces1))]
+  cl1 = [[1, 0, i % 2, (i + 1) % 2] for i in range(len(traces1))]
   bveng.update(traces1, cl1)
 
   # All generator returned data must be copied out
@@ -69,7 +69,7 @@ def bivar_ttest():
       print(f"Found stat diff in the split {i}")
 
   # Second (covariances)
-  for i, tt in enumerate(statmoments.stattests.ttests(bveng, moment=(1,1))):
+  for i, tt in enumerate(statmoments.stattests.ttests(bveng, moment=(1, 1))):
     if np.any(np.abs(tt) > 5):
       print(f"Found stat diff in the split {i}")
 

diff --git a/setup.cfg b/setup.cfg
@@ -1,7 +1,17 @@
 [metadata]
-description = Fast streaming bivariate statistics and t-test
+name             = statmoments
+author           = Anton Kochepasov
+author_email     = akss@me.com
+license          = MIT
+platforms        = any
+description      = Fast streaming single-pass univariate/bivariate statistics and t-test
+long_description = file: README.md
+long_description_content_type = text/markdown
+project_urls =
+    Source Code = https://github.com/akochepasov/statmoments/
 classifiers =
   Development Status :: 5 - Production/Stable
+  Environment :: Console
   Intended Audience :: Developers
   Intended Audience :: Science/Research
   Intended Audience :: Financial and Insurance Industry
@@ -20,8 +30,8 @@ classifiers =
   Topic :: Scientific/Engineering :: Information Analysis
   Topic :: Scientific/Engineering :: Mathematics
   Topic :: Scientific/Engineering :: Physics
-keywords = data-science,univariate,bivariate,statistics,streaming,numpy,vectorization
-platforms = any
+keywords =
+    data-science,univariate,bivariate,statistics,streaming,numpy,vectorization
 
 [options]
 zip_safe = False
@@ -31,17 +41,15 @@ install_requires =
   numpy
   scipy
   psutil
-
 setup_requires =
   cython
 
 [flake8]
-ignore = E111,E114,E226,E231,E241,E272,E221
+ignore = E111,E114,E221,E241,E272
 per-file-ignores =
 	__init__.py: F401
 	_statmoments_impl.py: F401
 	_native_shim.py:  E402,F401,F403
-	bench_bivar.py:   E121,E122,E201,E262,E265
   examples/bivar.py:  F841
   examples/univar.py: F841
 exclude =
@@ -64,4 +72,4 @@ statistics = yes
 # E262 Inline comment should start with #
 # E111 Indentation is not a multiple of four
 # E114 Indentation is not a multiple of four (comment)
-# E226 Missing whitespace around arithmetic operations
+# E226 Missing whitespace around arithmetic operations
diff --git a/setup.py b/setup.py
@@ -11,6 +11,7 @@
   # It should work even with 2.7, just never really tested
   raise RuntimeError("statmoments requires Python 3.6 or later")
 
+kwargs = {}
 basedir = os.path.abspath(os.path.dirname(__file__))
 USE_CYTHON = os.path.isfile(os.path.join(basedir, "statmoments/_native.pyx"))
 
@@ -62,17 +63,12 @@ def get_version():
 #  setuptools.setup(
 #    cmdclass = {'build_ext' : build_ext_cupy},
 
-kwargs = {}
-
 
 def main():
   ext = '.pyx' if USE_CYTHON else '.c'
   extensions = [make_ext("statmoments._native", 'statmoments/_native' + ext)]
   extensions = cythonize('statmoments/_native' + ext) if USE_CYTHON else extensions
   setuptools.setup(
-      name='statmoments',
-      author='Anton Kochepasov',
-      author_email='akss@me.com',
       version=get_version(),
       ext_modules=extensions,
       **kwargs)