# Statistical and Aggregation Functions

### Prerequisites
- [Probability and Statistics](https://www.mathsisfun.com/data/)

In [1]:
!pip install numpy --upgrade



In [2]:
import numpy as np

### sum
- Sum of elements
- **Syntax**: `numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)`

In [4]:
np.sum([0.5, 1.5])

np.float64(2.0)

In [5]:
np.sum([0.5, 0.7, 0.2, 1.5], dtype=np.int32)

np.int32(1)

In [7]:
x = np.array([[1,2,3],[4,5,6]])
np.sum(x, axis=1), np.sum(x, axis=0)

(array([ 6, 15]), array([5, 7, 9]))

In [8]:
np.sum([10], initial=5)

np.int64(15)

### mean
- Mean of elements
- **Syntax**: `numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)`

In [9]:
a = np.array([[1, 2], [3, 4]])
np.mean(a)

np.float64(2.5)

In [10]:
np.mean(a, axis=0), np.mean(a, axis=1)

(array([2., 3.]), array([1.5, 3.5]))

In [13]:
a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
np.mean(a)

np.float64(12.0)

In [15]:
np.mean(a, where=[[True], [False], [False]]) # only first row , other two rows are ignored

np.float64(9.0)

### median
- Median of elements
- **Syntax**: `numpy.median(a, axis=None, out=None, overwrite_input=False, keepdims=False)`

In [16]:
a = np.array([[10, 7, 4], [3, 2, 1]])
np.median(a)

np.float64(3.5)

In [17]:
np.median(a, axis=0), np.median(a, axis=1)

(array([6.5, 4.5, 2.5]), array([7., 2.]))

### std
- Standard deviation
- **Syntax**: `numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>, mean=<no value>, correction=<no value>)`
- Read about [Standard Deviation and Variance](https://www.mathsisfun.com/data/standard-deviation.html) Before

In [31]:
a = np.array([[1, 2], [3, 4]])
np.std(a)
# may vary

np.float64(1.118033988749895)

In [32]:
a = np.zeros((2, 512*512), dtype=np.float32)
a[0, :] = 1.0
a[1, :] = 0.1
np.std(a)

np.float32(0.45000005)

In [33]:
np.std(a, dtype=np.float64) # Computing the variance in float64 is more accurate
# may vary

np.float64(0.4499999992549418)

In [34]:
a = np.array([[14, 8, 11, 10], [7, 9, 10, 11], [10, 15, 5, 10]])
np.std(a)

np.float64(2.614064523559687)

In [36]:
np.std(a, where=[[True], [True], [False]]) # 2nd and 3rd row are ignored

np.float64(2.0)

### var
- Compute the variance along the specified axis.
- **Syntax**: `numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>, mean=<no value>, correction=<no value>)`
- Read about [Standard Deviation and Variance](https://www.mathsisfun.com/data/standard-deviation.html) Before

In [18]:
a = np.array([[1, 2], [3, 4]])
np.var(a)

np.float64(1.25)

In [19]:
np.var(a, axis=0),np.var(a, axis=1)

(array([1., 1.]), array([0.25, 0.25]))

In [24]:
a = np.zeros((2, 512*512), dtype=np.float32)
a[0, :] = 1.0
a[1, :] = 0.1
np.var(a)

np.float32(0.20250003)

In [26]:
np.var(a, dtype=np.float64) # Computing the variance in float64 is more accurate
# may vary

np.float64(0.2024999993294476)

In [27]:
a = np.array([[14, 8, 11, 10], [7, 9, 10, 11], [10, 15, 5, 10]])
np.var(a)

np.float64(6.833333333333333)

In [29]:
np.var(a, where=[[True], [True], [False]]) # 2nd and 3rd row are ignored

np.float64(4.0)

### corrcoef
- [Pearson correlation coefficient](https://www.geeksforgeeks.org/pearson-correlation-coefficient/)
- Return Pearson product-moment correlation coefficients.
- **Syntax**: `numpy.corrcoef(x, y=None, rowvar=True, *, dtype=None)`
  - rowvar - If rowvar is True (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
---
Please refer to the documentation for [cov](https://numpy.org/doc/stable/reference/generated/numpy.cov.html#numpy.cov) for more detail. The relationship between the correlation coefficient matrix, R, and the covariance matrix, C, is
\begin{align} R_{ij} = \frac{C_{ij}}{\sqrt{C_{ii}C_{jj}}} \end{align}
The values of R are between -1 and 1, inclusive.



In this example we generate two random arrays, `xarr` and `yarr`, and compute the row-wise and column-wise Pearson correlation coefficients, `R`.

In [41]:
rng = np.random.default_rng(seed=42) # random number generator
xarr = rng.random((3, 3))
xarr

array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235],
       [0.7611397 , 0.78606431, 0.12811363]])

In [42]:
R1 = np.corrcoef(xarr)
R1

array([[ 1.        ,  0.99256089, -0.68080986],
       [ 0.99256089,  1.        , -0.76492172],
       [-0.68080986, -0.76492172,  1.        ]])

If we add another set of variables and observations `yarr`, we can compute the row-wise Pearson correlation coefficients between the variables in `xarr` and `yarr`.



In [43]:
yarr = rng.random((3, 3))
yarr

array([[0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 ],
       [0.22723872, 0.55458479, 0.06381726]])

In [44]:
R2 = np.corrcoef(xarr, yarr)
R2

array([[ 1.        ,  0.99256089, -0.68080986,  0.75008178, -0.934284  ,
        -0.99004057],
       [ 0.99256089,  1.        , -0.76492172,  0.82502011, -0.97074098,
        -0.99981569],
       [-0.68080986, -0.76492172,  1.        , -0.99507202,  0.89721355,
         0.77714685],
       [ 0.75008178,  0.82502011, -0.99507202,  1.        , -0.93657855,
        -0.83571711],
       [-0.934284  , -0.97074098,  0.89721355, -0.93657855,  1.        ,
         0.97517215],
       [-0.99004057, -0.99981569,  0.77714685, -0.83571711,  0.97517215,
         1.        ]])

Finally if we use the option `rowvar=False`, the columns are now being treated as the variables and we will find the column-wise Pearson correlation coefficients between variables in `xarr` and `yarr`.



In [45]:
R3 = np.corrcoef(xarr, yarr, rowvar=False)
R3

array([[ 1.        ,  0.77598074, -0.47458546, -0.75078643, -0.9665554 ,
         0.22423734],
       [ 0.77598074,  1.        , -0.92346708, -0.99923895, -0.58826587,
        -0.44069024],
       [-0.47458546, -0.92346708,  1.        ,  0.93773029,  0.23297648,
         0.75137473],
       [-0.75078643, -0.99923895,  0.93773029,  1.        ,  0.55627469,
         0.47536961],
       [-0.9665554 , -0.58826587,  0.23297648,  0.55627469,  1.        ,
        -0.46666491],
       [ 0.22423734, -0.44069024,  0.75137473,  0.47536961, -0.46666491,
         1.        ]])

### percentile
- Compute percentile of data
- **Syntax**: `numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, method='linear', keepdims=False, *, weights=None)`
  - a - Input array
  - q - Percentage or sequence of percentages for the percentiles to compute. Values must be between 0 and 100 inclusive.

In [47]:
a = np.array([[10, 7, 4], [3, 2, 1]])
np.percentile(a, 50)

np.float64(3.5)

In [48]:
np.percentile(a, 50, axis=0), np.percentile(a, 50, axis=1)

(array([6.5, 4.5, 2.5]), array([7., 2.]))

### histogram
- Compute histogram of an array
- **Syntax**: `numpy.histogram(a, bins=10, range=None, density=None, weights=None)`
  - a - Input data
  - bins - If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines a monotonically increasing array of bin edges, including the rightmost edge, allowing for non-uniform bin widths.
  - range - The lower and upper range of the bins
  - weights - This parameter allows you to assign different weights to each value in the input array a.
  - density - If True, returns probabilities instead of raw counts.



In [49]:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])

(array([0, 2, 1]), array([0, 1, 2, 3]))

In [51]:
np.histogram(np.arange(4), bins=np.arange(5), density=True)

(array([0.25, 0.25, 0.25, 0.25]), array([0, 1, 2, 3, 4]))

In [52]:
np.histogram([[1, 2, 1], [1, 0, 1]], bins=[0,1,2,3])

(array([1, 4, 1]), array([0, 1, 2, 3]))

### References
- https://numpy.org/doc/stable/reference/routines.statistics.html