# NumPy Statistical Functions 

NumPy provides a wide range of statistical functions to analyze data in arrays. 

## Mean

In NumPy, mean is a function that calculates the arithmetic mean (average) of the elements in an array or along a specified axis. It is a commonly used statistical operation for analyzing data.

### Syntax
```
numpy.mean(arr, axis=None, dtype=None, keepdims=False)
```

### Parameters

- `arr`: Input array (can be any array-like object).
- `axis` (optional): Axis or axes along which the mean is computed. If None, the mean of the flattened array is returned.
- `axis=0`: Mean along columns (for 2D arrays).
- `axis=1`: Mean along rows (for 2D arrays).
- `dtype` (optional): Data type for computing the mean (e.g., float64).
- `keepdims` (optional): If True, retains reduced dimensions as size 1.

### Examples

#### Mean of a 1D Array

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])
mean_val = np.mean(arr)
print(mean_val)  # Output: 3.0 (since (1+2+3+4+5)/5 = 3.0)

#### Mean of a 2D Array (Along Rows or Columns)

In [None]:
import numpy as np

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

# Overall mean
print(np.mean(arr_2d))  # Output: 3.5

# Mean along columns (axis=0)
print(np.mean(arr_2d, axis=0))  # Output: [2.5 3.5 4.5]

# Mean along rows (axis=1)
print(np.mean(arr_2d, axis=1))  # Output: [2. 5.]

#### Specifying Data Type (dtype)

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4])
mean_float = np.mean(arr, dtype=np.float64)
print(mean_float)  # Output: 2.5 (instead of integer division)

#### Keeping Dimensions (keepdims=True)

In [None]:
import numpy as np

arr = np.array([[1, 2], [3, 4]])
mean_axis0 = np.mean(arr, axis=0, keepdims=True)
print(mean_axis0)  # Output: [[2. 3.]] (shape (1, 2) instead of (2,))

## Median

In NumPy, the median is a statistical measure that represents the middle value of a dataset when it is sorted in ascending or descending order. If the dataset has an odd number of elements, the median is the middle value. If the dataset has an even number of elements, the median is the average of the two middle numbers.

### Syntax:
```
numpy.median(a, axis=None, keepdims=False)
```

### Parameters

- `a`: Input array or object that can be converted to an array.
- `axis` (optional): Axis or axes along which the median is computed (default is None, which computes the median of the flattened array).
- `keepdims` (optional): If True, the reduced axes are kept as dimensions with size 1 (default is False

### Examples

#### Median of a 1D Array

In [None]:
import numpy as np

arr = np.array([3, 1, 5, 2, 4])
med = np.median(arr)
print(med)  # Output: 3.0

#### Median of a 2D Array (Along an Axis)

In [None]:
import numpy as np

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Median of the entire array
print(np.median(arr_2d))  # Output: 5.0

# Median along axis 0 (columns)
print(np.median(arr_2d, axis=0))  # Output: [4. 5. 6.]

# Median along axis 1 (rows)
print(np.median(arr_2d, axis=1))  # Output: [2. 5. 8.]

#### Median with Even Number of Elements

In [None]:
import numpy as np

arr_even = np.array([1, 3, 5, 7])
med_even = np.median(arr_even)
print(med_even)  # Output: 4.0 (average of 3 and 5)

### Key Notes:
- If the input array contains NaN (Not a Number), you can use `numpy.nanmedian()` to ignore `NaN` values.
- The median is a robust measure of central tendency and is less affected by outliers compared to the mean.

## Variance

- Measures how far each number in a dataset is from the mean (spread).
- Formula (for a population): <br />
  ![variance_population](./images/variance_population.png "variance population")
- Formula (for a sample): <br />
  ![variance_sample](./images/variance_sample.png "variance sample")
- Example: For 1,2,3, the mean is 2, and the variance is <br />
  ![variance](./images/variance.png "variance")

In NumPy, variance is a measure of how spread out the values in an array are. It quantifies the dispersion of data points around the mean (average) value.

### Types of Variance in NumPy
1. np.var() (Sample Variance by default):
    - Computes the variance of an array.
    - By default, it calculates the population variance (divides by n where n is the number of elements).
    - If ddof=1 (Delta Degrees of Freedom), it computes the sample variance (divides by n - 1).

2. np.nanvar():
    - Computes variance while ignoring NaN (Not a Number) values.

### Example

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])

# Population Variance (default)
var_pop = np.var(data)  # divides by n
print(var_pop)  # Output: 2.0

# Sample Variance (ddof=1)
var_sample = np.var(data, ddof=1)  # divides by n-1
print(var_sample)  # Output: 2.5

# Variance along an axis (for 2D arrays)
data_2d = np.array([[1, 2], [3, 4]])
var_axis0 = np.var(data_2d, axis=0)  # Variance along columns
print(var_axis0)  # Output: [1. 1.]

## Standard Deviation (SD)

- The square root of variance, giving a measure of spread in the same units as the data.
- Formula: <br />
  ![sd](./images/sd.png "sd")
- Example: If variance is 4, then SD = 2

In NumPy, standard deviation is a measure of how spread out numbers are in an array or dataset. It quantifies the amount of variation or dispersion from the average (mean).

### Examples

#### Basic

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])
std_dev = np.std(data)
print(std_dev)  # Output: 1.4142135623730951

#### ddof (Delta Degrees of Freedom)
- Default is 0 (population standard deviation)
- Set to 1 for sample standard deviation (Bessel's correction)

In [None]:
import numpy as np

# Population standard deviation (default)
pop_std = np.std(data)  # ddof=0
print(pop_std)

# Sample standard deviation
sample_std = np.std(data, ddof=1)
print(sample_std)

#### axis
- Calculate along rows or columns for multi-dimensional arrays

In [None]:
import numpy as np

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

# Standard deviation of each column
col_std = np.std(arr_2d, axis=0)

# Standard deviation of each row
row_std = np.std(arr_2d, axis=1)

## Minimun & Maximum

Minimum (Min)
- The smallest value in a dataset.
- Example: For 5,2,8, the minimum is 2

Maximum (Max)
- The largest value in a dataset.
- Example: For 5,2,8, the maximum is 8.

In [None]:
import numpy as np

# Minimum & Maximum
min_val = np.min(arr)  # 1
max_val = np.max(arr)  # 5
print(min_val)
print(max_val)

## Product

- The result of multiplying all values in a dataset.
- Formula: <br />
  ![product](./images/product.png "product")
- Example: For 2,3,4, the product is 2×3×4=24

In [None]:
import numpy as np

# Product of elements
product = np.prod(arr)  # 120
print(product)

## Percentiles 

in NumPy, percentiles are used to calculate the value below which a given percentage of observations in a dataset fall. The numpy.percentile() function computes the specified percentile(s) of the data along a given axis.

### Syntax
```
numpy.percentile(a, q, axis=None, method='linear')
```

- `a`: Input array or object that can be converted to an array.
- `q`: Percentile(s) to compute (between 0 and 100).
- `axis`: Axis along which to compute the percentile (None for flattened array).
- `method` (optional): Interpolation method ('linear', 'lower', 'higher', 'midpoint', 'nearest').

### Examples

#### Basic Percentile Calculation

In [None]:
import numpy as np

data = np.array([10, 20, 30, 40, 50])

# Compute the 50th percentile (median)
p50 = np.percentile(data, 50)
print(p50)  # Output: 30.0

#### Multiple Percentiles

In [None]:
import numpy as np

percentiles = np.percentile(data, [25, 50, 75])
print(percentiles)  # Output: [22.5 30. 37.5]

#### Along a Specific Axis (for 2D arrays)

In [None]:
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Compute 50th percentile along axis=0 (columns)
p50_axis0 = np.percentile(arr, 50, axis=0)
print(p50_axis0)  # Output: [2.5 3.5 4.5]

# Compute 50th percentile along axis=1 (rows)
p50_axis1 = np.percentile(arr, 50, axis=1)
print(p50_axis1)  # Output: [2. 5.]

#### Different Interpolation Methods

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])

# Different methods for interpolation
print(np.percentile(data, 50, method='linear'))  # 3.0 (default)
print(np.percentile(data, 50, method='lower'))   # 3 (closest lower value)
print(np.percentile(data, 50, method='higher'))  # 3 (closest higher value)
print(np.percentile(data, 50, method='midpoint')) # 3.0 (average of lower & higher)

#### Key Notes:
- `q` can be a single value or a list of percentiles (e.g., `[25, 50, 75]` for quartiles).
- Interpolation matters when the desired percentile lies between two data points.
- For large datasets, consider using `numpy.nanpercentile()` to ignore `NaN` values.

## Quantiles 

In NumPy, quantiles are values that divide a dataset into equal-sized intervals. They are useful for understanding the distribution of data, identifying outliers, and summarizing datasets.

### Key Points

1. Definition:
    - The q-th quantile is the value below which q×100% of the data falls.
    - For example:
        - The 0.5 quantile (median) splits the data into two equal parts.
        - The 0.25 quantile (1st quartile) marks the 25th percentile.

2. NumPy Function:
    - numpy.quantile(arr, q) computes the specified quantile(s) of an array.
    - q can be a single value (e.g., 0.5) or a list of quantiles (e.g., [0.25, 0.5, 0.75]).

3. Parameters:
    - `arr`: Input array (or array-like data).
    - `q`: Quantile value(s) between 0 and 1.
    - `axis`: Axis along which to compute (default is None, meaning flattened array).
    - `method`: Interpolation method (e.g., 'linear', 'lower', 'higher', 'midpoint').

### Example

In [None]:
import numpy as np

data = np.array([10, 20, 30, 40, 50])

# Compute the median (0.5 quantile)
median = np.quantile(data, 0.5)
print(median)  # Output: 30.0

# Compute quartiles (0.25, 0.5, 0.75)
quartiles = np.quantile(data, [0.25, 0.5, 0.75])
print(quartiles)  # Output: [20. 30. 40.]

### Interpolation Methods
When the quantile falls between two data points, NumPy uses interpolation:
- `'linear'`: Default (weighted average of neighboring points).
- `'lower'`/`'higher'`: Nearest lower/higher value.
- `'midpoint'`: Average of the two closest points.

Equivalent Function:
- `numpy.percentile(arr, p)` is similar but uses percentages (e.g., `p=50` for median). Example: `np.percentile(data, 50)` is the same as `np.quantile(data, 0.5)`.

## Correlation 

In NumPy, correlation refers to the statistical relationship between two arrays, measuring how they vary together. NumPy provides functions to compute different types of correlations, including

### Pearson Correlation Coefficient (Linear Correlation)
- Measures the linear relationship between two arrays.
- Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.
- Computed using numpy.corrcoef().

In [None]:
import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])

# Compute Pearson correlation coefficient
corr_matrix = np.corrcoef(x, y)
pearson_corr = corr_matrix[0, 1]  # Correlation between x and y

print(pearson_corr)  # Output: -1.0 (perfect negative correlation)

### Cross-Correlation
- Measures similarity between two signals as a function of a time lag.
- Computed using numpy.correlate() (for 1D arrays) or numpy.cross-correlate (for ND arrays).

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([0, 1, 0.5])

cross_corr = np.correlate(a, b, mode='valid')
print(cross_corr)  # Output: [3.5]

### Auto-Correlation

- Measures how a signal correlates with a delayed version of itself.
- Can be computed using numpy.correlate(x, x, mode='full').

In [None]:
import numpy as np

x = np.array([1, 2, 3])
auto_corr = np.correlate(x, x, mode='full')
print(auto_corr)  # Output: [3 8 14 8 3]

### Key Notes:
- Pearson Correlation (corrcoef) is the most common for linear relationships.
- Cross-Correlation (correlate) is useful in signal processing.
- For statistical applications, you might also consider scipy.stats.pearsonr or pandas.DataFrame.corr() for more features.

## Covariance

In NumPy, covariance refers to the measure of how much two random variables change together. The numpy.cov() function is used to compute the covariance matrix for a given dataset, which summarizes the covariances between all pairs of variables (features).

### Key Points about numpy.cov():
1. Input:
    - Accepts a 2D array where rows represent variables (features) and columns represent observations (samples).
    - Can also compute covariance between two 1D arrays (vectors).

2. Output:
    - Returns a covariance matrix (a square matrix) where:
    - Diagonal elements are variances of each variable.
    - Off-diagonal elements are covariances between pairs of variables.

### Example

In [None]:
import numpy as np

# Sample data: 3 variables (rows) with 5 observations (columns)
data = np.array([
    [1, 2, 3, 4, 5],      # Variable X
    [5, 4, 3, 2, 1],      # Variable Y
    [2, 2, 2, 2, 2]       # Variable Z (constant)
])

# Compute covariance matrix
cov_matrix = np.cov(data)
print(cov_matrix)

## Histograms

NumPy provides functions to compute histograms, which are graphical representations of the distribution of numerical data. A histogram divides the data into bins (intervals) and counts how many values fall into each bin. Histograms are fundamental for data analysis and visualization, often used before more advanced statistical analysis.

### Parameters
Key parameters for numpy.histogram():
- `a`: Input data
- `bins`: Can be an integer (number of bins) or a sequence (bin edges)
- `range`: The lower and upper range of the bins
- `density`: If True, returns probability density (normalized)
- `weights`: Array of weights for each value in a

### numpy.histogram()
The main function for creating histograms in NumPy

In [None]:
import numpy as np

data = np.random.randn(1000)  # Sample data
hist, bin_edges = np.histogram(data, bins=10)

print("Counts per bin:", hist)
print("Bin edges:", bin_edges)

### numpy.histogram2d()
For creating 2D histograms from two datasets.

In [None]:
import numpy as np

x = np.random.randn(1000)
y = np.random.randn(1000)
hist, xedges, yedges = np.histogram2d(x, y, bins=10)

### Example with Customization

In [None]:
import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(170, 10, 250)  # Mean=170, std=10, 250 values

# Custom bins and range
hist, bins = np.histogram(data, bins=[140, 150, 160, 170, 180, 190, 200], 
                         density=True)

# Plotting (requires matplotlib)
plt.hist(data, bins=[140, 150, 160, 170, 180, 190, 200], density=True)
plt.show()

## Convolution

In NumPy, convolve is a function that performs convolution of two one-dimensional arrays (vectors). Convolution is a mathematical operation commonly used in signal processing, image processing, and deep learning (e.g., in convolutional neural networks).

`numpy.convolve(v1, v2, mode='full')`
- `v1`: First input array (1D).
- `v2`: Second input array (1D).
- `mode`: Specifies the size of the output:
    - `'full'` (default): Returns the convolution at every overlapping point (output size = len(v1) + len(v2) - 1).
    - `'same'`: Output has the same length as the first input (v1).
    - `'valid'`: Returns only where the two arrays fully overlap (output size = max(len(v1), len(v2)) - min(len(v1), len(v2)) + 1).

In simple terms, it slides one array (v2, reversed) over the other (v1) and computes the sum of products at each position.

### Example

In [42]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([0, 1, 0.5])

# Convolution
result = np.convolve(a, b, mode='full')
print(result) # [0.  1.  2.5 4.  1.5]

[0.  1.  2.5 4.  1.5]


Explanation:
- 0.0 = 1*0
- 1.0 = 1*1 + 2*0
- 2.5 = 1*0.5 + 2*1 + 3*0
- 4.0 = 2*0.5 + 3*1
- 1.5 = 3*0.5

### Common Uses
- Signal Processing (e.g., smoothing, filtering).
- Image Processing (applying blur, edge detection).
- Neural Networks (CNNs use 2D convolution for feature extraction).

For multi-dimensional convolution, use:
- scipy.signal.convolve2d (2D arrays)
- scipy.ndimage.convolve (n-dimensional arrays)