# NumPy Aggregation and Statistics

**Author:** Dhanuja Wijerathne  
**Module:** NumPy Foundations  

**Objective:**  
Learn how to compute statistical summaries using NumPy and understand their importance in Machine Learning and data analysis.


## Why aggregation and statistics are important

In Machine Learning, before modeling, we often:
- Analyze data distributions
- Compute summary statistics
- Detect outliers
- Normalize features

NumPy provides fast and efficient aggregation functions for these tasks.


In [None]:
import numpy as np

In [None]:
# Simulated dataset (e.g., feature values)
data = np.array([12, 15, 18, 20, 22, 25, 30])

data

## Basic aggregation functions

Aggregation functions reduce an array to a single value that summarizes the data.


In [None]:
np.sum(data), np.min(data), np.max(data)

## Measures of central tendency and spread

These statistics help understand the distribution of data.

In [None]:
np.mean(data), np.median(data), np.std(data)

## Aggregation along axes

In ML datasets, data is often 2D:
- Rows → samples
- Columns → features

In [None]:
dataset = np.array([
    [10, 20, 30],
    [15, 25, 35],
    [20, 30, 40]
])

dataset

In [None]:
# Column-wise (feature-wise)
np.mean(dataset, axis=0)

In [None]:
# Row-wise (sample-wise)
np.mean(dataset, axis=1)

## Aggregation in feature scaling

Mean and standard deviation are used in normalization:

x_scaled = (x - mean) / std

This is a critical preprocessing step in many ML algorithms.


## Handling missing values

Real-world datasets often contain missing values (NaN).
NumPy provides NaN-safe aggregation functions.


In [None]:
data_with_nan = np.array([10, 15, np.nan, 25, 30])

np.mean(data_with_nan), np.nanmean(data_with_nan)

## Key Takeaways

- Aggregation functions summarize data efficiently
- Axis-based aggregation is essential for ML datasets
- Mean and standard deviation are foundational ML statistics
- NaN-safe functions prevent silent bugs