# üìä Section 4: Aggregation, Reduction, and Statistical Operations in NumPy

NumPy shines not just in elementwise computation but in **aggregating** and **summarizing** large datasets efficiently.

In this section, we‚Äôll explore how **reductions** like `sum()`, `mean()`, `min()`, and `std()` work internally, how to use them across axes, and how to chain them to compute advanced statistics efficiently.

---
## üéØ Learning Objectives

By the end of this section, you‚Äôll:
- Understand the concept of **reductions** and **axes**.
- Perform **aggregations** across specific dimensions.
- Learn to use **built-in statistical functions** in NumPy.
- Optimize aggregations for **speed and memory efficiency**.
- Avoid common pitfalls like axis confusion or unintended data collapse.

---
## üß† 1. What is a Reduction?

A *reduction* is any operation that **reduces the number of dimensions** by aggregating values along one or more axes.

For example, computing a mean of a 2D array along rows reduces it to 1D ‚Äî one mean per row.

In [ ]:
import numpy as np

# Create a 2D array of simulated sensor readings
data = np.array([
    [23.1, 24.5, 22.8, 23.9],
    [25.4, 24.8, 26.1, 25.7],
    [22.5, 21.9, 23.0, 22.7]
])

print("Data:\n", data)

# Compute simple reductions
print("\nOverall mean:", np.mean(data))
print("Row-wise means:", np.mean(data, axis=1))
print("Column-wise means:", np.mean(data, axis=0))

Notice how the **axis** parameter controls *which dimension collapses*.

- `axis=0` ‚Üí collapse rows ‚Üí operate column-wise.
- `axis=1` ‚Üí collapse columns ‚Üí operate row-wise.

---
## ‚öôÔ∏è 2. Common Aggregation Functions

NumPy provides many efficient aggregation functions, including:

| Function | Description |
|-----------|--------------|
| `np.sum()` | Sum of elements |
| `np.mean()` | Arithmetic mean |
| `np.std()` | Standard deviation |
| `np.var()` | Variance |
| `np.min()` / `np.max()` | Minimum / Maximum |
| `np.prod()` | Product of all elements |
| `np.median()` | Median value |
| `np.percentile()` | Percentile statistics |

All of these accept the same key arguments: `axis`, `dtype`, and `keepdims`.

In [ ]:
# Example: axis and keepdims
sum_rows = np.sum(data, axis=1, keepdims=True)
print("Sum of each row (kept as 2D):\n", sum_rows)

# Normalizing by row sums
normalized = data / sum_rows
print("\nRow-normalized data:\n", normalized)

# Compute standard deviation for each sensor (column)
print("\nStandard deviation per column:", np.std(data, axis=0))

‚úÖ `keepdims=True` keeps the result‚Äôs dimensionality consistent, which makes broadcasting back into the original shape easier.

---
## üìà 3. Realistic Example ‚Äî Daily Temperature Aggregation

Let‚Äôs simulate daily temperature readings (in ¬∞C) for 7 days from 5 weather stations and perform aggregations.

In [ ]:
np.random.seed(42)

temps = np.random.normal(loc=25, scale=3, size=(7, 5))  # 7 days √ó 5 stations
print("Temperatures (¬∞C):\n", np.round(temps, 2))

# Mean temperature per day
daily_mean = np.mean(temps, axis=1)
# Mean temperature per station
station_mean = np.mean(temps, axis=0)
# Overall stats
overall_min = np.min(temps)
overall_max = np.max(temps)
overall_std = np.std(temps)

print("\nDaily mean temperatures:", np.round(daily_mean, 2))
print("Station mean temperatures:", np.round(station_mean, 2))
print(f"\nOverall Min: {overall_min:.2f}, Max: {overall_max:.2f}, Std: {overall_std:.2f}")

NumPy‚Äôs aggregations are computed in **compiled C code**, so even large 1000√ó1000 matrices reduce nearly instantly.

---
## üß© 4. Using `dtype` for Precision and Stability

When aggregating integer arrays, precision can be lost due to overflow. You can specify the output `dtype` for safer computation.

In [ ]:
arr_int = np.arange(1_000_000, dtype=np.int16)
print("Default sum (may overflow):", np.sum(arr_int))
print("Safe sum with dtype=int64:", np.sum(arr_int, dtype=np.int64))

Always control the `dtype` when working with large integer arrays to ensure numerical stability.

---
## ‚ö° 5. Advanced: Weighted and Percentile Aggregations

NumPy offers tools like `np.average()` (for weighted means) and `np.percentile()` for quantile-based statistics.

In [ ]:
weights = np.array([0.1, 0.3, 0.4, 0.2])
weighted_avg = np.average(data, axis=1, weights=weights)
percentile_75 = np.percentile(data, 75)

print("Weighted average per row:", np.round(weighted_avg, 2))
print("75th percentile of all readings:", np.round(percentile_75, 2))

Percentiles are extremely useful in data science ‚Äî for instance, identifying outliers or cutoff thresholds.

---
## üîç Under the Hood: How Reductions Work

Internally, NumPy implements reductions using **C loops over contiguous memory** for each axis.

For example, `np.sum()` uses a specialized accumulator that iterates efficiently through the array using its **strides** ‚Äî avoiding Python-level iteration.

The process:
1. Identify the memory layout and stride pattern.
2. Collapse elements along the target axis.
3. Accumulate the result using optimized C functions.

If arrays are **non-contiguous** (e.g., from slicing), NumPy still works ‚Äî but slightly slower because of non-linear memory access.

---
## üß≠ Best Practices and Pitfalls

‚úÖ Use `axis` explicitly ‚Äî avoid confusion about which dimension is reduced.

‚úÖ Use `keepdims=True` when you‚Äôll broadcast results back into the original array.

‚úÖ Always specify `dtype` when summing large or mixed-type arrays.

‚úÖ Use NumPy‚Äôs built-in statistical functions ‚Äî they‚Äôre faster and numerically safer than Python‚Äôs `sum()` or `statistics` module.

‚ö†Ô∏è Beware of collapsing all axes accidentally when you forget to specify `axis`.

---
## üí™ Challenge Exercises

1. Create a 10√ó6 array of random integers (0‚Äì99). Compute:
   - The mean of each column.
   - The maximum of each row.
   - The overall standard deviation.
2. Compute the weighted average of a 3√ó4 array using different weights per column.
3. Find the 25th, 50th, and 90th percentiles of a random 1D array of 1,000 values.

---
## üèÅ Summary

- **Reductions** collapse dimensions by aggregating along axes.
- The **axis** parameter defines the direction of reduction.
- **Statistical functions** (`mean`, `std`, `percentile`) are optimized in C.
- Use `keepdims` and `dtype` wisely to ensure stability and compatibility.

Next up: **Section 5 ‚Äî Boolean Masking and Conditional Filtering**, where you‚Äôll learn how to apply logic and selection directly to arrays using Boolean masks.