# Statistics Basics: Manual Implementation

## Part 1: Numpy Refresher & Data Generation

Before diving into statistics, let's establish a synthetic dataset using Numpy. We will use this dataset to verify our manual implementations against Numpy's built-in functions.

**Objectives:**
1.  Initialize the random number generator for reproducibility.
2.  Create a 1D array of random integers.
3.  Inspect the array's properties.

In [None]:
import numpy as np

# 1. Set the seed for reproducibility (using Numpy's random state)
np.random.seed(42)

# 2. Generate 10 random integers between 0 (inclusive) and 100 (exclusive)
# np.random.randint(low, high, size)
data = np.random.randint(0, 100, 10)

# 3. Print the array and its attributes
print("Data array:", data)
print("Shape:", data.shape)
print("Data type:", data.dtype)

[51 92 14 71 60 20 82 86 74 74]


## Part 2: Arithmetic Mean

The arithmetic mean represents the central tendency of the data. It is the sum of all values divided by the number of observations.

**Formula:**
$$\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$$

**Task:** Implement the mean calculation manually and compare it with Numpy's optimized implementation.

In [6]:
def calculate_mean(arr):
    """
    Calculates the arithmetic mean of a numpy array.
    """
    # Sum all elements
    total = np.sum(arr)
    # Count number of elements (arr.size is more 'numpy-thonic' than len())
    count = arr.size
    return total / count

# Calculate using both methods
np_mean = np.mean(data)
cal_mean = calculate_mean(data)

print(f"Numpy Mean: {np_mean}")
print(f"Manual Mean: {cal_mean}")

# Verification: assert that the values are close enough (floating point safety)
assert np.isclose(np_mean, cal_mean), "Mean calculation is incorrect!"

Numpy Mean: 62.4
Manual Mean: 62.4


| Property/Function | Description | Return Type | Core Meaning |
|-------------------|-------------|-------------|--------------|
| arr.shape | View the specific size of each dimension | Tuple | Describes the **distribution** of array dimensions |
| arr.size | Calculate the total number of elements across all dimensions | Integer | Describes the total **quantity** of array elements |
| len(arr) | Get the length of the first dimension | Integer | Describes the **outer scale** of the array |

## Part 3: Variance

Variance measures the spread or dispersion of the data points around the mean.

**Key Concepts:**
1.  **Squared Deviations:** We square the difference between each point and the mean so negative and positive differences don't cancel each other out.
2.  **Bessel's Correction (ddof):**
    * **Population Variance ($N$):** Used when we have data for the *entire* population.
    * **Sample Variance ($N-1$):** Used when we only have a *sample* of the data. Dividing by $N-1$ provides an unbiased estimate.

**Formulas:**
$$\sigma^2 = \frac{\sum (x_i - \mu)^2}{N - ddof}$$
* Where $ddof=0$ for Population.
* Where $ddof=1$ for Sample.

In [12]:
def calculate_variance(arr, ddof = 0):
    """
    Calculates the variance of array arr.
    ddof = 0 for Population Variance.
    ddof = 1 for Sample Variance.
    """
    # 1. Calculate Mean
    arr_mean = np.mean(arr)
    # 2. Squared Deviations (Vectorized)
    square_arr = (arr - arr_mean) ** 2
    # 3. Sum of Squared Deviations
    total = np.sum(square_arr)
    # 4. Division (Fixing the denominator logic)
    n = arr.size
    return total / (n - ddof)

# --- Verification ---

# Test 1: Popolation Variance
np_var = np.var(data)
cal_var = calculate_variance(data)

print(f"Numpy variance: {np_var}")
print(f"Manual variance: {cal_var}")

# Verification: assert that the values are close enough (floating point safety)
assert np.isclose(np_var, cal_var), "Variance calculation is incorrect!"

# Test 2: Sample Variance
np_var = np.var(data, ddof=1)
cal_var = calculate_variance(data, 1)

print(f"Numpy variance: {np_var}")
print(f"Manual variance: {cal_var}")

# Verification: assert that the values are close enough (floating point safety)
assert np.isclose(np_var, cal_var), "Variance calculation is incorrect!"

Numpy variance: 643.6400000000001
Manual variance: 643.6400000000001
Numpy variance: 715.1555555555557
Manual variance: 715.1555555555557


| ddof Value | Denominator | Variance Type | Core Description | Key Usage Rule |
|------------|-------------|---------------|------------------|----------------|
| 0          | n           | Population Variance | Reflects the true dispersion of the **complete population**, unbiased | Use for full population data |
| 1          | n-1         | Sample Variance | Unbiased estimate of population variance via sample, avoids underestimation with df correction | Use for sampled data from population |

## Summary

In this notebook, we built the foundational building blocks of statistics from scratch using Python and Numpy.

**Key Takeaways:**

1.  **Numpy Efficiency:** We learned how to generate synthetic data using `np.random` and why Numpy's vectorized operations are preferred over loops.
2.  **Arithmetic Mean:**
    * **Concept:** The central tendency of the data.
    * **Implementation:** Summing all elements and dividing by the count ($N$).
3.  **Variance:**
    * **Concept:** The measure of how spread out the data is.
    * **The Critical Distinction:** We implemented the `ddof` (Delta Degrees of Freedom) parameter to distinguish between:
        * **Population Variance (`ddof=0`):** Dividing by $N$. Used when we have the complete dataset.
        * **Sample Variance (`ddof=1`):** Dividing by $N-1$ (Bessel's Correction). Used to estimate population variance from a sample.

**Verification:**
We successfully verified that our manual implementations match `np.mean()` and `np.var()` results, confirming our understanding of the underlying mathematics.

---

## Bonus: Performance Benchmark

Just how much faster is Numpy? Let's compare a pure Python loop implementation against Numpy's vectorized implementation on a larger dataset (1,000,000 elements).

In [13]:
import time

# 1. Generate a large dataset
large_data = np.random.randn(1000000)

# 2. Pure Python Implementation (Simulating "Slow" Code)
def python_variance(arr):
    # Note: This is slow because it iterates element by element in Python
    n = len(arr)
    m = sum(arr) / n
    return sum((x - m)**2 for x in arr) / n

# 3. Benchmark Python
start_time = time.time()
python_variance(large_data)
python_time = time.time() - start_time
print(f"Pure Python Time: {python_time:.5f} seconds")

# 4. Benchmark Numpy
start_time = time.time()
np.var(large_data)
numpy_time = time.time() - start_time
print(f"Numpy Time:       {numpy_time:.5f} seconds")

# 5. Result
print(f"\n>>> Numpy is {python_time / numpy_time:.1f}x faster!")

Pure Python Time: 0.24327 seconds
Numpy Time:       0.00242 seconds

>>> Numpy is 100.5x faster!


| Core Reason | Detailed Explanation | Key Advantage |
|-------------|----------------------|---------------|
| **Vectorized Operations** | NumPy abandons Python's one-by-one element loop, and performs mathematical operations on the **entire array** at once. It avoids the huge overhead of loop iteration in Python. | Eliminates Python loop iteration overhead, one operation for all elements |
| **Implemented in Low-Level Language** | NumPy's core calculation logic is written in **C/Fortran** (compiled languages), which runs directly on the CPU with high execution efficiency. Python loops are interpreted and executed line by line with low efficiency. | Compiled execution (C/Fortran) vs Python's interpreted execution, much lower CPU execution overhead |
| **Contiguous Memory Layout** | NumPy arrays store data in **contiguous physical memory blocks**, which can fully utilize CPU cache (cache hit rate is extremely high). Python lists store scattered object references, with frequent memory access and low cache utilization. | High CPU cache hit rate, reduces memory access time |
| **Broadcast Mechanism** | For operations between arrays of different shapes (e.g., array - constant), NumPy automatically expands the small operand to match the shape of the large operand **without copying data**, avoiding the memory overhead of manual loop expansion. | No redundant data copying during heterogeneous operations, saves memory and time |
| **Avoid Python Object Overhead** | NumPy arrays store **homogeneous basic data types** (e.g., int32/float64), without the extra overhead of Python's built-in object packaging (e.g., int/float objects). Python loops process packaged objects with high per-element overhead. | Reduces per-element processing overhead, more efficient data storage |