# ‚ö° Section 7: Performance Optimization and Memory Efficiency

NumPy is built for speed ‚Äî but writing fast NumPy code means understanding how it stores data, manages memory, and leverages vectorization.

This section explores performance profiling, in-place operations, broadcasting efficiency, and practical ways to minimize unnecessary memory usage.

## üß† 1. Why NumPy Is Fast

NumPy‚Äôs performance comes from two main ideas:
- **Vectorization:** loops are pushed down to C, reducing Python overhead.
- **Contiguous memory layout:** arrays are stored in blocks of memory that CPU caches can process efficiently.

Let‚Äôs compare a Python loop to a vectorized NumPy operation.

In [ ]:
import numpy as np
%timeit [i ** 2 for i in range(10_000_000)]  # Pure Python loop
%timeit np.arange(10_000_000) ** 2          # Vectorized NumPy array

‚û°Ô∏è The vectorized version runs **tens to hundreds of times faster** because it uses C loops internally.

This difference grows dramatically as array size increases.

## üß© 2. In-place Operations

Creating new arrays consumes both memory and CPU time. Instead, we can perform operations *in place* whenever possible.

Use operators like `+=`, `*=`, and slicing assignments to modify arrays without allocating new memory.

In [ ]:
# Example: in-place scaling
data = np.arange(1_000_000, dtype=np.float64)
print("Before scaling:", data[:5])

# Normal (allocates new array)
data_scaled = data * 1.2

# In-place operation (no new array created)
data *= 1.2
print("After in-place scaling:", data[:5])

You can verify whether two arrays share the same memory using `np.may_share_memory()` or `np.shares_memory()`.

In [ ]:
a = np.arange(10)
b = a[::2]  # Every second element ‚Äî view, not copy

print("Shares memory:", np.shares_memory(a, b))
print("May share memory:", np.may_share_memory(a, b))

## üîç 3. Measuring Performance with `%timeit` and `np.benchmark`

The `%timeit` magic command measures execution time accurately by running multiple iterations.
For memory profiling, you can combine it with `sys.getsizeof()` or third-party profilers like `memory_profiler`.

In [ ]:
import sys

arr = np.arange(1_000_000)
print("Array size (bytes):", arr.nbytes)
print("Python object overhead (bytes):", sys.getsizeof(arr))

# Compare slicing vs. copy
%timeit arr[1000:2000]         # View (fast, no copy)
%timeit arr[1000:2000].copy()  # Explicit copy (slower)

## üíæ 4. Memory Mapping Large Datasets

When dealing with large arrays that don't fit in RAM, **memory mapping** lets you work with data on disk as if it were in memory.

Use `np.memmap()` to load subsets of huge arrays efficiently.

In [ ]:
import os

# Create a large array and save to disk
filename = 'large_array.dat'
if not os.path.exists(filename):
    arr = np.arange(10_000_000, dtype=np.float32)
    arr.tofile(filename)

# Load using memory mapping (read-only mode)
mmap_arr = np.memmap(filename, dtype=np.float32, mode='r', shape=(10_000_000,))

print("First 5 elements:", mmap_arr[:5])
print("Memory-mapped array type:", type(mmap_arr))

## üßÆ 5. Vectorization vs. Python Loops ‚Äî Practical Example

Suppose you need to compute the Euclidean distance between 100,000 random points and the origin.
Let's compare a loop vs. a vectorized solution.

In [ ]:
N = 100_000
points = rng.random((N, 3)) * 10  # 3D coordinates

# Python loop (slow)
def loop_distance(pts):
    out = []
    for p in pts:
        out.append(np.sqrt(p[0]**2 + p[1]**2 + p[2]**2))
    return np.array(out)

%timeit loop_distance(points)

# Vectorized (fast)
%timeit np.sqrt(np.sum(points**2, axis=1))

‚û°Ô∏è The vectorized solution is typically **100√ó faster**, cleaner, and memory-efficient.

Whenever possible, push operations down to NumPy‚Äôs C-level routines rather than writing Python loops.

## ‚öôÔ∏è Under the Hood: How NumPy Manages Memory

- Arrays are stored in **contiguous memory blocks**, either C-order (row-major) or Fortran-order (column-major).
- Slicing creates **views** (not copies) that reference the same buffer.
- Operations that require reordering (e.g., transposing non-contiguous arrays) trigger new allocations.
- `strides` determine how many bytes to step in memory for each axis.

Example:
```python
arr = np.arange(9).reshape(3, 3)
print(arr.strides)
```
Each stride is the byte-step to move between rows/columns.

## ‚úÖ Best Practices & Pitfalls

‚úî Always prefer **vectorized** operations over Python loops.
‚úî Use **in-place operations** to save memory when possible.
‚úî Use **slicing** to create views instead of copies.
‚úî Check **array flags** (`arr.flags`) to confirm memory layout.
‚úî For massive datasets, use **`np.memmap()`** or chunking.

**Common pitfalls:**
- Forgetting that some operations (like transpose of non-contiguous data) trigger copies.
- Repeatedly allocating large temporary arrays.
- Mixing different dtypes ‚Äî causes implicit upcasting and extra memory use.

## üí™ Challenge Exercise

**Task:**
Create a 1,000,000-element float array. Compute its z-score (standardization) using an *in-place* approach ‚Äî i.e., modify the array without creating a new one.

*Hint:* z-score = (x - mean) / std

```python
# Your code here
```

# --- End of Section 7 ‚Äî Continue to Section 8 ---

Next, we'll explore **Integrating NumPy with Other Libraries**, where you'll learn how NumPy arrays interact seamlessly with pandas, scikit-learn, and Numba for high-performance analytics.