# Section 5 ‚Äî Performance Tuning and Vectorization Strategies

NumPy‚Äôs biggest strength lies in its **vectorized operations** ‚Äî executing array computations at compiled C speed.  
However, writing *truly efficient* NumPy code requires understanding **how memory layout, data types, and vectorization** interact.

In this section, you‚Äôll learn:
- How to identify performance bottlenecks in NumPy code.
- Techniques to replace slow Python loops with vectorized alternatives.
- The impact of data types, memory layout, and broadcasting.
- Practical comparisons with Python and other optimization tools.

## 5.1 Why Vectorization Matters

Vectorization refers to writing operations that act on **entire arrays** instead of element-by-element loops.  
NumPy achieves this through C-level ufunc loops that are **orders of magnitude faster** than Python iteration.

Let‚Äôs see how big the difference can be.

In [None]:
import numpy as np
import time

size = 10_000_000
x = np.random.rand(size)
y = np.random.rand(size)

# --- Python loop ---
start = time.time()
z_py = [a + b for a, b in zip(x, y)]
print(f"Python loop time: {time.time() - start:.4f} sec")

# --- NumPy vectorized ---
start = time.time()
z_np = x + y
print(f"NumPy vectorized time: {time.time() - start:.4f} sec")

NumPy vectorization typically runs **50‚Äì100x faster** than pure Python loops, thanks to compiled C code and SIMD (Single Instruction, Multiple Data) optimizations.  

This performance gain scales with data size and operation complexity.

## 5.2 Profiling and Identifying Bottlenecks

Before optimizing, always **profile** your code. NumPy performance issues often come from unnecessary copying, type conversions, or Python-level loops.

We‚Äôll use `%timeit` and NumPy‚Äôs built-in memory profiling to identify inefficiencies.

In [None]:
%timeit x + y  # Fast vectorized addition

def slow_square(arr):
    out = []
    for val in arr:
        out.append(val ** 2)
    return np.array(out)

%timeit slow_square(x)

# Fast alternative using vectorization
%timeit x ** 2

Even for simple operations like squaring, the difference can be *thousands of times faster*.  

Profiling helps confirm where Python loops or hidden type conversions are killing performance.

## 5.3 Avoiding Temporary Arrays with `out=` Parameter

Most NumPy operations allocate new arrays for results. When working with large datasets, this can waste memory and slow down your program.  

Use the `out=` parameter to reuse existing memory and perform **in-place computations**.

In [None]:
a = np.arange(1e6)
b = np.arange(1e6)
res = np.empty_like(a)

# Standard operation (allocates new array)
%timeit c = a + b

# In-place operation
%timeit np.add(a, b, out=res)

Using `out=` avoids extra memory allocations and reduces CPU cache misses. This pattern is common in high-performance numerical code.

## 5.4 Efficient Data Types and Casting

Data type (`dtype`) selection affects both speed and memory. Smaller dtypes use less memory and can be faster ‚Äî but may lose precision.  

Always match the smallest dtype that still provides acceptable accuracy.

In [None]:
x32 = np.random.rand(1_000_000).astype(np.float32)
x64 = np.random.rand(1_000_000).astype(np.float64)

%timeit x32 * 2.5
%timeit x64 * 2.5

print("Memory (float32):", x32.nbytes / 1e6, "MB")
print("Memory (float64):", x64.nbytes / 1e6, "MB")

Using `float32` often doubles performance in memory-bound operations ‚Äî but beware of cumulative precision loss in scientific applications.

## 5.5 Leveraging Broadcasting and Preallocation

Repeatedly resizing or appending arrays is slow. Instead, **preallocate** memory and use broadcasting for efficient element-wise computation.

In [None]:
n = 10000
a = np.random.rand(n, 1)
b = np.random.rand(1, n)

# Broadcasting creates a full grid efficiently
distances = np.sqrt((a - b)**2)
print(distances.shape)

Broadcasting automatically expands smaller arrays without explicit loops, but be careful: the resulting temporary arrays can still consume a lot of memory.  
Consider chunking or using libraries like **Dask** for extremely large arrays.

## 5.6 Under the Hood: Vectorization Mechanics

Internally, NumPy vectorization works by:
1. Compiling **C-level inner loops** for each operation and dtype combination.
2. Using **SIMD instructions** to perform multiple arithmetic operations per CPU cycle.
3. Minimizing Python function calls and overhead.

This means the Python interpreter never touches individual elements ‚Äî the heavy lifting happens entirely in C, using optimized libraries (BLAS, LAPACK, or SIMD-accelerated loops).

## 5.7 Best Practices & Pitfalls

**‚úÖ Best Practices:**
- Always profile before optimizing.
- Replace Python loops with ufuncs or broadcasting.
- Use `out=` for in-place updates.
- Match data types to task requirements.
- Reuse allocated arrays instead of repeated concatenation.

**‚ö†Ô∏è Pitfalls:**
- Avoid growing arrays dynamically with `np.append` inside loops.
- Beware of implicit dtype upcasting (e.g., mixing int and float).
- Don‚Äôt assume broadcasting is free ‚Äî it may create large temporary arrays.

In [None]:
# Demonstrating array growth pitfall
arr = np.array([])
for i in range(1000):
    arr = np.append(arr, i)  # Very slow

# Better: preallocate
arr_fast = np.empty(1000)
for i in range(1000):
    arr_fast[i] = i

## üß© Challenge Exercise

**Task:** Optimize a naive implementation of the Euclidean distance between two arrays.

1. Write a pure Python loop version.
2. Rewrite it using NumPy vectorization.
3. Add a version using `out=` for in-place updates.
4. Measure time and memory efficiency for all three versions.

_Hint: Use arrays of size ‚â• 10‚Å∂ to see meaningful performance differences._

---
# --- End of Section 5 ‚Äî Continue to Section 6 ---
In the next section, we‚Äôll explore **Memory Mapping and Large Dataset Handling**, learning how NumPy can work with data that doesn‚Äôt fit in memory efficiently using `np.memmap` and related tools.