# Level 10: Performance & Memory Optimization

Writing correct NumPy code is the first step. Writing *efficient* NumPy code is the next. This notebook covers key concepts for optimizing your code's speed and memory usage, which is crucial when working with large datasets.

In [1]:
import numpy as np

## 10.1 Vectorization vs. Loops

The single most important rule for performance in NumPy is to **avoid iterating over arrays with Python `for` loops**. Always look for a vectorized solution using ufuncs and broadcasting.

In [2]:
arr = np.arange(1_000_000)

# Bad: Python loop
def loop_sum():
    total = 0
    for x in arr:
        total += x
    return total

# Good: Vectorized ufunc
def vectorized_sum():
    return np.sum(arr)

print("Timing Python loop vs. vectorized sum:")
%timeit loop_sum()
%timeit vectorized_sum()

Timing Python loop vs. vectorized sum:
65.2 ms ± 3.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
685 μs ± 43.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


The vectorized version is orders of magnitude faster because the looping happens in pre-compiled C code, not interpreted Python.

## 10.2 Memory Layout

How arrays are stored in memory can impact performance. The two main memory layouts are:

- **Row-major order (C-style):** The default in NumPy. Elements of a row are contiguous in memory.
- **Column-major order (Fortran-style):** Elements of a column are contiguous in memory.

In [3]:
c_array = np.arange(12).reshape(3, 4, order='C')
f_array = np.arange(12).reshape(3, 4, order='F')

print("C-style (row-major) array:\n", c_array)
print("Fortran-style (col-major) array:\n", f_array)

C-style (row-major) array:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Fortran-style (col-major) array:
 [[ 0  3  6  9]
 [ 1  4  7 10]
 [ 2  5  8 11]]


You can check the memory layout using the `.flags` attribute.

In [4]:
print(f"C-style contiguous: {c_array.flags['C_CONTIGUOUS']}")
print(f"F-style contiguous: {f_array.flags['F_CONTIGUOUS']}")

C-style contiguous: True
F-style contiguous: True


**Why does this matter?** Accessing elements in the order they are stored in memory is faster due to caching. Summing along rows is slightly faster for C-ordered arrays, while summing along columns is faster for F-ordered arrays.

In [5]:
large_c_array = np.zeros((10000, 10000), order='C')
large_f_array = np.zeros((10000, 10000), order='F')

print("Summing C-array by rows (fast):")
%timeit large_c_array.sum(axis=1)
print("\nSumming C-array by cols (slower):")
%timeit large_c_array.sum(axis=0)

print("\nSumming F-array by rows (slower):")
%timeit large_f_array.sum(axis=1)
print("\nSumming F-array by cols (fast):")
%timeit large_f_array.sum(axis=0)

Summing C-array by rows (fast):
114 ms ± 9.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summing C-array by cols (slower):
55.7 ms ± 5.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Summing F-array by rows (slower):
56.4 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summing F-array by cols (fast):
96.7 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## 10.3 In-Place Operations

When you write `arr = arr + 1`, NumPy creates a brand new array to store the result and then assigns it back to `arr`. This can be inefficient for large arrays.

In-place operations modify the array's data directly without creating a new one.

In [6]:
arr = np.arange(10)

# Not in-place (creates a new array)
y = arr + 1

# In-place
arr += 1

### The `out` Parameter
Many ufuncs have an `out` parameter that lets you specify an existing array where the output should be stored. This is another way to perform in-place-like operations.

In [7]:
a = np.arange(5)
b = np.arange(5, 10)
result = np.zeros(5, dtype=int)

# Store the result of a + b in the 'result' array
np.add(a, b, out=result)
print(result)

[ 5  7  9 11 13]


## 10.4 Views vs. Copies (Revisited)

Understanding when NumPy returns a view versus a copy is critical for both performance and correctness.

- **View (No Copy):** Slicing, `reshape`, `ravel` (if possible). Fast, but can have side effects.
- **Copy:** Boolean indexing, fancy indexing, most arithmetic operations, explicit `.copy()`.

Unnecessary copying of large arrays can be a major performance bottleneck.

In [8]:
arr = np.arange(10)

slice_view = arr[2:5]
fancy_copy = arr[[2, 3, 4]]

# Use np.shares_memory() to check
print(f"Does slice share memory with arr? {np.shares_memory(arr, slice_view)}")
print(f"Does fancy index share memory with arr? {np.shares_memory(arr, fancy_copy)}")

Does slice share memory with arr? True
Does fancy index share memory with arr? False


**Rule of Thumb:** If you don't want to modify the original array, use `.copy()`. If you're working with large data and performance is critical, be mindful of which operations create copies.