# Numpy Benchmarking Tutorial
 
This notebook demonstrates best practices for benchmarking numerical and array operations in Numpy:
* Explains the pitfalls of inaccurate timing methods
* Explores high-resolution profiling techniques
* Compares the performance of different implementations and functions

# Setup & reproducibility

In [11]:
%load_ext line_profiler

import time
import numpy as np

rng = np.random.default_rng(42)

# A helper to create test data of controllable size
def make_data(n=20_000_000):
    return rng.standard_normal(n, dtype=np.float64)

# We'll also want 2D data for BLAS-backed ops:
def make_matrix(n=2000):
    return rng.standard_normal((n, n), dtype=np.float64)

print("Ready.")


The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler
Ready.


##  WRONG vs RIGHT: measuring runtime


### Why use `time.perf_counter()` instead of `time.time()`?

* `time.time()`: returns the current system time, which can be less precise and is subject to adjustments by the operating system (e.g. daylight saving changes, manual clock corrections, or NTP syncs).

* `time.perf_counter()`: provides the highest available resolution timer to measure short durations. It is specifically designed for benchmarking: it is monotonically increasing and never goes backward, unaffected by system clock updates.

* `%timeit`: is an IPython/Jupyter magic; it runs multiple times and reports stats. It uses `time.perf_counter()` under the hood.

In [15]:
def sum_of_squares(x):
    s = 0.0
    for v in x:
        s += v * v
    return s

x = make_data()

# WRONG: time.time() has lower resolution and is affected by system clock changes.
start = time.time()
_ = sum_of_squares(x)
elapsed = time.time() - start
print(f"[WRONG] time.time() measurement ~{elapsed:.3f}s")

# RIGHT: time.perf_counter() — highest available resolution timer for benchmarking.
start = time.perf_counter()
_ = sum_of_squares(x)
elapsed = time.perf_counter() - start
print(f"[RIGHT] time.perf_counter() measurement ~{elapsed:.3f}s")

# NOTE: %timeit is an IPython/Jupyter magic; it runs multiple times and reports stats.
%timeit sum_of_squares(x)


[WRONG] time.time() measurement ~2.772s
[RIGHT] time.perf_counter() measurement ~2.812s
2.54 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#  `%prun` to find hotspots (profiling)

In [16]:
# Example function with multiple steps
def pipeline(n=1_000_000):
    a = rng.standard_normal(n)
    b = rng.standard_normal(n)
    # intentionally unvectorized part:
    total = 0.0
    for i in range(0, n, 5):
        total += (a[i] * b[i]) ** 2
    # vectorized part:
    y = (a + b) * (a - b)
    return total + y.sum()

%prun -s tottime pipeline(10_000_000)
# Tip: -s cumulative sorts by cumulative time; try '-s tottime' too.


 

         7 function calls in 1.318 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.289    1.289    1.295    1.295 3341342332.py:2(pipeline)
        1    0.023    0.023    1.318    1.318 <string>:1(<module>)
        1    0.006    0.006    0.006    0.006 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.006    0.006 _methods.py:49(_sum)
        1    0.000    0.000    1.318    1.318 {built-in method builtins.exec}
        1    0.000    0.000    0.006    0.006 {method 'sum' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [17]:
%lprun -f pipeline pipeline(10_000_000)

Timer unit: 1e-09 s

Total time: 3.27197 s
File: /var/folders/2f/hd8mz33s7_q4_3ftljzc6rkm0000gn/T/ipykernel_44328/3341342332.py
Function: pipeline at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def pipeline(n=1_000_000):
     3         1  135996000.0 1.36e+08      4.2      a = rng.standard_normal(n)
     4         1  111620000.0 1.12e+08      3.4      b = rng.standard_normal(n)
     5                                               # intentionally unvectorized part:
     6         1       2000.0   2000.0      0.0      total = 0.0
     7   2000001 1105769000.0    552.9     33.8      for i in range(0, n, 5):
     8   2000000 1865750000.0    932.9     57.0          total += (a[i] * b[i]) ** 2
     9                                               # vectorized part:
    10         1   46953000.0  4.7e+07      1.4      y = (a + b) * (a - b)
    11         1    5881000.0 5.88e+06      0.2      return total + y.sum()

### Change the `Timer unit` in `%lprun`

In [18]:
from line_profiler import LineProfiler

lp = LineProfiler()
lp.add_function(pipeline)
lp.enable_by_count()
pipeline(1_000_000)
lp.disable_by_count()

lp.print_stats(output_unit=1e-3)


Timer unit: 0.001 s

Total time: 0.368647 s
File: /var/folders/2f/hd8mz33s7_q4_3ftljzc6rkm0000gn/T/ipykernel_44328/3341342332.py
Function: pipeline at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def pipeline(n=1_000_000):
     3         1         19.5     19.5      5.3      a = rng.standard_normal(n)
     4         1         19.6     19.6      5.3      b = rng.standard_normal(n)
     5                                               # intentionally unvectorized part:
     6         1          0.0      0.0      0.0      total = 0.0
     7    200001        106.9      0.0     29.0      for i in range(0, n, 5):
     8    200000        182.9      0.0     49.6          total += (a[i] * b[i]) ** 2
     9                                               # vectorized part:
    10         1         39.1     39.1     10.6      y = (a + b) * (a - b)
    11         1          0.7      0.7      0.2      return total + y.sum()



#  BLAS check with `np.show_config`

In [20]:
# WRONG: not checking your NumPy build; you might be using a slow, non-threaded BLAS.

print("\n[RIGHT] NumPy build / BLAS info:")
np.show_config()  # look for MKL, OpenBLAS, Apple vecLib, etc.

# Simple BLAS-backed operation (matrix multiply). Adjust size per your machine.
A = make_matrix(700)
B = make_matrix(700)

# %timeit A @ B  # '@' uses high-performance BLAS when available



[RIGHT] NumPy build / BLAS info:
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
Compilers:
  c:
    commands: cc
    linker: ld64
    name: clang
    version: 15.0.0
  c++:
    commands: c++
    linker: ld64
    name: clang
    version: 15.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.1.3
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: darwin
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: darwin
Python Information:
  path: /private/var/folders/vk/nx37ffx50hv5djclhl