![Numba Slide](img/03_Numba/Folie1.PNG)

![Numba Slide](img/03_Numba/Folie2.PNG)

In [1]:
import numba
import numpy as np
import time
from numba import jit, njit, vectorize, prange
import math

print(f"Numba version: {numba.__version__}")
print(f"NumPy version: {np.__version__}")

Numba version: 0.61.2
NumPy version: 2.2.6


## 1.1 Your First Numba Function

The `@jit` decorator compiles Python functions to machine code. Let's see the speedup:

In [2]:
# Pure Python version
def python_sum_squares(arr):
    total = 0.0
    for i in range(len(arr)):
        total += arr[i] ** 2
    return total

# Numba JIT version - identical code, just add decorator
@jit
def numba_sum_squares(arr):
    total = 0.0
    for i in range(len(arr)):
        total += arr[i] ** 2
    return total

# Create test data
data = np.random.rand(1_000_000)

# Warm up Numba (compilation happens here)
_ = numba_sum_squares(data)

print("Performance comparison:")
print("\nPure Python:")
%timeit python_sum_squares(data)

print("\nNumba JIT:")
%timeit numba_sum_squares(data)

print("\nNumPy (for reference):")
%timeit np.sum(data**2)

# Verify results are the same
py_result = python_sum_squares(data[:1000])
numba_result = numba_sum_squares(data[:1000])
numpy_result = np.sum(data[:1000]**2)

print(f"\nResults match: {np.allclose([py_result, numba_result, numpy_result], numpy_result)}")

Performance comparison:

Pure Python:
186 ms ± 4.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba JIT:
1.04 ms ± 24.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

NumPy (for reference):
1.16 ms ± 27.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Results match: True


![Numba Slide](img/03_Numba/Folie3.PNG)

In [3]:
def bubblesort(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

# same function as above, just add @jit decorator
@jit
def bubblesort_jit(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

# same function as above, just add @njit decorator (nopython-mode)
@njit
def bubblesort_njit(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp


original = np.arange(0.0, 10.0, 0.01, dtype='f4')
shuffled = original.copy()
np.random.shuffle(shuffled)

# same as srt = shuffled.copy()
srt = shuffled.copy()
bubblesort(srt)
print(np.array_equal(srt, original))

# warmup - pre-compile the functions before time measurement
_ = bubblesort_jit(srt)
_ = bubblesort_njit(srt)

# for each iteration, we copy the shuffled array to the srt-array so that we have the same input always
# the actually timed function is bubblesort(srt)
srt[:] = shuffled[:]
print("\nSort time pure:")
%timeit srt[:] = shuffled[:]; bubblesort(srt)

print("\nSort time JIT:")
%timeit srt[:] = shuffled[:]; bubblesort_jit(srt)

print("\nSort time NJIT:")
%timeit srt[:] = shuffled[:]; bubblesort_njit(srt)

# just check how big the copy overhead is:
print("\nCopy overhead (negligible):")
%timeit srt[:] = shuffled[:]

True

Sort time pure:
154 ms ± 3.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sort time JIT:
657 μs ± 4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Sort time NJIT:
650 μs ± 2.38 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Copy overhead (negligible):
401 ns ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## 1.2 Numba Type System

Numba infers types automatically but you can specify them for better performance:

In [4]:
from numba import float64, int64

# Automatic type inference (recommended)
@njit
def auto_typed_function(x, y):
    return x * y + x**2

# Explicit type specification
@njit(float64(float64, float64))  # return_type(arg1_type, arg2_type)
def explicit_typed_function(x, y):
    return x * y + x**2

# Test both
result1 = auto_typed_function(3.14, 2.71)
result2 = explicit_typed_function(3.14, 2.71)

print(f"Auto-typed result: {result1}")
print(f"Explicit-typed result: {result2}")
print(f"Results match: {np.isclose(result1, result2)}")

# Show function signatures (after compilation)
print(f"\nAuto-typed signatures: {auto_typed_function.signatures}")
print(f"Explicit-typed signatures: {explicit_typed_function.signatures}")

Auto-typed result: 18.369
Explicit-typed result: 18.369
Results match: True

Auto-typed signatures: [(float64, float64)]
Explicit-typed signatures: [(float64, float64)]


### Exercise 1: Counting Primes

Implement a method that counts primes and test how it compares to the njit-optimized version:

In [5]:
# Exercise 1: Implement Prime Counting

# Pure Python version
def count_primes_py(n):
    count = 0
    for num in range(2, n):
        prime = True
        for factor in range(2, int(num**0.5) + 1):
            if num % factor == 0:
                prime = False
                break
        if prime:
            count += 1
    return count

# Numba-accelerated version
@njit
def count_primes_numba(n):
    count = 0
    for num in range(2, n):
        prime = True
        for factor in range(2, int(num**0.5) + 1):
            if num % factor == 0:
                prime = False
                break
        if prime:
            count += 1
    return count

N = 5_000_000  # Large enough to see speedup, should take ~15s (py)

# Warm up Numba
count_primes_numba(100)

# Benchmark
start = time.time()
print(f"Python primes: Found {count_primes_py(N)} primes in {time.time() - start:.3f} s")

start = time.time()
print(f"Numba primes:  Found {count_primes_numba(N)} primes in {time.time() - start:.3f} s")

Python primes: Found 348513 primes in 32.589 s
Numba primes:  Found 348513 primes in 1.876 s


## 2.1 Parallel Computing with Numba

Numba can automatically parallelize loops using `prange()` (parallel range):

In [6]:
# Serial version
@njit
def serial_computation(arr):
    result = np.empty_like(arr)
    for i in range(len(arr)):
        # Expensive computation per element
        x = arr[i]
        result[i] = math.sin(x) * math.cos(x**2) + math.exp(-x**2)
    return result

# Parallel version - just change `range` to `prange`
@njit(parallel=True)
def parallel_computation(arr):
    result = np.empty_like(arr)
    for i in prange(len(arr)):  # prange instead of range
        # Same expensive computation
        x = arr[i]
        result[i] = math.sin(x) * math.cos(x**2) + math.exp(-x**2)
    return result

# Test data
large_array = np.random.rand(1_000_000)

# Warm up
_ = serial_computation(large_array)
_ = parallel_computation(large_array)

print("Parallel vs Serial comparison:")
print("Serial version:")
%timeit serial_computation(large_array)

print("Parallel version:")
%timeit parallel_computation(large_array)

# Verify results match
serial_result = serial_computation(large_array[:1000])
parallel_result = parallel_computation(large_array[:1000])
print(f"\nResults match: {np.allclose(serial_result, parallel_result)}")

# Show number of threads
from numba import config
print(f"Using {config.NUMBA_NUM_THREADS} threads")

Parallel vs Serial comparison:
Serial version:
19.7 ms ± 327 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Parallel version:
6.82 ms ± 87.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results match: True
Using 12 threads


### Parallel Reductions

Numba can also parallelize reduction operations:

In [7]:
# Parallel reduction example
@njit(parallel=True)
def parallel_sum_squares(arr):
    total = 0.0
    for i in prange(len(arr)):
        total += arr[i]**2  # Numba handles the reduction automatically
    return total

@njit(parallel=True)
def parallel_dot_product(a, b):
    result = 0.0
    for i in prange(len(a)):
        result += a[i] * b[i]
    return result

# Test parallel reductions
a = np.random.rand(5_000_000)
b = np.random.rand(5_000_000)

# Warm up
_ = parallel_sum_squares(a)
_ = parallel_dot_product(a, b)

print("Parallel reductions:")
print("Sum of squares:")
%timeit parallel_sum_squares(a)
%timeit np.sum(a**2)  # NumPy comparison

print("\nDot product:")
%timeit parallel_dot_product(a, b)
%timeit np.dot(a, b)  # NumPy comparison

# Verify correctness
print(f"\nSum squares match: {np.isclose(parallel_sum_squares(a), np.sum(a**2))}")
print(f"Dot product match: {np.isclose(parallel_dot_product(a, b), np.dot(a, b))}")

Parallel reductions:
Sum of squares:
1.8 ms ± 34.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
8.04 ms ± 47.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dot product:
3.35 ms ± 77.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.69 ms ± 467 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sum squares match: True
Dot product match: True


## 2.2 Universal Functions (ufuncs)

Create custom NumPy-style universal functions with `@vectorize`:

In [8]:
# Create a custom ufunc
@vectorize(['float64(float64, float64)'], target='parallel')
def custom_hypot(x, y):
    """Custom hypotenuse function: sqrt(x² + y²)"""
    return math.sqrt(x*x + y*y)

@vectorize(['float64(float64)'], target='parallel')
def sigmoid(x):
    """Sigmoid activation function"""
    return 1.0 / (1.0 + math.exp(-x))

# Test ufuncs
x = np.random.rand(1_000_000) * 10
y = np.random.rand(1_000_000) * 10

print("Custom ufunc performance:")
print("Custom hypot:")
%timeit custom_hypot(x, y)

print("NumPy hypot:")
%timeit np.hypot(x, y)

print("\nCustom sigmoid:")
%timeit sigmoid(x)

print("Manual sigmoid:")
%timeit 1.0 / (1.0 + np.exp(-x))

# Verify results
print(f"\nHypot results match: {np.allclose(custom_hypot(x[:100], y[:100]), np.hypot(x[:100], y[:100]))}")
print(f"Sigmoid vs manual: {np.allclose(sigmoid(x[:100]), 1.0 / (1.0 + np.exp(-x[:100])))}")

Custom ufunc performance:
Custom hypot:
981 μs ± 6.76 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
NumPy hypot:
16.3 ms ± 132 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Custom sigmoid:
1.26 ms ± 6.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Manual sigmoid:
9.54 ms ± 128 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Hypot results match: True
Sigmoid vs manual: True


### Exercise 2: Custom Distance Metrics

Create vectorized distance functions:

In [9]:
# Exercise 2: Implement custom distance metrics
@vectorize(['float64(float64, float64, float64, float64)'], target='parallel')
def manhattan_distance(x1, y1, x2, y2):
    # TODO: Implement Manhattan distance |x1-x2| + |y1-y2|
    return 0

@vectorize(['float64(float64, float64, float64, float64)'], target='parallel')
def euclidean_distance(x1, y1, x2, y2):
    # TODO: Implement Euclidean distance sqrt((x1-x2)² + (y1-y2)²)
    return 0

# Solution:
@vectorize(['float64(float64, float64, float64, float64)'], target='parallel')
def manhattan_distance_solution(x1, y1, x2, y2):
    return abs(x1 - x2) + abs(y1 - y2)

@vectorize(['float64(float64, float64, float64, float64)'], target='parallel')
def euclidean_distance_solution(x1, y1, x2, y2):
    dx = x1 - x2
    dy = y1 - y2
    return math.sqrt(dx*dx + dy*dy)

# Test with random points
n_points = 1_000_000
x1 = np.random.rand(n_points) * 100
y1 = np.random.rand(n_points) * 100
x2 = np.random.rand(n_points) * 100
y2 = np.random.rand(n_points) * 100

print("Distance computation performance:")
%timeit manhattan_distance_solution(x1, y1, x2, y2)
%timeit euclidean_distance_solution(x1, y1, x2, y2)

# Compare with manual NumPy implementation
print("\nNumPy comparison:")
%timeit np.abs(x1 - x2) + np.abs(y1 - y2)  # Manhattan
%timeit np.sqrt((x1 - x2)**2 + (y1 - y2)**2)  # Euclidean

Distance computation performance:
1.77 ms ± 124 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.66 ms ± 8.26 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

NumPy comparison:
7.07 ms ± 38.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.17 ms ± 74.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## 3.1 Advanced Example: Monte Carlo Integration

Let's implement Monte Carlo integration to estimate the area under curves:

In [10]:
@njit
def monte_carlo_integrate(func, a, b, n_samples, seed=42):
    """
    Monte Carlo integration of func from a to b
    
    Estimates: ∫[a to b] func(x) dx ≈ (b-a) * mean(func(random_points))
    """
    np.random.seed(seed)
    
    # Generate random points in [a, b]
    total = 0.0
    for i in range(n_samples):
        x = a + (b - a) * np.random.random()
        total += func(x)
    
    # Estimate integral
    return (b - a) * total / n_samples

# Parallel version
@njit(parallel=True)
def monte_carlo_integrate_parallel(func, a, b, n_samples, seed=42):
    """Parallel Monte Carlo integration"""
    np.random.seed(seed)
    
    total = 0.0
    for i in prange(n_samples):
        x = a + (b - a) * np.random.random()
        total += func(x)
    
    return (b - a) * total / n_samples

# Define test functions
@njit
def quadratic(x):
    return x**2

@njit
def sin_exp(x):
    return math.sin(x) * math.exp(-x**2)

# Test integration
n_samples = 10_000_000

print("Monte Carlo Integration Results:")

# Test 1: ∫[0 to 1] x² dx = 1/3
result1 = monte_carlo_integrate(quadratic, 0.0, 1.0, n_samples)
analytical1 = 1.0/3.0
print(f"∫₀¹ x² dx: {result1:.6f} (analytical: {analytical1:.6f}, error: {abs(result1-analytical1):.6f})")

# Test 2: More complex function
result2 = monte_carlo_integrate_parallel(sin_exp, 0.0, 2.0, n_samples)
print(f"∫₀² sin(x)e^(-x²) dx: {result2:.6f}")

# Performance comparison
print("\nPerformance comparison:")
print("Serial version:")
%timeit monte_carlo_integrate(quadratic, 0.0, 1.0, 1_000_000)

print("Parallel version:")
%timeit monte_carlo_integrate_parallel(quadratic, 0.0, 1.0, 1_000_000)

Monte Carlo Integration Results:
∫₀¹ x² dx: 0.333326 (analytical: 0.333333, error: 0.000008)
∫₀² sin(x)e^(-x²) dx: 0.421054

Performance comparison:
Serial version:
3.85 ms ± 16.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Parallel version:
834 μs ± 18.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## 3.2 Memory Management and Optimization

Numba can help identify performance bottlenecks and memory issues:

In [11]:
# Demonstrate memory allocation overhead
@njit
def inefficient_function(n):
    """Creates many temporary arrays (inefficient)"""
    total = 0.0
    for i in range(100):
        temp_array = np.random.random(n)  # Memory allocation in loop!
        total += np.sum(temp_array**2)
    return total

@njit
def efficient_function(n):
    """Reuses memory (efficient)"""
    temp_array = np.empty(n)  # Allocate once
    total = 0.0
    for i in range(100):
        # Fill with random numbers (no allocation)
        for j in range(n):
            temp_array[j] = np.random.random()
        total += np.sum(temp_array**2)
    return total

# Warm up
n_size = 10000
_ = inefficient_function(n_size)
_ = efficient_function(n_size)

print("Memory allocation comparison:")
print("Inefficient (allocates in loop):")
%timeit inefficient_function(n_size)

print("Efficient (allocates once):")
%timeit efficient_function(n_size)

# Verify results are similar (statistical variation expected)
result1 = inefficient_function(1000)
result2 = efficient_function(1000)
print(f"\nResults in similar range: {abs(result1 - result2) / result1 < 0.1}")

Memory allocation comparison:
Inefficient (allocates in loop):
4.9 ms ± 22.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Efficient (allocates once):
4.93 ms ± 32.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results in similar range: True


## 3.3 Numba Limitations and Workarounds

Numba doesn't support all Python features. Here are common limitations and solutions:

In [12]:
# Example limitations and workarounds

# 1. List comprehensions aren't supported in nopython mode
# BAD:
# @njit
# def bad_function(arr):
#     return [x**2 for x in arr]  # This would fail

# GOOD:
@njit
def good_function(arr):
    result = np.empty_like(arr)
    for i in range(len(arr)):
        result[i] = arr[i]**2
    return result

# 2. Strings have limited support
@njit
def string_limited():
    # Some string operations work
    s = "hello"
    return len(s)  # This works
    # return s.upper()  # This might not work

# 3. Dynamic data structures (lists, dicts) have limitations
# Use NumPy arrays instead when possible

@njit
def simulate_dynamic_list(n):
    """Simulate growing list with pre-allocated array"""
    max_size = n * 2  # Over-allocate
    data = np.empty(max_size)
    size = 0
    
    for i in range(n):
        if size < max_size:
            data[size] = i**2
            size += 1
    
    return data[:size]  # Return only used portion

# Test workarounds
test_arr = np.array([1, 2, 3, 4, 5])
result = good_function(test_arr)
print(f"Array transformation: {test_arr} -> {result}")

str_len = string_limited()
print(f"String length: {str_len}")

dynamic_result = simulate_dynamic_list(10)
print(f"Simulated dynamic list: {dynamic_result}")

# 4. Object mode fallback for unsupported features
@jit  # Not @njit - allows fallback
def fallback_function(data):
    # This part can be compiled
    processed = np.empty_like(data)
    for i in range(len(data)):
        processed[i] = data[i] * 2
    
    # This part would fall back to Python if needed
    # (but we'll keep it simple)
    return processed

fallback_result = fallback_function(test_arr)
print(f"Fallback function result: {fallback_result}")

Array transformation: [1 2 3 4 5] -> [ 1  4  9 16 25]
String length: 5
Simulated dynamic list: [ 0.  1.  4.  9. 16. 25. 36. 49. 64. 81.]
Fallback function result: [ 2  4  6  8 10]


## Summary: Numba Best Practices

### ✅ Do:
- Use `@njit` for maximum performance
- Replace Python loops with NumPy operations when possible
- Use `prange()` for embarrassingly parallel problems
- Pre-allocate arrays outside loops
- Profile before and after optimization
- Use type annotations when debugging

### ❌ Don't:
- Use list comprehensions in nopython mode
- Allocate memory inside tight loops
- Mix object mode and nopython mode unnecessarily
- Use complex Python features (classes, generators, etc.)
- Forget compilation overhead for small functions

In [13]:
# Final performance showcase: Mandelbrot set computation
@njit(parallel=True)
def mandelbrot_set(height, width, max_iter=100):
    """Compute the Mandelbrot set"""
    result = np.zeros((height, width))
    
    for i in prange(height):
        for j in range(width):
            # Map pixel coordinates to complex plane
            c = complex(-2.0 + 3.0 * j / width, -1.5 + 3.0 * i / height)
            z = 0.0 + 0.0j
            
            # Iterate until divergence or max_iter
            for n in range(max_iter):
                if abs(z) > 2.0:
                    break
                z = z*z + c
            
            result[i, j] = n
    
    return result

print("Numba Performance Showcase: Mandelbrot Set")
print("=" * 50)

# Warm up
_ = mandelbrot_set(100, 100, 50)

# Compute a large Mandelbrot set
size = 800
iterations = 100

start_time = time.time()
mandelbrot = mandelbrot_set(size, size, iterations)
end_time = time.time()

total_pixels = size * size
total_iterations = np.sum(mandelbrot)
pixels_per_second = total_pixels / (end_time - start_time)
iterations_per_second = total_iterations / (end_time - start_time)

print(f"Computed {size}×{size} Mandelbrot set in {end_time-start_time:.2f} seconds")
print(f"Performance: {pixels_per_second/1e6:.1f} million pixels/second")
print(f"Iterations: {iterations_per_second/1e6:.1f} million iterations/second")
print(f"Peak value: {np.max(mandelbrot)}")

Numba Performance Showcase: Mandelbrot Set
Computed 800×800 Mandelbrot set in 0.03 seconds
Performance: 19.8 million pixels/second
Iterations: 413.3 million iterations/second
Peak value: 99.0
