# NumExpr Performance Demonstration

NumExpr is a high-performance numerical expression evaluator for NumPy arrays. It provides:

- **Multi-threaded execution** using all CPU cores
- **Reduced memory allocation** by evaluating expressions in chunks
- **Optimized operations** that minimize cache misses
- **Just-in-time compilation** for complex expressions

This notebook demonstrates the performance advantages of NumExpr over standard NumPy operations for large-scale numerical computations.

In [None]:
import numpy as np
import numexpr as ne
import time

# Create large arrays for performance testing
# Using 50 million elements (~400MB each) to highlight performance differences
n = 50_000_000
a = np.random.randn(n)  # Random normal distribution
b = np.random.randn(n)
c = np.random.randn(n)
d = np.random.randn(n)
e = np.random.randn(n)

print(f"Array size: {n:,} elements ({a.nbytes/1024**2:.1f} MB each)")

# Traditional NumPy approach: multiple temporary arrays created
def numpy_calculation():
    """
    Standard NumPy calculation creating multiple intermediate arrays.
    Each operation creates a new array, consuming memory and time.
    """
    # Step 1: Complex polynomial (creates temporary array)
    step1 = a**3 + 2*b**2 - 3*c + d
    
    # Step 2: Trigonometric operations (creates more temporaries)
    step2 = np.sin(step1) * np.cos(b) + np.exp(-c/10)
    
    # Step 3: Statistical transformations (normalization)
    step3 = (step2 - step2.mean()) / step2.std()
    
    # Step 4: Final complex expression
    result = np.sqrt(np.abs(step3)) * np.log1p(np.abs(e)) + a*b/(c+1)
    return result

# NumExpr approach: optimized evaluation with minimal temporaries
def numexpr_calculation():
    """
    NumExpr calculation minimizing memory allocation and maximizing CPU utilization.
    Expressions are evaluated in chunks, reducing memory pressure.
    """
    # All operations in single optimized expressions - no intermediate arrays
    step1 = ne.evaluate("a**3 + 2*b**2 - 3*c + d")
    step2 = ne.evaluate("sin(step1) * cos(b) + exp(-c/10)")
    
    # Calculate mean and std using numpy (numexpr has limitations with reductions)
    mean_val = np.mean(step2)
    std_val = np.std(step2)
    
    # Use broadcasting variables in numexpr expressions
    step3 = ne.evaluate("(step2 - mean_val) / std_val")
    result = ne.evaluate("sqrt(abs(step3)) * log1p(abs(e)) + a*b/(c+1)")
    return result

# Performance benchmarking
print("\nRunning NumPy calculation...")
start = time.perf_counter()
result_numpy = numpy_calculation()
numpy_time = time.perf_counter() - start

print("Running NumExpr calculation...")
start = time.perf_counter()
result_numexpr = numexpr_calculation()
numexpr_time = time.perf_counter() - start

# Verify numerical accuracy (small floating-point differences expected)
max_diff = np.max(np.abs(result_numpy - result_numexpr))
print(f"\nMax difference: {max_diff:.2e}")
print(f"Results are {'identical' if max_diff < 1e-10 else 'numerically equivalent'}")

# Performance comparison and analysis
speedup = numpy_time / numexpr_time
print(f"\nPerformance Results:")
print(f"NumPy time:    {numpy_time:.3f}s")
print(f"NumExpr time:  {numexpr_time:.3f}s")
print(f"Speedup:       {speedup:.1f}x")

# Memory efficiency explanation
print(f"\nMemory efficiency:")
print(f"NumPy creates ~8 intermediate arrays ({8 * a.nbytes/1024**2:.1f} MB extra)")
print(f"NumExpr processes data in chunks, minimizing memory overhead")

## Understanding the Performance Gains

The speedup comes from several factors:

1. **Multi-threading**: NumExpr automatically uses all available CPU cores
2. **Memory efficiency**: Reduces temporary array allocation
3. **Cache optimization**: Better memory access patterns
4. **Expression optimization**: Combines operations to minimize passes through data

For small arrays (< 1000 elements), NumPy is often faster due to overhead. NumExpr shines with large datasets where memory bandwidth becomes the bottleneck.

In [None]:
# Demonstrate NumExpr configuration and threading
print("NumExpr Configuration:")
print(f"Number of threads: {ne.nthreads}")
print(f"VML available: {ne.use_vml}")
print(f"Max threads: {ne.ncores}")

# Show memory usage comparison with smaller example
print("\nMemory usage demonstration with smaller arrays:")
small_n = 1000
x = np.random.randn(small_n)
y = np.random.randn(small_n)

# NumPy creates temporary arrays
temp1 = x**2  # Creates temporary
temp2 = y**2  # Creates temporary  
numpy_result = temp1 + temp2  # Creates final result

# NumExpr evaluates in-place with minimal temporaries
numexpr_result = ne.evaluate("x**2 + y**2")

print(f"Results match: {np.allclose(numpy_result, numexpr_result)}")
print(f"NumPy created 3 arrays, NumExpr created 1")

## Practical Usage Guidelines

**When to use NumExpr:**
- Complex mathematical expressions with large arrays (> 10,000 elements)
- Memory-constrained environments
- CPU-bound numerical computations
- Operations that can be expressed as single mathematical expressions

**When to stick with NumPy:**
- Small arrays (< 1,000 elements) 
- Operations requiring array methods (e.g., `.sort()`, `.argmax()`)
- Complex control flow or conditional logic
- When readability is more important than performance

In [None]:
# Example: When NumExpr syntax limitations require workarounds
print("NumExpr Syntax Examples and Limitations:")

# Supported operations
arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

# Basic math operations (supported)
result1 = ne.evaluate("arr**2 + sqrt(arr)")
print(f"Basic operations: {result1}")

# Complex numbers (limited support)
complex_arr = np.array([1+2j, 3+4j, 5+6j])
# ne.evaluate("complex_arr * 2")  # This would fail - complex not fully supported

# String expressions with variables
a_val = 3.5
b_val = 2.1
result2 = ne.evaluate("a_val * arr + b_val")
print(f"With variables: {result2}")

# Broadcasting works naturally
matrix = np.random.randn(3, 4)
vector = np.random.randn(4)
result3 = ne.evaluate("matrix + vector")  # Broadcasting happens automatically
print(f"Broadcasting shape: {result3.shape}")

print(f"Available functions: sin, cos, tan, exp, log, sqrt, abs, etc.")
print(f"NOT available: advanced NumPy functions like argmax, sort, unique")

## Performance Scaling Analysis

Let's examine how the performance benefits scale with array size:

In [None]:
# Performance scaling with different array sizes
import matplotlib.pyplot as plt

sizes = [1000, 10000, 100000, 1000000, 10000000]
numpy_times = []
numexpr_times = []
speedups = []

print("Testing performance across different array sizes...")
print(f"{'Size':<10} {'NumPy(s)':<10} {'NumExpr(s)':<12} {'Speedup':<8}")
print("-" * 45)

for size in sizes:
    # Create test arrays
    test_a = np.random.randn(size)
    test_b = np.random.randn(size) 
    test_c = np.random.randn(size)
    
    # Time NumPy
    start = time.perf_counter()
    np_result = test_a**2 + test_b**2 + np.sin(test_c)
    numpy_time = time.perf_counter() - start
    
    # Time NumExpr
    start = time.perf_counter()
    ne_result = ne.evaluate("test_a**2 + test_b**2 + sin(test_c)")
    numexpr_time = time.perf_counter() - start
    
    speedup = numpy_time / numexpr_time if numexpr_time > 0 else 1
    
    numpy_times.append(numpy_time)
    numexpr_times.append(numexpr_time)
    speedups.append(speedup)
    
    print(f"{size:<10} {numpy_time:<10.4f} {numexpr_time:<12.4f} {speedup:<8.1f}x")

print(f"\nKey insight: NumExpr advantage grows with array size")
print(f"Small arrays: ~{speedups[0]:.1f}x speedup")  
print(f"Large arrays: ~{speedups[-1]:.1f}x speedup")