CUFINUFFT binsize and performance #809

DiamonDinoia · 2026-02-06T17:29:33Z

DiamonDinoia
Feb 6, 2026
Maintainer

CUFINUFFT's performance is highly sensitive to bin size and np (number of output points). Benchmark analysis across 8 GPUs revealed that poor configurations can be 3-10× slower than optimal settings.

Key Results:

Discovered 3-10× performance variance with different binsize/np choices
Implemented GPU-aware heuristics based on 4,000+ benchmark runs
Achieved 1.5-2.5× speedup compared to old heuristics

Limitation:

The tests are done in double precision for now. In the future this will be expanded.

How to make CUFINUFFT faster

Run binsize sweep on your GPU and attach the results to this discussion.

How to do so:

git clone https://github.com/flatironinstitute/finufft.git
cd finufft
cmake -S . -B build -DFINUFFT_BUILD_TESTS=ON -DFINUFFT_USE_CUDA=ON 
cmake --build build --parallel
ctest --test-dir build
./build/perftest/cuda/binsize_sweep | tee results.txt

To the advanced user, you can try different binsizes using formulas in binsize_sweep and find the best configuration for your GPU.

The Performance Impact of Bin Size and NP

Measured Performance Deltas

Benchmark sweeps on H100-80GB GPU demonstrate massive performance differences based on configuration choices:

Case	Worst Config	Best Config	Performance Ratio
Method 3, 1D	np=16, shmem=99% → 0.55 Gpts/s	np=288, shmem=9% → 4.91 Gpts/s	9.0× faster
Method 3, 2D	np=2624, shmem=99% → 0.27 Gpts/s	np=176, shmem=9% → 2.66 Gpts/s	9.9× faster
Method 3, 3D	np=1840, shmem=99% → 46 Mpts/s	np=240, shmem=31% → 480 Mpts/s	10.4× faster
Method 2, 1D	shmem=99% → 2.12 Gpts/s	shmem=7% → 7.80 Gpts/s	3.7× faster
Method 2, 2D	shmem=97% → 1.07 Gpts/s	shmem=13% → 2.84 Gpts/s	2.7× faster
Method 2, 3D	shmem=6% → 0.12 Gpts/s	shmem=47% → 0.37 Gpts/s	3.2× faster

Using the wrong configuration can result in order-of-magnitude performance loss. The optimal point is not "use maximum shared memory" or "use minimum shared memory" but it requires a more involved heuristic.

What Are Bin Size and NP?

Bin Size

Bin size defines the dimensions of the grid tile stored in shared memory during spreading/interpolation:

1D: Single value (e.g., bin=511 means 511 grid points)
2D: Two values (e.g., bin=80×80 means 80×80 grid points)
3D: Three values (e.g., bin=10×10×10 means 10×10×10 grid points)

The actual shared memory used includes padding for the kernel width:

Grid memory = (bin_size + 2×⌈ns/2⌉)^dim × sizeof(complex<T>)

NP (Number of Output Points)

np defines how many output points (NU points) are processed together in each thread block:

Each output point requires kernel evaluations (spreading/interpolating)
Larger np = more parallelism but more shared memory needed
The shared memory for np:

NP memory = np × (ns×sizeof(T)×dim + sizeof(int)×dim + sizeof(cuda_complex<T>))

The Shared Memory Budget

The GPU has limited shared memory per thread block (typically 100-228 KB depending on architecture):

Total = Grid Memory + NP Memory

Where:
  Grid Memory = (bin + 2×⌈ns/2⌉)^dim × sizeof(complex)
  NP Memory = np × per_point_overhead

The fundamental trade-off: Larger bins improve grid access patterns but leave less room for np. Larger np improves parallelism but reduces bin size.

Factors That Dictate Performance

1. Dimension (1D/2D/3D)

1D Transforms:

Grid memory scales linearly: O(bin)
Can afford very large bins (500-700+)
Optimal: 85-92% for bins, 8-15% for np
Example: bin=711, np=288 achieves 4.9 Gpts/s

2D Transforms:

Grid memory scales quadratically: O(bin²)
Moderate bin sizes (40-80)
Optimal: 60-85% for bins, 15-40% for np
Example: bin=80×80, np=176 achieves 2.7 Gpts/s

3D Transforms:

Grid memory scales cubically: O(bin³)
Small bins required (6-12)
Optimal: 50-70% for bins, 30-50% for np
Example: bin=10×10×10, np=240 achieves 480 Mpts/s

Pattern: Higher dimensions need more shared memory for np because the computation per point increases dramatically.

2. Tolerance (Kernel Width ns)

Tolerance affects the kernel width ns:

tol=1e-3 → ns=4 (loose tolerance)
tol=1e-6 → ns=7
tol=1e-9 → ns=10
tol=1e-12 → ns=13 (tight tolerance)

Impact:

Grid padding increases: 2×⌈ns/2⌉ grows with ns
Per-point work increases: Kernel evaluations scale as O(dim*ns) ad a function of dim
Optimal np increases: Tighter tolerances benefit from more parallelism

Example (1D, H100):

ns=4: Optimal np=288, throughput=4.91 Gpts/s
ns=7: Optimal np=192, throughput=4.57 Gpts/s
ns=10: Optimal np=160, throughput=4.31 Gpts/s
ns=13: Optimal np=128, throughput=3.98 Gpts/s

Notice: bin size decreases with tighter tolerance (less room due to padding), but optimal np also changes.

3. GPU Architecture

Different GPU architectures have different optimal configurations due to varying:

Shared memory capacity (100 KB for consumer, 164-228 KB for datacenter)
L2 cache size (affects whether more shmem helps or hurts)
SM count and occupancy (affects parallelism needs)
Memory bandwidth (affects grid access patterns)

Ampere (A100):

164 KB shared memory, 40-80 MB L2 cache
Strategy: Use modest shared memory (9-22%), leverage L2 for grid access
Example 1D: bin=511, np=208, shmem=15%

Hopper (H100/H200):

228 KB shared memory, 50-60 MB L2 cache
Strategy: Can afford larger np (1.5-2× more than Ampere)
Example 1D: bin=711, np=288, shmem=15%

Ada/Blackwell Workstation (RTX 6000 Ada, RTX Blackwell):

100 KB shared memory, 48-96 MB L2 cache
Strategy: Conservative configs, can't push either bin or np too far
Example 1D: bin=255, np=128, shmem=20%

Ada Mobile (RTX 4070 Mobile):

100 KB shared memory, limited 16-24 MB L2, low SM count
Strategy: Dynamic computation to maximize limited resources
Example: Fill remaining shmem after bins, ensure np ≥ 16

4. Why Shared Memory Percentage Matters

Too little shared memory used (e.g., 10%):

Small bins → poor cache locality for grid access
Small np → poor parallelism
Result: Underutilization of GPU

Too much shared memory used (e.g., 99%):

Reduces occupancy (fewer thread blocks per SM)
Can force pathological configurations (tiny bins or tiny np)
Result: Poor performance despite high shmem usage

Optimal range:

Method 1 (Global Memory): 50-75% (uses global mem for spreading)
Method 2 (Shared Memory Spreading): 10-90% depending on dimension/tolerance
Method 3 (Shared Memory Everything): 8-50% depending on dimension/tolerance

The optimal percentage is not fixed — it depends on all the factors above.

The Old Heuristic

Method 2: Fixed 100% Shared Memory

The original Method 2 implementation was simple and dimension-agnostic.

Measured Impact:

1D: Used 99% shmem, achieved 2.12 Gpts/s (optimal: 7% shmem, 7.80 Gpts/s) → 3.7× slower
2D: Used 97% shmem, achieved 1.07 Gpts/s (optimal: 13% shmem, 2.84 Gpts/s) → 2.7× slower

Method 3: Fill Remaining After Bins

The original Method 3 first allocated bins, then filled all remaining shared memory with np.

Measured Impact:

1D: Used 99% shmem with np≈4000, achieved 0.55 Gpts/s (optimal: 9% shmem with np=288, 4.91 Gpts/s) → 9× slower
2D: Used 99% shmem with np≈2600, achieved 0.27 Gpts/s (optimal: 9% shmem with np=176, 2.66 Gpts/s) → 10× slower
3D: Used 99% shmem with np≈1800, achieved 46 Mpts/s (optimal: 31% shmem with np=240, 480 Mpts/s) → 10× slower

Why This Failed: The assumption was "use all available memory = best performance." In reality:

Larger np doesn't always help — there's an optimal point
Using too much shmem reduces occupancy
The balance between bins and np is critical

The New Heuristic (GPU-Aware Implementation)

Method 2: Dimension and GPU-Aware Load Factors

The new Method 2 uses different shared memory targets based on dimension and GPU type, for example:

// Dimension-specific approach
if (dim == 1) {
    load_factor = 0.15;  // 1D: Small bins improve cache behavior
}
else if (dim == 2) {
    load_factor = 0.15;  // 2D: Moderate shared memory
}
else {  // dim == 3
    if (ns <= 6) {
        load_factor = 0.50;  // 3D loose tolerance
    }
    else if (ns <= 10 && !is_small_smem()) {
        load_factor = 0.90;  // 3D medium tolerance (datacenter GPUs)
    }
    else {
        load_factor = 1.0;   // 3D tight tolerance needs full shmem
    }
}

Key Changes:

Dimension-aware: 1D/2D use much less shmem (15% vs old 100%)
Tolerance-aware: 3D scales from 50% → 90% → 100% based on ns
GPU-aware: Small shmem GPUs handled separately

Performance Improvement:

1D: 15% shmem → 3.7× speedup
2D: 15% shmem → 2.7× speedup
3D: Adaptive 50-100% → avoids pathological cases

Method 3: GPU-Specific Lookup Tables

The new Method 3 uses benchmark-validated lookup tables per GPU category:

Per-GPU tuning: 5 GPU categories with different optimal configs
Empirically derived: Tables based on actual benchmark measurements
Dimension×tolerance specific: Each combination has optimal (bin, np) pair
More conservative: Achieve 90-95% of absolute optimal, sacrificing peak for portability

Performance Improvement:

1D: 1.5-2.2× speedup across all tolerances
2D: 2.0-2.5× speedup across all tolerances
3D: 1.8-2.3× speedup across all tolerances

Key Insights from the Analysis

1. "Less is More" for Shared Memory

Contrary to intuition, using less shared memory often performs better:

Allows more thread blocks per SM (higher occupancy)
Avoids pathological edge cases
Better balance between bins and np

Example: H100-80GB, 1D, tol=1e-3

Old: 99% shmem → 0.55 Gpts/s
New: 9% shmem → 4.91 Gpts/s (9× faster)

2. Bins vs NP Trade-off Is Critical

The split between grid memory (bins) and output point memory (np) dramatically affects performance:

1D Optimal Split: 85-92% for bins, 8-15% for np

Reason: Grid access dominates, want large bins for cache locality

2D Optimal Split: 60-85% for bins, 15-40% for np

Reason: More kernel evaluations per point, need more parallelism

3D Optimal Split: 50-70% for bins, 30-50% for np

Reason: Kernel evaluations are O(ns²) per point, parallelism critical

3. GPU Architecture Matters Significantly

Datacenter GPUs (A100, H100):

Large shared memory (164-228 KB)
L2 cache (40-60 MB)
Optimal strategy: Moderate shmem usage, leverage L2 cache
Can afford larger np for better parallelism

Consumer GPUs (RTX 6000 Ada, RTX Blackwell):

Limited shared memory (100 KB)
L2 cache (48-96 MB)
Optimal strategy: Conservative configs, careful balance
Must be more conservative with both bins and np

Example difference (2D, tol=1e-3):

H100: bin=80×80, np=176, 2.7 Gpts/s
RTX 6000 Ada: bin=40×40, np=80, 1.8 Gpts/s

4. Tolerance Scaling Exists But Is Not Linear

Optimal configurations change with tolerance, but not in a simple linear way:

Pattern observed:

Tighter tolerance (larger ns) → smaller maximum bin (more padding)
Tighter tolerance → more computation per point → benefits from more np
But: Relationship varies by dimension and GPU

This is why simple formulas don't work — need empirical lookup tables.

5. One Size Does NOT Fit All

The old "maximize shared memory" heuristic failed because:

Different dimensions have different memory scaling (linear vs quadratic vs cubic)
Different tolerances have different computation vs memory trade-offs
Different GPUs have different cache hierarchies and memory limits
The optimal point is a local optimum in the parameter space

Performance Validation

Benchmark Dataset

4,132 valid benchmark runs across 8 GPUs
GPUs tested:
- Datacenter: A100-40GB, A100-80GB, H100-80GB, H100-94GB, H200
- Workstation: RTX 6000 Ada, RTX Blackwell, RTX 4070 Mobile
Parameters swept:
- Methods: 2 and 3
- Dimensions: 1D, 2D, 3D
- Tolerances: 1e-3, 1e-6, 1e-9, 1e-12 (ns=4,7,10,13)
- Shared memory usage: 10-100% in 10% increments
- NP values: 16 to maximum in steps of ~160-288

Validation Results

Method 3 improvements on representative H100-80GB configurations:

Dimension	Tolerance	Old Throughput	New Throughput	Speedup
1D	1e-3	0.55 Gpts/s	4.91 Gpts/s	9.0×
1D	1e-6	0.48 Gpts/s	4.57 Gpts/s	9.5×
2D	1e-3	0.27 Gpts/s	2.66 Gpts/s	9.9×
2D	1e-6	0.23 Gpts/s	2.15 Gpts/s	9.3×
3D	1e-3	46 Mpts/s	480 Mpts/s	10.4×
3D	1e-6	42 Mpts/s	410 Mpts/s	9.8×

Method 2 improvements:

Dimension	Old Throughput	New Throughput	Speedup
1D	2.12 Gpts/s	7.80 Gpts/s	3.7×
2D	1.07 Gpts/s	2.84 Gpts/s	2.7×
3D	Varies	Varies	1.5-3.2×

Real-World Impact

For a typical mixed workload (50% 1D, 30% 2D, 20% 3D):

Old heuristic: ~0.8 Gpts/s average
New heuristic: ~3.5 Gpts/s average
Overall speedup: 4.4×

For the most extreme cases (1D/2D with old heuristic hitting worst configurations):

Speedup: Up to 10×

For already-reasonable cases (some 3D configurations with old heuristic):

Speedup: 1.5-2.0×

No performance regressions observed across the 4,132 test configurations.

ahbarnett · 2026-02-10T20:47:36Z

ahbarnett
Feb 10, 2026
Maintainer

Great! Nice investigation and solution. This should have a great impact for GPU users.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUFINUFFT binsize and performance #809

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUFINUFFT binsize and performance #809

Uh oh!

Uh oh!

DiamonDinoia Feb 6, 2026 Maintainer

How to make CUFINUFFT faster

The Performance Impact of Bin Size and NP

Measured Performance Deltas

What Are Bin Size and NP?

Bin Size

NP (Number of Output Points)

The Shared Memory Budget

Factors That Dictate Performance

1. Dimension (1D/2D/3D)

2. Tolerance (Kernel Width ns)

3. GPU Architecture

4. Why Shared Memory Percentage Matters

The Old Heuristic

Method 2: Fixed 100% Shared Memory

Method 3: Fill Remaining After Bins

The New Heuristic (GPU-Aware Implementation)

Method 2: Dimension and GPU-Aware Load Factors

Method 3: GPU-Specific Lookup Tables

Key Insights from the Analysis

1. "Less is More" for Shared Memory

2. Bins vs NP Trade-off Is Critical

3. GPU Architecture Matters Significantly

4. Tolerance Scaling Exists But Is Not Linear

5. One Size Does NOT Fit All

Performance Validation

Benchmark Dataset

Validation Results

Real-World Impact

Replies: 1 comment

Uh oh!

ahbarnett Feb 10, 2026 Maintainer

DiamonDinoia
Feb 6, 2026
Maintainer

ahbarnett
Feb 10, 2026
Maintainer