CUFINUFFT binsize and performance #809
DiamonDinoia
started this conversation in
General
Replies: 1 comment
-
|
Great! Nice investigation and solution. This should have a great impact for GPU users. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
CUFINUFFT's performance is highly sensitive to
bin sizeandnp(number of output points). Benchmark analysis across 8 GPUs revealed that poor configurations can be 3-10× slower than optimal settings.Key Results:
Limitation:
How to make CUFINUFFT faster
Run binsize sweep on your GPU and attach the results to this discussion.
How to do so:
To the advanced user, you can try different binsizes using formulas in binsize_sweep and find the best configuration for your GPU.
The Performance Impact of Bin Size and NP
Measured Performance Deltas
Benchmark sweeps on H100-80GB GPU demonstrate massive performance differences based on configuration choices:
Using the wrong configuration can result in order-of-magnitude performance loss. The optimal point is not "use maximum shared memory" or "use minimum shared memory" but it requires a more involved heuristic.
What Are Bin Size and NP?
Bin Size
Bin size defines the dimensions of the grid tile stored in shared memory during spreading/interpolation:
bin=511means 511 grid points)bin=80×80means 80×80 grid points)bin=10×10×10means 10×10×10 grid points)The actual shared memory used includes padding for the kernel width:
NP (Number of Output Points)
np defines how many output points (NU points) are processed together in each thread block:
np= more parallelism but more shared memory neededThe Shared Memory Budget
The GPU has limited shared memory per thread block (typically 100-228 KB depending on architecture):
The fundamental trade-off: Larger bins improve grid access patterns but leave less room for np. Larger np improves parallelism but reduces bin size.
Factors That Dictate Performance
1. Dimension (1D/2D/3D)
1D Transforms:
O(bin)2D Transforms:
O(bin²)3D Transforms:
O(bin³)Pattern: Higher dimensions need more shared memory for np because the computation per point increases dramatically.
2. Tolerance (Kernel Width ns)
Tolerance affects the kernel width
ns:tol=1e-3→ns=4(loose tolerance)tol=1e-6→ns=7tol=1e-9→ns=10tol=1e-12→ns=13(tight tolerance)Impact:
2×⌈ns/2⌉grows with nsO(dim*ns)ad a function of dimExample (1D, H100):
ns=4: Optimal np=288, throughput=4.91 Gpts/sns=7: Optimal np=192, throughput=4.57 Gpts/sns=10: Optimal np=160, throughput=4.31 Gpts/sns=13: Optimal np=128, throughput=3.98 Gpts/sNotice: bin size decreases with tighter tolerance (less room due to padding), but optimal np also changes.
3. GPU Architecture
Different GPU architectures have different optimal configurations due to varying:
Ampere (A100):
Hopper (H100/H200):
Ada/Blackwell Workstation (RTX 6000 Ada, RTX Blackwell):
Ada Mobile (RTX 4070 Mobile):
4. Why Shared Memory Percentage Matters
Too little shared memory used (e.g., 10%):
Too much shared memory used (e.g., 99%):
Optimal range:
The optimal percentage is not fixed — it depends on all the factors above.
The Old Heuristic
Method 2: Fixed 100% Shared Memory
The original Method 2 implementation was simple and dimension-agnostic.
Measured Impact:
Method 3: Fill Remaining After Bins
The original Method 3 first allocated bins, then filled all remaining shared memory with np.
Measured Impact:
Why This Failed: The assumption was "use all available memory = best performance." In reality:
The New Heuristic (GPU-Aware Implementation)
Method 2: Dimension and GPU-Aware Load Factors
The new Method 2 uses different shared memory targets based on dimension and GPU type, for example:
Key Changes:
Performance Improvement:
Method 3: GPU-Specific Lookup Tables
The new Method 3 uses benchmark-validated lookup tables per GPU category:
Performance Improvement:
Key Insights from the Analysis
1. "Less is More" for Shared Memory
Contrary to intuition, using less shared memory often performs better:
Example: H100-80GB, 1D, tol=1e-3
2. Bins vs NP Trade-off Is Critical
The split between grid memory (bins) and output point memory (np) dramatically affects performance:
1D Optimal Split: 85-92% for bins, 8-15% for np
2D Optimal Split: 60-85% for bins, 15-40% for np
3D Optimal Split: 50-70% for bins, 30-50% for np
3. GPU Architecture Matters Significantly
Datacenter GPUs (A100, H100):
Consumer GPUs (RTX 6000 Ada, RTX Blackwell):
Example difference (2D, tol=1e-3):
4. Tolerance Scaling Exists But Is Not Linear
Optimal configurations change with tolerance, but not in a simple linear way:
Pattern observed:
This is why simple formulas don't work — need empirical lookup tables.
5. One Size Does NOT Fit All
The old "maximize shared memory" heuristic failed because:
Performance Validation
Benchmark Dataset
Validation Results
Method 3 improvements on representative H100-80GB configurations:
Method 2 improvements:
Real-World Impact
For a typical mixed workload (50% 1D, 30% 2D, 20% 3D):
For the most extreme cases (1D/2D with old heuristic hitting worst configurations):
For already-reasonable cases (some 3D configurations with old heuristic):
No performance regressions observed across the 4,132 test configurations.
Beta Was this translation helpful? Give feedback.
All reactions