Adaptive Matrix Multiplication for Edge Computing
cuBLAS is amazing... but it's over-engineered for small problems.
Traditional GPU libraries like cuBLAS are optimized for massive workloads - think training giant neural networks with thousands of parameters. But what happens when you need to multiply small matrices? You get:
- Setup overhead that takes longer than the actual computation
- Memory allocation optimized for gigabytes, not kilobytes
- Complex dispatch logic that assumes large-scale parallelism
- Enterprise features you don't need for edge computing
Result: A Ferrari stuck in city traffic. 🏎️🚦
LEVI GPU Library fills the gap between "basic code" and "industrial-strength cuBLAS" with adaptive kernel selection that picks the right tool for each job.
| Matrix Size | cuBLAS | LEVI | Speedup | Use Case |
|---|---|---|---|---|
| 64×64 | 0.105ms | 0.043ms | 2.4x | Edge AI inference |
| 128×128 | 0.105ms | 0.0061ms | 1.72x | Mobile computer vision |
| 256×256 | 0.106ms | 0.091ms | 1.2x | Embedded robotics |
| 512×512 | 0.143ms | 0.313ms | 0.5x | ← cuBLAS starts to work |
The Sweet Spot: LEVI dominates in the 64-256 range where edge computing lives.
LEVI uses proprietary Kernel Ritus technology to automatically select optimal algorithms:
def select_optimal_kernel(M, N, K):
"""Intelligent kernel selection based on workload characteristics"""
if total_elements <= 65536: # Small matrices
return "simple" # Cache-friendly, minimal overhead
else:
return "tiled" # Shared memory optimization🏃♂️ Simple Kernel (small matrices)
- Minimal setup overhead
- Cache-optimized access patterns
- Loop unrolling for better IPC
- Perfect for edge devices
🏗️ Tiled Kernel (medium+ matrices)
- Shared memory utilization
- Bank conflict avoidance
- Optimized for throughput
- Competitive with cuBLAS
Traditional Data Centers:
- Batch size: 1024+ samples
- Matrix size: 2048×2048+
- Memory: Abundant
- Power: Unlimited
Edge Computing:
- Batch size: 1-32 samples
- Matrix size: 64-512×64-512
- Memory: Limited
- Power: Battery-constrained
LEVI targets exactly this gap.
- GPU: NVIDIA GeForce RTX 3060
- Memory: 12GB GDDR6
- Compute Capability: 8.6
- Precision: FP32
- Iterations: 50 per test (median timing)
- Numerical accuracy: < 1e-5 error vs cuBLAS
- All tests pass correctness validation
- IEEE 754 compliant floating point
from levi_gpu import LEVILibrary
# Initialize LEVI
levi = LEVILibrary()
# Your matrices
A = cp.random.randn(128, 128, dtype=cp.float32)
B = cp.random.randn(128, 128, dtype=cp.float32)
# Automatic optimization
C = levi.gemm(A, B) # 3.8x faster than cuBLAS!That's it. No configuration, no tuning, no complexity.
- Do one thing well: Matrix multiplication for edge workloads
- Automatic selection: No manual tuning required
- Minimal dependencies: Just CuPy and NumPy
- Production ready: Full validation and error handling
Use LEVI when:
✅ Matrix size < 512×512
✅ Edge/mobile deployment
✅ Power/memory constraints
✅ Batch processing many small problems
Use cuBLAS when:
✅ Matrix size > 512×512
✅ Data center deployment
✅ Maximum absolute throughput needed
✅ Deep learning training
- Improve edge GPU utilization by 2-5x
- Extend battery life in mobile devices
- Enable new AI applications previously too slow
- Reduce compute costs for small workloads
- Improve instance efficiency
- New service tiers for edge computing
- Drop-in performance improvement
- No code changes required
- Automatic optimization
// Simple kernel: Optimized for small data
for (int k = 0; k < K; k++) {
sum += A[row * K + k] * B[k * N + col]; // Sequential access
}
// Tiled kernel: Shared memory for larger data
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
// ... tiled computation- Matrix footprint analysis
- Cache size consideration
- Thread occupancy optimization
- Memory bandwidth utilization
# Every operation validated against cuBLAS
max_error = cp.max(cp.abs(C_levi - C_cublas))
assert max_error < 1e-5 # Numerically identical- 50 iterations per benchmark
- Median timing for stability
- Warmup cycles to avoid cold starts
- Full GPU synchronization
1. Direct Replacement:
# Replace this:
C = cp.matmul(A, B)
# With this:
C = levi.gemm(A, B)2. Conditional Usage:
if A.shape[0] < 400:
C = levi.gemm(A, B) # Fast path
else:
C = cp.matmul(A, B) # cuBLAS path3. Framework Integration:
# PyTorch/TensorFlow backends
torch.backends.cuda.levi_enabled = TrueMore on Demand
- Right-sized optimization for edge computing
- Automatic kernel selection - no manual tuning
- Production-ready code quality
- Honest benchmarking - shows where cuBLAS wins
- Clear value proposition - not trying to solve everything
"The best optimization is the one that knows when not to optimize."
LEVI doesn't try to beat cuBLAS everywhere - it focuses on the specific niche where simpler approaches actually work better.
🏢 Company: Forgotten Forge
📧 Email: nfo@forgottenforge.xyz
🐕 Inspiration: Levi (the goodest optimization dog)
We're actively seeking partnerships for:
- Hardware optimization collaboration
- SDK integration opportunities
- Edge computing initiatives
- Developer ecosystem expansion
- maybe someday a mathematician
-
This project follows a dual-license model:
-
For Personal & Research Use: CC BY-NC 4.0 → Free for non-commercial use only.
-
For Commercial Use: Companies must obtain a commercial license (Elastic License 2.0).
📜 For details, see the LICENSE file.
�
Built with ❤️ for the edge computing revolution.
