Optimized CUDA kernels for H100 GPUs targeting the HuggingFace diffusers library, with a Claude Code skill for guided kernel development.
- CUDA Kernels: Optimized implementations for RMSNorm, RoPE (1D/3D), GEGLU, SwiGLU, and AdaLN
- Python API: Drop-in replacements for diffusers operations via
ltx_kernels - Claude Code Skill: Expert guidance for writing custom H100 kernels
# Install the package
pip install -e .
# With diffusers support
pip install -e ".[diffusers]"# Using Docker (recommended)
docker run --rm --mount type=bind,source=$(pwd),target=/kernelcode \
-w /kernelcode ghcr.io/huggingface/kernel-builder:main build
# Or with Nix
nix run .#build-and-copy --max-jobs 2 --cores 8 -LThis repository includes a Claude Code skill that provides expert guidance for developing optimized CUDA kernels targeting H100 GPUs.
The skill activates automatically when you ask Claude Code about:
- Writing CUDA kernels for diffusion models
- Optimizing attention, normalization, or activation layers
- Integrating custom kernels with diffusers pipelines
- H100-specific optimizations
Writing a new kernel:
Write a fused RMSNorm + residual kernel optimized for H100
Optimizing existing code:
Help me optimize this attention kernel for H100's 192KB shared memory
Integration questions:
How do I add a custom AdaLN kernel to the LTX-Video transformer?
Architecture guidance:
What block sizes should I use for flash attention on H100?
- H100 Architecture Reference: SM count, shared memory, memory bandwidth, warp size
- Kernel Templates: Complete CUDA implementations for common operations
- Block Size Guidelines: Optimal configurations for different kernel types
- PyTorch Integration Patterns: C++ bindings and Python API examples
- Performance Profiling: Commands for nsys and ncu analysis
The skill documentation is in .claude/skills/h100-diffusers-kernels/: