ac4k_kernel is a high-performance kernel library for AI applications, designed to leverage the power of NVIDIA Blackwell GPUs. It provides optimized kernels for various AI tasks, ensuring efficient execution and minimal latency.
01/10/2026: ac4k_kernel 0.1.0 is now available! This release includes the following features:
- Optimized
NVFP4dot scale kernel for RTX5090 (Blackwell) GPUs. - Optimized 8-bit quantized MHA kernel for RTX5090 (Blackwell) GPUs.
BF16toNVFP4,BF16toFP8E4M3andBF16toINT8quantize kernels for RTX5090 (Blackwell) GPUs.- Only support for CUDA 12.8.
Install build dependencies for fastest compilation:
pip install -r requirements.txt
# Recommended: ccache for caching (dramatically speeds up recompilation)
# Ubuntu/Debian
sudo apt install ccachegit clone git@github.com:ac4k/ac4k_kernel.git
cd ac4k_kernel
# Install (auto-detects GPU architecture)
pip install -e . --no-build-isolation
# Specify architecture explicitly (no GPU required at build time)
AC4K_CUDA_ARCH=sm120 pip install -e . --no-build-isolationWhen modifying C++/CUDA code during development, use this faster command instead of re-running pip:
python setup.py build_ext --inplaceNote: Python code changes take effect immediately (editable install). Only C++/CUDA changes require rebuilding.
The build system automatically applies these optimizations when available:
| Optimization | Effect | How to enable |
|---|---|---|
| Ninja | Parallel file-level compilation | pip install ninja |
| ccache | Caches object files across rebuilds | apt install ccache |
| MAX_JOBS | Controls parallel compilation jobs | MAX_JOBS=N pip install ... (default: half CPU cores) |
| nvcc --threads | Intra-file parallelism for CUDA compilation | Automatic |
| Single-arch build | Only compiles target GPU architecture | Automatic via -arch=sm_XXXa |
To ensure code quality, style consistency, and commit integrity, we use pre-commit. Install the hooks before contributing:
# Install clang-format
pip install clang-format
# Navigate to the project root directory
cd ac4k_kernel
# Install pre-commit hooks (runs on every commit)
pre-commit install
# (Optional) Install pre-push hooks (runs additional checks before pushing to remote)
pre-commit install --hook-type pre-push
Explanation: Hooks will automatically run code formatting (black, isort), linting (flake8, pylint), and syntax checks. Commits that fail validation will be blocked—fix the issues before re-committing.
- Fork the repository
- Create a feature branch (git checkout -b feature/your-feature-name)
- Make your changes (follow the code style guidelines)
- Run tests locally (see Testing)
- Commit your changes (pre-commit hooks will run automatically)
- Push to your forked repository
- Create a Pull Request (PR) to the dev branch of the original repository
For a consistent and reproducible development environment, use the following Docker image (includes CUDA 12.8 + cuDNN + Ubuntu 22.04):
# Pull the recommended image
docker pull nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04
# Run the container (mount project directory for local development)
docker run -it --gpus all -v $(pwd):/ac4k_kernel -w /ac4k_kernel nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.