ac4k_kernel

ac4k_kernel is a high-performance kernel library for AI applications, designed to leverage the power of NVIDIA Blackwell GPUs. It provides optimized kernels for various AI tasks, ensuring efficient execution and minimal latency.

New Features

01/10/2026: ac4k_kernel 0.1.0 is now available! This release includes the following features:

Optimized NVFP4 dot scale kernel for RTX5090 (Blackwell) GPUs.
Optimized 8-bit quantized MHA kernel for RTX5090 (Blackwell) GPUs.
BF16 to NVFP4, BF16 to FP8E4M3 and BF16 to INT8 quantize kernels for RTX5090 (Blackwell) GPUs.
Only support for CUDA 12.8.

Installation

Prerequisites

Install build dependencies for fastest compilation:

pip install -r requirements.txt

# Recommended: ccache for caching (dramatically speeds up recompilation)
# Ubuntu/Debian
sudo apt install ccache

Install from source

git clone git@github.com:ac4k/ac4k_kernel.git
cd ac4k_kernel

# Install (auto-detects GPU architecture)
pip install -e . --no-build-isolation

# Specify architecture explicitly (no GPU required at build time)
AC4K_CUDA_ARCH=sm120 pip install -e . --no-build-isolation

Rebuild after C++/CUDA changes

When modifying C++/CUDA code during development, use this faster command instead of re-running pip:

python setup.py build_ext --inplace

Note: Python code changes take effect immediately (editable install). Only C++/CUDA changes require rebuilding.

Build acceleration

The build system automatically applies these optimizations when available:

Optimization	Effect	How to enable
Ninja	Parallel file-level compilation	`pip install ninja`
ccache	Caches object files across rebuilds	`apt install ccache`
MAX_JOBS	Controls parallel compilation jobs	`MAX_JOBS=N pip install ...` (default: half CPU cores)
nvcc --threads	Intra-file parallelism for CUDA compilation	Automatic
Single-arch build	Only compiles target GPU architecture	Automatic via `-arch=sm_XXXa`

Contribution Guidelines

Pre-commit Hooks

To ensure code quality, style consistency, and commit integrity, we use pre-commit. Install the hooks before contributing:

# Install clang-format
pip install clang-format

# Navigate to the project root directory
cd ac4k_kernel

# Install pre-commit hooks (runs on every commit)
pre-commit install

# (Optional) Install pre-push hooks (runs additional checks before pushing to remote)
pre-commit install --hook-type pre-push

Explanation: Hooks will automatically run code formatting (black, isort), linting (flake8, pylint), and syntax checks. Commits that fail validation will be blocked—fix the issues before re-committing.

Contribution Workflow

Fork the repository
Create a feature branch (git checkout -b feature/your-feature-name)
Make your changes (follow the code style guidelines)
Run tests locally (see Testing)
Commit your changes (pre-commit hooks will run automatically)
Push to your forked repository
Create a Pull Request (PR) to the dev branch of the original repository

Development Environment

Recommended Docker Image

For a consistent and reproducible development environment, use the following Docker image (includes CUDA 12.8 + cuDNN + Ubuntu 22.04):

# Pull the recommended image
docker pull nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04

# Run the container (mount project directory for local development)
docker run -it --gpus all -v $(pwd):/ac4k_kernel -w /ac4k_kernel nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
include/ac4k_kernel		include/ac4k_kernel
lib		lib
python/ac4k_kernel		python/ac4k_kernel
test		test
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ac4k_kernel

New Features

Installation

Prerequisites

Install from source

Rebuild after C++/CUDA changes

Build acceleration

Contribution Guidelines

Pre-commit Hooks

Contribution Workflow

Development Environment

Recommended Docker Image

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

ac4k/ac4k_kernel

Folders and files

Latest commit

History

Repository files navigation

ac4k_kernel

New Features

Installation

Prerequisites

Install from source

Rebuild after C++/CUDA changes

Build acceleration

Contribution Guidelines

Pre-commit Hooks

Contribution Workflow

Development Environment

Recommended Docker Image

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages