A pure CUDA implementation of the LLaMA model for high-performance inference and educational purposes. Supports LLaMA 1, 2, and 3 architectures.
This repository demonstrates how to run LLaMA inference using CUDA C++, making it ideal for learning GPU acceleration techniques and understanding transformer internals with minimal dependencies.
- Pure CUDA Implementation – Direct CUDA kernels for maximum performance without heavy ML frameworks
- Optimized Matrix Operations – Custom CUDA kernels for matrix multiplication and attention mechanisms
- Educational – Clean, readable CUDA code with inline documentation for learning GPU programming
make
./llama stories15M.binThe examples use small models trained by Andrej Karpathy for demonstration.
Requires NVIDIA CUDA Toolkit (11.0 or later):
makeOr using CMake:
mkdir build && cd build
cmake ..
make- Implement FP16 (float16) version for better memory efficiency and performance
- Add Flash Attention for faster attention computation
If you're interested in LLaMA implementations in other languages:
Inspired by llama2.c, llama3.cuda and the broader LLaMA community. This project aims to provide a GPU-accelerated alternative for educational purposes.
MIT