Skip to content

Zaneham/BarraCUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

135 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BarraCUDA

An open-source CUDA C++ compiler written from scratch in C99 that takes .cu files and compiles them to AMD GPU machine code and Tenstorrent Tensix C++, with more architectures planned. No LLVM, no dependencies, and no permission asked.

This is what happens when you look at NVIDIA's walled garden and think "how hard can it be?" The answer is: quite hard, actually, but I did it anyway.

See Changelog for recent updates.

What It Does

Takes CUDA C source code, the same .cu files you'd feed to nvcc, and compiles them to AMD RDNA 2/3/4 binaries or Tenstorrent Tensix Metalium C++.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     BarraCUDA Pipeline                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Source (.cu)                                                β”‚
β”‚       ↓                                                      β”‚
β”‚  Preprocessor β†’ #include, #define, macros, conditionals      β”‚
β”‚       ↓                                                      β”‚
β”‚  Lexer β†’ Tokens                                              β”‚
β”‚       ↓                                                      β”‚
β”‚  Parser (Recursive Descent) β†’ AST                            β”‚
β”‚       ↓                                                      β”‚
β”‚  Semantic Analysis β†’ Type checking, scope resolution         β”‚
β”‚       ↓                                                      β”‚
β”‚  BIR (BarraCUDA IR) β†’ SSA form, typed instructions           β”‚
β”‚       ↓                                                      β”‚
β”‚  mem2reg β†’ Promotes allocas to SSA registers                 β”‚
β”‚       ↓                                                      β”‚
β”‚  Instruction Selection                                       β”‚
β”‚       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€       β”‚
β”‚       ↓ AMD                  ↓ Tenstorrent           β”‚       β”‚
β”‚  VGPR/SGPR regalloc    Tensix SFPU isel              β”‚       β”‚
β”‚       ↓                      ↓                       β”‚       β”‚
β”‚  GFX10/11/12 encoding  Metalium C++ emission         β”‚       β”‚
β”‚       ↓                      ↓                       β”‚       β”‚
β”‚  .hsaco ELF            compute/reader/writer/host    β”‚       β”‚
β”‚       ↓                      ↓                       β”‚       β”‚
β”‚  Your kernel runs on silicon you own                 β”‚       |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Building

# It's C99. It builds with gcc. There are no dependencies.
make

# That's it. No cmake. No autoconf. No 47-step build process.
# If this doesn't work, your gcc is broken, not the Makefile.

Requirements

  • A C99 compiler (gcc, clang, whatever you've got)
  • A will to live (optional but recommended)
  • LLVM is NOT required. BarraCUDA does its own instruction encoding like an adult.

Usage

# Compile to AMD GPU binary (RDNA 3, default)
./barracuda --amdgpu-bin kernel.cu -o kernel.hsaco

# Compile for RDNA 2
./barracuda --amdgpu-bin --gfx1030 kernel.cu -o kernel.hsaco

# Compile for RDNA 4
./barracuda --amdgpu-bin --gfx1200 kernel.cu -o kernel.hsaco

# Compile to Tenstorrent Metalium C++
./barracuda --tensix kernel.cu -o kernel_compute.cpp

# Dump the IR (for debugging or curiosity)
./barracuda --ir kernel.cu

# Just parse and dump the AST
./barracuda --ast kernel.cu

# Run semantic analysis
./barracuda --sema kernel.cu

# Error messages in te reo Maori (or any language with a translation file)
./barracuda --lang lang/mi.txt --amdgpu-bin kernel.cu -o kernel.hsaco

Runtime Launcher

BarraCUDA includes a minimal HSA runtime (src/runtime/) for dispatching compiled kernels on real AMD hardware. Zero compile-time dependency on ROCm β€” loads libhsa-runtime64.so at runtime via dlopen.

# Compile the runtime and example together
gcc -std=c99 -O2 -I src/runtime \
    examples/launch_saxpy.c src/runtime/bc_runtime.c \
    -ldl -lm -o launch_saxpy

# Compile a kernel and run it
./barracuda --amdgpu-bin -o test.hsaco tests/canonical.cu
./launch_saxpy test.hsaco

Requires Linux with ROCm installed. See examples/launch_saxpy.c for a complete example. Not yet tested on real hardware β€” if you have an AMD GPU, we'd love a test report (#39).

What Works

The following CUDA features compile to working GFX10/GFX11/GFX12 machine code and Tensix Metalium C++:

Core Language

  • __global__, __device__, __host__ function qualifiers
  • threadIdx, blockIdx, blockDim, gridDim builtins
  • Structs, enums, typedefs, namespaces
  • Pointers, arrays, pointer arithmetic
  • All C control flow: if/else, for, while, do-while, switch/case, goto/label
  • Short-circuit && and ||
  • Ternary operator
  • Templates (basic instantiation)
  • Multiple return paths, continue, break

CUDA Features

  • __shared__ memory (allocated from LDS, properly tracked)
  • __syncthreads() β†’ s_barrier
  • Atomic operations: atomicAdd, atomicSub, atomicMin, atomicMax, atomicExch, atomicCAS, atomicAnd, atomicOr, atomicXor
  • Warp intrinsics: __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync
  • Warp votes: __ballot_sync, __any_sync, __all_sync
  • Vector types: float2, float3, float4, int2, int3, int4 with .x/.y/.z/.w access
  • Half precision: __half, __float2half(), __half2float(), __nv_bfloat16
  • __launch_bounds__ (parsed, propagated, enforces VGPR caps)
  • Cooperative groups: cooperative_groups::this_thread_block() with .sync(), .thread_rank(), .size()
  • Operator overloading
  • Math builtins: sqrtf, rsqrtf, expf, exp2f, logf, log2f, log10f, sinf, cosf, tanf, tanhf, powf, fabsf, floorf, ceilf, truncf, roundf, rintf, fmaxf, fminf, fmodf, copysignf
  • __constant__ memory, __device__ globals

Compiler Features

  • Full C preprocessor: #include, #define/#undef, function-like macros, #ifdef/#ifndef/#if/#elif/#else/#endif, #pragma, #error, -I/-D flags
  • Error recovery (reports multiple errors without hanging)
  • Multilingual error messages (--lang <file>) with language-neutral E-codes
  • Source location tracking in IR dumps
  • Struct pass-by-value

Example

__global__ void vector_add(float *c, float *a, float *b, int n)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        c[idx] = a[idx] + b[idx];
}
$ ./barracuda --amdgpu-bin vector_add.cu -o vector_add.hsaco
wrote vector_add.hsaco (528 bytes code, 1 kernels)

No LLVM required :-)

What Doesn't Work (Yet)

Being honest about limitations is important. Here's what's missing:

  • Parameter reassignment in __device__ functions (use local variables)
  • Textures and surfaces
  • Dynamic parallelism (device-side kernel launch)
  • Multiple translation units
  • Host code generation (only device code is compiled)

None of these are architectural blockers. They're all "haven't got round to it yet" items.

Test Suite

14 test files, 35+ kernels, ~1,700 BIR instructions, ~27,000 bytes of machine code:

  • vector_add.cu - The "hello world" of GPU computing
  • cuda_features.cu - Atomics, warp ops, barriers, gotos, switch, short-circuit
  • test_tier12.cu - Vectors, shared memory, operator overloading
  • notgpt.cu - AI-generated CUDA with extremely sarcastic comments (tiled SGEMM, reductions, histograms, prefix scan, stencils, half precision, cooperative groups, and the "kitchen sink" kernel)
  • stress.cu - N-body simulation, nested control flow, bit manipulation, struct pass-by-value, chained function calls
  • canonical.cu - Canonical patterns from NVIDIA samples adapted for the parser
  • test_errors.cu - Deliberate syntax errors to verify error recovery
  • test_launch_bounds.cu - __launch_bounds__ parsing and VGPR cap enforcement
  • test_coop_groups.cu - Cooperative groups lowering
  • mymathhomework.cu - Trig identities, exponential growth, Newton-Raphson, log laws, hyperbolic functions, floor/ceil/round, power rule, clamping
  • Plus preprocessor tests, template tests, unsigned integer tests

Roadmap

Near Term: Hardening

Fix the known gaps: integer literal suffixes, const, parameter reassignment. These are all small parser/lowerer changes. The goal is to compile real-world .cu files without modifications.

Medium Term: Optimisation

The generated code works but isn't winning any benchmarks. Done so far: instruction scheduling, constant folding, dead code elimination. Priorities:

  • Better register allocation (currently linear scan, consider graph colouring)
  • Loop-invariant code motion
  • Occupancy tuning based on register pressure

Long Term: More Architectures

The IR (BIR) is target-independent. The backend is cleanly separated. Adding a new target means writing a new isel + emit pair.

  • Tenstorrent Tensix - Done. Compiles CUDA to TT-Metalium C++ for Blackhole. --tensix
  • Intel Arc - Xe architecture. Would give BarraCUDA coverage across all three major GPU vendors.
  • RISC-V Vector Extension - For when GPUs are too mainstream and you want to run CUDA on a softcore.

Contributing

Issues and PRs in any language are welcome β€” just include an English translation alongside. See CONTRIBUTING.md for the full guide on style, naming, and where to help.

The HLASM-style short identifiers (ra_gc, mk_hash, enc_vop3) are culturally neutral by accident, there's nothing English about a 5-character label. If you've found a bug or have an idea, write it up in whatever language you think in.

Changelog

2026-03-09 β€” Post-isel verification pass (bc_vfy). The encoder used to trust isel to produce valid machine instructions. It shouldn't have. bc_vfy runs twice (post-isel, post-RA) and catches 5 classes of encoding violation before the binary leaves the compiler. Its first run immediately found 7 isel bugs across GFX10 and GFX942 β€” every one a silent miscompile that would fault on hardware with "Reason: Unknown." Fixed them all. Also: bc_abend runtime crash diagnostics, because if IBM could do post-mortem dumps in 1964, we can do it for GPUs in 2026.

2026-03-08 β€” Error localisation infrastructure. Every diagnostic now has a language-neutral ID (E001–E111). External translation files via --lang <file>. English reference at lang/en.txt, te reo Maori at lang/mi.txt. Unified error structs. Lowering errors now displayed.

2026-03-05 β€” CDNA 3 additions: GFX942 backend hardening, MFMA, Wave64 divergence, tinygrad compat. 8/8 tests passing on MI300X (PR#56).

2026-03-05 β€” Instruction scheduling (PR#52).

2026-03-03 β€” CDNA 2 support (--gfx90a, MI250). Tinygrad compatibility.

2026-02-28 β€” Tenstorrent Tensix backend (--tensix). Compiles CUDA to TT-Metalium C++ for Blackhole. Constant folding (PR#51). Dead code elimination (PR#48).

2026-02-25 β€” HSA runtime launcher (PR#40). RDNA 2 support (--gfx1030, PR#38). Test suite (PR#41).

2026-02-20 β€” RDNA 4 support (--gfx1200, PR#32).

2026-02-16 β€” Initial release. CUDA compiler targeting AMD RDNA 3 (gfx1100).

Contact

Found a bug? Want to discuss the finer points of AMDGPU instruction encoding? Need someone to commiserate with about the state of GPU computing?

zanehambly@gmail.com

Open an issue if there's anything you want to discuss. Or don't. I'm not your mum.

Based in New Zealand, where it's already tomorrow and the GPUs are just as confused as everywhere else.

License

Apache 2.0. Do whatever you want. If this compiler somehow ends up in production, I'd love to hear about it, mostly so I can update my LinkedIn with something more interesting than wrote a CUDA compiler for fun.

Acknowledgements

  • Steven Muchnick for Advanced Compiler Design and Implementation. If this compiler does anything right, that book is why.
  • Low Level for the Zero to Hero C course and the YouTube channel. That's where I learnt C.
  • Abe Kornelis for being an amazing teacher. His work on the z390 Portable Mainframe Assembler project is well worth your time.
  • To the people who've sent messages of kindness and critique, thank you from a forever student and a happy hobbyist.
  • My Granny, Grandad, Nana and Baka. Love you x

He aha te mea nui o te ao. He tāngata, he tāngata, he tāngata.

What is the most important thing in the world? It is people, it is people, it is people.


About

Open-source CUDA compiler targeting multiple GPU architectures. Compiles .cu to AMD and Tenstorrent GPU's

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors