baracuda

About the name. Yes, we know — it's spelled barracuda (two Rs). That name was taken on crates.io, so we dropped one R and kept swimming.

Idiomatic, ergonomic Rust wrappers for the entire NVIDIA CUDA stack: the Driver and Runtime APIs plus the ML-adjacent libraries (NVRTC, cuBLAS + cuBLASLt + cuBLASXt, cuFFT, cuSPARSE, cuSOLVER + cuSOLVERMg, cuRAND, cuTENSOR, cuDNN, NCCL, NPP, nvJPEG, nvCOMP, CV-CUDA, NVML, CUPTI, cuFile, nvJitLink, TensorRT, cuDF).

Every library is wrapped as a two-crate pair — baracuda-<lib>-sys for the raw FFI + dynamic loader, and baracuda-<lib> for the safe, idiomatic API — and re-exported through the umbrella [baracuda] crate behind cargo features.

Why baracuda

Breadth. Every library in NVIDIA's compute stack, wrapped with the same conventions — handles are RAII, scalars are generic over f32 / f64 / Complex32 / Complex64 where the math allows, errors flow through a shared CudaStatus trait.
Dynamic loading everywhere. cargo build needs no CUDA toolkit installed. The resulting binary runs against whatever NVIDIA driver + CUDA runtime the user has — or fails loudly with a typed loader error if they don't.
Version-flexible. CUDA 11.4 floor, targets latest 13.x. Newer APIs (green contexts, multicast, cuDNN Graph API, conditional graph nodes, etc.) are feature-gated at the safe layer but always present in the FFI.
Two-layer crates. If you only need the type vocabulary (Half, BFloat16, Complex32/64, DeviceRepr, CudaVersion), depend on [baracuda-types] alone — no loader, no CUDA, no transitive weight.

Getting started

Add the umbrella crate to your Cargo.toml:

[dependencies]
baracuda = { version = "0.1", features = ["driver", "runtime"] }

use baracuda::runtime::{DeviceBuffer, set_device};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    set_device(0)?;
    let host = vec![1.0f32, 2.0, 3.0];
    let d = DeviceBuffer::from_slice(&host)?;

    let mut back = vec![0.0f32; 3];
    d.copy_to_host(&mut back)?;
    assert_eq!(host, back);
    Ok(())
}

Turn on additional libraries with features:

baracuda = { version = "0.1", features = ["ml"] }
# ml = driver + runtime + nvrtc + nvjitlink + cublas + cufft + cusparse
#      + cusolver + curand + cudnn + nccl + tensorrt

Full guides live under docs/guides/:

Getting started — install, first kernel, common pitfalls.
Matrix multiply with cuBLAS — SGEMM + GemmEx + cuBLASLt.
NVRTC runtime compilation — compile CUDA C++ to PTX at runtime.
Async pipelines — overlap H2D / compute / D2H.
CUDA graphs — capture and replay.
Memory patterns — device / unified / pinned / pool.
Using feature flags — picking the right bundle.

See examples/ for end-to-end runnable programs.

Workspace layout

baracuda/
├── crates/
│   ├── baracuda/                 umbrella crate — re-exports per feature
│   ├── baracuda-types/           pure-data types: Half, BFloat16, Complex, DeviceRepr
│   ├── baracuda-types-derive/    #[derive(DeviceRepr)], #[derive(KernelArg)]
│   ├── baracuda-core/            loader (libloading), Error plumbing, stream_mode
│   ├── baracuda-build/           build.rs helpers
│   ├── baracuda-cuda-sys/        Driver + Runtime FFI
│   ├── baracuda-driver/          safe Driver API
│   ├── baracuda-runtime/         safe Runtime API
│   ├── baracuda-nvrtc{,-sys}     runtime CUDA C++ → PTX
│   ├── baracuda-nvjitlink{,-sys} CUDA 12+ JIT linker
│   ├── baracuda-cublas{,-sys}    cuBLAS + cuBLASLt + cuBLASXt
│   ├── baracuda-cufft{,-sys}     cuFFT
│   ├── baracuda-cusparse{,-sys}  generic sparse API
│   ├── baracuda-cusolver{,-sys}  Dn + Sp + Rf + Mg
│   ├── baracuda-curand{,-sys}    RNG + quasi-random
│   ├── baracuda-cudnn{,-sys}     classic ops + Graph API
│   ├── baracuda-nccl{,-sys}      multi-GPU collectives
│   ├── baracuda-npp{,-sys}       image / signal primitives (workhorse subset)
│   ├── baracuda-nvjpeg{,-sys}    JPEG codec (single + batched + hybrid + encoder)
│   ├── baracuda-nvml{,-sys}      device monitoring, ECC, NVLink, events
│   ├── baracuda-cupti{,-sys}     profiling: activity + callback + event/metric + profiler host
│   ├── baracuda-cutensor{,-sys}  contraction / reduction / elementwise / permutation
│   ├── baracuda-nvcomp{,-sys}    LZ4 / Snappy / Zstd / GDeflate / Deflate / Bitcomp / ANS / Cascaded / Gzip / CRC32
│   ├── baracuda-cvcuda{,-sys}    computer-vision operator catalog
│   ├── baracuda-cufile{,-sys}    GPUDirect Storage (Linux-only)
│   ├── baracuda-tensorrt{,-sys}  runtime inference (engine deserialize + execute)
│   └── baracuda-cudf{,-sys}      RAPIDS GPU DataFrames (skeleton — tracks libcudf_c)
├── assets/                       README imagery
├── docs/                         design notes + per-feature guides
└── examples/                     runnable end-to-end programs

Feature flags (umbrella crate)

default   = ["driver", "runtime"]

# Core APIs
driver    nvrtc   nvjitlink   runtime

# Math libraries
cublas    cufft   cusparse    cusolver   curand   cutensor

# Deep learning
cudnn     nccl    tensorrt

# Data / I/O / imaging
cudf      npp     nvjpeg      nvcomp     cvcuda

# System tooling
nvml      cupti   cufile

# Convenience bundles
math      = ["cublas", "cufft", "cusparse", "cusolver", "curand"]
dl        = ["cudnn", "nccl", "tensorrt"]
imaging   = ["npp", "nvjpeg", "cvcuda"]
ml        = ["driver", "runtime", "nvrtc", "nvjitlink", "math", "dl"]
full      = ["ml", "imaging", "nvcomp", "nvml", "cupti", "cufile", "cudf", "cutensor"]

Coverage status

Legend: ✅ comprehensive — full host-facing surface wrapped idiomatically · 🟢 workhorse — enough for 95% of uses, with a documented tail · 🟡 scaffolding — loader + smoke coverage, specific ops on request.

API / library	Status	Notes
CUDA Driver API	✅	Device / Context (primary + explicit + green) / Module / Library / Function / Kernel / Memory (device + host pinned + managed + pool + VMM + IPC) / Stream / Event / Graph (capture + exec + conditional + switch nodes) / Launch / Occupancy / Texture / Surface / Tensor Map / Multicast / External memory + semaphore / Peer / Graphics interop / Profiler
CUDA Runtime API	✅	Mirror of Driver API for everything that Runtime exposes; zero-cost interop of `Stream`/`Event`/`Graph` between the two
NVRTC	✅	`Program`, compile options, include-path registration, error-log retrieval
nvJitLink	✅	Full linker API
cuBLAS + LT + XT	✅	L1/L2/L3 S/D/C/Z + real-only L2 (symv/trmv/trsv/ger/syr) + Ex variants (axpy/dot/nrm2/scal/rot) + batched GEMM + batched direct solvers (getrf / getrs / getri / matinv) + GemmEx + GemmStridedBatchedEx + cuBLASLt (descriptors, preference, heuristics, matmul) + cuBLASXt multi-GPU GEMM
cuFFT	✅	Plan1d / Plan2d / Plan3d / PlanMany + R2C / C2R / C2C / D2Z / Z2D / Z2Z + XT multi-GPU + callbacks
cuSPARSE	✅	CSR / CSC / COO / BSR + DnVec / DnMat + SpMV / SpMM / SpGEMM (3-phase) / SpSV / SpSM / SDDMM + sparse↔dense / CSR↔CSC conversions + sparse-BLAS-1 (axpby / gather / scatter / rot)
cuSOLVER	✅	Dense: LU / QR (geqrf + orgqr + ormqr) / Cholesky (potrf + potrs + potri) / SVD / syevd / syevj / gesvdj / syevj_batched / gesvdj_batched / gels S/D/C/Z + generic 64-bit X… (Xgetrf/Xgetrs/Xgeqrf/Xpotrf/Xpotrs/Xsyevd); Sparse: csrlsvchol / csrlsvqr; Refactor: full Rf; Multi-GPU: cuSOLVERMg (getrf / potrf / syevd)
cuRAND	✅	Host + device generators, quasi-random (Sobol + scrambled), all documented distributions (uniform, normal, log-normal, Poisson, binomial)
cuTENSOR	✅	`Handle`, `ComputeDescriptor`, `TensorDescriptor`, `OperationDescriptor`, `PlanPreference`, `Plan` + full op catalog: Contraction, Reduction, ElementwiseBinary, ElementwiseTrinary, Permutation, BlockSparseContraction, TrinaryContraction + plan-cache I/O
cuDNN	🟢	Classic ops (activation, conv forward + backward data + backward filter + backward bias, pooling, softmax, BN inference + training + backward, LRN, dropout, op-tensor, reduce, transform, add, scale), spatial transformer, CTC loss, RNN forward + backward, N-D tensor / filter / pooling / conv descriptors, full Graph (backend) API scaffolding with `BackendDescriptor::set_attribute_raw`
NCCL	✅	Communicators + all collectives (all-reduce, reduce, reduce-scatter, all-gather, broadcast, reduce, send/recv) + p2p + group API + memory alloc/free + comm split + register/deregister + abort/finalize
nvJPEG	✅	Single + batched decode (simple + three-phase hybrid: host / transfer / device), planar decode, buffer pools (pinned + device), DecodeParams (output format / ROI / CMYK), JpegStream parser, full encoder (quality, chroma subsampling, optimized Huffman, encode image / YUV, retrieve bitstream)
NPP	🟡	`version()` + sample arithmetic; NPP has thousands of function variants — expansion on request
nvCOMP	✅	LZ4, Snappy, Zstd, GDeflate, Gzip, Deflate, Bitcomp, ANS, Cascaded (compress + decompress async, temp-size, max-chunk-size, alignment queries) + CRC32 + status / property APIs
CV-CUDA	🟢	~30 operators — geometric (Resize / PillowResize / Warp* / Remap / Rotate / Crop / CopyMakeBorder), filters (Gaussian / Median / Average / Laplacian / Bilateral / Motion / Conv2D), morphology, edges (Canny), thresholds, color (ColorTwist / BrightnessContrast / GammaContrast), stats, composite, misc
NVML	✅	Enumeration, memory / temperature / power / fan / utilization, clock info (current / max / applications / default), applications-clock set, power-limit get / set / range, P-states (power + performance), temperature thresholds, ECC (per-location + total error counts, mode get / set), PCI info / link gen / link width / throughput, NVLink state / version / capability, running processes (compute + graphics), compute mode, UUID / serial / index / minor number, handle lookup by UUID / PCI bus ID, event set with register + wait, field values
CUPTI	✅	Activity API + record walker, Callback API (`Subscriber` RAII), Event API (Group RAII + device/domain enumeration + read), Metric API (id-by-name, attributes, value), Profiler Host API (initialize / de-initialize / begin session / end session / set config / begin pass / end pass / enable / disable / push range / pop range / flush / counter availability), `cuptiGetResultString` helper
cuFile	✅	Driver lifecycle + properties, file-handle register, buffer / stream register, sync + async read / write, BatchIO (setup / submit / poll / cancel / destroy), configurable direct-I/O / cache / pinned-mem limits, op-status error-string helper
TensorRT	🟢	Runtime-side inference: Runtime, Engine (deserialize + inspect IO / shapes / dtypes + serialize back to bytes), ExecutionContext (with allocation strategy), set input shape, set tensor address, enqueueV3. Builder side remains C++-only by TensorRT's design — use `trtexec` or the Python API to produce engine blobs, then load them here.
cuDF	🟡	Tracks RAPIDS `libcudf_c`'s emerging C ABI — CSV / Parquet I/O, Column, Table, TypeId. Anything not exposed through libcudf_c itself is bounded by upstream progress.

Examples

Example	Demonstrates
`hello_kernel`	Driver API: load PTX, launch kernel, readback
`hello_runtime`	Runtime API: allocation + H2D + D2H
`matmul_cublas`	SGEMM via `baracuda-cublas`
`nvrtc_jit`	Compile CUDA C++ at runtime and run it
`stream_pipeline`	Overlap H2D / compute / D2H on multiple streams
`graph_capture`	Record and replay a graph

Comparison

Feature	`baracuda`	`cust`	`cudarc`	`rustacuda`
CUDA Driver API	✅	✅	✅	✅ (stale)
CUDA Runtime API	✅	—	partial	—
ML-stack libraries	✅ (everything)	—	✅ (opinionated)	—
cuBLASLt / cuBLASXt	✅	—	—	—
TensorRT	🟢 runtime	—	—	—
CUPTI profiling	✅	—	—	—
Dynamic loading	✅	✅ (opt)	✅	—
Active maintenance	✅	✅	✅	abandoned

See docs/compared-to-cust.md and docs/compared-to-cudarc.md for detailed positioning.

Platform support

Platform	Driver	Runtime	Libraries
Linux x86_64	✅	✅	all
Linux aarch64	✅	✅	all (Jetson: full NvSci support)
Windows x86_64	✅	✅	all except cuFile and CV-CUDA (both are Linux-primary)
macOS	—	—	NVIDIA dropped Mac CUDA support in 2019

Design highlights

Every handle is RAII. Contexts, streams, events, plans, descriptors, communicators — they all have Drop impls and as_raw() escape hatches.
Ownership flows through lifetimes. Descriptors borrow their backing DeviceBuffers; you cannot free a matrix while cuSPARSE still holds a pointer to it.
Send / !Sync everywhere it matters. Handles that NVIDIA documents as "single-thread at a time" (cuBLAS, cuDNN, cuSPARSE) are Send but not Sync; the ones that are thread-safe (NCCL, most CUDA Graph types) are both.
No hidden global state. The loader's singletons are the only global state; baracuda_core::stream_mode::init is the one knob you set per process if you want non-default per-thread-default-stream semantics.
Errors are typed per-library but unified by CudaStatus. An application-level error type can absorb all of them through baracuda::types::CudaStatus.

Testing

# Host-only tests (no GPU required):
cargo test --workspace --lib

# Full workspace build, proves the dynamic-load story works without CUDA:
cargo build --workspace

# GPU-backed integration tests (requires a working NVIDIA driver + CUDA runtime):
BARACUDA_GPU_TESTS=1 cargo test --workspace -- --ignored

License

Dual-licensed under MIT or Apache-2.0. Pick whichever fits your project. Contributions are accepted under the same terms.

NVIDIA CUDA libraries (libcuda, libcudart, libcublas, libcudnn, libnccl, cutensor.so, nvcomp64_5.dll, etc.) are not redistributed by this project. You must obtain them from NVIDIA separately — either through the CUDA Toolkit or through the individual library's download page. The dynamic loader opens whatever the driver / toolkit has installed.

The barracuda image in assets/ is the project's namesake mascot; replace it with any freely-licensed barracuda photo of your choice.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
assets		assets
crates		crates
docs		docs
examples		examples
xtask		xtask
.gitignore		.gitignore
.publish-remaining.sh		.publish-remaining.sh
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

baracuda

Why baracuda

Getting started

Workspace layout

Feature flags (umbrella crate)

Coverage status

Examples

Comparison

Platform support

Design highlights

Testing

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

baracuda

Why baracuda

Getting started

Workspace layout

Feature flags (umbrella crate)

Coverage status

Examples

Comparison

Platform support

Design highlights

Testing

License

About

Resources

License

Licenses found

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages