TensaLang is a Tensor-first programming language, compiler, and runtime that let you develop your the Model's inference engine (e.g. LLMs) using a simple high level language, then compile it through MLIR to multiple targets (e.g. CPU, CUDA, ROCm). You can change attention, sampling, tiling, and memory placement without rewriting the compiler.
Most LLM inference stacks are monolithic: kernels are "the runtime," and the model is encoded by the library's implementation. TensaLang flips that:
- LLM logic is source code. RMSNorm, RoPE, attention, MLP, sampling—all written in
.tl. - MLIR is the core IR. You get a full compiler pipeline, not a black box.
- Scheduling is part of the language. You can hint tile sizes, parallel indices, memory placement directly in source (
with tile=... parallel=... memory=...). - Builtins are overridable. You can replace
softmaxorlayernormin source for custom behavior. - Target‑dependent lowerings live under
Targets/. Today: CPU + CUDA. Planned: MLX and ROCm lowering pipelines.
Build:
./build.shDownload example weights:
git clone https://huggingface.co/DatarusAI/Tensa-Lang modelsThis provides models/llama2_7b/ and models/Qwen2.5-0.5B-Coder-Instruct/.
Run Llama2 fp16 (CPU):
./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \
--model models/llama2_7b/llama2_7b_f16.safetensors \
--tokenizer models/llama2_7b/tokenizer.json \
--prompt "Once upon a time"Run Llama2 fp16 (CUDA):
./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \
--model models/llama2_7b/llama2_7b_f16.safetensors \
--tokenizer models/llama2_7b/tokenizer.json \
--prompt "Once upon a time" \
--target cuda \
--cuda-arch sm_89Run Qwen2.5-Coder-0.5B-Instruct (CUDA):
./bin/tensalang-run examples/qwen25_coder_bf16.tl \
--model models/Qwen2.5-0.5B-Coder-Instruct/qwen25_0.5b_bf16.safetensors \
--tokenizer models/Qwen2.5-0.5B-Coder-Instruct/tokenizer.json \
--prompt "Once upon a time" \
--target cuda \
--cuda-arch sm_89Notes:
- Defaults:
--steps 128,--target cpu,--fused-attention 2,--seed 12345. - The runner prints a
Settings used:block at startup and anOUTPUT:block at the end.
Build the image once:
docker build -f docker/Dockerfile -t tensalang:local .Run with the helper (auto-mounts model/tokenizer paths):
./docker_command_exec examples/llama2_manual_tiling_fp16.tl \
--model /path/to/llama2_7b_f16.safetensors \
--tokenizer /path/to/tokenizer.json \
--prompt "Once upon a time" \
--target cuda \
--cuda-arch sm_89Notes:
- Set
TENSALANG_DOCKER_IMAGEto override the image name. - Use
TENSALANG_DOCKER_FORCE_GPU=1orTENSALANG_DOCKER_NO_GPU=1to force GPU on/off.
.
├── src/ # compiler, runtime, and sugar front-end
├── Targets/ # target backends (CUDA, CPU, etc.)
├── examples/ # reference programs
├── models/ # local model weights + tokenizers (clone from HF)
├── tools/ # conversion utilities
├── bin/ # binaries + static runtime lib
├── tensalang_run.cpp # C++ runner (builds bin/tensalang-run)
└── docs.md # full language reference
flowchart LR
TL[".tl source"] --> SEXPR["S-expression IR"]
SEXPR --> MLIR["MLIR module"]
MLIR --> PASS["MLIR passes"]
PASS --> CPU["CPU lowering + LLVM JIT"]
PASS --> CUDA["CUDA lowering + NVVM"]
CPU --> RTCPU["CPU runtime + kernels"]
CUDA --> RTCUDA["CUDA runtime + kernels"]
RTCPU --> RUN["Token generation"]
RTCUDA --> RUN
- The surface syntax is parsed by
src/tensalang_sugar.py. - It emits a compact S-expression IR (s-expr) that is easy to consume in C++.
src/codegen.cpplowers the s-expr into MLIR using linalg/scf/arith/memref ops.- The compiler builds a full MLIR module, then runs optimization passes.
CPU path
- MLIR is lowered to LLVM and JIT‑compiled via MLIR ExecutionEngine.
- Hot paths are specialized in
Targets/CPU/runtime_cpu.cpp(SIMD matvec, RMSNorm, RoPE, attention).
CUDA path
- GPU mapping passes convert loops into GPU kernels.
- GPU kernels are lowered to NVVM, serialized to cubin, and launched via runtime wrappers.
- CUDA runtime wrappers live in
Targets/CUDA/runtime_cuda.cppand handle:- CUDA context + streams
- device alloc/free + memcpy
- kernel launches
- cuBLAS GEMV fast path
The runtime provides the "outside world" that the .tl program calls into:
- Safetensors I/O
- Tokenization + decoding
- Sampling helpers
- Arena allocator for temporary tensors
- GPU helpers (CUDA, cuBLAS)
Source (.tl):
fn attn_scores(q: Tensor<f32, [H, Dh]>, k: Tensor<f16, [T, Dh]>, scale: f32) -> Tensor<f32, [H, T]>
with tile=[8, 64], parallel=[h, t] {
var s: Tensor<f32, [H, T]>
s[h, t] = sum(i) q[h, i] * (k[t, i] as f32) * scale
return s
}S-expression IR:
(program
(fn attn_scores
(params
(param q (tensor f32 [H Dh]))
(param k (tensor f16 [T Dh]))
(param scale f32)
)
(returns (tensor f32 [H T]))
(with
(tile (list (number 8) (number 64)))
(parallel (list (symbol h) (symbol t)))
)
(body
(var s (tensor f32 [H T]))
(assign (index (symbol s) (symbol h) (symbol t)) (reduce sum i (binary * (binary * (index (symbol q) (symbol h) (symbol i)) (cast (index (symbol k) (symbol t) (symbol i)) f32)) (symbol scale))))
(return (symbol s))
)
)
)MLIR (CUDA path, excerpt after gpu-kernel-outlining, showing GPU dialect):
module {
func.func @attn_scores(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf16>, %arg2: f32) -> memref<?x?xf32> {
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%dim = memref.dim %arg0, %c0 : memref<?x?xf32>
%0 = arith.index_cast %dim : index to i64
%1 = arith.sitofp %0 : i64 to f32
%dim_0 = memref.dim %arg0, %c1 : memref<?x?xf32>
%2 = arith.index_cast %dim_0 : index to i64
%3 = arith.sitofp %2 : i64 to f32
%dim_1 = memref.dim %arg1, %c0 : memref<?x?xf16>
%4 = arith.index_cast %dim_1 : index to i64
%5 = arith.sitofp %4 : i64 to f32
%6 = arith.fptosi %1 : f32 to i64
%7 = arith.index_cast %6 : i64 to index
%8 = arith.fptosi %5 : f32 to i64
%9 = arith.index_cast %8 : i64 to index
%alloc = memref.alloc(%7, %9) : memref<?x?xf32>
%dim_2 = memref.dim %arg0, %c0 : memref<?x?xf32>
%dim_3 = memref.dim %arg1, %c0 : memref<?x?xf16>
%10 = arith.subi %dim_2, %c1 : index
%11 = arith.subi %dim_3, %c1 : index
%12 = arith.addi %10, %c16 : index
%13 = arith.addi %11, %c16 : index
%14 = arith.divsi %12, %c16 : index
%15 = arith.divsi %13, %c16 : index
gpu.launch blocks(%arg3, %arg4, %arg5) in (%arg9 = %14, %arg10 = %15, %arg11 = %c1)
threads(%arg6, %arg7, %arg8) in (%arg12 = %c16, %arg13 = %c16, %arg14 = %c1) {
%16 = arith.muli %arg3, %c16 : index
%17 = arith.muli %arg4, %c16 : index
%18 = arith.addi %16, %arg6 : index
%19 = arith.addi %17, %arg7 : index
%20 = arith.cmpi slt, %18, %dim_2 : index
%21 = arith.cmpi slt, %19, %dim_3 : index
%22 = arith.andi %20, %21 : i1
scf.if %22 {
%cst = arith.constant 0.000000e+00 : f32
%dim_4 = memref.dim %arg0, %c1 : memref<?x?xf32>
%23 = scf.for %arg15 = %c0 to %dim_4 step %c1 iter_args(%arg16 = %cst) -> (f32) {
%24 = memref.load %arg0[%18, %arg15] : memref<?x?xf32>
%25 = memref.load %arg1[%19, %arg15] : memref<?x?xf16>
%26 = arith.extf %25 : f16 to f32
%27 = arith.mulf %24, %26 : f32
%28 = arith.mulf %27, %arg2 : f32
%29 = arith.addf %arg16, %28 : f32
scf.yield %29 : f32
}
memref.store %23, %alloc[%18, %19] : memref<?x?xf32>
}
gpu.terminator
}
return %alloc : memref<?x?xf32>
}
}The Llama2 program calls attention_f16_fused. The compiler detects it and emits:
- fused kernel (
--fused-attention 1) or - two‑stage fused kernel (
--fused-attention 2, default)
Two‑stage splits score computation and normalization for better numerical stability and efficiency. If you want to use your own attention, define attention_f16_fused in .tl.
Matvec for projection weights can dispatch to cuBLAS GEMV (on CUDA) when layouts are compatible.
arena_begin() / arena_reset() / arena_end() enable a bump‑pointer allocator that reuses memory each token, reducing allocation overhead during decoding.
TensaLang lets you embed scheduling preferences directly in source. This is the key differentiator for experimentation.
fn attention_f16(q: Tensor<f32, [D]>,
key_cache: Tensor<f16, [L, SeqLen, KvDim]>,
value_cache: Tensor<f16, [L, SeqLen, KvDim]>,
layer: i32, pos: i32, H: i32, KvH: i32, scale: f32) -> Tensor<f32, [D]>
with tile=[8, 64], parallel=[h, t],
memory={key_cache: shared_mem, value_cache: shared_mem} {
var att: Tensor<f32, [H, SeqLen]> = zeros([H, SeqLen])
att[h, t] = if t > pos { -inf } else {
sum(i) q[h * Dh + i] * (key_cache[layer, t, (h / kv_mul) * Dh + i] as f32) * scale
}
var weights: Tensor<f32, [H, SeqLen]> = softmax(att)
...
}examples/llama2_manual_tiling_fp16.tluses explicit hints.examples/llama2_auto_tiling_fp16.tlomits them and lets defaults apply.
Hints are preferences; the compiler may clamp or ignore if unsafe.
fn matmul_vec(w: Tensor<f32, [O, I]>, x: Tensor<f32, [I]>) -> Tensor<f32, [O]> {
var y: Tensor<f32, [O]>
y[o] = sum(i) w[o, i] * x[i]
return y
}fn scores_to_probs(scores: Tensor<f32, [H, T]>) -> Tensor<f32, [H, T]> {
return softmax(scores)
}If you define your own softmax, it overrides the builtin implementation.
./bin/tensalang-run <source.tl> --model <safetensors> --tokenizer <tokenizer.json> --prompt <text> [options]
Key options:
--target cpu|cuda--fused-attention 0|1|2(default: 2)--cuda-device <idx>--cuda-arch <sm_XX>--bench-tokens <n>,--bench-skip <n>--cpu-threads <n>
./bin/tensalang [--run] [--emit mlir|none] [--entry name] [--target cpu|cuda] <input.sexp>
Planned next:
- Auto‑tiling and fusion passes inside MLIR pipeline.
- MLX and ROCm lowering pipelines under
Targets/MLXandTargets/ROCm. - Quantization and mixed‑precision tooling.
Full language reference: docs.md

