TensaLang

TensaLang is a Tensor-first programming language, compiler, and runtime that let you develop your the Model's inference engine (e.g. LLMs) using a simple high level language, then compile it through MLIR to multiple targets (e.g. CPU, CUDA, ROCm). You can change attention, sampling, tiling, and memory placement without rewriting the compiler.

Why this is different (and why it matters)

Most LLM inference stacks are monolithic: kernels are "the runtime," and the model is encoded by the library's implementation. TensaLang flips that:

LLM logic is source code. RMSNorm, RoPE, attention, MLP, sampling—all written in .tl.
MLIR is the core IR. You get a full compiler pipeline, not a black box.
Scheduling is part of the language. You can hint tile sizes, parallel indices, memory placement directly in source (with tile=... parallel=... memory=...).
Builtins are overridable. You can replace softmax or layernorm in source for custom behavior.
Target‑dependent lowerings live under Targets/. Today: CPU + CUDA. Planned: MLX and ROCm lowering pipelines.

Quick Start

Build:

./build.sh

Download example weights:

git clone https://huggingface.co/DatarusAI/Tensa-Lang models

This provides models/llama2_7b/ and models/Qwen2.5-0.5B-Coder-Instruct/.

Run Llama2 fp16 (CPU):

./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \
  --model models/llama2_7b/llama2_7b_f16.safetensors \
  --tokenizer models/llama2_7b/tokenizer.json \
  --prompt "Once upon a time"

Run Llama2 fp16 (CUDA):

./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \
  --model models/llama2_7b/llama2_7b_f16.safetensors \
  --tokenizer models/llama2_7b/tokenizer.json \
  --prompt "Once upon a time" \
  --target cuda \
  --cuda-arch sm_89

Run Qwen2.5-Coder-0.5B-Instruct (CUDA):

./bin/tensalang-run examples/qwen25_coder_bf16.tl \
  --model models/Qwen2.5-0.5B-Coder-Instruct/qwen25_0.5b_bf16.safetensors \
  --tokenizer models/Qwen2.5-0.5B-Coder-Instruct/tokenizer.json \
  --prompt "Once upon a time" \
  --target cuda \
  --cuda-arch sm_89

Notes:

Defaults: --steps 128, --target cpu, --fused-attention 2, --seed 12345.
The runner prints a Settings used: block at startup and an OUTPUT: block at the end.

Docker Quick Test

Build the image once:

docker build -f docker/Dockerfile -t tensalang:local .

Run with the helper (auto-mounts model/tokenizer paths):

./docker_command_exec examples/llama2_manual_tiling_fp16.tl \
  --model /path/to/llama2_7b_f16.safetensors \
  --tokenizer /path/to/tokenizer.json \
  --prompt "Once upon a time" \
  --target cuda \
  --cuda-arch sm_89

Notes:

Set TENSALANG_DOCKER_IMAGE to override the image name.
Use TENSALANG_DOCKER_FORCE_GPU=1 or TENSALANG_DOCKER_NO_GPU=1 to force GPU on/off.

Project Layout

.
├── src/                      # compiler, runtime, and sugar front-end
├── Targets/                  # target backends (CUDA, CPU, etc.)
├── examples/                 # reference programs
├── models/                   # local model weights + tokenizers (clone from HF)
├── tools/                    # conversion utilities
├── bin/                      # binaries + static runtime lib
├── tensalang_run.cpp         # C++ runner (builds bin/tensalang-run)
└── docs.md                   # full language reference

How it works (end-to-end)

Pipeline flow (markdown diagram)

flowchart LR
  TL[".tl source"] --> SEXPR["S-expression IR"]
  SEXPR --> MLIR["MLIR module"]
  MLIR --> PASS["MLIR passes"]
  PASS --> CPU["CPU lowering + LLVM JIT"]
  PASS --> CUDA["CUDA lowering + NVVM"]
  CPU --> RTCPU["CPU runtime + kernels"]
  CUDA --> RTCUDA["CUDA runtime + kernels"]
  RTCPU --> RUN["Token generation"]
  RTCUDA --> RUN

1) `.tl` source → S-expression IR

The surface syntax is parsed by src/tensalang_sugar.py.
It emits a compact S-expression IR (s-expr) that is easy to consume in C++.

2) S-expression → MLIR

src/codegen.cpp lowers the s-expr into MLIR using linalg/scf/arith/memref ops.
The compiler builds a full MLIR module, then runs optimization passes.

3) MLIR → Target-specific lowering

CPU path

MLIR is lowered to LLVM and JIT‑compiled via MLIR ExecutionEngine.
Hot paths are specialized in Targets/CPU/runtime_cpu.cpp (SIMD matvec, RMSNorm, RoPE, attention).

CUDA path

GPU mapping passes convert loops into GPU kernels.
GPU kernels are lowered to NVVM, serialized to cubin, and launched via runtime wrappers.
CUDA runtime wrappers live in Targets/CUDA/runtime_cuda.cpp and handle:
- CUDA context + streams
- device alloc/free + memcpy
- kernel launches
- cuBLAS GEMV fast path

4) Runtime services

The runtime provides the "outside world" that the .tl program calls into:

Safetensors I/O
Tokenization + decoding
Sampling helpers
Arena allocator for temporary tensors
GPU helpers (CUDA, cuBLAS)

Lowering snapshots (LLM-ish attention score kernel)

Source (.tl):

fn attn_scores(q: Tensor<f32, [H, Dh]>, k: Tensor<f16, [T, Dh]>, scale: f32) -> Tensor<f32, [H, T]>
    with tile=[8, 64], parallel=[h, t] {
  var s: Tensor<f32, [H, T]>
  s[h, t] = sum(i) q[h, i] * (k[t, i] as f32) * scale
  return s
}

S-expression IR:

(program
  (fn attn_scores
    (params
      (param q (tensor f32 [H Dh]))
      (param k (tensor f16 [T Dh]))
      (param scale f32)
    )
    (returns (tensor f32 [H T]))
    (with
      (tile (list (number 8) (number 64)))
      (parallel (list (symbol h) (symbol t)))
    )
    (body
      (var s (tensor f32 [H T]))
      (assign (index (symbol s) (symbol h) (symbol t)) (reduce sum i (binary * (binary * (index (symbol q) (symbol h) (symbol i)) (cast (index (symbol k) (symbol t) (symbol i)) f32)) (symbol scale))))
      (return (symbol s))
    )
  )
)

MLIR (CUDA path, excerpt after gpu-kernel-outlining, showing GPU dialect):

module {
  func.func @attn_scores(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf16>, %arg2: f32) -> memref<?x?xf32> {
    %c16 = arith.constant 16 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %dim = memref.dim %arg0, %c0 : memref<?x?xf32>
    %0 = arith.index_cast %dim : index to i64
    %1 = arith.sitofp %0 : i64 to f32
    %dim_0 = memref.dim %arg0, %c1 : memref<?x?xf32>
    %2 = arith.index_cast %dim_0 : index to i64
    %3 = arith.sitofp %2 : i64 to f32
    %dim_1 = memref.dim %arg1, %c0 : memref<?x?xf16>
    %4 = arith.index_cast %dim_1 : index to i64
    %5 = arith.sitofp %4 : i64 to f32
    %6 = arith.fptosi %1 : f32 to i64
    %7 = arith.index_cast %6 : i64 to index
    %8 = arith.fptosi %5 : f32 to i64
    %9 = arith.index_cast %8 : i64 to index
    %alloc = memref.alloc(%7, %9) : memref<?x?xf32>
    %dim_2 = memref.dim %arg0, %c0 : memref<?x?xf32>
    %dim_3 = memref.dim %arg1, %c0 : memref<?x?xf16>
    %10 = arith.subi %dim_2, %c1 : index
    %11 = arith.subi %dim_3, %c1 : index
    %12 = arith.addi %10, %c16 : index
    %13 = arith.addi %11, %c16 : index
    %14 = arith.divsi %12, %c16 : index
    %15 = arith.divsi %13, %c16 : index
    gpu.launch blocks(%arg3, %arg4, %arg5) in (%arg9 = %14, %arg10 = %15, %arg11 = %c1)
               threads(%arg6, %arg7, %arg8) in (%arg12 = %c16, %arg13 = %c16, %arg14 = %c1) {
      %16 = arith.muli %arg3, %c16 : index
      %17 = arith.muli %arg4, %c16 : index
      %18 = arith.addi %16, %arg6 : index
      %19 = arith.addi %17, %arg7 : index
      %20 = arith.cmpi slt, %18, %dim_2 : index
      %21 = arith.cmpi slt, %19, %dim_3 : index
      %22 = arith.andi %20, %21 : i1
      scf.if %22 {
        %cst = arith.constant 0.000000e+00 : f32
        %dim_4 = memref.dim %arg0, %c1 : memref<?x?xf32>
        %23 = scf.for %arg15 = %c0 to %dim_4 step %c1 iter_args(%arg16 = %cst) -> (f32) {
          %24 = memref.load %arg0[%18, %arg15] : memref<?x?xf32>
          %25 = memref.load %arg1[%19, %arg15] : memref<?x?xf16>
          %26 = arith.extf %25 : f16 to f32
          %27 = arith.mulf %24, %26 : f32
          %28 = arith.mulf %27, %arg2 : f32
          %29 = arith.addf %arg16, %28 : f32
          scf.yield %29 : f32
        }
        memref.store %23, %alloc[%18, %19] : memref<?x?xf32>
      }
      gpu.terminator
    }
    return %alloc : memref<?x?xf32>
  }
}

LLM‑specific optimizations

Fused attention

The Llama2 program calls attention_f16_fused. The compiler detects it and emits:

fused kernel (--fused-attention 1) or
two‑stage fused kernel (--fused-attention 2, default)

Two‑stage splits score computation and normalization for better numerical stability and efficiency. If you want to use your own attention, define attention_f16_fused in .tl.

cuBLAS GEMV

Matvec for projection weights can dispatch to cuBLAS GEMV (on CUDA) when layouts are compatible.

Arena allocator

arena_begin() / arena_reset() / arena_end() enable a bump‑pointer allocator that reuses memory each token, reducing allocation overhead during decoding.

Source-level scheduling hints (manual tiling)

TensaLang lets you embed scheduling preferences directly in source. This is the key differentiator for experimentation.

fn attention_f16(q: Tensor<f32, [D]>,
                 key_cache: Tensor<f16, [L, SeqLen, KvDim]>,
                 value_cache: Tensor<f16, [L, SeqLen, KvDim]>,
                 layer: i32, pos: i32, H: i32, KvH: i32, scale: f32) -> Tensor<f32, [D]>
    with tile=[8, 64], parallel=[h, t],
         memory={key_cache: shared_mem, value_cache: shared_mem} {
  var att: Tensor<f32, [H, SeqLen]> = zeros([H, SeqLen])
  att[h, t] = if t > pos { -inf } else {
    sum(i) q[h * Dh + i] * (key_cache[layer, t, (h / kv_mul) * Dh + i] as f32) * scale
  }
  var weights: Tensor<f32, [H, SeqLen]> = softmax(att)
  ...
}

examples/llama2_manual_tiling_fp16.tl uses explicit hints.
examples/llama2_auto_tiling_fp16.tl omits them and lets defaults apply.

Hints are preferences; the compiler may clamp or ignore if unsafe.

Example snippets

Minimal matvec

fn matmul_vec(w: Tensor<f32, [O, I]>, x: Tensor<f32, [I]>) -> Tensor<f32, [O]> {
  var y: Tensor<f32, [O]>
  y[o] = sum(i) w[o, i] * x[i]
  return y
}

Softmax builtin (overridable)

fn scores_to_probs(scores: Tensor<f32, [H, T]>) -> Tensor<f32, [H, T]> {
  return softmax(scores)
}

If you define your own softmax, it overrides the builtin implementation.

CLI

Runner (recommended)

./bin/tensalang-run <source.tl> --model <safetensors> --tokenizer <tokenizer.json> --prompt <text> [options]

Key options:

--target cpu|cuda
--fused-attention 0|1|2 (default: 2)
--cuda-device <idx>
--cuda-arch <sm_XX>
--bench-tokens <n>, --bench-skip <n>
--cpu-threads <n>

Compiler (low-level)

./bin/tensalang [--run] [--emit mlir|none] [--entry name] [--target cpu|cuda] <input.sexp>

Roadmap

Planned next:

Auto‑tiling and fusion passes inside MLIR pipeline.
MLX and ROCm lowering pipelines under Targets/MLX and Targets/ROCm.
Quantization and mixed‑precision tooling.

Docs

Full language reference: docs.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensaLang

Why this is different (and why it matters)

Quick Start

Docker Quick Test

Project Layout

How it works (end-to-end)

Pipeline flow (markdown diagram)

1) `.tl` source → S-expression IR

2) S-expression → MLIR

3) MLIR → Target-specific lowering

4) Runtime services

Lowering snapshots (LLM-ish attention score kernel)

LLM‑specific optimizations

Fused attention

cuBLAS GEMV

Arena allocator

Source-level scheduling hints (manual tiling)

Example snippets

Minimal matvec

Softmax builtin (overridable)

CLI

Runner (recommended)

Compiler (low-level)

Roadmap

Docs

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Targets		Targets
docker		docker
examples		examples
src		src
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
docker_command_exec		docker_command_exec
docs.md		docs.md
tensalang_run.cpp		tensalang_run.cpp
tensalang_stdlib.tl		tensalang_stdlib.tl

Folders and files

Latest commit

History

Repository files navigation

TensaLang

Why this is different (and why it matters)

Quick Start

Docker Quick Test

Project Layout

How it works (end-to-end)

Pipeline flow (markdown diagram)

1) .tl source → S-expression IR

2) S-expression → MLIR

3) MLIR → Target-specific lowering

4) Runtime services

Lowering snapshots (LLM-ish attention score kernel)

LLM‑specific optimizations

Fused attention

cuBLAS GEMV

Arena allocator

Source-level scheduling hints (manual tiling)

Example snippets

Minimal matvec

Softmax builtin (overridable)

CLI

Runner (recommended)

Compiler (low-level)

Roadmap

Docs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

1) `.tl` source → S-expression IR