Skip to content

BenChaliah/Tensa-Lang

Repository files navigation

TensaLang

TensaLang is a Tensor-first programming language, compiler, and runtime that let you develop your the Model's inference engine (e.g. LLMs) using a simple high level language, then compile it through MLIR to multiple targets (e.g. CPU, CUDA, ROCm). You can change attention, sampling, tiling, and memory placement without rewriting the compiler.

pic


Why this is different (and why it matters)

Most LLM inference stacks are monolithic: kernels are "the runtime," and the model is encoded by the library's implementation. TensaLang flips that:

  • LLM logic is source code. RMSNorm, RoPE, attention, MLP, sampling—all written in .tl.
  • MLIR is the core IR. You get a full compiler pipeline, not a black box.
  • Scheduling is part of the language. You can hint tile sizes, parallel indices, memory placement directly in source (with tile=... parallel=... memory=...).
  • Builtins are overridable. You can replace softmax or layernorm in source for custom behavior.
  • Target‑dependent lowerings live under Targets/. Today: CPU + CUDA. Planned: MLX and ROCm lowering pipelines.

short_demo


Quick Start

Build:

./build.sh

Download example weights:

git clone https://huggingface.co/DatarusAI/Tensa-Lang models

This provides models/llama2_7b/ and models/Qwen2.5-0.5B-Coder-Instruct/.

Run Llama2 fp16 (CPU):

./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \
  --model models/llama2_7b/llama2_7b_f16.safetensors \
  --tokenizer models/llama2_7b/tokenizer.json \
  --prompt "Once upon a time"

Run Llama2 fp16 (CUDA):

./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \
  --model models/llama2_7b/llama2_7b_f16.safetensors \
  --tokenizer models/llama2_7b/tokenizer.json \
  --prompt "Once upon a time" \
  --target cuda \
  --cuda-arch sm_89

Run Qwen2.5-Coder-0.5B-Instruct (CUDA):

./bin/tensalang-run examples/qwen25_coder_bf16.tl \
  --model models/Qwen2.5-0.5B-Coder-Instruct/qwen25_0.5b_bf16.safetensors \
  --tokenizer models/Qwen2.5-0.5B-Coder-Instruct/tokenizer.json \
  --prompt "Once upon a time" \
  --target cuda \
  --cuda-arch sm_89

Notes:

  • Defaults: --steps 128, --target cpu, --fused-attention 2, --seed 12345.
  • The runner prints a Settings used: block at startup and an OUTPUT: block at the end.

Docker Quick Test

Build the image once:

docker build -f docker/Dockerfile -t tensalang:local .

Run with the helper (auto-mounts model/tokenizer paths):

./docker_command_exec examples/llama2_manual_tiling_fp16.tl \
  --model /path/to/llama2_7b_f16.safetensors \
  --tokenizer /path/to/tokenizer.json \
  --prompt "Once upon a time" \
  --target cuda \
  --cuda-arch sm_89

Notes:

  • Set TENSALANG_DOCKER_IMAGE to override the image name.
  • Use TENSALANG_DOCKER_FORCE_GPU=1 or TENSALANG_DOCKER_NO_GPU=1 to force GPU on/off.

Project Layout

.
├── src/                      # compiler, runtime, and sugar front-end
├── Targets/                  # target backends (CUDA, CPU, etc.)
├── examples/                 # reference programs
├── models/                   # local model weights + tokenizers (clone from HF)
├── tools/                    # conversion utilities
├── bin/                      # binaries + static runtime lib
├── tensalang_run.cpp         # C++ runner (builds bin/tensalang-run)
└── docs.md                   # full language reference

How it works (end-to-end)

Pipeline flow (markdown diagram)

flowchart LR
  TL[".tl source"] --> SEXPR["S-expression IR"]
  SEXPR --> MLIR["MLIR module"]
  MLIR --> PASS["MLIR passes"]
  PASS --> CPU["CPU lowering + LLVM JIT"]
  PASS --> CUDA["CUDA lowering + NVVM"]
  CPU --> RTCPU["CPU runtime + kernels"]
  CUDA --> RTCUDA["CUDA runtime + kernels"]
  RTCPU --> RUN["Token generation"]
  RTCUDA --> RUN
Loading

1) .tl source → S-expression IR

  • The surface syntax is parsed by src/tensalang_sugar.py.
  • It emits a compact S-expression IR (s-expr) that is easy to consume in C++.

2) S-expression → MLIR

  • src/codegen.cpp lowers the s-expr into MLIR using linalg/scf/arith/memref ops.
  • The compiler builds a full MLIR module, then runs optimization passes.

3) MLIR → Target-specific lowering

CPU path

  • MLIR is lowered to LLVM and JIT‑compiled via MLIR ExecutionEngine.
  • Hot paths are specialized in Targets/CPU/runtime_cpu.cpp (SIMD matvec, RMSNorm, RoPE, attention).

CUDA path

  • GPU mapping passes convert loops into GPU kernels.
  • GPU kernels are lowered to NVVM, serialized to cubin, and launched via runtime wrappers.
  • CUDA runtime wrappers live in Targets/CUDA/runtime_cuda.cpp and handle:
    • CUDA context + streams
    • device alloc/free + memcpy
    • kernel launches
    • cuBLAS GEMV fast path

4) Runtime services

The runtime provides the "outside world" that the .tl program calls into:

  • Safetensors I/O
  • Tokenization + decoding
  • Sampling helpers
  • Arena allocator for temporary tensors
  • GPU helpers (CUDA, cuBLAS)

Lowering snapshots (LLM-ish attention score kernel)

Source (.tl):

fn attn_scores(q: Tensor<f32, [H, Dh]>, k: Tensor<f16, [T, Dh]>, scale: f32) -> Tensor<f32, [H, T]>
    with tile=[8, 64], parallel=[h, t] {
  var s: Tensor<f32, [H, T]>
  s[h, t] = sum(i) q[h, i] * (k[t, i] as f32) * scale
  return s
}

S-expression IR:

(program
  (fn attn_scores
    (params
      (param q (tensor f32 [H Dh]))
      (param k (tensor f16 [T Dh]))
      (param scale f32)
    )
    (returns (tensor f32 [H T]))
    (with
      (tile (list (number 8) (number 64)))
      (parallel (list (symbol h) (symbol t)))
    )
    (body
      (var s (tensor f32 [H T]))
      (assign (index (symbol s) (symbol h) (symbol t)) (reduce sum i (binary * (binary * (index (symbol q) (symbol h) (symbol i)) (cast (index (symbol k) (symbol t) (symbol i)) f32)) (symbol scale))))
      (return (symbol s))
    )
  )
)

MLIR (CUDA path, excerpt after gpu-kernel-outlining, showing GPU dialect):

module {
  func.func @attn_scores(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf16>, %arg2: f32) -> memref<?x?xf32> {
    %c16 = arith.constant 16 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %dim = memref.dim %arg0, %c0 : memref<?x?xf32>
    %0 = arith.index_cast %dim : index to i64
    %1 = arith.sitofp %0 : i64 to f32
    %dim_0 = memref.dim %arg0, %c1 : memref<?x?xf32>
    %2 = arith.index_cast %dim_0 : index to i64
    %3 = arith.sitofp %2 : i64 to f32
    %dim_1 = memref.dim %arg1, %c0 : memref<?x?xf16>
    %4 = arith.index_cast %dim_1 : index to i64
    %5 = arith.sitofp %4 : i64 to f32
    %6 = arith.fptosi %1 : f32 to i64
    %7 = arith.index_cast %6 : i64 to index
    %8 = arith.fptosi %5 : f32 to i64
    %9 = arith.index_cast %8 : i64 to index
    %alloc = memref.alloc(%7, %9) : memref<?x?xf32>
    %dim_2 = memref.dim %arg0, %c0 : memref<?x?xf32>
    %dim_3 = memref.dim %arg1, %c0 : memref<?x?xf16>
    %10 = arith.subi %dim_2, %c1 : index
    %11 = arith.subi %dim_3, %c1 : index
    %12 = arith.addi %10, %c16 : index
    %13 = arith.addi %11, %c16 : index
    %14 = arith.divsi %12, %c16 : index
    %15 = arith.divsi %13, %c16 : index
    gpu.launch blocks(%arg3, %arg4, %arg5) in (%arg9 = %14, %arg10 = %15, %arg11 = %c1)
               threads(%arg6, %arg7, %arg8) in (%arg12 = %c16, %arg13 = %c16, %arg14 = %c1) {
      %16 = arith.muli %arg3, %c16 : index
      %17 = arith.muli %arg4, %c16 : index
      %18 = arith.addi %16, %arg6 : index
      %19 = arith.addi %17, %arg7 : index
      %20 = arith.cmpi slt, %18, %dim_2 : index
      %21 = arith.cmpi slt, %19, %dim_3 : index
      %22 = arith.andi %20, %21 : i1
      scf.if %22 {
        %cst = arith.constant 0.000000e+00 : f32
        %dim_4 = memref.dim %arg0, %c1 : memref<?x?xf32>
        %23 = scf.for %arg15 = %c0 to %dim_4 step %c1 iter_args(%arg16 = %cst) -> (f32) {
          %24 = memref.load %arg0[%18, %arg15] : memref<?x?xf32>
          %25 = memref.load %arg1[%19, %arg15] : memref<?x?xf16>
          %26 = arith.extf %25 : f16 to f32
          %27 = arith.mulf %24, %26 : f32
          %28 = arith.mulf %27, %arg2 : f32
          %29 = arith.addf %arg16, %28 : f32
          scf.yield %29 : f32
        }
        memref.store %23, %alloc[%18, %19] : memref<?x?xf32>
      }
      gpu.terminator
    }
    return %alloc : memref<?x?xf32>
  }
}

LLM‑specific optimizations

Fused attention

The Llama2 program calls attention_f16_fused. The compiler detects it and emits:

  • fused kernel (--fused-attention 1) or
  • two‑stage fused kernel (--fused-attention 2, default)

Two‑stage splits score computation and normalization for better numerical stability and efficiency. If you want to use your own attention, define attention_f16_fused in .tl.

cuBLAS GEMV

Matvec for projection weights can dispatch to cuBLAS GEMV (on CUDA) when layouts are compatible.

Arena allocator

arena_begin() / arena_reset() / arena_end() enable a bump‑pointer allocator that reuses memory each token, reducing allocation overhead during decoding.


Source-level scheduling hints (manual tiling)

TensaLang lets you embed scheduling preferences directly in source. This is the key differentiator for experimentation.

fn attention_f16(q: Tensor<f32, [D]>,
                 key_cache: Tensor<f16, [L, SeqLen, KvDim]>,
                 value_cache: Tensor<f16, [L, SeqLen, KvDim]>,
                 layer: i32, pos: i32, H: i32, KvH: i32, scale: f32) -> Tensor<f32, [D]>
    with tile=[8, 64], parallel=[h, t],
         memory={key_cache: shared_mem, value_cache: shared_mem} {
  var att: Tensor<f32, [H, SeqLen]> = zeros([H, SeqLen])
  att[h, t] = if t > pos { -inf } else {
    sum(i) q[h * Dh + i] * (key_cache[layer, t, (h / kv_mul) * Dh + i] as f32) * scale
  }
  var weights: Tensor<f32, [H, SeqLen]> = softmax(att)
  ...
}
  • examples/llama2_manual_tiling_fp16.tl uses explicit hints.
  • examples/llama2_auto_tiling_fp16.tl omits them and lets defaults apply.

Hints are preferences; the compiler may clamp or ignore if unsafe.


Example snippets

Minimal matvec

fn matmul_vec(w: Tensor<f32, [O, I]>, x: Tensor<f32, [I]>) -> Tensor<f32, [O]> {
  var y: Tensor<f32, [O]>
  y[o] = sum(i) w[o, i] * x[i]
  return y
}

Softmax builtin (overridable)

fn scores_to_probs(scores: Tensor<f32, [H, T]>) -> Tensor<f32, [H, T]> {
  return softmax(scores)
}

If you define your own softmax, it overrides the builtin implementation.


CLI

Runner (recommended)

./bin/tensalang-run <source.tl> --model <safetensors> --tokenizer <tokenizer.json> --prompt <text> [options]

Key options:

  • --target cpu|cuda
  • --fused-attention 0|1|2 (default: 2)
  • --cuda-device <idx>
  • --cuda-arch <sm_XX>
  • --bench-tokens <n>, --bench-skip <n>
  • --cpu-threads <n>

Compiler (low-level)

./bin/tensalang [--run] [--emit mlir|none] [--entry name] [--target cpu|cuda] <input.sexp>

Roadmap

Planned next:

  • Auto‑tiling and fusion passes inside MLIR pipeline.
  • MLX and ROCm lowering pipelines under Targets/MLX and Targets/ROCm.
  • Quantization and mixed‑precision tooling.

Docs

Full language reference: docs.md

About

TensaLang is a Tensor-first programming language, compiler, and runtime that let you write the Model’s inference engine (e.g. LLMs) and sampling in high level language, then compile it through MLIR to Multiple targets (e.g. CPU, CUDA, ROCm)

Topics

Resources

Stars

Watchers

Forks

Contributors