Reverse engineering NVIDIA SASS from controlled kernels to production audits.
Article 1 · Article 2 · Knowledge base · Pattern library · SM120 instruction glossary · Encoding notes · Start here · Project structure · Tensor-core chapters · Contributing
SASS King is a systematic reverse-engineering project for NVIDIA SASS, the native GPU instruction set emitted inside compiled CUDA binaries. The project starts with SM120 / SM120a consumer Blackwell hardware and expands toward a full cross-architecture ISA and pattern library over time.
The goal is practical: help a kernel engineer open a SASS dump, recognize compiler patterns, identify performance-relevant structures, and connect the binary back to source-level optimization decisions.
The project has completed its initial Phase 3 pattern library: 29 reusable SASS signatures are now formalized under patterns/, with knowledge/FINDINGS.md kept as the full evidence trail. The next major step is Phase 4: applying those patterns to real production kernels.
| If you want to... | Start here | Then read |
|---|---|---|
| Understand the project in 10 minutes | docs/README.md |
docs/START_HERE.md, then docs/PROJECT_STRUCTURE.md |
| Reproduce the evidence | corpus/README.md |
one chapter conclusion*.md, then its .sass dump |
| Find the source of truth | knowledge/FINDINGS.md |
knowledge/SASS_INSTRUCTIONS_SM120.md, knowledge/encoding/README.md |
| Recognize a pattern in a new dump | patterns/README.md |
the matching patterns/NN-*.md page |
| Start a production audit | production/README.md |
matching PATTERN-NN pages and source evidence |
| Contribute a correction or dump | CONTRIBUTING.md |
docs/START_HERE.md |
The repository is organized as an evidence pipeline:
corpus/ controlled kernels and raw SASS evidence
knowledge/ project-wide findings, instruction notes, and encoding notes
patterns/ reusable Phase 3 audit signatures
production/ Phase 4 real-kernel audits
The last broad public SASS reverse-engineering work comparable in spirit was Jia et al. on Volta and Turing in 2018. Ampere, Hopper, and Blackwell have changed the instruction mix substantially: async copy paths, tensor-core families, matrix load/store instructions, sparse and scaled MMA forms, and new uniform-register flows.
SASS King fills that gap by combining controlled micro-kernels, raw SASS reading, runtime probes, and production-kernel audits.
| Area | Status | Where |
|---|---|---|
| SM120 teaching kernels | Complete through kernels 01-12 | corpus/basics/01_vector_add/ to corpus/math_and_spills/12_register_spill/ |
| Tensor-core studies | Complete through Kernel 25 | corpus/tensor_cores/ |
| Global findings | Active source of truth | knowledge/FINDINGS.md |
| SM120 instruction glossary | Active, evidence-backed | knowledge/SASS_INSTRUCTIONS_SM120.md |
| Encoding pilots | Started with LDSM, STSM, QMMA |
knowledge/encoding/ |
| denvdis cross-validation | Initial pass complete; deeper control-code gaps remain | knowledge/DENVDIS_INTEGRATION.md |
| Pattern library | Initial Phase 3 library complete | patterns/ |
| Production audits | Next phase | production/ |
The formal pattern library is the main output of Phase 3. It turns the chapter-local evidence into reusable audit signatures, so an audit can cite a named pattern instead of rewriting the full research trail every time.
Phase 3 is considered complete because:
- the repeated structures found in chapters 01-25 have been promoted into 29 named pattern pages;
- every pattern has a plain-English explanation, SASS signature, variants, anti-patterns, open gaps, and confidence level;
- claim tags remain bounded to the source evidence in
knowledge/FINDINGS.md; - audit-facing navigation now starts from
patterns/README.md; - unresolved items are explicitly carried forward as gaps instead of being hidden inside the pattern text.
| Pattern family | Examples | Where |
|---|---|---|
| Tensor-core compute | HMMA, QMMA, OMMA accumulator chains; sparse metadata; narrow fragments | patterns/02-* to patterns/04-*, patterns/10-*, patterns/21-* |
| Matrix memory and epilogues | LDSM, STSM, async copy pipelines, REDG reduction epilogues | patterns/05-*, patterns/06-*, patterns/07-*, patterns/28-* |
| Control flow | divergence/reconvergence, loop back-edges, predicated exits, cold traps, local CALLs | patterns/08-*, patterns/14-*, patterns/16-*, patterns/26-*, patterns/29-* |
| Memory and registers | vectorized global memory, spills, shared-memory staging, descriptors, uniform-register flow | patterns/09-*, patterns/11-*, patterns/17-*, patterns/19-*, patterns/20-* |
| Arithmetic and scheduling | FFMA fusion, constants, MUFU slowpaths, scoreboards, lifetime recycling | patterns/12-*, patterns/18-*, patterns/22-*, patterns/23-*, patterns/24-* |
| Warp collectives | warp reductions, shuffle/vote/match/sync primitives | patterns/01-*, patterns/25-* |
Each pattern page includes:
- plain-English meaning;
- SASS signature;
- observed variants;
- interpretation boundaries;
- anti-patterns;
- open gaps;
- confidence level.
Use patterns/README.md as the audit-facing index. Use knowledge/FINDINGS.md when you need the longer research context behind a pattern.
Phase 3 does not claim that every NVIDIA SASS behavior is decoded. It establishes a reusable SM120 / SM120a pattern layer good enough to begin manual production audits. Runtime layout decode, full control-code bit placement, automated cubin reporting, and cross-architecture replay remain future work.
- New to the project: read Start Here.
- Want the project-wide map: read the knowledge base index.
- Want the current instruction map: read SASS instructions on SM120 / SM120a.
- Want encoding notes: read encoding notes.
- Want the raw source of truth: read findings.
- Want reusable audit signatures: read the pattern library.
- Want tensor-core evidence: start with tensor-core chapters.
- Want to contribute dumps or corrections: read contributing.
- Want the v0.1 boundary: read release notes.
Public writeups:
Controlled variation. Two kernels differ by exactly one variable: dtype, operand order, unroll factor, memory layout, or compilation target. The SASS diff isolates the compiler decision.
Strict claim tags. Every technical claim uses a tag:
| Tag | Meaning |
|---|---|
[OBS] |
Directly observed in a dump, log, runtime output, or profile. |
[INF] |
Inferred from observed evidence. |
[HYP] |
Plausible but not confirmed. |
[RES] |
A prior hypothesis resolved by later evidence. |
[GAP] |
Open question documented explicitly. |
Top-down and bottom-up together. Micro-kernels isolate individual instructions and compiler decisions. Production-like kernels show which patterns matter in real code.
Pattern-first audits. A production audit should cite a formal PATTERN-NN page only after matching the visible SASS signature and carrying over its confidence limits, anti-patterns, and open gaps.
The first pass focuses on the SM120 tensor-core and memory pipeline:
HMMA,QMMA,OMMALDSM,STSMLDGSTS,LDGDEPBAR,DEPBARLDG,STG,LDS,STS,REDGBRA,EXIT,BSSY,BSYNC,WARPSYNCSHFL,VOTE,REDUX- uniform-register flow:
S2UR,R2UR,UMOV,ULEA,LDCU
The project does not pretend the ISA is complete yet. The public glossary tracks what is observed and explained; deeper pages under knowledge/encoding/ track families with enough evidence for matcher-style documentation.
SASS King does not compete with bit-level SASS disassemblers. The project uses local dumps as primary evidence and may use redplait/denvdis as a cross-check for instruction fields, scheduling tables, predicates, and register tracking. denvdis can validate low-level encoding interpretations; SASS King owns the controlled-variation evidence, semantic pattern layer, and production-audit interpretation.
flowchart LR
P1["Phase 1<br/>Teaching kernels<br/>01-12"] --> P2["Phase 2<br/>SM120 tensor-core corpus<br/>13-25"]
P2 --> P25["Phase 2.5<br/>denvdis cross-validation<br/>bit-level backend"]
P25 --> P3["Phase 3<br/>Pattern library<br/>compiler signatures"]
P3 --> P4["Phase 4<br/>Production audits<br/>real kernels"]
P4 --> P5["Phase 5<br/>Audit tool<br/>cubin reports"]
P5 --> P6["Phase 6<br/>Cross-architecture replay<br/>SM80/86/89/90a/100a/120"]
classDef done fill:#0b6d55,color:#fff,stroke:#0b6d55;
classDef active fill:#f4c95d,color:#111,stroke:#b89422;
classDef planned fill:#1f2937,color:#fff,stroke:#6b7280;
class P1,P2,P25,P3 done;
class P4 active;
class P5,P6 planned;
| Phase | Status | Output | Why it matters |
|---|---|---|---|
| 1. Teaching kernels | Done | corpus/basics/, corpus/warp_collectives/, corpus/math_and_spills/ |
Establishes the reading vocabulary from controlled CUDA-to-SASS experiments. |
| 2. SM120 tensor-core corpus | Done | corpus/tensor_cores/13_hmma_fp16/ to 25_stsm_epilogue/ |
Captures the first SM120 / SM120a tensor-core, matrix-memory, control-flow, and epilogue evidence set. |
| 2.5. denvdis cross-validation | Initial pass complete | knowledge/DENVDIS_INTEGRATION.md, knowledge/encoding/CONTROL_CODE.md |
Uses denvdis as a bit-level cross-check without replacing local dump evidence. Full stall/yield bit placement remains open. |
| 3. Pattern library | Initial library complete | patterns/ |
Turns repeated compiler/SASS structures into reusable signatures. |
| 4. Production audits | Next | production/ |
Tests whether corpus patterns explain real kernels from production libraries. |
| 5. Audit tool | Planned | cubin-to-report pipeline | Makes the pattern layer scriptable and repeatable. |
| 6. Cross-architecture replay | Planned | SM80, SM86, SM89, SM90a, SM100a, SM120 comparisons | Separates architecture-specific facts from general NVIDIA SASS behavior. |
Kernels 01-12 establish baseline SASS concepts: FMA fusion, scoreboard behavior, loop lowering, shared memory, global memory, warp primitives, slow-path math, and local-memory spills.
Kernels 13-25 cover the current SM120 tensor-core path:
| Kernel | Topic |
|---|---|
| 13 | HMMA baseline, register allocation, accumulator chaining |
| 14 | QMMA FP8 / FP6 / FP4 baseline |
| 15 | Narrow MMA variants |
| 16 | FP4 peak and block-scaled OMMA/QMMA |
| 17 | LDSM and matrix-load behavior |
| 18 | Pipelined MMA tile and async copy staging |
| 19 | Sparse MMA metadata |
| 20 | Control flow and back-edge detection |
| 21 | Divergence and reconvergence |
| 22 | STSM matrix-store behavior |
| 23 | FP4 / FP6 fragment layout probes |
| 24 | Production mini-GEMM audit |
| 25 | STSM epilogue layout and storeback semantics |
Validate redplait/denvdis as the bit-level cross-check backend for SM120 / SM120a before production audits depend on the pattern library. The pass runs nvd -O, nvd -S, nvd -p, and where useful nvd -T on representative local cubins or dumps covering HMMA, QMMA, QMMA.SF, QMMA.SP, OMMA, LDSM, STSM b16/b8, LDGSTS, DEPBAR, and divergence markers.
The output is knowledge/DENVDIS_INTEGRATION.md: a factual compatibility table from family to denvdis recognition status, modifier coverage, exposed control-code fields, and the SASS King action. denvdis output is supporting evidence, not a replacement for local dump observations.
Formalized recurring structures into reusable signatures:
LDGSTS -> DEPBAR -> LDSM -> MMA- chained
HMMA/QMMA/OMMA STSM -> BAR -> LDS -> STG- warp reductions and cross-lane collectives
- register-spill signatures
- scalar and uniform control-flow patterns
The initial Phase 3 library contains 29 pattern pages under patterns/. knowledge/FINDINGS.md remains the research log and source of truth; patterns/ is the audit-facing entry point.
Phase 3 is complete at the initial library level. Remaining items such as runtime layout decode, full control-code bit placement, and cross-architecture replay are tracked as gaps or future phases, not blockers for starting Phase 4 production audits.
Apply the pattern library to real kernels from libraries such as FlashAttention, CUTLASS, xFormers, Transformer Engine, FlashInfer, llama.cpp / ggml, tinygrad, and related projects. The goal is representative coverage by algorithmic pattern, not one markdown file per kernel.
The first Phase 4 deliverable should be a manual audit report that:
- segments one real kernel into SASS regions;
- cites matching
PATTERN-NNpages; - assigns confidence levels to each conclusion;
- records unexplained regions as new gaps;
- avoids building an audit tool until at least one manual report is stable.
Build a pipeline that takes a cubin, detects known patterns, and emits an optimization-oriented report.
Replay the methodology on additional targets:
| Arch | Representative GPU | Why |
|---|---|---|
| SM80 | A100 | Datacenter Ampere baseline |
| SM86 | RTX 3090 | Consumer Ampere corpus |
| SM89 | RTX 4090 | Common consumer inference card |
| SM90a | H100 | TMA, WGMMA, warp specialization, clusters |
| SM100a | B200 | tcgen05.mma, TMEM |
| SM120 | RTX 5070 Ti / 5090 | Consumer Blackwell starting point |
.
├── corpus/ # Controlled kernels, dumps, and chapter writeups
│ ├── basics/ # Kernels 01-08: scalar/vector and memory basics
│ ├── warp_collectives/ # Kernels 09-10: shuffle, vote, reduction
│ ├── math_and_spills/ # Kernels 11-12: slow paths and spills
│ └── tensor_cores/ # Kernels 13-25: tensor-core studies
├── knowledge/ # Findings, glossary, encoding notes
│ ├── FINDINGS.md
│ ├── SASS_INSTRUCTIONS_SM120.md
│ └── encoding/
├── patterns/ # Formal Phase 3 pattern library
├── production/ # Phase 4 production-kernel audits
├── docs/ # Onboarding, structure, and release-facing notes
└── guide/ # External SASS reading guide submodule
Each chapter folder contains source kernels, compiled artifacts when relevant, SASS dumps when they are part of the validated evidence set, and a conclusion<N>.md writeup.
For a fuller explanation of what belongs in each directory, read Project Structure.
cuobjdump --dump-sassfor raw disassembly.gpuasm.comfor scoreboards, stalls, pressure, and dependency arrows.- Nsight Compute for profiling and stall attribution.
%clockmicrobenchmarks for instruction latency probes.nvcc -Xptxas -vfor register and spill metadata.
- Jia et al. 2018, "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking" for the empirical microbenchmarking discipline behind latency, throughput, and dependency validation.
- kuterdinel.com/nv_isa for fuzzed NVIDIA ISA encoding work, especially the idea of deriving machine-readable encoding rules from disassembler behavior.
- redplait/denvdis for opcode tables, bit-level disassembly, encoding-field inspection, scheduling analysis, register tracking, and cubin manipulation. The extracted SM120 data12 tables are used as a low-level cross-check while local dumps remain primary evidence.
- Redplait tooling and notes: ced cubin editor, SASS disassembly Perl bindings, SASS latency analysis, and libcuda/nvasm_internal notes.
- Huerta et al. 2025 for reverse-engineering compiler-guided scheduling, control codes, dependency counters, reuse flags, and yield behavior.
- Yan et al. 2026 for the driver-layer launch and pushbuffer analysis below SASS.
- MaxAS and TuringAS as prior public SASS assembler efforts for older NVIDIA architectures.
- NVIDIA CUDA Binary Utilities documentation for official cubin, fatbin, and disassembly tooling.
SASS King operates at the algorithmic pattern layer: recognizing how compiled kernels are structured and connecting those structures to source-level optimization decisions.
Contributions are welcome, especially:
- raw SASS dumps from hardware not directly available here;
- controlled kernel studies that isolate one compiler decision;
- corrections to existing observations;
- new production-kernel pattern proposals;
- cross-architecture comparisons.
See CONTRIBUTING.md for the expected metadata and writing standard.
Florian Mattana. florianmattana.com