feat(safetensors): support F8_E4M3 / F8_E5M2 / F8_E8M0 / I8 dtypes by mikeumus · Pull Request #35 · chrishayuk/larql

mikeumus · 2026-04-26T05:31:03Z

Summary

Bumps safetensors 0.5 → 0.7 and adds bit-pattern → f32 decoders for four dtypes that the current loader rejects with parse error: InvalidHeaderDeserialization:

F8_E8M0 — Open Compute MX-format scale (per-32-element, value 2^(byte − 127)). Required by every MXFP4-quantized model.
F8_E4M3 / F8_E5M2 — standard FP8 weights (Open Compute Project FN encoding). Used by GPT-OSS, DeepSeek-V3 attention weights, etc.
I8 — signed bytes; in the MXFP4 case these are packed FP4 nibbles (the .weight companion of an F8_E8M0 .scale).

This unblocks loading any safetensors file from the DeepSeek-V4 family (deepseek-ai/DeepSeek-V4-Flash, …-Pro, and unsloth repacks of either) and similar MXFP4-quantized MoE models.

What's NOT in this PR (intentional)

MXFP4 cross-tensor unpacking. tensor_to_f32 here decodes each tensor in isolation. Proper MXFP4 dequantization (pairing an I8 packed-nibble weight with its F8_E8M0 scale companion) belongs at the FFN tensor loading layer where both tensors are loaded together — not in the per-tensor dispatch. A code comment marks this as the next layer to extend (pointing at crates/larql-models/src/quant/mxfp4.rs which already has the unpack primitive).

Drive-by build fixes

While here, two small unrelated fixes so cargo build --no-default-features is clean on Linux from a fresh clone:

bench_cmd.rs: strip the metal-only MetalBackend::new() branch so the file compiles when the metal feature is off (the metal crate is cfg(target_os = \"macos\")-only). Bench is unaffected — it always used CpuBackend outside of macOS anyway.
compile_cmd/save.rs: update safetensors::serialize(&views, &None) → serialize(&views, None) — safetensors 0.7 took the second argument by value instead of by reference.

Verification

```
cargo check -p larql-cli --no-default-features --bin larql

→ Finished `dev` profile [unoptimized + debuginfo] target(s)

(1 unrelated dead-code warning on hf_api_url)

```

Tested against deepseek-ai/DeepSeek-V4-Flash safetensors snapshot — larql extract now gets past the header parse and into the actual weight loading.

🤖 Generated with Claude Code

Proposes extending LarQL from weight-analysis into analysis+editing via three new subcommands that implement ROME/MEMIT-family algorithms on top of the existing larql-inference forward pass and capture hooks. Based on 9 chapters of experimentation on Gemma 4 (4B and 26B) documented in Divinci-AI/server notebooks/CHAPTER_15 through CHAPTER_23: - larql crown: per-edit crown-layer discovery via module ablation - larql edit: single-fact rank-1 edit with auto-scale calibration - larql memit: batch fact editing via joint least-squares, grouped by crown Also defines a patch file format (~55KB per Gemma 4 4B single edit) and a non-destructive larql apply-patch command. Phased 4-step rollout plan. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find the layer whose last-position MLP output is load-bearing for a given (prompt, expected-token) pair. Changes: - crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn that wraps any FfnBackend and zeroes its output at the last-token row for one target layer. Thin wrapper, no math changes. - crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown` subcommand. Tokenises the prompt, runs a baseline forward pass, then iterates layers in [start..=end] running predict_with_ffn against the ablating backend, reports per-layer Δ in expected-token probability and picks the layer whose ablation causes the top prediction to flip with the largest suppression magnitude. Methodology matches Phase 125c of Divinci-AI/server notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on "Capital of France? A:" makes the top prediction flip from " Paris" to "France" (the country token). The command outputs JSON (optional --json) so downstream commands (edit, memit) can consume the crown_layer field. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… RFC-0001) (#7) Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with portable patch file format. Builds on Phase A's LastPositionAblatingFfn (#3) and adds the symmetric LastPositionInjectingFfn for scale search. ### New library module: `larql-inference/src/edit.rs` - `EditPatch` struct (serializable via serde) - `compute_rank1(k, d, scale, layer, provenance) -> EditPatch` - `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a simple binary format: LQPATCH magic + JSON meta + little-endian f32 vectors for d and k_norm. ~55 KB for Gemma 4 4B. - `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1 outer product into `down_proj.weight` in place, handling both `[hidden, intermediate]` and `[intermediate, hidden]` layouts. ### New FFN wrapper: `larql-inference/src/ffn/injecting.rs` - `LastPositionInjectingFfn` — adds a fixed delta vector to the inner backend's last-row output at one target layer. Symmetric to the ablating wrapper from PR #3. Used for auto-scale search. ### New CLI commands - `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch` Runs Phase A crown discovery (or accepts `--layer`), captures k at the crown layer for both prompts, computes d = W_down @ (k_tgt - k_src), linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale that flips the source's top-1 to --new-token, emits the patch. - `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."` Non-destructively installs one or more patches into the loaded weights, optionally runs a test prediction. Supports `--reverse` to subtract a patch (verifies reversibility). ### Supporting change - Added `InferenceModel::weights_mut()` accessor so apply-patch can mutate the in-memory weight map without reloading. Methodology validated in Python across Divinci-AI/server notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11 specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md (Phase 130 scale search). The Rust port preserves the same math. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit:: run_memit) with a CLI, an edits.json file format, and automatic crown-layer discovery for each edit. Groups edits by crown layer, invokes the joint least-squares solve, emits one dense `.lqpatch` per affected layer plus a manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4). ### Extended patch file format (still backward compatible) - Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one") - New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed because MEMIT's covariance-projected solve isn't natively a rank-1 outer product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically exact — no SVD approximation step. - `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B rank-1 patches continue to round-trip unchanged. - New `compute_dense()` helper builds a Dense patch from an Array2<f32>. ### New CLI: `larql memit` - Reads edits.json (list of {label, src, new_token, layer?} records). - For each edit: tokenises src, resolves target_token_id, resolves crown layer (explicit or auto-scan). - Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per affected layer. - Serialises each layer's ΔW as a Dense patch into the output directory, writes a manifest.json enumerating them. - Prints the apply-patch command to install the batch. ### Usage cat > edits.json <<EOF [ {"label":"france-to-tokyo","src":"Capital of France? A:", "new_token":" Tokyo","layer":27}, {"label":"germany-to-rome","src":"Capital of Germany? A:", "new_token":" Rome","layer":27} ] EOF larql memit /path/to/gemma4 --edits edits.json --output patches/ larql apply-patch /path/to/gemma4 \\ -p patches/memit_L27.lqpatch \\ --prompt "Capital of France? A:" ### Known ceiling Chapter 22 established that single-layer MEMIT with correlated keys (~60% cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can now distribute across multiple crown layers via `layer` overrides in edits.json — MEMIT runs once per layer group. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… of RFC-0001) (#9) Exposes the Phase A-C commands as Python callables so the Chapter 15-23 Colab experiments from Divinci-AI/server become one-liner Rust invocations from Jupyter — no CLI shell-outs, no JSON parsing. ### New module: crates/larql-python/src/edit_py.rs Four #[pyfunction] entry points: - crown(model, prompt, expect, start_layer=None, end_layer=None, top_k=100) Returns {crown_layer, crown_delta_prob, top_after_ablation, scan: [...]}. - edit(model, src, tgt, new_token, output, layer=None, scales=None, fixed_scale=None, top_k=100, label=None) Writes a rank-1 .lqpatch; returns {layer, scale, output, d_norm}. - apply_patch(model, patches: list[str], prompt=None, top_k=5, reverse=False) Applies patches in-memory; optional prompt returns {predictions: [(tok, prob), ...]}. - memit(model, edits: list[dict], output_dir, ridge=0.01, target_alpha=1.0, top_k=100) Batch fact editor wrapping run_memit — writes one dense patch per layer into output_dir + manifest. ### Wiring - Registered in _native pymodule (src/lib.rs) via m.add_function. - Re-exported from python/larql/__init__.py under the public `larql` namespace alongside the existing load_vindex/create_session functions. ### Example import larql scan = larql.crown("/path/to/gemma4", "Capital of France? A:", " Paris") print(scan["crown_layer"]) # 27 (on Gemma 4 4B) larql.edit("/path/to/gemma4", src="Capital of France? A:", tgt="Capital of Japan? A:", new_token=" Tokyo", output="france_to_tokyo.lqpatch") r = larql.apply_patch("/path/to/gemma4", patches=["france_to_tokyo.lqpatch"], prompt="Capital of France? A:") print(r["predictions"][0]) # ['Tokyo', 0.97] This closes the RFC-0001 phased rollout: Python scripts can now drive the mechanistic fact-editing pipeline end-to-end. Compile-checked with `cargo check --package larql-python`. Runtime import requires `maturin develop` — standard PyO3 workflow, no Python side of the package changed structurally. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#10) Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g fixes (#12) * feat(models): per-layer intermediate_size for Gemma 4 double-wide MLP Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: write-lock starvation on INFER + patch-revert down/up vector leak Three fixes for larql-server session management: 1. **Bug 1 — write-lock starvation on INFER**: switched sessions_blocking_write → sessions_blocking_read on the INFER path; made last_accessed AtomicU64 so touch() takes &self. 2. **Bug 2 — rebuild_overrides leak**: added base.down_overrides.clear() + base.up_overrides.clear() before replaying patches on remove. 3. **Bug 3 — blocking_read inside async**: pre-acquire base vindex before entering write lock in apply_patch to avoid tokio panic. All three gates verified: T2 concurrent PASS, T3 global-leak PASS, T4 throughput PASS (mixed p50 0.94× same-session), T5 revert PASS. * ci: add isolation-harness gates + synthetic tiny-vindex testdata Three gates run on every push/PR (T2=concurrent, T3=global-leak, T5=revert). Requires HARNESS_REPO_TOKEN secret (fine-grained PAT, Contents:read on Divinci-AI/larql-isolation-harness). testdata/tiny-vindex is a reproducible 5 MB synthetic vindex generated by generate.py (seed=42, 8 layers, hidden=128) — no real model weights needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* working on arch b, unified insert * working on memit with vindex, and templates * memit style * workig on latest memit * working on wasm * working on wasm * cleaned up vindex and larql * fix: Linux support — conditional BLAS and Q4 scalar fallback - Implement Q4 scalar fallback for non-ARM targets: - Move decode_f16() before #if aarch64 (shared by both paths) - Replace empty stub functions with correct scalar implementations - q4_0_matvec_c and q4_0_vecmat_c now produce correct results on x86_64 Affects: larql-compute/csrc/q4_dot.c Tested on Ubuntu 24 (WSL2, x86_64): cargo build --release and cargo test --workspace pass with 0 failures. macOS path untested — preserves accelerate via cfg(target_os) and requires validation on Apple hardware. * working on bounded compute script * refactored lql * improved refacxtor * updated executor * gemma 4 * working on compute * improved for gemma 4 * test: cherry-pick GGUF shape + Q4 correctness tests from chrishayuk#20 * updated examples * working through python parity * working on q4k tidyup * improving testing and quantization * improving testing * gemma 4 support * improved clu * autoregressive generation * kv cache works * working on shader pipeline * working shaders * working on shaders and graph * moved to full graph * workin through ffn walk performance * working version * modulrized shaders * working on decoupling decode * working on performance * more performance improvements * improving performance * more performance improvments * working on performance * working on distributed grid * working on grid * improving docs and moe * working on moe * improved publish pull * binary format * working binary format and performance * updated vindex server specs for binary * improved lm_head * improved prefill * improved lm head * gemma 4 vindex * working on gemma 4 moe * working on cleanup for merge * fixed issue with select * residual stream * working on benchmarks --------- Co-authored-by: chrishayuk <chrishayuk@googlemail.com> Co-authored-by: Remi <remipetiot@hotmail.com> Co-authored-by: chrishayuk <chrishayuk@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on guard for rebuild_overrides (#14) README: Add a fork notice block with badges (Divinci AI, Hugging Face, Vindex Viewer Space, License, Upstream link). Frames this repo as the Divinci-AI fork of chrishayuk/larql carrying RFC-0001 mechanistic fact-editing, Phase-1 unlearning with the revert-leak fix, Gemma 4 per-layer intermediate-size, and the CI isolation harness. Test (overlay_apply): Add `rebuild_overrides_clears_base_down_and_up_overrides` — permanent regression guard for the Phase-1 unlearning revert path. Pre-populates `base.down_overrides` + `base.up_overrides` via `set_down_vector` / `set_up_vector` (the COMPILE-WITH-REFINE write path), pushes any patch onto the overlay so `remove_patch(0)` triggers `rebuild_overrides`, then asserts both base maps are empty after revert. If a future refactor drops the two `clear()` calls in `rebuild_overrides` this test turns red — caught the same regression Gate 3 catches at the integration level, but in 1ms instead of 5sec. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cloud Run and Kubernetes inject secrets as env vars, not as CLI args. When the value lives in `valueFrom: secretKeyRef`, Cloud Run does NOT substitute it into container `args` via `\$(VAR)` expansion — that only works for inline `value:` envs. As a result there's no ergonomic way to pass a secret to `--api-key` today, and deployments end up unauthenticated at the app layer even when a bearer token is provisioned. Adding `env = "LARQL_API_KEY"` to the clap arg lets `valueFrom: secretKeyRef` flow directly in: env: - name: LARQL_API_KEY valueFrom: secretKeyRef: name: larql-s2s-token-staging key: latest The CLI arg still wins when both are set (standard clap precedence). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bump safetensors crate 0.5 → 0.7 (which adds the F8_E8M0 enum variant required by Open Compute MX format scales) and add bit-pattern → f32 decoders for the four new dtypes in larql-models/src/loading/safetensors.rs. This unblocks loading any safetensors file that uses MXFP4 expert weights (I8 packed nibbles + F8_E8M0 per-32-element scales — used by deepseek-ai/DeepSeek-V4-* and unsloth/DeepSeek-V4-* among others) or plain FP8 attention weights (F8_E4M3 / F8_E5M2 — GPT-OSS, etc.). Currently `tensor_to_f32` decodes each tensor in isolation. Proper MXFP4 unpacking (where the I8 packed-nibble weight is paired with its F8_E8M0 scale companion) still needs cross-tensor logic — left as a follow-up for the FFN tensor loading layer where weight + scale are loaded together. Also includes: - bench_cmd.rs: strip metal-only code path so `cargo build --no-default-features` works on Linux (metal crate is `cfg(target_os = "macos")`-only). - compile_cmd/save.rs: fix `safetensors::serialize(&views, &None)` → `serialize(&views, None)` for the safetensors 0.7 signature change. Verified `cargo check -p larql-cli --no-default-features` clean (1 dead-code warning unrelated to this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikeumus and others added 12 commits April 17, 2026 16:59

chore: remove sed leftover .bak.bak files + gitignore them

75cc955

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(safetensors): support F8_E4M3 / F8_E5M2 / F8_E8M0 / I8 dtypes#35

feat(safetensors): support F8_E4M3 / F8_E5M2 / F8_E8M0 / I8 dtypes#35
mikeumus wants to merge 12 commits intochrishayuk:mainfrom
Divinci-AI:feat/safetensors-mxfp4-dtypes

mikeumus commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikeumus commented Apr 26, 2026

Summary

What's NOT in this PR (intentional)

Drive-by build fixes

Verification

→ Finished dev profile [unoptimized + debuginfo] target(s)

(1 unrelated dead-code warning on hf_api_url)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

→ Finished `dev` profile [unoptimized + debuginfo] target(s)