feat(safetensors): support F8_E4M3 / F8_E5M2 / F8_E8M0 / I8 dtypes#35
Open
mikeumus wants to merge 12 commits intochrishayuk:mainfrom
Open
feat(safetensors): support F8_E4M3 / F8_E5M2 / F8_E8M0 / I8 dtypes#35mikeumus wants to merge 12 commits intochrishayuk:mainfrom
mikeumus wants to merge 12 commits intochrishayuk:mainfrom
Conversation
Proposes extending LarQL from weight-analysis into analysis+editing via three new subcommands that implement ROME/MEMIT-family algorithms on top of the existing larql-inference forward pass and capture hooks. Based on 9 chapters of experimentation on Gemma 4 (4B and 26B) documented in Divinci-AI/server notebooks/CHAPTER_15 through CHAPTER_23: - larql crown: per-edit crown-layer discovery via module ablation - larql edit: single-fact rank-1 edit with auto-scale calibration - larql memit: batch fact editing via joint least-squares, grouped by crown Also defines a patch file format (~55KB per Gemma 4 4B single edit) and a non-destructive larql apply-patch command. Phased 4-step rollout plan. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements Phase A of RFC-0001 (#2): per-layer MLP ablation scan to find the layer whose last-position MLP output is load-bearing for a given (prompt, expected-token) pair. Changes: - crates/larql-inference/src/ffn/ablating.rs — new LastPositionAblatingFfn that wraps any FfnBackend and zeroes its output at the last-token row for one target layer. Thin wrapper, no math changes. - crates/larql-cli/src/commands/extraction/crown_cmd.rs — new `larql crown` subcommand. Tokenises the prompt, runs a baseline forward pass, then iterates layers in [start..=end] running predict_with_ffn against the ablating backend, reports per-layer Δ in expected-token probability and picks the layer whose ablation causes the top prediction to flip with the largest suppression magnitude. Methodology matches Phase 125c of Divinci-AI/server notebooks/CHAPTER_17_CORONATION.md — on Gemma 4 4B, ablating L27 MLP on "Capital of France? A:" makes the top prediction flip from " Paris" to "France" (the country token). The command outputs JSON (optional --json) so downstream commands (edit, memit) can consume the crown_layer field. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… RFC-0001) (#7) Implements Phase B of RFC-0001 (#2): single-fact rank-1 editor with portable patch file format. Builds on Phase A's LastPositionAblatingFfn (#3) and adds the symmetric LastPositionInjectingFfn for scale search. ### New library module: `larql-inference/src/edit.rs` - `EditPatch` struct (serializable via serde) - `compute_rank1(k, d, scale, layer, provenance) -> EditPatch` - `write_patch(path, &patch)` / `read_patch(path) -> EditPatch` with a simple binary format: LQPATCH magic + JSON meta + little-endian f32 vectors for d and k_norm. ~55 KB for Gemma 4 4B. - `apply_patch(&mut ModelWeights, &EditPatch)`: installs the rank-1 outer product into `down_proj.weight` in place, handling both `[hidden, intermediate]` and `[intermediate, hidden]` layouts. ### New FFN wrapper: `larql-inference/src/ffn/injecting.rs` - `LastPositionInjectingFfn` — adds a fixed delta vector to the inner backend's last-row output at one target layer. Symmetric to the ablating wrapper from PR #3. Used for auto-scale search. ### New CLI commands - `larql edit <model> --src "..." --tgt "..." --new-token " Tokyo" --output f2t.lqpatch` Runs Phase A crown discovery (or accepts `--layer`), captures k at the crown layer for both prompts, computes d = W_down @ (k_tgt - k_src), linearly searches [0.5, 1, 1.5, 2, 2.5, 3, 4] for the minimum scale that flips the source's top-1 to --new-token, emits the patch. - `larql apply-patch <model> --patch f2t.lqpatch --prompt "..."` Non-destructively installs one or more patches into the loaded weights, optionally runs a test prediction. Supports `--reverse` to subtract a patch (verifies reversibility). ### Supporting change - Added `InferenceModel::weights_mut()` accessor so apply-patch can mutate the in-memory weight map without reloading. Methodology validated in Python across Divinci-AI/server notebooks/CHAPTER_20_HONEY.md (Phase 140c: France→Tokyo with 11/11 specificity at 0.9% weight perturbation) and CHAPTER_18_THE_EDIT.md (Phase 130 scale search). The Rust port preserves the same math. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the existing covariance-MEMIT solver (larql_inference::forward::memit:: run_memit) with a CLI, an edits.json file format, and automatic crown-layer discovery for each edit. Groups edits by crown layer, invokes the joint least-squares solve, emits one dense `.lqpatch` per affected layer plus a manifest.json. Phase C of RFC-0001 (#2), stacked on Phase B (#4). ### Extended patch file format (still backward compatible) - Bumped patch version 1 → 2 with a `kind` field (defaults to "rank_one") - New `kind = "dense"` variant carries a flat row-major ΔW matrix, needed because MEMIT's covariance-projected solve isn't natively a rank-1 outer product. Larger on disk (~72 MB per Gemma 4 4B layer) but semantically exact — no SVD approximation step. - `write_patch`, `read_patch`, `apply_patch` all dispatch on kind. Phase B rank-1 patches continue to round-trip unchanged. - New `compute_dense()` helper builds a Dense patch from an Array2<f32>. ### New CLI: `larql memit` - Reads edits.json (list of {label, src, new_token, layer?} records). - For each edit: tokenises src, resolves target_token_id, resolves crown layer (explicit or auto-scan). - Calls `run_memit` with Vec<MemitFact>, receives one `MemitResult` per affected layer. - Serialises each layer's ΔW as a Dense patch into the output directory, writes a manifest.json enumerating them. - Prints the apply-patch command to install the batch. ### Usage cat > edits.json <<EOF [ {"label":"france-to-tokyo","src":"Capital of France? A:", "new_token":" Tokyo","layer":27}, {"label":"germany-to-rome","src":"Capital of Germany? A:", "new_token":" Rome","layer":27} ] EOF larql memit /path/to/gemma4 --edits edits.json --output patches/ larql apply-patch /path/to/gemma4 \\ -p patches/memit_L27.lqpatch \\ --prompt "Capital of France? A:" ### Known ceiling Chapter 22 established that single-layer MEMIT with correlated keys (~60% cosine) lands ~3/5 concurrent targets. For 5+ correlated edits, users can now distribute across multiple crown layers via `layer` overrides in edits.json — MEMIT runs once per layer group. Compile-checked with `cargo check --package larql-cli`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… of RFC-0001) (#9) Exposes the Phase A-C commands as Python callables so the Chapter 15-23 Colab experiments from Divinci-AI/server become one-liner Rust invocations from Jupyter — no CLI shell-outs, no JSON parsing. ### New module: crates/larql-python/src/edit_py.rs Four #[pyfunction] entry points: - crown(model, prompt, expect, start_layer=None, end_layer=None, top_k=100) Returns {crown_layer, crown_delta_prob, top_after_ablation, scan: [...]}. - edit(model, src, tgt, new_token, output, layer=None, scales=None, fixed_scale=None, top_k=100, label=None) Writes a rank-1 .lqpatch; returns {layer, scale, output, d_norm}. - apply_patch(model, patches: list[str], prompt=None, top_k=5, reverse=False) Applies patches in-memory; optional prompt returns {predictions: [(tok, prob), ...]}. - memit(model, edits: list[dict], output_dir, ridge=0.01, target_alpha=1.0, top_k=100) Batch fact editor wrapping run_memit — writes one dense patch per layer into output_dir + manifest. ### Wiring - Registered in _native pymodule (src/lib.rs) via m.add_function. - Re-exported from python/larql/__init__.py under the public `larql` namespace alongside the existing load_vindex/create_session functions. ### Example import larql scan = larql.crown("/path/to/gemma4", "Capital of France? A:", " Paris") print(scan["crown_layer"]) # 27 (on Gemma 4 4B) larql.edit("/path/to/gemma4", src="Capital of France? A:", tgt="Capital of Japan? A:", new_token=" Tokyo", output="france_to_tokyo.lqpatch") r = larql.apply_patch("/path/to/gemma4", patches=["france_to_tokyo.lqpatch"], prompt="Capital of France? A:") print(r["predictions"][0]) # ['Tokyo', 0.97] This closes the RFC-0001 phased rollout: Python scripts can now drive the mechanistic fact-editing pipeline end-to-end. Compile-checked with `cargo check --package larql-python`. Runtime import requires `maturin develop` — standard PyO3 workflow, no Python side of the package changed structurally. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#10) Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g fixes (#12) * feat(models): per-layer intermediate_size for Gemma 4 double-wide MLP Gemma 4's `use_double_wide_mlp=True` widens gate/up/down_proj to 2× base `intermediate_size` on KV-shared layers. On gemma-4-e2b-it (35 layers, last 20 shared), layers 15–34 have `intermediate=12288`, layers 0–14 have 6144. Crown-scan defaults to `(3n/5)=21` and lands on a double-wide layer, so the rank-1 edit hit `intermediate-size mismatch in captured keys` against the config-wide base size. Adds `ModelArchitecture::intermediate_size_for_layer(layer) -> usize` (default = `config.intermediate_size`, mirroring `head_dim_for_layer`). `Gemma4Arch` overrides by reusing the precomputed `kv_sources` set — one source of truth for KV-shared-layer membership. Thread the per-layer lookup through: - `edit_py.rs`: compute `intermediate` after `chosen_layer` is picked. - `edit_cmd.rs`: same for the CLI path. - `memit.rs`: `ffn_dim` now per-layer; `run_memit` already solves per layer, so covariances remain correctly sized across mixed layers. Parse `use_double_wide_mlp` in `detect.rs`; add to `ModelConfig`. Tests (in `detect.rs`): - `test_detect_gemma4_e2b`: asserts 6144 on L0/L14, 12288 on L15/L21/L34 — matches the actual HF tensor shapes verified in the Colab repl. - `test_gemma4_31b_no_double_wide`: 31B lacks the flag → base everywhere. - `test_non_gemma4_intermediate_default`: Llama returns base for all layers via the default trait impl. The bare `weights.intermediate_size` field is left as "base" for display / metadata call sites (demos, patch-print, vindex stats). Patch file-format unchanged: `compute_rank1` / `compute_dense` already derive `intermediate_size` from the runtime tensor, so new patches for double-wide layers store 12288 correctly without a version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: write-lock starvation on INFER + patch-revert down/up vector leak Three fixes for larql-server session management: 1. **Bug 1 — write-lock starvation on INFER**: switched sessions_blocking_write → sessions_blocking_read on the INFER path; made last_accessed AtomicU64 so touch() takes &self. 2. **Bug 2 — rebuild_overrides leak**: added base.down_overrides.clear() + base.up_overrides.clear() before replaying patches on remove. 3. **Bug 3 — blocking_read inside async**: pre-acquire base vindex before entering write lock in apply_patch to avoid tokio panic. All three gates verified: T2 concurrent PASS, T3 global-leak PASS, T4 throughput PASS (mixed p50 0.94× same-session), T5 revert PASS. * ci: add isolation-harness gates + synthetic tiny-vindex testdata Three gates run on every push/PR (T2=concurrent, T3=global-leak, T5=revert). Requires HARNESS_REPO_TOKEN secret (fine-grained PAT, Contents:read on Divinci-AI/larql-isolation-harness). testdata/tiny-vindex is a reproducible 5 MB synthetic vindex generated by generate.py (seed=42, 8 layers, hidden=128) — no real model weights needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* working on arch b, unified insert * working on memit with vindex, and templates * memit style * workig on latest memit * working on wasm * working on wasm * cleaned up vindex and larql * fix: Linux support — conditional BLAS and Q4 scalar fallback - Implement Q4 scalar fallback for non-ARM targets: - Move decode_f16() before #if aarch64 (shared by both paths) - Replace empty stub functions with correct scalar implementations - q4_0_matvec_c and q4_0_vecmat_c now produce correct results on x86_64 Affects: larql-compute/csrc/q4_dot.c Tested on Ubuntu 24 (WSL2, x86_64): cargo build --release and cargo test --workspace pass with 0 failures. macOS path untested — preserves accelerate via cfg(target_os) and requires validation on Apple hardware. * working on bounded compute script * refactored lql * improved refacxtor * updated executor * gemma 4 * working on compute * improved for gemma 4 * test: cherry-pick GGUF shape + Q4 correctness tests from chrishayuk#20 * updated examples * working through python parity * working on q4k tidyup * improving testing and quantization * improving testing * gemma 4 support * improved clu * autoregressive generation * kv cache works * working on shader pipeline * working shaders * working on shaders and graph * moved to full graph * workin through ffn walk performance * working version * modulrized shaders * working on decoupling decode * working on performance * more performance improvements * improving performance * more performance improvments * working on performance * working on distributed grid * working on grid * improving docs and moe * working on moe * improved publish pull * binary format * working binary format and performance * updated vindex server specs for binary * improved lm_head * improved prefill * improved lm head * gemma 4 vindex * working on gemma 4 moe * working on cleanup for merge * fixed issue with select * residual stream * working on benchmarks --------- Co-authored-by: chrishayuk <chrishayuk@googlemail.com> Co-authored-by: Remi <remipetiot@hotmail.com> Co-authored-by: chrishayuk <chrishayuk@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on guard for rebuild_overrides (#14) README: Add a fork notice block with badges (Divinci AI, Hugging Face, Vindex Viewer Space, License, Upstream link). Frames this repo as the Divinci-AI fork of chrishayuk/larql carrying RFC-0001 mechanistic fact-editing, Phase-1 unlearning with the revert-leak fix, Gemma 4 per-layer intermediate-size, and the CI isolation harness. Test (overlay_apply): Add `rebuild_overrides_clears_base_down_and_up_overrides` — permanent regression guard for the Phase-1 unlearning revert path. Pre-populates `base.down_overrides` + `base.up_overrides` via `set_down_vector` / `set_up_vector` (the COMPILE-WITH-REFINE write path), pushes any patch onto the overlay so `remove_patch(0)` triggers `rebuild_overrides`, then asserts both base maps are empty after revert. If a future refactor drops the two `clear()` calls in `rebuild_overrides` this test turns red — caught the same regression Gate 3 catches at the integration level, but in 1ms instead of 5sec. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud Run and Kubernetes inject secrets as env vars, not as CLI args.
When the value lives in `valueFrom: secretKeyRef`, Cloud Run does NOT
substitute it into container `args` via `\$(VAR)` expansion — that only
works for inline `value:` envs. As a result there's no ergonomic way to
pass a secret to `--api-key` today, and deployments end up unauthenticated
at the app layer even when a bearer token is provisioned.
Adding `env = "LARQL_API_KEY"` to the clap arg lets `valueFrom: secretKeyRef`
flow directly in:
env:
- name: LARQL_API_KEY
valueFrom:
secretKeyRef:
name: larql-s2s-token-staging
key: latest
The CLI arg still wins when both are set (standard clap precedence).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump safetensors crate 0.5 → 0.7 (which adds the F8_E8M0 enum variant required by Open Compute MX format scales) and add bit-pattern → f32 decoders for the four new dtypes in larql-models/src/loading/safetensors.rs. This unblocks loading any safetensors file that uses MXFP4 expert weights (I8 packed nibbles + F8_E8M0 per-32-element scales — used by deepseek-ai/DeepSeek-V4-* and unsloth/DeepSeek-V4-* among others) or plain FP8 attention weights (F8_E4M3 / F8_E5M2 — GPT-OSS, etc.). Currently `tensor_to_f32` decodes each tensor in isolation. Proper MXFP4 unpacking (where the I8 packed-nibble weight is paired with its F8_E8M0 scale companion) still needs cross-tensor logic — left as a follow-up for the FFN tensor loading layer where weight + scale are loaded together. Also includes: - bench_cmd.rs: strip metal-only code path so `cargo build --no-default-features` works on Linux (metal crate is `cfg(target_os = "macos")`-only). - compile_cmd/save.rs: fix `safetensors::serialize(&views, &None)` → `serialize(&views, None)` for the safetensors 0.7 signature change. Verified `cargo check -p larql-cli --no-default-features` clean (1 dead-code warning unrelated to this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bumps
safetensors0.5 → 0.7 and adds bit-pattern → f32 decoders for four dtypes that the current loader rejects withparse error: InvalidHeaderDeserialization:F8_E8M0— Open Compute MX-format scale (per-32-element, value2^(byte − 127)). Required by every MXFP4-quantized model.F8_E4M3/F8_E5M2— standard FP8 weights (Open Compute Project FN encoding). Used by GPT-OSS, DeepSeek-V3 attention weights, etc.I8— signed bytes; in the MXFP4 case these are packed FP4 nibbles (the.weightcompanion of anF8_E8M0.scale).This unblocks loading any safetensors file from the DeepSeek-V4 family (
deepseek-ai/DeepSeek-V4-Flash,…-Pro, and unsloth repacks of either) and similar MXFP4-quantized MoE models.What's NOT in this PR (intentional)
tensor_to_f32here decodes each tensor in isolation. Proper MXFP4 dequantization (pairing anI8packed-nibble weight with itsF8_E8M0scale companion) belongs at the FFN tensor loading layer where both tensors are loaded together — not in the per-tensor dispatch. A code comment marks this as the next layer to extend (pointing atcrates/larql-models/src/quant/mxfp4.rswhich already has the unpack primitive).Drive-by build fixes
While here, two small unrelated fixes so
cargo build --no-default-featuresis clean on Linux from a fresh clone:bench_cmd.rs: strip themetal-onlyMetalBackend::new()branch so the file compiles when themetalfeature is off (themetalcrate iscfg(target_os = \"macos\")-only). Bench is unaffected — it always usedCpuBackendoutside of macOS anyway.compile_cmd/save.rs: updatesafetensors::serialize(&views, &None)→serialize(&views, None)— safetensors 0.7 took the second argument by value instead of by reference.Verification
```
cargo check -p larql-cli --no-default-features --bin larql
→ Finished
devprofile [unoptimized + debuginfo] target(s)(1 unrelated dead-code warning on hf_api_url)
```
Tested against deepseek-ai/DeepSeek-V4-Flash safetensors snapshot —
larql extractnow gets past the header parse and into the actual weight loading.🤖 Generated with Claude Code