RFC: ggml-bridge — Standardized Tensor Exchange between with llama.cpp (and stable-diffusion.cpp) #24538

martinbu69 · 2026-06-12T18:15:30Z

martinbu69
Jun 12, 2026

RFC: ggml-bridge — Standardized Tensor Exchange Between llama.cpp and stable-diffusion.cpp

Authors: [TBD]
Status: Draft
Target: huggingface/llama.cpp, leejet/stable-diffusion.cpp
Date: June 2026

Abstract

We propose ggml-bridge, a lightweight specification and library for exchanging intermediate tensor data (embeddings, conditioning vectors) between ggml-based inference tools — primarily llama.cpp and stable-diffusion.cpp.

This enables a UNIX-philosophy approach to multimodal AI: each binary does one thing well, and a standardized tensor pipe connects them.

Problem Statement

The Duplication Problem

stable-diffusion.cpp currently reimplements transformer inference for text and vision encoders that llama.cpp already handles — often better:

Encoder	sd.cpp (reimplemented)	llama.cpp (native)
CLIP ViT-L/14	✅ Custom C++ forward pass	✅ Native (LLaVA, CLIP models)
CLIP ViT-bigG	✅ Custom C++ forward pass	✅ Native
T5-XXL	✅ Custom C++ forward pass	✅ Native (seq2seq models)
Qwen3-VL-8B (Ideogram 4)	❌ Not implemented	✅ Native (multimodal)
Future vision encoders	❌ Must reimplement each one	✅ Already supported

This creates several problems:

Duplicated effort: Every new encoder must be re-implemented in C++ by both projects
Unequal optimization: llama.cpp has Flash Attention, KV-cache, advanced quantization — sd.cpp's encoder copies lag behind
Architectural bloat: sd.cpp grows with every new model architecture, moving away from its core strength (diffusion inference)
Blocked features: Ideogram 4's vision-encoder-based character reference is impossible in sd.cpp today — not because of a fundamental limitation, but because the VLM integration path doesn't exist

The Multimodal Gap

Modern image generation models are increasingly multi-model pipelines:

Ideogram 4:    Qwen3-VL-8B (vision) → DiT (diffusion)
FLUX.2:        T5-XXL (text) + CLIP (text) → DiT (diffusion)  
SD3:           CLIP-L + CLIP-G + T5-XXL → MMDiT (diffusion)

Each pipeline combines a transformer encoder with a diffusion backbone. Today, sd.cpp must implement both internally. With ggml-bridge, the split becomes natural:

llama.cpp  = All transformer inference (text, vision, audio encoding)
sd.cpp     = All diffusion inference (denoising, VAE, sampling)
ggml-bridge = The pipe between them

Proposed Solution

Architecture

┌─────────────────┐      ┌──────────────┐      ┌─────────────────┐
│   llama.cpp     │      │  ggml-bridge │      │    sd.cpp       │
│                 │      │              │      │                 │
│  Load encoder   │      │  ┌────────┐  │      │  Load diffusion │
│  (CLIP/T5/VLM)  │─────▶│  │ .ggmlb │  │─────▶│  (UNet/DiT)     │
│                 │      │  │file/shm│  │      │                 │
│  Encode text    │      │  └────────┘  │      │  Read conditioning│
│  Encode image   │      │              │      │  Run denoising   │
│  Export tensor  │      │              │      │  VAE decode      │
│                 │      │              │      │  Output image    │
└─────────────────┘      └──────────────┘      └─────────────────┘

Build Modes for sd.cpp

A key concern is that sd.cpp must remain standalone-capable. We propose three compile-time build modes via cmake, so the codebase can be cleanly separated without losing any capability:

# cmake -DBRIDGE_MODE=STANDALONE | BRIDGED | JOINT
option(BRIDGE_MODE "Build mode: STANDALONE, BRIDGED, or JOINT" STANDALONE)

STANDALONE (default — today's behavior)

┌──────────────────────────────────────────────┐
│  sd-cli                                      │
│  ┌─────────┐  ┌─────────┐  ┌──────────────┐ │
│  │ CLIP.cpp│  │ T5.cpp  │  │ DiT/UNet.cpp │ │
│  └─────────┘  └─────────┘  └──────────────┘ │
└──────────────────────────────────────────────┘

Nothing changes. All internal encoders compiled in. No bridge dependency.

BRIDGED (slim — needs external llama.cpp)

┌──────────────────────────────────────────────┐
│  sd-cli-slim                                 │
│  ┌──────────────┐  ┌──────────────────────┐  │
│  │ bridge_reader│  │ DiT/UNet.cpp         │  │
│  └──────────────┘  └──────────────────────┘  │
└──────────────────────────────────────────────┘

Internal encoders stripped out. Conditioning must come via bridge files or SHM. Smallest possible binary, focused purely on diffusion inference.

JOINT (Mixture-of-Experts binary — standalone + bridge)

┌──────────────────────────────────────────────┐
│  sd-cli-full                                 │
│  ┌───────────────────┐  ┌──────────────────┐ │
│  │ llama.cpp (linked)│  │ DiT/UNet.cpp     │ │
│  │  CLIP, T5, VLMs   │  │                  │ │
│  └────────┬──────────┘  └────────┬─────────┘ │
│           │    bridge (in-proc)  │           │
│           └──────────────────────┘           │
└──────────────────────────────────────────────┘

Statically links llama.cpp as the encoder backend. Single binary, fully standalone, but internally uses the clean bridge architecture. The bridge becomes an in-process function call — zero IPC overhead.

This is the best of both worlds: clean separation of concerns internally, single-file deployment externally.

Code Separation

The build mode controls which code path is compiled:

// In sd.cpp's conditioning pipeline:
#if BRIDGE_MODE == STANDALONE
    // Legacy path: internal CLIP forward pass
    struct ggml_tensor * cond = clip_text_encode(ctx, tokens, n_tokens);
#elif BRIDGE_MODE == BRIDGED
    // External path: read pre-computed conditioning
    struct ggml_tensor * cond = ggmlb_reader_get(reader, "clip_text_embeddings", ctx);
#elif BRIDGE_MODE == JOINT
    // In-process path: call llama.cpp's encoder directly
    struct ggml_tensor * cond = llama_encode_clip(llama_ctx, prompt);
#endif

Over time, the STANDALONE code paths can be deprecated without breaking anything — the JOINT mode provides identical functionality with better optimization.

File Format: `.ggmlb` (ggml bridge)

A minimal binary format for exchanging named tensors between processes. Designed to be:

mmap-compatible for zero-copy IPC
Self-describing (tensor names, shapes, types)
Compatible with existing GGUF infrastructure

// Header
struct ggmlb_header {
    uint32_t magic;          // "GMLB" = 0x424C4D47
    uint32_t version;        // 1
    uint32_t n_tensors;      // Number of tensors in this file
    uint32_t metadata_size;  // Optional JSON metadata (model name, encoding params)
};

// Per-tensor info (follows header)
struct ggmlb_tensor_info {
    char     name[64];       // e.g. "clip_text_embeddings", "t5_hidden_states"
    uint32_t type;           // ggml_type enum (F32, F16, Q8_0, ...)
    uint32_t n_dims;         // 1-4
    int64_t  ne[4];          // Dimensions
    uint64_t offset;         // Offset to raw data from start of data section
};

// Data section: raw tensor bytes (page-aligned for mmap)

Note

This is intentionally simpler than GGUF. GGUF is a model storage format with rich metadata. .ggmlb is an IPC format — it carries only the tensors needed for one inference step.

Transport Layer: File + SHM

The bridge supports two transport mechanisms with the same ggmlb format:

Transport	Mechanism	Best for	Persistence
File	mmap'd `.ggmlb` file on disk	Batch workflows, caching, encode-once-generate-many	✅ Persists on disk
SHM	POSIX `shm_open` / Win32 `CreateFileMapping`	Live pipelines, concurrent processes, zero-copy	❌ Ephemeral

Both transports mmap the same header + tensor layout. The only difference is the open call:

// File-based (batch, caching, portable)
ggmlb_reader * ggmlb_reader_open_file(const char * path);
ggmlb_writer * ggmlb_writer_open_file(const char * path);

// SHM-based (live pipeline, zero-copy IPC)
ggmlb_reader * ggmlb_reader_open_shm(const char * shm_name);
ggmlb_writer * ggmlb_writer_open_shm(const char * shm_name, size_t capacity);

// Common API (both transports)
void ggmlb_writer_add(ggmlb_writer * w, const char * name, struct ggml_tensor * tensor);
void ggmlb_writer_close(ggmlb_writer * w);
struct ggml_tensor * ggmlb_reader_get(ggmlb_reader * r, const char * name, struct ggml_context * ctx);
void ggmlb_reader_close(ggmlb_reader * r);

The CLI uses a shm:// prefix to select the transport:

# File-based (default)
llama-cli --model clip.gguf --prompt "..." --export-bridge clip_cond.ggmlb
sd-cli --model flux2.gguf --bridge-conditioning clip_cond.ggmlb -o result.png

# SHM-based (zero-copy, concurrent)
llama-cli --model clip.gguf --prompt "..." --export-bridge shm://clip_cond
sd-cli --model flux2.gguf --bridge-conditioning shm://clip_cond -o result.png

Note

On POSIX, shm_open() returns a file descriptor that supports mmap() — so the reader/writer code is nearly identical for both transports. On Windows, CreateFileMapping with INVALID_HANDLE_VALUE provides equivalent functionality.

CLI Integration

llama.cpp: `--export-bridge`

# Export CLIP text embeddings
llama-cli --model clip-vit-l-14.gguf \
          --prompt "a mountain meadow at sunset" \
          --export-bridge clip_cond.ggmlb

# Export T5-XXL hidden states
llama-cli --model t5-xxl.gguf \
          --prompt "a mountain meadow at sunset" \
          --export-bridge t5_cond.ggmlb

# Export Qwen3-VL vision embeddings from a reference image
llama-cli --model qwen3-vl-8b.gguf \
          --image reference_photo.png \
          --export-bridge vision_cond.ggmlb

sd.cpp: `--bridge-conditioning`

# Generate image using pre-computed conditioning
sd-cli --model ideogram4-dit.gguf \
       --bridge-conditioning clip_cond.ggmlb \
       --bridge-conditioning vision_cond.ggmlb \
       --output result.png

Combined pipeline (shell)

# Full Ideogram 4 pipeline with character reference — zero Python
llama-cli --model qwen3-vl-8b.gguf --image ref.png --export-bridge vision.ggmlb
llama-cli --model clip-vit-l-14.gguf --prompt "a woman hiking" --export-bridge clip.ggmlb

sd-cli --model ideogram4-dit.gguf \
       --bridge-conditioning vision.ggmlb \
       --bridge-conditioning clip.ggmlb \
       --output hiking_scene.png

Use Cases

1. Ideogram 4 Character Reference (currently impossible in sd.cpp)

# Vision encoder analyzes reference photo
llama-cli --model qwen3-vl-8b.gguf --image person.png --export-bridge char_ref.ggmlb

# Diffusion generates consistent character
sd-cli --model ideogram4.gguf --prompt "same person skiing" \
       --bridge-conditioning char_ref.ggmlb -o skiing.png

2. FLUX.2 with better T5 encoding

# Use llama.cpp's optimized T5 with Flash Attention + KV-cache
llama-cli --model t5-xxl-q4.gguf --prompt "..." --export-bridge t5.ggmlb

# sd.cpp skips its internal T5, uses pre-computed embeddings
sd-cli --model flux2-dev.gguf --bridge-conditioning t5.ggmlb -o result.png

3. Audio-to-Image (future)

# Whisper encodes speech
llama-cli --model whisper-large.gguf --audio narration.wav --export-bridge audio.ggmlb

# Diffusion generates matching image
sd-cli --model sd3.gguf --bridge-conditioning audio.ggmlb -o scene.png

4. Batch processing with cached encodings

# Encode once
llama-cli --model clip.gguf --prompt "mountain landscape" --export-bridge landscape.ggmlb

# Generate many variations without re-encoding
for seed in 1 2 3 4 5; do
  sd-cli --model sdxl.gguf --bridge-conditioning landscape.ggmlb --seed $seed -o "var_${seed}.png"
done

Benefits

For llama.cpp / Hugging Face

Positions llama.cpp as the universal encoder backend for the entire ggml ecosystem
Natural extension of HF's "single-click local AI" strategy
Minimal implementation effort — embedding export is mostly a serialization step

For sd.cpp / leejet

Dramatically reduces codebase: CLIP, T5, future VLM encoders can all be removed over time
Enables features that are currently architecturally impossible (Ideogram 4 character reference)
Focus on core strength: diffusion inference, sampling, VAE

For the ecosystem

Composability: Any ggml tool can produce or consume bridge files
UNIX philosophy: Small, focused tools connected by standard interfaces
Future tools (whisper.cpp, bark.cpp, etc.) can join the pipeline without modifying sd.cpp or llama.cpp

Implementation Roadmap

Phase 1: Minimal POC (weeks)

Implement ggmlb reader/writer as a standalone C library (~500 lines)
Patch llama.cpp: add --export-bridge for CLIP text embeddings
Patch sd.cpp: add --bridge-conditioning that reads .ggmlb instead of running internal CLIP
Demo: SD1.5 image generation with CLIP running in llama.cpp

Phase 2: Multi-encoder support (months)

T5-XXL export from llama.cpp
SDXL and SD3 support in sd.cpp (dual CLIP + T5 conditioning from bridge files)
FLUX.2 support

Phase 3: Vision encoders (months)

Qwen3-VL vision embedding export from llama.cpp
Ideogram 4 character reference via bridge
SigLIP, InternVL, and other vision encoders

Alternatives Considered

Alternative	Why not
Merge sd.cpp into llama.cpp	Different domains, different maintainers, different release cycles
ComfyUI as orchestrator	Heavy Python dependency, GUI-centric, not composable from CLI
Shared library (libggml-encode)	Forces tight coupling; doesn't work across language boundaries
HTTP API between processes	Overhead for large tensors; unnecessary complexity for local IPC
Runtime-only flag (no build modes)	Keeps dead encoder code in every binary; no clean separation path

Open Questions

Important

Tensor naming convention: Should bridge files use standardized names (e.g., clip_l_hidden_states, t5_encoder_output) or model-specific names? A registry of standard names would improve interoperability.

Important

JOINT mode linking: Should the JOINT binary link llama.cpp statically or dynamically? Static linking produces a single file but increases binary size. Dynamic linking (libllama.so) allows shared updates but adds a deployment dependency.

References

llama.cpp — Transformer inference in C/C++
stable-diffusion.cpp — Diffusion inference in C/C++
GGUF specification — Model file format
Ideogram 4 architecture — Multi-modal DiT with Qwen3-VL encoder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: ggml-bridge — Standardized Tensor Exchange between with llama.cpp (and stable-diffusion.cpp) #24538

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC: ggml-bridge — Standardized Tensor Exchange between with llama.cpp (and stable-diffusion.cpp) #24538

Uh oh!

martinbu69 Jun 12, 2026

RFC: ggml-bridge — Standardized Tensor Exchange Between llama.cpp and stable-diffusion.cpp

Abstract

Problem Statement

The Duplication Problem

The Multimodal Gap

Proposed Solution

Architecture

Build Modes for sd.cpp

STANDALONE (default — today's behavior)

BRIDGED (slim — needs external llama.cpp)

JOINT (Mixture-of-Experts binary — standalone + bridge)

Code Separation

File Format: .ggmlb (ggml bridge)

Transport Layer: File + SHM

CLI Integration

llama.cpp: --export-bridge

sd.cpp: --bridge-conditioning

Combined pipeline (shell)

Use Cases

1. Ideogram 4 Character Reference (currently impossible in sd.cpp)

2. FLUX.2 with better T5 encoding

3. Audio-to-Image (future)

4. Batch processing with cached encodings

Benefits

For llama.cpp / Hugging Face

For sd.cpp / leejet

For the ecosystem

Implementation Roadmap

Phase 1: Minimal POC (weeks)

Phase 2: Multi-encoder support (months)

Phase 3: Vision encoders (months)

Alternatives Considered

Open Questions

References

Replies: 0 comments

martinbu69
Jun 12, 2026

File Format: `.ggmlb` (ggml bridge)

llama.cpp: `--export-bridge`

sd.cpp: `--bridge-conditioning`