[RFC/Proposal] End-to-End Architecture & Implementation Plan for Phi-3.5-Vision #17464

khemchand-zetta · 2025-11-24T07:34:52Z

khemchand-zetta
Nov 24, 2025

Hi @ggerganov and the llama.cpp community,

I am opening this discussion to propose a roadmap for implementing Phi-3-Vision (and Phi-3.5-Vision) support in llama.cpp.

While the text-only variants of Phi-3 (Mini/Medium) are currently supported and perform exceptionally well, the Vision variant remains unimplemented due to significant architectural divergences from the standard LLaVA pipeline currently available in the library.

I have spent some time deconstructing the Phi3VForCausalLM architecture and the specific configuration requirements. Below is an analysis of the existing gaps, the challenges faced by previous attempts, and a proposed end-to-end plan to tackle them.

1. Architectural Analysis

The Phi-3-Vision model is not a standard LLaVA clone. It utilizes a specific set of mechanisms for image processing and positional encoding that requires custom handling.

A. The Vision Tower & Image Processing (The "HD Transform")

Encoder: It uses CLIP-ViT-L/14.
Dynamic Cropping: Unlike models that resize images to a fixed square, Phi-3 uses a dynamic "HD Transform." The image is sliced into a variable grid of high-resolution crops (e.g., 2x2, 3x5) based on aspect ratio, plus one low-resolution "Global" crop.
Stitching Order (sub_glb): Crucially, the configuration specifies a sub_glb order. The local high-res crops are processed first, and the global crop is appended at the end. This is the reverse of many standard implementations.

B. The Projector & Separators

2-Layer MLP: The projection from Vision Dim (1024) to Text Dim (3072) is done via a 2-layer MLP (Linear -> GELU -> Linear), not a single linear projection.
Learnable Separators: The model inserts specific, trained separator vectors (glue tokens) between the cropped image tiles. These are not static tokens but learned weights that must be extracted during GGUF conversion.

C. The Text Decoder & "Su" RoPE

Backbone: The text decoder is the standard Phi-3 architecture.
The "Su" RoPE Scaling: This is the most significant deviation. The model uses "Su" rotary embedding scaling, defined by explicit short_factor and long_factor arrays in the config. This allows the model to handle context lengths up to 128k, but it fundamentally differs from the linear or yarn scaling currently implemented in the generic Llama kernels.

D. Fusion Mechanism

Fusion happens at the input embedding level. The system does not use cross-attention layers. Instead, the token stream contains placeholder tokens (e.g., <|image_1|>). During inference, these placeholders are expanded and replaced by the flattened sequence of [Local Embeddings + Separator + Global Embeddings].

2. Previous Context & Challenges

The community has successfully integrated the Phi-3 text models, proving that the tokenizer and block structure are compatible with GGUF. However, the Vision integration has historically stalled due to two main friction points:

The RoPE Complexity: Previous attempts to simply load Phi-3-Vision weights often result in coherent text but complete "gibberish" when processing long-context images. This is likely due to the mismatch between the Su scaling factors and the default RoPE implementation in llama.cpp.
Dynamic Batching in C++: The llava examples in llama.cpp are optimized for static image sizes (padding to squares). Phi-3's requirement to generate a variable number of crops per image requires a rework of the clip_image_preprocess function to support dynamic batching within the vision encoder.

3. Proposed Implementation Plan

I am interested in contributing to this implementation. To ensure stability and compatibility, I propose tackling this in the following phases:

Phase 1: The GGUF Conversion Layer
We need to ensure the convert-hf-to-gguf.py script captures the unique metadata Phi-3 requires.

Action: Extend the script to parse and save the rope_scaling dictionary (specifically the short_factor and long_factor arrays) into the GGUF header.
Action: Map the specific model.vision_embed_tokens (separator weights) and the MLP projector weights to standardized mmproj tensor keys.

Phase 2: The "Su" Kernel (The Core Blocker)
Before touching the vision part, the text decoder must correctly interpret the Su factors.

Proposal: We likely need to introduce a new RoPE scaling type (e.g., LLAMA_ROPE_SCALING_TYPE_SU).
Question for Maintainers: Would you prefer a dedicated kernel for this, or can we adapt the existing yarn implementation to accept these explicit factor arrays?

Phase 3: Logic Porting for Inference
Once the weights are loaded, the llava example logic needs adaptation.

Preprocessing: Implement the hd_transform logic in C++: calculating the optimal crop grid and slicing the input image tensor accordingly.
Separator Injection: Ensure the evaluation loop inserts the learned separator embeddings between the crop features, respecting the sub_glb order.

4. Request for Feedback

Before diving into the code, I would appreciate guidance on the RoPE implementation strategy. Given the complexity of the Su factors, is there an existing mechanism in llama.cpp's KV cache management that closely mimics this, or should this be treated as a completely new scaling paradigm?

I look forward to your thoughts and am happy to draft the initial conversion scripts if this approach aligns with the project's goals.

Best regards,
Khemchand Nagar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC/Proposal] End-to-End Architecture & Implementation Plan for Phi-3.5-Vision #17464

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[RFC/Proposal] End-to-End Architecture & Implementation Plan for Phi-3.5-Vision #17464

Uh oh!

khemchand-zetta Nov 24, 2025

1. Architectural Analysis

2. Previous Context & Challenges

3. Proposed Implementation Plan

4. Request for Feedback

Replies: 0 comments

khemchand-zetta
Nov 24, 2025