[RFC/Proposal] End-to-End Architecture & Implementation Plan for Phi-3.5-Vision #17464
khemchand-zetta
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @ggerganov and the
llama.cppcommunity,I am opening this discussion to propose a roadmap for implementing Phi-3-Vision (and Phi-3.5-Vision) support in
llama.cpp.While the text-only variants of Phi-3 (Mini/Medium) are currently supported and perform exceptionally well, the Vision variant remains unimplemented due to significant architectural divergences from the standard LLaVA pipeline currently available in the library.
I have spent some time deconstructing the
Phi3VForCausalLMarchitecture and the specific configuration requirements. Below is an analysis of the existing gaps, the challenges faced by previous attempts, and a proposed end-to-end plan to tackle them.1. Architectural Analysis
The Phi-3-Vision model is not a standard LLaVA clone. It utilizes a specific set of mechanisms for image processing and positional encoding that requires custom handling.
A. The Vision Tower & Image Processing (The "HD Transform")
CLIP-ViT-L/14.sub_glb): Crucially, the configuration specifies asub_glborder. The local high-res crops are processed first, and the global crop is appended at the end. This is the reverse of many standard implementations.B. The Projector & Separators
C. The Text Decoder & "Su" RoPE
short_factorandlong_factorarrays in the config. This allows the model to handle context lengths up to 128k, but it fundamentally differs from thelinearoryarnscaling currently implemented in the generic Llama kernels.D. Fusion Mechanism
<|image_1|>). During inference, these placeholders are expanded and replaced by the flattened sequence of [Local Embeddings + Separator + Global Embeddings].2. Previous Context & Challenges
The community has successfully integrated the Phi-3 text models, proving that the tokenizer and block structure are compatible with GGUF. However, the Vision integration has historically stalled due to two main friction points:
Suscaling factors and the default RoPE implementation inllama.cpp.llavaexamples inllama.cppare optimized for static image sizes (padding to squares). Phi-3's requirement to generate a variable number of crops per image requires a rework of theclip_image_preprocessfunction to support dynamic batching within the vision encoder.3. Proposed Implementation Plan
I am interested in contributing to this implementation. To ensure stability and compatibility, I propose tackling this in the following phases:
Phase 1: The GGUF Conversion Layer
We need to ensure the
convert-hf-to-gguf.pyscript captures the unique metadata Phi-3 requires.rope_scalingdictionary (specifically theshort_factorandlong_factorarrays) into the GGUF header.model.vision_embed_tokens(separator weights) and the MLP projector weights to standardizedmmprojtensor keys.Phase 2: The "Su" Kernel (The Core Blocker)
Before touching the vision part, the text decoder must correctly interpret the
Sufactors.LLAMA_ROPE_SCALING_TYPE_SU).yarnimplementation to accept these explicit factor arrays?Phase 3: Logic Porting for Inference
Once the weights are loaded, the
llavaexample logic needs adaptation.hd_transformlogic in C++: calculating the optimal crop grid and slicing the input image tensor accordingly.sub_glborder.4. Request for Feedback
Before diving into the code, I would appreciate guidance on the RoPE implementation strategy. Given the complexity of the
Sufactors, is there an existing mechanism inllama.cpp's KV cache management that closely mimics this, or should this be treated as a completely new scaling paradigm?I look forward to your thoughts and am happy to draft the initial conversion scripts if this approach aligns with the project's goals.
Best regards,
Khemchand Nagar
Beta Was this translation helpful? Give feedback.
All reactions