model : Fix marker placement for LFM2-VL in single turn llama-mtmd-cli #17616

tdakhran · 2025-11-30T11:17:46Z

LFM2-VL is sensitive to media marker placement and expects image embeddings to be placed before the text.

By default,

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf  --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image /data/playground/issue_17290/siglip_1024.png -p "OCR."

places a media marker after the message.

By @ngxson 's suggestion, a C-compatible API is introduced to control default media marker placement.

Affects only LFM2-VL.

For image

the output of

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf  --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image /data/playground/issue_17290/siglip_1024.png -p "OCR."

before change:

It looks like you're describing a system for converting input images into token sequences using specific encoders. Here's a more detailed and structured version of your description:
...

after change:

For the vision tower, LEM2-VL uses SigilLP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

…tmd-cli

tdakhran · 2025-11-30T11:19:53Z

Debugging was done in #17290 (comment)

ngxson · 2025-11-30T11:22:26Z

Small question though, in the case below:

> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?

We still expect to place images at the beginning, is that correct?

For example, the formatted chat is:

<img1><img2>
This is the first step:
Then the next step:
What do you see?

(If yes, probably because the dataset of this model does not include such case, I am correct?)

tdakhran · 2025-11-30T11:26:49Z

Small question though, in the case below:
> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?
We still expect to place images at the beginning, is that correct?

For example, the formatted chat is:
<img1><img2>
This is the first step:
Then the next step:
What do you see?
(If yes, probably because the dataset of this model does not include such case, I am correct?)

Yes, it seems like a training data issue.

Multiturn CLI works for OCR if the order is preserved

> /image /data/playground/issue_17290/siglip_1024.png
/data/playground/issue_17290/siglip_1024.png image loaded

> OCR.
encoding image slice...
image slice encoded in 762 ms
decoding image batch 1/1, n_tokens_batch = 105
image decoded (batch 1/1) in 295 ms

For the vision tower, LEM2-VL uses Sigil LP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

ngxson · 2025-11-30T11:33:48Z

tools/mtmd/mtmd-cli.cpp

    return 0;
 }

+static std::string insert_default_marker(mtmd_context * ctx, const std::string & msg) {


not quite sure if this works in case user text-image-text-image

reading the code, I assume that it will produce image-text-image-text instead, while we want image-image-text-text

I think I will have to rework a bit the loop that handles user input, will push a commit for that. The rest of the changes seems OK to me.

Ran multiturn with PyTorch code from HF, for conversation

conversation = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "OCR first image."}, {"type": "image", "image": image1}, {"type": "text", "text": "OCR second image."}, ], }, ]

It produces

<|startoftext|><|im_start|>user <|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR first image.<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR second image.<|im_end|> <|im_start|>assistant

The order looks like <image0>text0<image1>text1

UPD: it just follows the order in conversation.

hmm, do you have an example of how the model was trained with multiple images input? (I guess you have access to the internal dataset, right?)

I'll ask the training team and get back with a reply.

@ngxson, I got the response

The model is trained on arbitrary number and order of images interleaved with text.

Suggest keeping everything else as is and only address the default behaviour of

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image siglip_1024.png -p "OCR."

wdyt?

In otherwords, only the if (is_single_turn) branch in mtmd-cli need to be changed

params.prompt += mtmd_default_marker(); // changes to params.prompt = mtmd_default_marker() + params.prompt;

Yes, that's what is implemented in PR. It's essentially an erratum for a specific use case for LFM2-VL.

I'm a bit doubt that this is actually a specific use case, as it is technically a problem when you prepare dataset (i.e. consistency of image placement); so I think we should not add an API just to be use to fix a one-off problem. There is always a risk that in the upcoming version of LFM2, this will get fixed.

What I think we have 2 options now:

Implement the change (without any API changes) to all models, not just LFM2

Or, leave the CLI code as-is because most users will use it via API anyway

I'll try option (1) a bit later to see if it negatively affect other models. But I hope this won't be a roadblocking on your side. As I mentioned, we can probably just fix here in CLI, but there is nothing prevent API user from placing text before image. So the ultimate fix is to refine the dataset.

Agree, API change is overkill to fix a single model. I also confirmed that upcoming models would be less sensitive to image placement. It would be great if we could proceed with option (1), otherwise, let's keep things as is.
-p "<__media__>OCR." is also an option that works.

model : Fix media marker placement for LFM2-VL in single turn llama-m…

0b133f9

…tmd-cli

tdakhran requested a review from ngxson as a code owner November 30, 2025 11:17

tdakhran mentioned this pull request Nov 30, 2025

model: LFM2-VL fixes #17577

Merged

github-actions bot added the examples label Nov 30, 2025

Formatting

73ff3d0

ngxson reviewed Nov 30, 2025

View reviewed changes

model : Fix marker placement for LFM2-VL in single turn llama-mtmd-cli #17616

Are you sure you want to change the base?

model : Fix marker placement for LFM2-VL in single turn llama-mtmd-cli #17616

Conversation

tdakhran commented Nov 30, 2025

Uh oh!

tdakhran commented Nov 30, 2025

Uh oh!

ngxson commented Nov 30, 2025

Uh oh!

tdakhran commented Nov 30, 2025

Uh oh!

ngxson Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

tdakhran Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

tdakhran Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

tdakhran Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdakhran Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdakhran Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tdakhran Nov 30, 2025 •

edited

Loading

ngxson Dec 1, 2025 •

edited

Loading

ngxson Dec 1, 2025 •

edited

Loading

tdakhran Dec 1, 2025 •

edited

Loading