Skip to content

Conversation

@tdakhran
Copy link
Contributor

cont: #17577

LFM2-VL is sensitive to media marker placement and expects image embeddings to be placed before the text.

By default,

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf  --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image /data/playground/issue_17290/siglip_1024.png -p "OCR."

places a media marker after the message.

By @ngxson 's suggestion, a C-compatible API is introduced to control default media marker placement.

Affects only LFM2-VL.

For image
siglip_1024

the output of

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf  --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image /data/playground/issue_17290/siglip_1024.png -p "OCR."

before change:

It looks like you're describing a system for converting input images into token sequences using specific encoders. Here's a more detailed and structured version of your description:
...

after change:

For the vision tower, LEM2-VL uses SigilLP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

@tdakhran tdakhran requested a review from ngxson as a code owner November 30, 2025 11:17
@tdakhran tdakhran mentioned this pull request Nov 30, 2025
@tdakhran
Copy link
Contributor Author

Debugging was done in #17290 (comment)

@ngxson
Copy link
Collaborator

ngxson commented Nov 30, 2025

Small question though, in the case below:

> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?

We still expect to place images at the beginning, is that correct?

For example, the formatted chat is:

<img1><img2>
This is the first step:
Then the next step:
What do you see?

(If yes, probably because the dataset of this model does not include such case, I am correct?)

@tdakhran
Copy link
Contributor Author

Small question though, in the case below:

> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?

We still expect to place images at the beginning, is that correct?

For example, the formatted chat is:

<img1><img2>
This is the first step:
Then the next step:
What do you see?

(If yes, probably because the dataset of this model does not include such case, I am correct?)

Yes, it seems like a training data issue.

Multiturn CLI works for OCR if the order is preserved

> /image /data/playground/issue_17290/siglip_1024.png
/data/playground/issue_17290/siglip_1024.png image loaded

> OCR.
encoding image slice...
image slice encoded in 762 ms
decoding image batch 1/1, n_tokens_batch = 105
image decoded (batch 1/1) in 295 ms

For the vision tower, LEM2-VL uses Sigil LP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

return 0;
}

static std::string insert_default_marker(mtmd_context * ctx, const std::string & msg) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure if this works in case user text-image-text-image

reading the code, I assume that it will produce image-text-image-text instead, while we want image-image-text-text

I think I will have to rework a bit the loop that handles user input, will push a commit for that. The rest of the changes seems OK to me.

Copy link
Contributor Author

@tdakhran tdakhran Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran multiturn with PyTorch code from HF, for conversation

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "OCR first image."},
            {"type": "image", "image": image1},
            {"type": "text", "text": "OCR second image."},
        ],
    },
]

It produces

<|startoftext|><|im_start|>user
<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR first image.<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR second image.<|im_end|>
<|im_start|>assistant

The order looks like <image0>text0<image1>text1

UPD: it just follows the order in conversation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do you have an example of how the model was trained with multiple images input? (I guess you have access to the internal dataset, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask the training team and get back with a reply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson, I got the response

The model is trained on arbitrary number and order of images interleaved with text.

Suggest keeping everything else as is and only address the default behaviour of

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf  --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image siglip_1024.png -p "OCR."

wdyt?

Copy link
Collaborator

@ngxson ngxson Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In otherwords, only the if (is_single_turn) branch in mtmd-cli need to be changed

params.prompt += mtmd_default_marker();

// changes to
params.prompt = mtmd_default_marker() + params.prompt;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what is implemented in PR. It's essentially an erratum for a specific use case for LFM2-VL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit doubt that this is actually a specific use case, as it is technically a problem when you prepare dataset (i.e. consistency of image placement); so I think we should not add an API just to be use to fix a one-off problem. There is always a risk that in the upcoming version of LFM2, this will get fixed.

What I think we have 2 options now:

  1. Implement the change (without any API changes) to all models, not just LFM2
  2. Or, leave the CLI code as-is because most users will use it via API anyway

Copy link
Collaborator

@ngxson ngxson Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try option (1) a bit later to see if it negatively affect other models. But I hope this won't be a roadblocking on your side. As I mentioned, we can probably just fix here in CLI, but there is nothing prevent API user from placing text before image. So the ultimate fix is to refine the dataset.

Copy link
Contributor Author

@tdakhran tdakhran Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, API change is overkill to fix a single model. I also confirmed that upcoming models would be less sensitive to image placement. It would be great if we could proceed with option (1), otherwise, let's keep things as is.
-p "<__media__>OCR." is also an option that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants