-
Notifications
You must be signed in to change notification settings - Fork 13.9k
model : Fix marker placement for LFM2-VL in single turn llama-mtmd-cli #17616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model : Fix marker placement for LFM2-VL in single turn llama-mtmd-cli #17616
Conversation
|
Debugging was done in #17290 (comment) |
|
Small question though, in the case below: We still expect to place images at the beginning, is that correct? For example, the formatted chat is: (If yes, probably because the dataset of this model does not include such case, I am correct?) |
Yes, it seems like a training data issue. Multiturn CLI works for OCR if the order is preserved |
| return 0; | ||
| } | ||
|
|
||
| static std::string insert_default_marker(mtmd_context * ctx, const std::string & msg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite sure if this works in case user text-image-text-image
reading the code, I assume that it will produce image-text-image-text instead, while we want image-image-text-text
I think I will have to rework a bit the loop that handles user input, will push a commit for that. The rest of the changes seems OK to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran multiturn with PyTorch code from HF, for conversation
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "OCR first image."},
{"type": "image", "image": image1},
{"type": "text", "text": "OCR second image."},
],
},
]
It produces
<|startoftext|><|im_start|>user
<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR first image.<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR second image.<|im_end|>
<|im_start|>assistant
The order looks like <image0>text0<image1>text1
UPD: it just follows the order in conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, do you have an example of how the model was trained with multiple images input? (I guess you have access to the internal dataset, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll ask the training team and get back with a reply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson, I got the response
The model is trained on arbitrary number and order of images interleaved with text.
Suggest keeping everything else as is and only address the default behaviour of
bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image siglip_1024.png -p "OCR."
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In otherwords, only the if (is_single_turn) branch in mtmd-cli need to be changed
params.prompt += mtmd_default_marker();
// changes to
params.prompt = mtmd_default_marker() + params.prompt;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what is implemented in PR. It's essentially an erratum for a specific use case for LFM2-VL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit doubt that this is actually a specific use case, as it is technically a problem when you prepare dataset (i.e. consistency of image placement); so I think we should not add an API just to be use to fix a one-off problem. There is always a risk that in the upcoming version of LFM2, this will get fixed.
What I think we have 2 options now:
- Implement the change (without any API changes) to all models, not just LFM2
- Or, leave the CLI code as-is because most users will use it via API anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try option (1) a bit later to see if it negatively affect other models. But I hope this won't be a roadblocking on your side. As I mentioned, we can probably just fix here in CLI, but there is nothing prevent API user from placing text before image. So the ultimate fix is to refine the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, API change is overkill to fix a single model. I also confirmed that upcoming models would be less sensitive to image placement. It would be great if we could proceed with option (1), otherwise, let's keep things as is.
-p "<__media__>OCR." is also an option that works.
cont: #17577
LFM2-VL is sensitive to media marker placement and expects image embeddings to be placed before the text.
By default,
places a media marker after the message.
By @ngxson 's suggestion, a C-compatible API is introduced to control default media marker placement.
Affects only LFM2-VL.
For image

the output of
before change:
after change: