Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion tools/mtmd/mtmd-cli.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,15 @@ static int eval_message(mtmd_cli_context & ctx, common_chat_msg & msg) {
return 0;
}

static std::string insert_default_marker(mtmd_context * ctx, const std::string & msg) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure if this works in case user text-image-text-image

reading the code, I assume that it will produce image-text-image-text instead, while we want image-image-text-text

I think I will have to rework a bit the loop that handles user input, will push a commit for that. The rest of the changes seems OK to me.

Copy link
Contributor Author

@tdakhran tdakhran Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran multiturn with PyTorch code from HF, for conversation

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "OCR first image."},
            {"type": "image", "image": image1},
            {"type": "text", "text": "OCR second image."},
        ],
    },
]

It produces

<|startoftext|><|im_start|>user
<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR first image.<|image_start|><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><|image_end|>OCR second image.<|im_end|>
<|im_start|>assistant

The order looks like <image0>text0<image1>text1

UPD: it just follows the order in conversation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do you have an example of how the model was trained with multiple images input? (I guess you have access to the internal dataset, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask the training team and get back with a reply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson, I got the response

The model is trained on arbitrary number and order of images interleaved with text.

Suggest keeping everything else as is and only address the default behaviour of

bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf  --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf --image siglip_1024.png -p "OCR."

wdyt?

Copy link
Collaborator

@ngxson ngxson Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In otherwords, only the if (is_single_turn) branch in mtmd-cli need to be changed

params.prompt += mtmd_default_marker();

// changes to
params.prompt = mtmd_default_marker() + params.prompt;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what is implemented in PR. It's essentially an erratum for a specific use case for LFM2-VL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit doubt that this is actually a specific use case, as it is technically a problem when you prepare dataset (i.e. consistency of image placement); so I think we should not add an API just to be use to fix a one-off problem. There is always a risk that in the upcoming version of LFM2, this will get fixed.

What I think we have 2 options now:

  1. Implement the change (without any API changes) to all models, not just LFM2
  2. Or, leave the CLI code as-is because most users will use it via API anyway

Copy link
Collaborator

@ngxson ngxson Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try option (1) a bit later to see if it negatively affect other models. But I hope this won't be a roadblocking on your side. As I mentioned, we can probably just fix here in CLI, but there is nothing prevent API user from placing text before image. So the ultimate fix is to refine the dataset.

Copy link
Contributor Author

@tdakhran tdakhran Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, API change is overkill to fix a single model. I also confirmed that upcoming models would be less sensitive to image placement. It would be great if we could proceed with option (1), otherwise, let's keep things as is.
-p "<__media__>OCR." is also an option that works.

switch (mtmd_get_default_marker_placement(ctx)) {
case MTMD_DEFAULT_MARKER_PLACEMENT_BEGIN: return mtmd_default_marker() + msg;
case MTMD_DEFAULT_MARKER_PLACEMENT_NONE:
case MTMD_DEFAULT_MARKER_PLACEMENT_END:
default: return msg + mtmd_default_marker();
}
}

int main(int argc, char ** argv) {
ggml_time_init();

Expand Down Expand Up @@ -313,7 +322,7 @@ int main(int argc, char ** argv) {
g_is_generating = true;
if (params.prompt.find(mtmd_default_marker()) == std::string::npos) {
for (size_t i = 0; i < params.image.size(); i++) {
params.prompt += mtmd_default_marker();
params.prompt = insert_default_marker(ctx.ctx_vision.get(), params.prompt);
}
}
common_chat_msg msg;
Expand Down
8 changes: 8 additions & 0 deletions tools/mtmd/mtmd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1099,3 +1099,11 @@ void mtmd_log_set(ggml_log_callback log_callback, void * user_data) {
g_logger_state.log_callback = log_callback ? log_callback : clip_log_callback_default;
g_logger_state.log_callback_user_data = user_data;
}

mtmd_default_marker_placement mtmd_get_default_marker_placement(mtmd_context * ctx) {
if (ctx && ctx->ctx_v && clip_get_projector_type(ctx->ctx_v) == PROJECTOR_TYPE_LFM2) {
return MTMD_DEFAULT_MARKER_PLACEMENT_BEGIN;
}

return MTMD_DEFAULT_MARKER_PLACEMENT_NONE;
}
9 changes: 8 additions & 1 deletion tools/mtmd/mtmd.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ enum mtmd_input_chunk_type {
MTMD_INPUT_CHUNK_TYPE_AUDIO,
};

enum mtmd_default_marker_placement {
MTMD_DEFAULT_MARKER_PLACEMENT_NONE, // place media marker freely inside the message
MTMD_DEFAULT_MARKER_PLACEMENT_BEGIN, // place media marker in the beginning of the message
MTMD_DEFAULT_MARKER_PLACEMENT_END, // place media marker in the end of the message
};

// opaque types
struct mtmd_context;
struct mtmd_bitmap;
Expand Down Expand Up @@ -88,7 +94,8 @@ struct mtmd_context_params {
int image_max_tokens; // maximum number of tokens for image input (default: read from metadata)
};

MTMD_API const char * mtmd_default_marker(void);
MTMD_API const char * mtmd_default_marker(void);
MTMD_API mtmd_default_marker_placement mtmd_get_default_marker_placement(mtmd_context * ctx);

MTMD_API struct mtmd_context_params mtmd_context_params_default(void);

Expand Down
Loading