Skip to content

Feature Request: add mtmd functions to get vision image size and patch size #16703

@deadprogram

Description

@deadprogram

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

It would be very helpful to be able to determine the expected image size and batch size for vision models.

This information is already available, just not exposed via a convenient function the way that mtmd_get_audio_bitrate does.

I propose adding 2 new functions:

// get vision image size in pixels, for example 1024
// return -1 if vision is not supported
MTMD_API int mtmd_get_vision_image_size(mtmd_context * ctx);

// get vision patch size, for example 14
// return -1 if vision is not supported
MTMD_API int mtmd_get_vision_patch_size(mtmd_context * ctx);

Motivation

This will make it easier to do any image preprocessing before calling into the projector/model.

Possible Implementation

int mtmd_get_vision_image_size(mtmd_context * ctx) {
    if (!ctx->ctx_v) {
        return -1;
    }

    return clip_get_image_size(ctx->ctx_v);
}

int mtmd_get_vision_patch_size(mtmd_context * ctx) {
    if (!ctx->ctx_v) {
        return -1;
    }

    return clip_get_patch_size(ctx->ctx_v);
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions