Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull image processing out of the LLM/infrastructure? #5667

Closed
ericwj opened this issue Nov 19, 2024 · 4 comments
Closed

Pull image processing out of the LLM/infrastructure? #5667

ericwj opened this issue Nov 19, 2024 · 4 comments
Labels

Comments

@ericwj
Copy link

ericwj commented Nov 19, 2024

First thing I try with M.E.AI, using Ollama llama3.2-vision is to send a 16-bit TIF.

Second thing I try is converting to PNG and asking about EXIF metadata, really basic stuff, like what is the resolution of the image. And where on the image is the girl? So what is the approximate coördinate?

Now that might be really stupid novice idiocy, since the LLM isn't good at that whatsoever, also considering I took a file with some huge resolution almost uncompressible, because 16-bit, but it suggests to me:

  • Why in the world would the server need any specific file format? Shouldn't it be able to receive a raw RGB buffer? A width, a height, and a (compressed/strided) frame. That way M.E.AI would offer the most efficient way of doing that, which would then be pluggable so it would support my TIF. TIF is a very basic format, and it comes out of my photo scanner. Others will want WebP, or AV1. And it shouldn't matter which LLM I use?

  • Why in the world would it be smart to send that image over every time I got more history? I think it's very wasteful. Even twice. Perhaps it should give me an embedding to send over, or some id to send over on 2nd reference to the same data? I even think it works like that, with caching at the Ollama side, but it wastes my bandwidth at the very least to have to send over the image as fast as the user can press Enter. Sure, I know, I can protect against misuse, especially if its me pressing enter, but that's not the point I'm making.

Now to be fair it can tell me what is wrong with the image (its old, very purple) it does a fairly good job of suggesting what I should do to fix it. And since I am typing here, does it know anything technical about the image, like the resolution? It insists the image is 1024x720, which is certianly not true and is also not the correct aspect. But I have not one single clue who touched my image and what they did, before sending it into the LLM....

@ericwj
Copy link
Author

ericwj commented Nov 19, 2024

Microsoft.Extensions.AI

@stephentoub
Copy link
Member

Why in the world would the server need any specific file format? Shouldn't it be able to receive a raw RGB buffer? A width, a height, and a (compressed/strided) frame. That way M.E.AI would offer the most efficient way of doing that, which would then be pluggable so it would support my TIF. TIF is a very basic format, and it comes out of my photo scanner. Others will want WebP, or AV1. And it shouldn't matter which LLM I use?

I'm not clear on what you're asking for. Are you asking for the M.E.AI libraries to be able to infer the image type from the bytes and then normalize the image format to one the target LLM natively understands?

Why in the world would it be smart to send that image over every time I got more history? I think it's very wasteful. Even twice. Perhaps it should give me an embedding to send over, or some id to send over on 2nd reference to the same data? I even think it works like that, with caching at the Ollama side, but it wastes my bandwidth at the very least to have to send over the image as fast as the user can press Enter. Sure, I know, I can protect against misuse, especially if its me pressing enter, but that's not the point I'm making.

The LLMs are stateless. You need to provide them with the whole request each time. Some services allow you to upload images ahead of time and then refer to those images by url / ID. And with M.E.AI, you can create an ImageContent referencing a url, so if you want to upload your image somewhere accessible to the LLM and it supports being told about images at a URL (like OpenAI does), then you can avoid sending the image each time.

@ericwj
Copy link
Author

ericwj commented Nov 20, 2024

Are you asking for the M.E.AI libraries to be able to infer the image type from the bytes and then normalize the image format to one the target LLM natively understands?

With raw RGB buffer I would have meant a simple, very portable format. Similarly with audio - everyone should support PCM, but nobody should pick it as a first choice by default. But I guess there is no such image format.

Not entirely sure exactly what I ask. I know little of the standards, the capabilities, behaviors, conventions if any, of either the server hosting the LLM or the LLM itself, even what is typical, and whether I got true answers from perspective of the LLM, or hallucinations.

But I can say that I would want to pre-empt processing by the LLM server.

  • For performance reasons, especially if it is done repeatedly. Because of both bandwith and suppose if I host my own LLM server also CPU/GPU, which is both a highly contested resource.
  • If the LLM is aware of image metadata like resolution and DPI, as it suggests to me, and should be able to identify locations of people and things in the image, it is a gross faux-pas for the LLM server to resize, and especially to change the aspect. I would rather have it fail. But if llama3.2-vision hallucinates about that, then wrong image metadata is not really a reason to want to pre-empt processing of the image by the LLM server.
    I just don't know and don't know yet how to find out for certain.

Well then the question is whether M.E.AI has any idea of the capabilities of the LLM server or some component between it and the LLM in terms of metadata it could request about input it accepts, capabilities of reasoning about them and whether there is any standard way of producing errors that M.E.AI could handle (without regex, that is). I think those are all the ways that M.E.AI can glean about the LLM service. At least for Ollama there is not afaik any custom options or metadata that I can get to.

If there is metadata from the LLM server that M.E.AI could use, at the very least client side there is a MIME type and I would say there should be a filename extension. Getting dimensions is also usually easy, but as soon as the bytes in an ImageContent need to be inspected then its likely to get my first run exerience covered 99% of the way, until someone really needs some jxr or bmp or who knows, pif? or whatnot - there will be holes.

  • I think it would be useful first and foremost to know more about the model, besides a model string and the provider name and CLR type. That is all ChatClient.Metadata gives me. And I don't think M.E.AI does any talking before the first chat message is sent, or if that would be efficient. If not then all M.E.AI can do is wait for HTTP 400 and a string, it appears, in case of Ollama. And the user will likely get "Something went wrong". If it is not M.E.AI that could do much here, then it is up to the wider AI community to evolve these kinds of things, as the field matures.
  • Second having infrastructure abstractions that can be configured and setup, especially combining M.E.AI say middleware with ASP.NET middleware/endpoints and making that all work efficiently, smartly, configurable, straightforwardly, based on fact for what needs processing and what not, etc, could be a very nice addition to both M.E.AI and ASP.NET. And extremely efficient especially when ASP.NET and the LLM are close on the network, or on the same machine, or if memory is really scarce, which it always is in the cloud, ime.
  • I was surprised by how ImageContent works. Practically likely people will end up having all image bytes twice in memory, once in an array which is handed (copied?) to ImageContent before it gets encoded. After that still that array lingering as long as it is in scope. And it doesn't appear easy to me to say substitute ImageContent for my own content that would stream from disk, since that code path goes all the way down to the HttpClient, I suspect.
  • And I am suprised that the chat response can be streamed but not the request. Useful with images, more so even - required perhaps - for video and audio (e.g. whisper).

Here's a fact that suggests me that there is some metadata somewhere that could potentially be used:

  • Uploading TIF gives me an InvalidOperationException running Ollama against llama3.2-vision and the exception has an error string with sloppy punctuation and uppercasing (i.e. non-Microsoft), but I can send the bytes to phi3.5 without problem (and have them ignored). Not sure how Ollama makes the distinction, or the decision to fail or not, but I don't think it necessarily special cases any model and I guess it doesn't try to do anything with the TIFF talking to phi3.5, since it doesn't have vision. But how or why it knows to ignore the TIF, I don't know.

The LLMs are stateless.

Yes, I could have guessed. Apart from sending lots of stuff over repeatedly, processing it again is a cost. Both contribute to the delays I was experiencing. But not as bad as I thought. The delays I was getting have at least two other reasons: switching models and a limit of one model loaded at any time, and a silly aggressively short default expiration of Ollama model loads. The code editor is such an attention hog, don't you agree? 5 minutes is absolutely nothing. I wasn't aware of any of that yet, either.

@stephentoub
Copy link
Member

If there is metadata from the LLM server that M.E.AI could use

In general there is not.

And it doesn't appear easy to me to say substitute ImageContent for my own content that would stream from disk, since that code path goes all the way down to the HttpClient, I suspect.

You can write your own custom-AIContent-derived type if you'd like. You can also use ImageContent to just refer to a url. The reason ImageContent currently works the way it does is, in general, it needs to be available to be examined multiple times, e.g. as part of a chat history that's repeatedly augmented and sent, and a transient stream of data doesn't work well with that. We've contemplated augmenting ImageContent to accept a seekable Stream, or possibly having a dedicated StreamingDataContent type for that, but it's still challenging in the face of other things you want to do with it, like caching, where an input containing the content needs to be hashed and the output containing image content needs to live in the cache.

And I am suprised that the chat response can be streamed but not the request.

That's the nature of all of the chat completion services today.

I appreciate your thoughts on the matter. At the moment, though, I don't see anything in the issue that's actionable right now, so I'm going to close this. If you feel I'm missing something, please feel free to re-open and clarify a concrete request. Thanks.

@stephentoub stephentoub closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants