Added the option to wait for model & tokenizer in lmql.model.serve #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently the model starts up and runs the server while the model and tokenizer are loading.
The client cannot tell when the model & tokenizer are ready, so they could send requests during this period. However, since models often take a long time to load, if the client sends a request during this period, they get a timeout, and an exception is raised.
A simple solution to this is to add an optional flag that tells the model server not to run until after the model and tokenizer are loaded. That way the client can know not to send requests while the model is loading (since the port is not open)
In some use cases where a python script wants to host a model and query inference, having this optional flag would be really convenient.
Since this is an optional flag, it doesn't change the current behavior at all, but just provides the option to people who want to use it.