Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #5
By default, model quantisation (int4) is turned off. (Results are much worse and avoids a breaking change.)
The Quanto integration in the Transformers library is fairly new, and it's broken in the few releases that have it. So for now a temporary build from main was created at ae9is/transformers that fixes it. (This seemed simpler than setting up the Docker builds to checkout and build the source repository.)
With int4 quantisation the model api needs only ~256M instead of ~512M, which was right around the cut-off for small VMs and causing out of memory errors.