-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Batch embed strings to fit model context #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@iamlemec absolute legend, thank you! I'll test that this doesn't interfere with the current prefix-matching kv logic and once that looks good I should be able to merge. Wrt normalizing the embeddings what do you think are the benefits for / against? |
|
Actually, found a bug in the normalization! (the flag |
|
I could go either way on the default for From a practical perspective, I assume most people are using these for cosine similarities, in which case you'd want to normalize them. But normalization is also a destructive process, so in some sense not normalizing is the safer bet. Also, I should note that the the |
|
Defaulting normalize to true sounds like the best approach then. I've made a minor adjustment to the I'll test a little more tomorrow, for good measure we may want to add a |
|
Actually I'll just go ahead and do that so we can merge. Looks good to me, thank you so much for your work on this! |
|
Wonderful, thanks for a great library! Will keep testing. |
| s_sizes = [] | ||
|
|
||
| # add to batch | ||
| self._batch.add_sequence(tokens, len(s_sizes), False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamlemec There is a null pointer access in this function if n_batch < n_ctx and the prompt exceeds n_batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Check out #1194. It changes the bounds checking from n_ctx to n_batch.
Now that
llama.cppsupports BERT embedding models (ggml-org/llama.cpp#5423), this modifiesembed/create_embeddingto truncate and combine inputs to fit into the model context size and allows for efficient high-volume embedding. I tried to be minimal with changes, and am definitely open to alternate suggestions. What we have now:resetandadd_sequencemethods to_LlamaBatchto make multi-sequence batchesn_ctxby default, and without truncation an error is thrown ifn_tokens > n_ctxOne could eke out some gains from grouping batches more intelligently, but I wanted to keep it simple.