Skip to content

Conversation

@iamlemec
Copy link
Contributor

Now that llama.cpp supports BERT embedding models (ggml-org/llama.cpp#5423), this modifies embed / create_embedding to truncate and combine inputs to fit into the model context size and allows for efficient high-volume embedding. I tried to be minimal with changes, and am definitely open to alternate suggestions. What we have now:

  • Add reset and add_sequence methods to _LlamaBatch to make multi-sequence batches
  • Tokens are truncated to n_ctx by default, and without truncation an error is thrown if n_tokens > n_ctx
  • Embeddings are normalized by default but this can be turned off

One could eke out some gains from grouping batches more intelligently, but I wanted to keep it simple.

@abetlen
Copy link
Owner

abetlen commented Feb 14, 2024

@iamlemec absolute legend, thank you!

I'll test that this doesn't interfere with the current prefix-matching kv logic and once that looks good I should be able to merge. Wrt normalizing the embeddings what do you think are the benefits for / against?

@iamlemec
Copy link
Contributor Author

Actually, found a bug in the normalization! (the flag normalize gets overwritten by the function normalize). Will push in a minute.

@iamlemec
Copy link
Contributor Author

I could go either way on the default for normalize. It looks like sentence_transformers defaults to normalize_embeddings=False. But actually, their results come back the same regardless of what you set this flag to, so I guess it's getting normalized anyway somewhere else?

From a practical perspective, I assume most people are using these for cosine similarities, in which case you'd want to normalize them. But normalization is also a destructive process, so in some sense not normalizing is the safer bet.

Also, I should note that the the llama.cpp code returns the pooled sum over token embeddings, so we also divide here by the number of tokens to make it a pooled mean, which is more standard.

@abetlen
Copy link
Owner

abetlen commented Feb 14, 2024

Defaulting normalize to true sounds like the best approach then. I've made a minor adjustment to the embed method so that it returns a list[float] if a string is passed, this should avoid any breaking changes.

I'll test a little more tomorrow, for good measure we may want to add a kv_cache_clear / reset to clean up any sequences taking up space in the cache.

@abetlen
Copy link
Owner

abetlen commented Feb 14, 2024

Actually I'll just go ahead and do that so we can merge.

Looks good to me, thank you so much for your work on this!

@abetlen abetlen merged commit d7a6791 into abetlen:main Feb 14, 2024
@iamlemec
Copy link
Contributor Author

Wonderful, thanks for a great library! Will keep testing.

s_sizes = []

# add to batch
self._batch.add_sequence(tokens, len(s_sizes), False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamlemec There is a null pointer access in this function if n_batch < n_ctx and the prompt exceeds n_batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Check out #1194. It changes the bounds checking from n_ctx to n_batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants