-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include input token count in results #334
Include input token count in results #334
Conversation
FYI: @markstur |
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
89fd252
to
3c1f4b9
Compare
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to bump minimum caikit version to get the proposed data model changes that are being used here.
Otherwise just some nits.
source_sentence, truncate_input_tokens=truncate_input_tokens | ||
) | ||
embeddings = self._encode_with_retry( | ||
embeddings, embeddings_token_count = self._encode_with_retry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: embeddings_token_count is a confusing var name.
sentences_token_count or well even count2 would be better "embeddings_token_count", but doesn't matter. Varable name can be fixed whenever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming to sentences_token_count
.
source_sentences, truncate_input_tokens=truncate_input_tokens | ||
) | ||
embeddings = self._encode_with_retry( | ||
embeddings, embeddings_token_count = self._encode_with_retry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming to sentences_token_count
.
to_tokenize = [texts] | ||
else: | ||
assert 0 | ||
to_tokenize = [texts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given this change, I think this line can go after the asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. reodering.
|
||
def encode( | ||
self, | ||
sentences: Union[str, List[str]], | ||
sentences: Union[str, Collection[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a signature does not match the base class situation (added an arg we need and chose to leave some out that we ignore), but I don't see much value in these new tweaks. Were we already abusing the hints on those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an attempt to follow this guidance which states to use abstract base classes for arguments and concrete types for return types.
I'm reverting this to List[str]
since I also personally find it easier to read (the specifics of ABC's are not widely known).
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
@mynhardtburger caikit 0.26.14 is available with the data model update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! caikit should now be ready to bump, and looks like linting might need some attention
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions, and it'd probably be good for @markstur to review again as well
Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Signed-off-by: Mynhardt Burger <mynhardt@gmail.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Depends on caikit data model updates in this PR: caikit/caikit#675
This extends the embedding module to include the
input_token_count
in the results of theEmbeddingModule
'srun_
methods.The
sum_token_count(tokenized: BatchEncoding) -> int
function calculates the count of tokens requiring model attention, based on theEncoding.attention_mask
property, as returned bySentenceTransformerWithTruncate.tokenizer()
. [PAD] is irrelevant for truncation and max_token_count parameters, while [CLS] and [SEP] are counted by the model when it considers the max length and truncation.Additionally tests to confirm sort order is maintained was added.
Various other quality of life type hints were added.