Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include input token count in results #334

Merged

Conversation

mynhardtburger
Copy link
Contributor

@mynhardtburger mynhardtburger commented Mar 4, 2024

Depends on caikit data model updates in this PR: caikit/caikit#675

This extends the embedding module to include the input_token_count in the results of the EmbeddingModule's run_ methods.

The sum_token_count(tokenized: BatchEncoding) -> int function calculates the count of tokens requiring model attention, based on the Encoding.attention_mask property, as returned by SentenceTransformerWithTruncate.tokenizer(). [PAD] is irrelevant for truncation and max_token_count parameters, while [CLS] and [SEP] are counted by the model when it considers the max length and truncation.

Additionally tests to confirm sort order is maintained was added.

Various other quality of life type hints were added.

@mynhardtburger
Copy link
Contributor Author

FYI: @markstur

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
@mynhardtburger mynhardtburger force-pushed the inlude-input_token_count-in-results branch from 89fd252 to 3c1f4b9 Compare March 5, 2024 01:30
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
@mynhardtburger
Copy link
Contributor Author

mynhardtburger commented Mar 5, 2024

_truncate_input_tokens()'s error case could also possibly be refactored to make use of _sum_token_count() to avoid having to rerun the tokenization.

Copy link
Contributor

@markstur markstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to bump minimum caikit version to get the proposed data model changes that are being used here.

Otherwise just some nits.

source_sentence, truncate_input_tokens=truncate_input_tokens
)
embeddings = self._encode_with_retry(
embeddings, embeddings_token_count = self._encode_with_retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: embeddings_token_count is a confusing var name.

sentences_token_count or well even count2 would be better "embeddings_token_count", but doesn't matter. Varable name can be fixed whenever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming to sentences_token_count.

source_sentences, truncate_input_tokens=truncate_input_tokens
)
embeddings = self._encode_with_retry(
embeddings, embeddings_token_count = self._encode_with_retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming to sentences_token_count.

to_tokenize = [texts]
else:
assert 0
to_tokenize = [texts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given this change, I think this line can go after the asserts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. reodering.


def encode(
self,
sentences: Union[str, List[str]],
sentences: Union[str, Collection[str]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a signature does not match the base class situation (added an arg we need and chose to leave some out that we ignore), but I don't see much value in these new tweaks. Were we already abusing the hints on those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an attempt to follow this guidance which states to use abstract base classes for arguments and concrete types for return types.

I'm reverting this to List[str] since I also personally find it easier to read (the specifics of ABC's are not widely known).

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
@evaline-ju
Copy link
Collaborator

@mynhardtburger caikit 0.26.14 is available with the data model update

Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! caikit should now be ready to bump, and looks like linting might need some attention

caikit_nlp/modules/text_embedding/embedding.py Outdated Show resolved Hide resolved
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, and it'd probably be good for @markstur to review again as well

caikit_nlp/modules/text_embedding/embedding.py Outdated Show resolved Hide resolved
tests/modules/text_embedding/test_embedding.py Outdated Show resolved Hide resolved
tests/modules/text_embedding/test_embedding.py Outdated Show resolved Hide resolved
Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>
Signed-off-by: Mynhardt Burger <mynhardt@gmail.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>
Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@markstur markstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@evaline-ju evaline-ju merged commit 01526aa into caikit:main Mar 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants