Include input token count in results #334

mynhardtburger · 2024-03-04T23:23:41Z

Depends on caikit data model updates in this PR: caikit/caikit#675

This extends the embedding module to include the input_token_count in the results of the EmbeddingModule's run_ methods.

The sum_token_count(tokenized: BatchEncoding) -> int function calculates the count of tokens requiring model attention, based on the Encoding.attention_mask property, as returned by SentenceTransformerWithTruncate.tokenizer(). [PAD] is irrelevant for truncation and max_token_count parameters, while [CLS] and [SEP] are counted by the model when it considers the max length and truncation.

Additionally tests to confirm sort order is maintained was added.

Various other quality of life type hints were added.

mynhardtburger · 2024-03-04T23:25:46Z

FYI: @markstur

caikit_nlp/modules/text_embedding/embedding.py

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

mynhardtburger · 2024-03-05T14:17:03Z

_truncate_input_tokens()'s error case could also possibly be refactored to make use of _sum_token_count() to avoid having to rerun the tokenization.

markstur

Need to bump minimum caikit version to get the proposed data model changes that are being used here.

Otherwise just some nits.

markstur · 2024-03-06T00:01:55Z

caikit_nlp/modules/text_embedding/embedding.py

            source_sentence, truncate_input_tokens=truncate_input_tokens
        )
-        embeddings = self._encode_with_retry(
+        embeddings, embeddings_token_count = self._encode_with_retry(


nit: embeddings_token_count is a confusing var name.

sentences_token_count or well even count2 would be better "embeddings_token_count", but doesn't matter. Varable name can be fixed whenever.

Renaming to sentences_token_count.

markstur · 2024-03-06T00:02:24Z

caikit_nlp/modules/text_embedding/embedding.py

            source_sentences, truncate_input_tokens=truncate_input_tokens
        )
-        embeddings = self._encode_with_retry(
+        embeddings, embeddings_token_count = self._encode_with_retry(


Renaming to sentences_token_count.

markstur · 2024-03-06T00:08:54Z

caikit_nlp/modules/text_embedding/embedding.py

-            to_tokenize = [texts]
-        else:
-            assert 0
+        to_tokenize = [texts]


given this change, I think this line can go after the asserts

Agreed. reodering.

caikit_nlp/modules/text_embedding/embedding.py

markstur · 2024-03-06T00:19:38Z

caikit_nlp/modules/text_embedding/embedding.py


    def encode(
        self,
-        sentences: Union[str, List[str]],
+        sentences: Union[str, Collection[str]],


We already have a signature does not match the base class situation (added an arg we need and chose to leave some out that we ignore), but I don't see much value in these new tweaks. Were we already abusing the hints on those?

It was an attempt to follow this guidance which states to use abstract base classes for arguments and concrete types for return types.

I'm reverting this to List[str] since I also personally find it easier to read (the specifics of ABC's are not widely known).

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju · 2024-03-06T21:32:01Z

@mynhardtburger caikit 0.26.14 is available with the data model update

evaline-ju

Thanks for the contribution! caikit should now be ready to bump, and looks like linting might need some attention

caikit_nlp/modules/text_embedding/embedding.py

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju

A few questions, and it'd probably be good for @markstur to review again as well

caikit_nlp/modules/text_embedding/embedding.py

tests/modules/text_embedding/test_embedding.py

caikit_nlp/modules/text_embedding/embedding.py

Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Signed-off-by: Mynhardt Burger <mynhardt@gmail.com>

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju

LGTM!

markstur

LGTM. Thanks!

markstur reviewed Mar 4, 2024

View reviewed changes

caikit_nlp/modules/text_embedding/embedding.py Outdated Show resolved Hide resolved

Add input_token_count to results

3c1f4b9

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

mynhardtburger force-pushed the inlude-input_token_count-in-results branch from 89fd252 to 3c1f4b9 Compare March 5, 2024 01:30

Bug fixes

ef72725

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

mynhardtburger mentioned this pull request Mar 5, 2024

Add input token count to embedding, reranker, sentence similarity caikit/caikit#675

Merged

3 tasks

Add tests and bug fixes

14b298d

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

mynhardtburger marked this pull request as ready for review March 5, 2024 18:50

mynhardtburger requested review from alex-jw-brooks, gkumbhat, evaline-ju, gabe-l-hart, tharapalanivel and Ssukriti as code owners March 5, 2024 18:50

markstur reviewed Mar 6, 2024

View reviewed changes

Review comments

700a5d5

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju reviewed Mar 6, 2024

View reviewed changes

caikit_nlp/modules/text_embedding/embedding.py Outdated Show resolved Hide resolved

bump caikit dependency for datamodel updates

6547083

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju mentioned this pull request Mar 7, 2024

🥅 Allow int thresholds #336

Merged

mynhardtburger added 4 commits March 7, 2024 22:17

add test for sort order

4c7fd20

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

Fix warnings

df52040

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

refactor _truncate_input_tokens and sum_token_count

0465796

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

Linting

09e66fc

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

mynhardtburger requested review from evaline-ju and markstur March 8, 2024 13:50

Add token count asserts for all endpoint tests

9cf0491

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju reviewed Mar 8, 2024

View reviewed changes

Update tests/modules/text_embedding/test_embedding.py

4ef9dc1

Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Signed-off-by: Mynhardt Burger <mynhardt@gmail.com>

mynhardtburger added 5 commits March 8, 2024 17:56

Fix docstring

c8b2a8f

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

Fix get_sample_start_indexes

11d6135

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

Add comments about token counts

e89a84d

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

Readability updates for get_sample_start_indexes

942826f

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

Remove #type: ignore

bdda232

Signed-off-by: Mynhardt Burger <Mynhardt.Burger@ibm.com>

evaline-ju approved these changes Mar 12, 2024

View reviewed changes

markstur approved these changes Mar 13, 2024

View reviewed changes

evaline-ju merged commit 01526aa into caikit:main Mar 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include input token count in results #334

Include input token count in results #334

mynhardtburger commented Mar 4, 2024 •

edited

mynhardtburger commented Mar 4, 2024

mynhardtburger commented Mar 5, 2024 •

edited

markstur left a comment

markstur Mar 6, 2024

mynhardtburger Mar 6, 2024

markstur Mar 6, 2024

mynhardtburger Mar 6, 2024

markstur Mar 6, 2024

mynhardtburger Mar 6, 2024

markstur Mar 6, 2024

mynhardtburger Mar 6, 2024

evaline-ju commented Mar 6, 2024

evaline-ju left a comment

evaline-ju left a comment

evaline-ju left a comment

markstur left a comment

Include input token count in results #334

Include input token count in results #334

Conversation

mynhardtburger commented Mar 4, 2024 • edited

mynhardtburger commented Mar 4, 2024

mynhardtburger commented Mar 5, 2024 • edited

markstur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evaline-ju commented Mar 6, 2024

evaline-ju left a comment

Choose a reason for hiding this comment

evaline-ju left a comment

Choose a reason for hiding this comment

evaline-ju left a comment

Choose a reason for hiding this comment

markstur left a comment

Choose a reason for hiding this comment

mynhardtburger commented Mar 4, 2024 •

edited

mynhardtburger commented Mar 5, 2024 •

edited