Make encode() in wrapped model compatible with super encode() #337

markstur · 2024-03-14T22:38:11Z

Adding missing params
Don't return unexpected tuple (with token count) unless asked
Adding check to not use our params if given an unwrapped model
Fixing some param position things

* Adding missing params * Don't return unexpected tuple (with token count) unless asked * Adding check to not use our params if given an unwrapped model * Fixing some param position things Signed-off-by: Mark Sturdevant <markstur@Marks-MacBook-Pro.local>

markstur · 2024-03-14T22:48:59Z

@mynhardtburger

evaline-ju · 2024-03-14T23:53:38Z

caikit_nlp/modules/text_embedding/embedding.py

+
+        # Else...
+        # It's possible to init with a model that doesn't have the added kwargs.
+        # E.g. a SentenceTransformer or other transormer model. Remove those kwargs!


Suggested change

# E.g. a SentenceTransformer or other transormer model. Remove those kwargs!

# E.g. a SentenceTransformer or other transformer model. Remove those kwargs!

good catch, done

evaline-ju · 2024-03-14T23:55:51Z

caikit_nlp/modules/text_embedding/embedding.py

        :param truncate_input_tokens: Truncation length for input tokens.
                Truncation length for input tokens.
                If less than zero, this truncation is left up to the tokenizer default (model max).
                If zero or greater than the model's maximum, then this is used as a test
                to see if truncation is needed. If needed is needed, an exception is thrown.
                Otherwise, we take this usable truncation limit to truncate the input tokens.
+        :param return_token_count: If true, a tuple is returned to add the input token count.

        :return:
           A tuple of the embedding, as a numpy matrix, and the input_token_count int.


could this return be updated to reflect the return_token_count update?

almost missed this again, but done!

evaline-ju · 2024-03-14T23:57:53Z

caikit_nlp/modules/text_embedding/embedding.py

+        # This is not the normal use case but at least don't pass invalid kwargs, to encode()
+        # and don't return the unexpected tuple (adding token count).
+        del kwargs["truncate_input_tokens"]
+        del kwargs["return_token_count"]


might be good to have a small test for this case, in case an update could break it (again)?

Yes and I came here to make the PR a draft because this isn't catching keyerrors (noticed while reviewing with Flavia)

Yep and done. Definitely was a silly miss to not check AND not test. :(

…tion behavior. * First draft could KeyError when deleting kwargs that don't exist. Tests added. * Adding a config option so the desired default behavior can be either: - Throw an error if truncation is happening implicitly, or - Nah. Just let it go. First was requested, so trunction does not happen quietly. This is common behavoir for some models. The second is more aligned with SentenceTransformers and is probably necessary for standard tests to run without errors. Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

evaline-ju

LGTM! Thanks for the test update!

markstur requested review from alex-jw-brooks, gkumbhat, evaline-ju, gabe-l-hart, tharapalanivel and Ssukriti as code owners March 14, 2024 22:38

evaline-ju reviewed Mar 14, 2024

View reviewed changes

markstur marked this pull request as draft March 15, 2024 17:37

markstur force-pushed the compatible_encode branch from b97e34f to c6ed5b2 Compare March 16, 2024 23:04

fmt fix

fc7d81e

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

markstur marked this pull request as ready for review March 16, 2024 23:19

markstur requested a review from evaline-ju March 18, 2024 17:02

evaline-ju approved these changes Mar 19, 2024

View reviewed changes

evaline-ju merged commit ce34b1c into caikit:main Mar 19, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make encode() in wrapped model compatible with super encode() #337

Make encode() in wrapped model compatible with super encode() #337

markstur commented Mar 14, 2024

markstur commented Mar 14, 2024

evaline-ju Mar 14, 2024

markstur Mar 16, 2024

evaline-ju Mar 14, 2024

markstur Mar 16, 2024

evaline-ju Mar 14, 2024

markstur Mar 15, 2024

markstur Mar 16, 2024

evaline-ju left a comment

	# E.g. a SentenceTransformer or other transormer model. Remove those kwargs!
	# E.g. a SentenceTransformer or other transformer model. Remove those kwargs!

Make encode() in wrapped model compatible with super encode() #337

Make encode() in wrapped model compatible with super encode() #337

Conversation

markstur commented Mar 14, 2024

markstur commented Mar 14, 2024

evaline-ju Mar 14, 2024

Choose a reason for hiding this comment

markstur Mar 16, 2024

Choose a reason for hiding this comment

evaline-ju Mar 14, 2024

Choose a reason for hiding this comment

markstur Mar 16, 2024

Choose a reason for hiding this comment

evaline-ju Mar 14, 2024

Choose a reason for hiding this comment

markstur Mar 15, 2024

Choose a reason for hiding this comment

markstur Mar 16, 2024

Choose a reason for hiding this comment

evaline-ju left a comment

Choose a reason for hiding this comment