Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix Array out of bounds exception in the XLM Roberta tokenizer #106655

Merged
merged 2 commits into from Mar 22, 2024

Conversation

davidkyle
Copy link
Member

The following ArrayIndexOutOfBoundsException has been observed tokenising certain Unicode characters in the XLM Roberta tokenizer. The overflow comes from a fixed sized reusable buffer that holds the normalised form of a single unicode character encoded as UTF-8. The normalised form can be much longer than the allocated buffer size (8 chars), for example 'ﷺ' normalises to 33 characters. That is surprising so I compared the output to other SentencePiece implementations and the results are in agreement.

The fix here is to spend a few bytes increasing the buffer size.

java.lang.ArrayIndexOutOfBoundsException: Index 8 out of bounds for length 8
	at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:650) ~[lucene-core-9.9.2.jar:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.PrecompiledCharMapNormalizer.normalize(PrecompiledCharMapNormalizer.java:194) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.PrecompiledCharMapNormalizer.fill(PrecompiledCharMapNormalizer.java:263) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.PrecompiledCharMapNormalizer.read(PrecompiledCharMapNormalizer.java:242) ~[?:?]
	at org.apache.lucene.analysis.CharacterUtils.readFully(CharacterUtils.java:183) ~[lucene-core-9.9.2.jar:?]
	at org.apache.lucene.analysis.CharacterUtils.fill(CharacterUtils.java:159) ~[lucene-core-9.9.2.jar:?]
	at org.apache.lucene.analysis.CharacterUtils.fill(CharacterUtils.java:177) ~[lucene-core-9.9.2.jar:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer$SimpleWhitespaceTokenizer.next(UnigramTokenizer.java:464) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer.incrementToken(UnigramTokenizer.java:165) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.innerTokenize(XLMRobertaTokenizer.java:173) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.NlpTokenizer.tokenize(NlpTokenizer.java:60) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.lambda$requestBuilder$0(XLMRobertaTokenizer.java:132) ~[?:?]

This PR also removes some unused fields.

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 22, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 22, 2024
@davidkyle davidkyle merged commit a8188f8 into elastic:main Mar 22, 2024
14 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.13

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Mar 22, 2024
…lastic#106655)

Increases the buffer size for the normalised form of the input unicode 
character. Certain characters can have surprisingly long normalised forms
elasticsearchmachine pushed a commit that referenced this pull request Mar 22, 2024
…106655) (#106661)

Increases the buffer size for the normalised form of the input unicode 
character. Certain characters can have surprisingly long normalised forms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :ml Machine learning Team:ML Meta label for the ML team v8.13.1 v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants