[ML] Fix Array out of bounds exception in the XLM Roberta tokenizer #106655

davidkyle · 2024-03-22T10:16:04Z

The following ArrayIndexOutOfBoundsException has been observed tokenising certain Unicode characters in the XLM Roberta tokenizer. The overflow comes from a fixed sized reusable buffer that holds the normalised form of a single unicode character encoded as UTF-8. The normalised form can be much longer than the allocated buffer size (8 chars), for example 'ﷺ' normalises to 33 characters. That is surprising so I compared the output to other SentencePiece implementations and the results are in agreement.

The fix here is to spend a few bytes increasing the buffer size.

java.lang.ArrayIndexOutOfBoundsException: Index 8 out of bounds for length 8
	at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:650) ~[lucene-core-9.9.2.jar:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.PrecompiledCharMapNormalizer.normalize(PrecompiledCharMapNormalizer.java:194) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.PrecompiledCharMapNormalizer.fill(PrecompiledCharMapNormalizer.java:263) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.PrecompiledCharMapNormalizer.read(PrecompiledCharMapNormalizer.java:242) ~[?:?]
	at org.apache.lucene.analysis.CharacterUtils.readFully(CharacterUtils.java:183) ~[lucene-core-9.9.2.jar:?]
	at org.apache.lucene.analysis.CharacterUtils.fill(CharacterUtils.java:159) ~[lucene-core-9.9.2.jar:?]
	at org.apache.lucene.analysis.CharacterUtils.fill(CharacterUtils.java:177) ~[lucene-core-9.9.2.jar:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer$SimpleWhitespaceTokenizer.next(UnigramTokenizer.java:464) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer.incrementToken(UnigramTokenizer.java:165) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.innerTokenize(XLMRobertaTokenizer.java:173) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.NlpTokenizer.tokenize(NlpTokenizer.java:60) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.lambda$requestBuilder$0(XLMRobertaTokenizer.java:132) ~[?:?]

This PR also removes some unused fields.

elasticsearchmachine · 2024-03-22T10:16:27Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-03-22T10:16:27Z

Hi @davidkyle, I've created a changelog YAML for you.

droberts195

LGTM

elasticsearchmachine · 2024-03-22T11:13:54Z

💚 Backport successful

Status	Branch	Result
✅	8.13

…lastic#106655) Increases the buffer size for the normalised form of the input unicode character. Certain characters can have surprisingly long normalised forms

…106655) (#106661) Increases the buffer size for the normalised form of the input unicode character. Certain characters can have surprisingly long normalised forms

increase buffer size

54e5dff

davidkyle added >bug :ml Machine learning v8.13.1 v8.14.0 labels Mar 22, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 22, 2024

Update docs/changelog/106655.yaml

620ca4a

droberts195 approved these changes Mar 22, 2024

View reviewed changes

droberts195 added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 22, 2024

davidkyle merged commit a8188f8 into elastic:main Mar 22, 2024
14 checks passed

davidkyle mentioned this pull request Mar 22, 2024

[8.13] [ML] Fix Array out of bounds exception in the XLM Roberta tokenizer (#106655) #106661

Merged

This was referenced Mar 25, 2024

Set index mode earlier for new downsample index #106728

Merged

Added initial metrics for synthetic source #106732

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix Array out of bounds exception in the XLM Roberta tokenizer #106655

[ML] Fix Array out of bounds exception in the XLM Roberta tokenizer #106655

davidkyle commented Mar 22, 2024

elasticsearchmachine commented Mar 22, 2024

elasticsearchmachine commented Mar 22, 2024

droberts195 left a comment

elasticsearchmachine commented Mar 22, 2024

[ML] Fix Array out of bounds exception in the XLM Roberta tokenizer #106655

[ML] Fix Array out of bounds exception in the XLM Roberta tokenizer #106655

Conversation

davidkyle commented Mar 22, 2024

elasticsearchmachine commented Mar 22, 2024

elasticsearchmachine commented Mar 22, 2024

droberts195 left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 22, 2024

💚 Backport successful