[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

wwang500 · 2024-01-31T15:45:23Z

Elasticsearch Version

8.12.0

OS Version

Linux x86

Problem Description

While running inference using XML-RoBERTa model: 02shanky/finetuned-twitter-xlm-roberta-base-emotion, if input contains certain mixed emojis and strings , such as: 😱😱😱😱😱😱this is weird, inference will fail.

The error message is: [array_index_out_of_bounds_exception Root causes: array_index_out_of_bounds_exception: Index 33 out of bounds for length 32]: Index 33 out of bounds for length 32

Steps to Reproduce

Import 02shanky/finetuned-twitter-xlm-roberta-base-emotion model from huggingface: eland_import_hub_model --url {es_url} -u {es_user} -p {es_password} --insecure --hub-model-id 02shanky/finetuned-twitter-xlm-roberta-base-emotion --task-type text_classification --start
Go to Machine Learning -> Model Management -> Trained Models: test the model using the text: "😱😱😱😱😱😱this is weird"

Observed

Error log

[2024-01-31T10:38:25,435][WARN ][r.suppressed             ] [node-0] path: /_ml/trained_models/02shanky__finetuned-twitter-xlm-roberta-base-emotion/_infer, params: {model_id=02shanky__finetuned-twitter-xlm-roberta-base-emotion}, status: 500
java.lang.ArrayIndexOutOfBoundsException: Index 33 out of bounds for length 32
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer.tokenize(UnigramTokenizer.java:283) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer.incrementToken(UnigramTokenizer.java:223) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.innerTokenize(XLMRobertaTokenizer.java:173) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.NlpTokenizer.tokenize(NlpTokenizer.java:60) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.lambda$requestBuilder$0(XLMRobertaTokenizer.java:132) ~[?:?]
        at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273) ~[?:?]
        at java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180) ~[?:?]
        at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104) ~[?:?]
        at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:712) ~[?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.lambda$requestBuilder$1(XLMRobertaTokenizer.java:133) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.deployment.InferencePyTorchAction.doRun(InferencePyTorchAction.java:122) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.PriorityProcessWorkerExecutorService$OrderedRunnable.run(PriorityProcessWorkerExecutorService.java:54) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.AbstractProcessWorkerExecutorService.start(AbstractProcessWorkerExecutorService.java:111) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-01-31T15:45:46Z

Pinging @elastic/ml-core (Team:ML)

joeafari · 2024-02-02T09:19:48Z

Hey team,

This issue can occur with only emojis as well

I'd like to clarify that issue also occurs when there're only emojis

Also, different models lead to the same issue

wwang500 added >bug :ml Machine learning Team:ML Meta label for the ML team labels Jan 31, 2024

davidkyle mentioned this issue Feb 6, 2024

[ML] Fix handling surrogate pairs in the XLM Roberta tokenizer #105183

Merged

davidkyle closed this as completed in #105183 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

wwang500 commented Jan 31, 2024

elasticsearchmachine commented Jan 31, 2024

joeafari commented Feb 2, 2024 •

edited

[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

Comments

wwang500 commented Jan 31, 2024

Elasticsearch Version

OS Version

Problem Description

Steps to Reproduce

Observed

Error log

elasticsearchmachine commented Jan 31, 2024

joeafari commented Feb 2, 2024 • edited

joeafari commented Feb 2, 2024 •

edited