Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

Closed
wwang500 opened this issue Jan 31, 2024 · 2 comments 路 Fixed by #105183
Closed

[ML] _infer using XML-RoBERTa model failed when input contains emojis #104981

wwang500 opened this issue Jan 31, 2024 · 2 comments 路 Fixed by #105183
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@wwang500
Copy link

Elasticsearch Version

8.12.0

OS Version

Linux x86

Problem Description

While running inference using XML-RoBERTa model: 02shanky/finetuned-twitter-xlm-roberta-base-emotion, if input contains certain mixed emojis and strings , such as: 馃槺馃槺馃槺馃槺馃槺馃槺this is weird, inference will fail.

The error message is: [array_index_out_of_bounds_exception Root causes: array_index_out_of_bounds_exception: Index 33 out of bounds for length 32]: Index 33 out of bounds for length 32

Steps to Reproduce

  1. Import 02shanky/finetuned-twitter-xlm-roberta-base-emotion model from huggingface: eland_import_hub_model --url {es_url} -u {es_user} -p {es_password} --insecure --hub-model-id 02shanky/finetuned-twitter-xlm-roberta-base-emotion --task-type text_classification --start
  2. Go to Machine Learning -> Model Management -> Trained Models: test the model using the text: "馃槺馃槺馃槺馃槺馃槺馃槺this is weird"

Observed

image

Error log

[2024-01-31T10:38:25,435][WARN ][r.suppressed             ] [node-0] path: /_ml/trained_models/02shanky__finetuned-twitter-xlm-roberta-base-emotion/_infer, params: {model_id=02shanky__finetuned-twitter-xlm-roberta-base-emotion}, status: 500
java.lang.ArrayIndexOutOfBoundsException: Index 33 out of bounds for length 32
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer.tokenize(UnigramTokenizer.java:283) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.UnigramTokenizer.incrementToken(UnigramTokenizer.java:223) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.innerTokenize(XLMRobertaTokenizer.java:173) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.NlpTokenizer.tokenize(NlpTokenizer.java:60) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.lambda$requestBuilder$0(XLMRobertaTokenizer.java:132) ~[?:?]
        at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273) ~[?:?]
        at java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180) ~[?:?]
        at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104) ~[?:?]
        at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:712) ~[?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.nlp.tokenizers.XLMRobertaTokenizer.lambda$requestBuilder$1(XLMRobertaTokenizer.java:133) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.deployment.InferencePyTorchAction.doRun(InferencePyTorchAction.java:122) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.PriorityProcessWorkerExecutorService$OrderedRunnable.run(PriorityProcessWorkerExecutorService.java:54) ~[?:?]
        at org.elasticsearch.xpack.ml.job.process.AbstractProcessWorkerExecutorService.start(AbstractProcessWorkerExecutorService.java:111) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]

@wwang500 wwang500 added >bug :ml Machine learning Team:ML Meta label for the ML team labels Jan 31, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@joeafari
Copy link
Contributor

joeafari commented Feb 2, 2024

Hey team,

This issue can occur with only emojis as well

I'd like to clarify that issue also occurs when there're only emojis

sademojis

Also, different models lead to the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants