[ML] ELSER crashes in local serverless setup #106206

jonathan-buttner · 2024-03-11T20:53:16Z

Description

When interacting with ELSER in serverless locally it crashes when attempting to perform inference.

Steps to reproduce

Ensure docker is setup and running

Checkout kibana and bootstrap it
Start elasticsearch serverless locally: yarn es serverless --projectType=security --ssl
Start kibana locally yarn start --serverless=security --ssl
Download elser
Deploy elser via the inference API

PUT _inference/sparse_embedding/elser
{
  "service": "elser",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

Add an ingest processor

PUT _ingest/pipeline/elser
{
  "processors": [
    {
      "inference": {
        "model_id": "elser",
        "input_output": [
            {
                "input_field": "content",
                "output_field": "text_embedding"
            }
        ]
      }
    },
    {
      "set": {
        "field": "timestamp",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}

Attempt to perform inference

POST _ingest/pipeline/elser/_simulate
{
  "docs": [
    {
      "_source": {
             "content": "hello" 
      }
    }]
}

Retrieve the stats from the trained models api to observe that the process has crashed

            "routing_state": {
              "routing_state": "failed",
              "reason": """inference process crashed due to reason [[my-elser-model] pytorch_inference/659 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff83b20140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff83a13000, normalized address: 0x10d140', version: 8.14.0-SNAPSHOT (build 38a5b0ec077958)
]"""
            },

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-03-11T20:53:40Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2024-03-12T10:57:26Z

I just confirmed that these steps don't cause a crash in the ESS CFT region running 8.14.0-SNAPSHOT. This is interesting, because the code should be very similar.

Serverless is running on c6i instances in AWS. CFT is running on n2 instances in GCP. So the problem might be down to serverless or might be down to the exact type of hardware.

droberts195 · 2024-03-12T12:21:20Z

Logs show the crash happened on ARM:

"inference process crashed due to reason [[.elser_model_2] pytorch_inference/644 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff7a188140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff7a07b000, normalized address: 0x10d140', version: 8.14.0-SNAPSHOT (build 38a5b0ec077958)\n]"

ML nodes on serverless are supposed to be on Intel hardware. I just tried reproducing this in a serverless project and the steps worked fine. However, as expected, my ML node was on Intel.

So it may be that the bug here is really "ELSER crashes on ARM".

And then the next question would be how did we end up with an ML node on ARM in serverless?

droberts195 · 2024-03-12T12:28:25Z

Just reading through the report more closely, this wasn't even using real serverless. It was using simulated serverless running locally on a Mac. That explains why it was on ARM.

But also, running locally on a Mac, it's running Docker images in a Linux VM. We don't know how much memory that Linux VM had. It may be that it was trying to do too much in too little memory and because of the vagaries of Docker on a Mac that ended up as a SEGV rather than an out-of-memory error.

Given the circumstances I don't think this bug is anywhere near as serious as the title makes it sound.

droberts195 · 2024-03-12T18:36:32Z

I tried these steps on an m6g.2xlarge AWS instance, and they ran successfully without the process crashing.

(Originally, I tried on an m6g.large instance with 8GB RAM, and there pytorch_inference was killed by the OOM killer. But that was running Elasticsearch as a single node cluster, so 50% memory to the JVM heap, and Kibana also running on the same machine. So that problem really was due to lack of memory. On the 32GB m6g.2xlarge inference worked fine.)

Therefore, this problem really does seem to be confined to running in a Docker container in a Linux VM on ARM macOS. It's not great that this crash happens, and it's still a bug that running in Docker on a Mac doesn't work, but at least it's not going to affect customers in production.

maxjakob · 2024-04-24T07:47:20Z

I encountered this bug yesterday trying to set up some integration tests locally on my Mac through Docker. The problem is not ELSER-specific but happens for other trained models too. For local dev it would be quite nice to have this working.

sophiec20 · 2024-05-08T14:12:03Z

@maxjakob Which other trained models did you try?

maxjakob · 2024-05-08T14:49:40Z

I deployed sentence-transformers/msmarco-minilm-l-12-v3 with Eland which worked fine but upon ~~search~~ inference I got

"type": "status_exception",
"reason": "Error in inference process: [inference canceled as process is stopping]"

and the logs showed

... "message":"Inference process [sentence-transformers__msmarco-minilm-l-12-v3] failed due to [[sentence-transformers__msmarco-minilm-l-12-v3] pytorch_inference/229 process stopped unexpectedly:
Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff8407c140,
library: /lib/aarch64-linux-gnu/libc.so.6,
base: 0xffff83f6f000, normalized address: 0x10d140', version: 8.13.2 (build fdd7177d8c1325)\n]. This is the [1] failure in 24 hours, and the process will be restarted.", ...

(line breaks from me to show that it's the same issue as reported above)

maxjakob · 2024-05-08T14:51:22Z

And I should add, this was with a regular Elasticsearch docker.elastic.co/elasticsearch/elasticsearch container, not with serverless!

tveasey · 2024-05-08T20:59:27Z

Looking back over the comments on this issue I'm trying to understand if the problem is running the linux version of our inference code on arm Macs.

There is no reason to expect that instructions used by libtorch will be supported if they don't exist on the target platform: it will use a lot of hand rolled SIMD stuff via the MKL. These are sometimes emulated, but it isn't guaranteed.

I would have bet that this is what is the cause, except the latest error report was for a SIGSEGV (11) rather than a SIGILL (4). In any case, I think we need to understand exactly what build of our code inference is being run in this scenario.

davidkyle · 2024-05-10T08:14:06Z

I've tested on a bunch of different docker versions and the good news is that before 8.13 you can run the ELSER model in docker on macOS without it crashing.

In 8.13 libtorch was upgraded (elastic/ml-cpp#2612) to 2.1.2 from 1.13. This was a major version upgrade and could have introduced some incompatibility. MKL was also upgraded in 8.13 but that shouldn't be a problem as MKL is only used in the Linux x86 build and these crashes are on Aarch64 (library: /lib/aarch64-linux-gnu/libc.so.6).

Perhaps something changed in the way the docker image is created in 8.13 and it would be a good first step to eliminate that possibility

droberts195 · 2024-05-10T15:58:10Z

oneapi-src/oneDNN#1832 looks interesting.

jonathan-buttner · 2024-05-14T13:32:52Z

Including some ideas from @davidkyle

Try rebuilding with an upgraded pytorch
Try rebuilding with this fix: src: cpu: aarch64: updated xbyak_aarch64 oneapi-src/oneDNN#1832

tushar8590 · 2024-05-14T15:14:39Z

I am using Mac OS 13.6.6 on Intel hardware. I have self-hosted Elastic Search version 8.13.2 on a local machine and getting the same error while running infer on a Huggingface model(entence-transformers__stsb-distilroberta-base-v2).
Can someone help to troubleshoot the issue?

fred-maussion · 2024-07-22T09:19:01Z

Facing the same issue with different type of environment where I can't use ELSER (.elser_model_2_linux-x86_64).
Model is deployed correctly but crash as soon as I try to call it.

Environment 1

Container
Elastic : v8.13.2 / v8.14.3
Docker Env
MacOS M1

Environment 2

Virtual Machine - Ubuntu 22.04 - x86
Elastic : v8.13.2 / v8.14.3
Docker Env

Environment 3

Virtual Machine - Ubuntu 22.04 - x86
Elastic : v8.13.2 / v8.14.3
Package installation

Error

On every environment, the following behavior.

Model are being deployed :

But getting the following error on ELSER Linux version as soon as I try to ingest the Observability Knowledge Base.

Let me know if I can help.

sejbot · 2024-08-20T14:09:02Z

I also experience this issue. Running plain Elasticsearch, not serverless.

My setup is a Macbook M1 Pro running macOS Sonoma 14.5 and I am running Elasticsearch in a docker container for local development and integration tests.
I am using the bundled E5 model in Elasticsearch. Deployment works fine but inference is crashing the model.

Have tested it on 8.12.2, 8.14.1 and 8.15.0.
8.12.2 works fine, but on 8.14.1 and 8.15.0, the model crashes when using inference. This is the error output I get:
[.multilingual-e5-small] inference process crashed due to reason [[.multilingual-e5-small] pytorch_inference/982 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff839a0140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff83893000, normalized address: 0x10d140', version: 8.15.0 (build 64f00009177815)

edsavage · 2024-08-22T02:54:41Z

The crashes related to builds running Elasticsearch in a Docker container on macOS Silicon machines are almost certainly due to the xbyak_aarch64 bug fix mentioned above - #106206 (comment). I reproduced the crash and obtained a stack trace:

#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffffa3ac34d4 in ml::core::crashHandler (sig=11, info=0xffff9d7f8940, context=<optimized out>) at /ml-cpp/lib/core/CCrashHandler_Linux.cc:65
#2  <signal handler called>
#3  0x0000ffffa359a140 in __aarch64_cas4_acq () from /lib/aarch64-linux-gnu/libc.so.6
#4  0x0000ffffa352c560 in __GI___readdir64 (dirp=dirp@entry=0x0) at ../sysdeps/posix/readdir.c:44
#5  0x0000ffffa8abeb34 in Xbyak_aarch64::util::Cpu::getFilePathMaxTailNumPlus1 (this=this@entry=0xffffaaf12730 <dnnl::impl::cpu::aarch64::cpu()::cpu_>, path=path@entry=0xffffa9d1cd48 "/sys/devices/system/node/node") at /usr/src/pytorch/third_party/ideep/mkl-dnn/src/cpu/aarch64/xbyak_aarch64/src/util_impl.cpp:175

This bug seems to have been fixed back in March and hence Pytorch 2.3.1 is unaffected.

toughcoding · 2024-08-30T22:04:46Z

Any procedure of PyTorch upgrade for Elasticsearch docker image available or we have to do it ourselves?

jonathan-buttner added >bug :ml Machine learning Team:ML Meta label for the ML team labels Mar 11, 2024

dimkots changed the title ~~[ML] ELSER crashes in serverless~~ [ML] ELSER crashes in local serverless setup Mar 12, 2024

jonathan-buttner added the Feature:NLP Features and issues around NLP label Apr 15, 2024

mgreau mentioned this issue May 13, 2024

[WIP] Add semantic search using ELSER by default mgreau/daggerverse#1

Draft

dimkots self-assigned this Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] ELSER crashes in local serverless setup #106206

[ML] ELSER crashes in local serverless setup #106206

jonathan-buttner commented Mar 11, 2024 •

edited

Loading

elasticsearchmachine commented Mar 11, 2024

droberts195 commented Mar 12, 2024

droberts195 commented Mar 12, 2024

droberts195 commented Mar 12, 2024

droberts195 commented Mar 12, 2024

maxjakob commented Apr 24, 2024

sophiec20 commented May 8, 2024

maxjakob commented May 8, 2024 •

edited

Loading

maxjakob commented May 8, 2024

tveasey commented May 8, 2024

davidkyle commented May 10, 2024 •

edited

Loading

droberts195 commented May 10, 2024

jonathan-buttner commented May 14, 2024

tushar8590 commented May 14, 2024 •

edited

Loading

fred-maussion commented Jul 22, 2024 •

edited

Loading

sejbot commented Aug 20, 2024 •

edited

Loading

edsavage commented Aug 22, 2024

toughcoding commented Aug 30, 2024

[ML] ELSER crashes in local serverless setup #106206

[ML] ELSER crashes in local serverless setup #106206

Comments

jonathan-buttner commented Mar 11, 2024 • edited Loading

Description

elasticsearchmachine commented Mar 11, 2024

droberts195 commented Mar 12, 2024

droberts195 commented Mar 12, 2024

droberts195 commented Mar 12, 2024

droberts195 commented Mar 12, 2024

maxjakob commented Apr 24, 2024

sophiec20 commented May 8, 2024

maxjakob commented May 8, 2024 • edited Loading

maxjakob commented May 8, 2024

tveasey commented May 8, 2024

davidkyle commented May 10, 2024 • edited Loading

droberts195 commented May 10, 2024

jonathan-buttner commented May 14, 2024

tushar8590 commented May 14, 2024 • edited Loading

fred-maussion commented Jul 22, 2024 • edited Loading

Environment 1

Environment 2

Environment 3

Error

sejbot commented Aug 20, 2024 • edited Loading

edsavage commented Aug 22, 2024

toughcoding commented Aug 30, 2024

jonathan-buttner commented Mar 11, 2024 •

edited

Loading

maxjakob commented May 8, 2024 •

edited

Loading

davidkyle commented May 10, 2024 •

edited

Loading

tushar8590 commented May 14, 2024 •

edited

Loading

fred-maussion commented Jul 22, 2024 •

edited

Loading

sejbot commented Aug 20, 2024 •

edited

Loading