Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] ELSER crashes in local serverless setup #106206

Open
jonathan-buttner opened this issue Mar 11, 2024 · 18 comments
Open

[ML] ELSER crashes in local serverless setup #106206

jonathan-buttner opened this issue Mar 11, 2024 · 18 comments
Assignees
Labels
>bug Feature:NLP Features and issues around NLP :ml Machine learning Team:ML Meta label for the ML team

Comments

@jonathan-buttner
Copy link
Contributor

jonathan-buttner commented Mar 11, 2024

Description

When interacting with ELSER in serverless locally it crashes when attempting to perform inference.

Steps to reproduce

  • Ensure docker is setup and running
  1. Checkout kibana and bootstrap it
  2. Start elasticsearch serverless locally: yarn es serverless --projectType=security --ssl
  3. Start kibana locally yarn start --serverless=security --ssl
  4. Download elser
  5. Deploy elser via the inference API
PUT _inference/sparse_embedding/elser
{
  "service": "elser",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}
  1. Add an ingest processor
PUT _ingest/pipeline/elser
{
  "processors": [
    {
      "inference": {
        "model_id": "elser",
        "input_output": [
            {
                "input_field": "content",
                "output_field": "text_embedding"
            }
        ]
      }
    },
    {
      "set": {
        "field": "timestamp",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}
  1. Attempt to perform inference
POST _ingest/pipeline/elser/_simulate
{
  "docs": [
    {
      "_source": {
             "content": "hello" 
      }
    }]
}
  1. Retrieve the stats from the trained models api to observe that the process has crashed
            "routing_state": {
              "routing_state": "failed",
              "reason": """inference process crashed due to reason [[my-elser-model] pytorch_inference/659 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff83b20140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff83a13000, normalized address: 0x10d140', version: 8.14.0-SNAPSHOT (build 38a5b0ec077958)
]"""
            },
@jonathan-buttner jonathan-buttner added >bug :ml Machine learning Team:ML Meta label for the ML team labels Mar 11, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

I just confirmed that these steps don't cause a crash in the ESS CFT region running 8.14.0-SNAPSHOT. This is interesting, because the code should be very similar.

Serverless is running on c6i instances in AWS. CFT is running on n2 instances in GCP. So the problem might be down to serverless or might be down to the exact type of hardware.

@droberts195
Copy link
Contributor

Logs show the crash happened on ARM:

"inference process crashed due to reason [[.elser_model_2] pytorch_inference/644 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff7a188140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff7a07b000, normalized address: 0x10d140', version: 8.14.0-SNAPSHOT (build 38a5b0ec077958)\n]"

ML nodes on serverless are supposed to be on Intel hardware. I just tried reproducing this in a serverless project and the steps worked fine. However, as expected, my ML node was on Intel.

So it may be that the bug here is really "ELSER crashes on ARM".

And then the next question would be how did we end up with an ML node on ARM in serverless?

@droberts195
Copy link
Contributor

Just reading through the report more closely, this wasn't even using real serverless. It was using simulated serverless running locally on a Mac. That explains why it was on ARM.

But also, running locally on a Mac, it's running Docker images in a Linux VM. We don't know how much memory that Linux VM had. It may be that it was trying to do too much in too little memory and because of the vagaries of Docker on a Mac that ended up as a SEGV rather than an out-of-memory error.

Given the circumstances I don't think this bug is anywhere near as serious as the title makes it sound.

@dimkots dimkots changed the title [ML] ELSER crashes in serverless [ML] ELSER crashes in local serverless setup Mar 12, 2024
@droberts195
Copy link
Contributor

I tried these steps on an m6g.2xlarge AWS instance, and they ran successfully without the process crashing.

(Originally, I tried on an m6g.large instance with 8GB RAM, and there pytorch_inference was killed by the OOM killer. But that was running Elasticsearch as a single node cluster, so 50% memory to the JVM heap, and Kibana also running on the same machine. So that problem really was due to lack of memory. On the 32GB m6g.2xlarge inference worked fine.)

Therefore, this problem really does seem to be confined to running in a Docker container in a Linux VM on ARM macOS. It's not great that this crash happens, and it's still a bug that running in Docker on a Mac doesn't work, but at least it's not going to affect customers in production.

@jonathan-buttner jonathan-buttner added the Feature:NLP Features and issues around NLP label Apr 15, 2024
@maxjakob
Copy link

I encountered this bug yesterday trying to set up some integration tests locally on my Mac through Docker. The problem is not ELSER-specific but happens for other trained models too. For local dev it would be quite nice to have this working.

@sophiec20
Copy link
Contributor

@maxjakob Which other trained models did you try?

@maxjakob
Copy link

maxjakob commented May 8, 2024

I deployed sentence-transformers/msmarco-minilm-l-12-v3 with Eland which worked fine but upon search inference I got

"type": "status_exception",
"reason": "Error in inference process: [inference canceled as process is stopping]"

and the logs showed

... "message":"Inference process [sentence-transformers__msmarco-minilm-l-12-v3] failed due to [[sentence-transformers__msmarco-minilm-l-12-v3] pytorch_inference/229 process stopped unexpectedly:
Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff8407c140,
library: /lib/aarch64-linux-gnu/libc.so.6,
base: 0xffff83f6f000, normalized address: 0x10d140', version: 8.13.2 (build fdd7177d8c1325)\n]. This is the [1] failure in 24 hours, and the process will be restarted.", ...

(line breaks from me to show that it's the same issue as reported above)

@maxjakob
Copy link

maxjakob commented May 8, 2024

And I should add, this was with a regular Elasticsearch docker.elastic.co/elasticsearch/elasticsearch container, not with serverless!

@tveasey
Copy link
Contributor

tveasey commented May 8, 2024

Looking back over the comments on this issue I'm trying to understand if the problem is running the linux version of our inference code on arm Macs.

There is no reason to expect that instructions used by libtorch will be supported if they don't exist on the target platform: it will use a lot of hand rolled SIMD stuff via the MKL. These are sometimes emulated, but it isn't guaranteed.

I would have bet that this is what is the cause, except the latest error report was for a SIGSEGV (11) rather than a SIGILL (4). In any case, I think we need to understand exactly what build of our code inference is being run in this scenario.

@davidkyle
Copy link
Member

davidkyle commented May 10, 2024

I've tested on a bunch of different docker versions and the good news is that before 8.13 you can run the ELSER model in docker on macOS without it crashing.

In 8.13 libtorch was upgraded (elastic/ml-cpp#2612) to 2.1.2 from 1.13. This was a major version upgrade and could have introduced some incompatibility. MKL was also upgraded in 8.13 but that shouldn't be a problem as MKL is only used in the Linux x86 build and these crashes are on Aarch64 (library: /lib/aarch64-linux-gnu/libc.so.6).

Perhaps something changed in the way the docker image is created in 8.13 and it would be a good first step to eliminate that possibility

@droberts195
Copy link
Contributor

oneapi-src/oneDNN#1832 looks interesting.

@jonathan-buttner
Copy link
Contributor Author

Including some ideas from @davidkyle

@tushar8590
Copy link

tushar8590 commented May 14, 2024

I am using Mac OS 13.6.6 on Intel hardware. I have self-hosted Elastic Search version 8.13.2 on a local machine and getting the same error while running infer on a Huggingface model(entence-transformers__stsb-distilroberta-base-v2).
Can someone help to troubleshoot the issue?

@fred-maussion
Copy link

fred-maussion commented Jul 22, 2024

Facing the same issue with different type of environment where I can't use ELSER (.elser_model_2_linux-x86_64).
Model is deployed correctly but crash as soon as I try to call it.

Environment 1

  • Container
  • Elastic : v8.13.2 / v8.14.3
  • Docker Env
  • MacOS M1

Environment 2

  • Virtual Machine - Ubuntu 22.04 - x86
  • Elastic : v8.13.2 / v8.14.3
  • Docker Env

Environment 3

  • Virtual Machine - Ubuntu 22.04 - x86
  • Elastic : v8.13.2 / v8.14.3
  • Package installation

Error

On every environment, the following behavior.

Model are being deployed :
image

But getting the following error on ELSER Linux version as soon as I try to ingest the Observability Knowledge Base.
image

Let me know if I can help.

@sejbot
Copy link

sejbot commented Aug 20, 2024

I also experience this issue. Running plain Elasticsearch, not serverless.

My setup is a Macbook M1 Pro running macOS Sonoma 14.5 and I am running Elasticsearch in a docker container for local development and integration tests.
I am using the bundled E5 model in Elasticsearch. Deployment works fine but inference is crashing the model.

Have tested it on 8.12.2, 8.14.1 and 8.15.0.
8.12.2 works fine, but on 8.14.1 and 8.15.0, the model crashes when using inference. This is the error output I get:
[.multilingual-e5-small] inference process crashed due to reason [[.multilingual-e5-small] pytorch_inference/982 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff839a0140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff83893000, normalized address: 0x10d140', version: 8.15.0 (build 64f00009177815)

@edsavage
Copy link
Contributor

The crashes related to builds running Elasticsearch in a Docker container on macOS Silicon machines are almost certainly due to the xbyak_aarch64 bug fix mentioned above - #106206 (comment). I reproduced the crash and obtained a stack trace:

#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffffa3ac34d4 in ml::core::crashHandler (sig=11, info=0xffff9d7f8940, context=<optimized out>) at /ml-cpp/lib/core/CCrashHandler_Linux.cc:65
#2  <signal handler called>
#3  0x0000ffffa359a140 in __aarch64_cas4_acq () from /lib/aarch64-linux-gnu/libc.so.6
#4  0x0000ffffa352c560 in __GI___readdir64 (dirp=dirp@entry=0x0) at ../sysdeps/posix/readdir.c:44
#5  0x0000ffffa8abeb34 in Xbyak_aarch64::util::Cpu::getFilePathMaxTailNumPlus1 (this=this@entry=0xffffaaf12730 <dnnl::impl::cpu::aarch64::cpu()::cpu_>, path=path@entry=0xffffa9d1cd48 "/sys/devices/system/node/node") at /usr/src/pytorch/third_party/ideep/mkl-dnn/src/cpu/aarch64/xbyak_aarch64/src/util_impl.cpp:175

This bug seems to have been fixed back in March and hence Pytorch 2.3.1 is unaffected.

@dimkots dimkots self-assigned this Aug 28, 2024
@toughcoding
Copy link

Any procedure of PyTorch upgrade for Elasticsearch docker image available or we have to do it ourselves?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Feature:NLP Features and issues around NLP :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests