-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Recent PyTorch images causing Zombie Process #3965
Comments
We too are facing the same error. Traceback is :
The base image that we are using is |
Hi, we are tracking this issue internally. The current fix is in progess with this aws/sagemaker-pytorch-inference-toolkit#166. Alternatively, a quick workaround if running the DLC manually would be to add
|
Hi @sirutBuasai, I am working on a Batch-Transform job using a pytorch-model and the I saw the new release |
Hi @sirutBuasai, I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0. I saw here that the fix was merged in When this fix will be available for deploying models ? |
@conti748 @adrien-code-it I've tried putting |
Get the same traceback as @dylanhellems: I've compared it by filenames and lines. Base image in our case: We are experiencing intermittent errors on inference endpoints during new cold container starts (scaling). Usually several next requests to endpoint resolve the issue, but yeah - it's not stable behaviour. Our Dockerfile is:
|
The recent new releases would have solved all the issues, if |
Hi, We are in the process of upgrading toolkit versions in PyTorch Inference DLCs. Once PRs are merged, I will update when the images are publicly released again. |
@5agado @angarsky @adrien-code-it |
Hi all, PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.13-pt-sagemaker-2.2.0-inf-py310 PT 1.13 is still WIP, will update release status once merged and built. |
PT 1.13 has been released: https://github.com/aws/deep-learning-containers/releases/tag/v1.26-pt-sagemaker-1.13.1-inf-cpu-py39 All images are patched, closing issue. |
Checklist
I've attached the script to reproduce the bugI've documented below the tests I've run on the DLC imageI've built my own container based off DLC (and I've attached the code used to build my own image)Concise Description:
As of this May 22nd release of the PyTorch 2.1.0 images, our SageMaker Endpoints and Batch Transform Jobs using the new images have been failing. No obvious errors are thrown other than a
psutil.ZombieProcess: PID still exists but it's a zombie
from thepytorch_serving
entrypoint.DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8
Current behavior:
SageMaker Endpoints and Batch Transform Jobs are failing with a
psutil.ZombieProcess: PID still exists but it's a zombie
error from thepytorch_serving
entrypoint.Expected behavior:
SageMaker Endpoints and Batch Transform Jobs work as expected.
Additional context:
We had previously been using the
2.1.0-cpu-py310
and2.1.0-gpu-py310
images but have had to pin the images back to their May 14th releases. The error is present in bothpytorch-training
andpytorch-inference
. We made no changes to our deployments during this time, they simply started to fail out of the blue once the new image was released.Here is the full stacktrace from a failed Batch Transform Job:
The text was updated successfully, but these errors were encountered: