[bug] Recent PyTorch images causing Zombie Process #3965

dylanhellems · 2024-05-29T16:26:29Z

Checklist

I've prepended issue tag with type of change: [bug]
(If applicable) ~~I've attached the script to reproduce the bug~~
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) ~~I've documented below the tests I've run on the DLC image~~
I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
~~I've built my own container based off DLC (and I've attached the code used to build my own image)~~

Concise Description:
As of this May 22nd release of the PyTorch 2.1.0 images, our SageMaker Endpoints and Batch Transform Jobs using the new images have been failing. No obvious errors are thrown other than a psutil.ZombieProcess: PID still exists but it's a zombie from the pytorch_serving entrypoint.

DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8

Current behavior:
SageMaker Endpoints and Batch Transform Jobs are failing with a psutil.ZombieProcess: PID still exists but it's a zombie error from the pytorch_serving entrypoint.

Expected behavior:
SageMaker Endpoints and Batch Transform Jobs work as expected.

Additional context:
We had previously been using the 2.1.0-cpu-py310 and 2.1.0-gpu-py310 images but have had to pin the images back to their May 14th releases. The error is present in both pytorch-training and pytorch-inference. We made no changes to our deployments during this time, they simply started to fail out of the blue once the new image was released.

Here is the full stacktrace from a failed Batch Transform Job:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)
   
psutil.ZombieProcess: PID still exists but it's a zombie (pid=104)

The text was updated successfully, but these errors were encountered:

greeshmaPr · 2024-05-30T06:24:42Z

We too are facing the same error. Traceback is :

['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model']
serving.main()
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
_start_torchserve()
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 257, in call
return attempt.get(self._wrap_exception)
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
ts_process = _retrieve_ts_server_process()
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 266, in call
raise attempt.get()
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
if TS_NAMESPACE in process.cmdline():
File "/opt/conda/lib/python3.9/site-packages/psutil/init.py", line 719, in cmdline
return self._proc.cmdline()
File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1714, in wrapper
return fun(self, *args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1853, in cmdline
self._raise_if_zombie()
File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
raise ZombieProcess(self.pid, self._name, self._ppid)

The base image that we are using is 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.13.1-cpu-py39

sirutBuasai · 2024-05-30T18:55:13Z

Hi, we are tracking this issue internally. The current fix is in progess with this aws/sagemaker-pytorch-inference-toolkit#166. Alternatively, a quick workaround if running the DLC manually would be to add --init flag to the command.
Eg:

docker run --init --name sagemaker_pt_dlc 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-graviton:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve

conti748 · 2024-06-01T10:21:25Z

Hi @sirutBuasai,

I am working on a Batch-Transform job using a pytorch-model and the 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310 and I am experiencing the same error.

I saw the new release 2.0.24 of the package sagemaker-pytorch-inference-toolkit, I tried installing it on the image using the requirements.txt, but I got the same error.

adrien-code-it · 2024-06-02T12:45:18Z

Hi @sirutBuasai,

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.

I saw here that the fix was merged in aws:master two days ago (https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/166), however my latest deployment today still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ?
Regards

alan1420 · 2024-06-02T22:31:52Z

@conti748 @adrien-code-it I've tried putting git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git in requirements.txt and it works

5agado · 2024-06-04T11:05:23Z

Same situation as @conti748. I tried to add to the model-inference-requirements as suggested by @alan1420 , but didn't work.
Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker as base image.

angarsky · 2024-06-04T13:49:31Z

Get the same traceback as @dylanhellems: I've compared it by filenames and lines.

Base image in our case: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker

We are experiencing intermittent errors on inference endpoints during new cold container starts (scaling). Usually several next requests to endpoint resolve the issue, but yeah - it's not stable behaviour.

Our Dockerfile is:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker

# Update torch version to resolve the issue with `mmcv` and `import mmdet.apis`.
# Similar issue: https://github.com/open-mmlab/mmdetection/issues/4291
RUN pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cpu

# Install MMDetection framework.
RUN pip install -U openmim && \
  mim install mmengine && \
  mim install mmcv && \
  mim install mmdet

# Install some extra pip packages.
RUN pip install imutils sagemaker flask

# An attempt to fix permissions issue.
RUN mkdir -p /logs && chmod -R 777 /logs

# NOTE: SageMaker in a local mode overrides the SAGEMAKER_* variables.
ENV AWS_DEFAULT_REGION us-east-1

# Use single worker for a serverless mode.
ENV SAGEMAKER_MODEL_SERVER_WORKERS 1

# Cleanup
RUN pip cache purge \
  && rm -rf /tmp/tmp* \
  && rm -iRf /root/.cache

EXPOSE 8080 8081
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["torchserve", "--start", "--ts-config", "/home/model-server/config.properties", "--model-store", "/home/model-server/"]

5agado · 2024-06-04T23:24:37Z

The recent new releases would have solved all the issues, if sagemaker-pytorch-inference was updated to include the fix, instead it is still stuck to 2.0.23 :/

sirutBuasai · 2024-06-05T01:06:36Z

Hi, We are in the process of upgrading toolkit versions in PyTorch Inference DLCs.
Please track the following progress for each images here:
PyTorch 2.2 SageMaker Inference DLC: #3984
PyTorch 2.2 SageMaker Graviton Inference DLC: #3985
PyTorch 2.1 SageMaker Inference DLC: #3986
PyTorch 2.1 sageMaker Graviton Inference DLC: #3987
PyTorch 1.13 SageMaker Inference DLC: #3988

Once PRs are merged, I will update when the images are publicly released again.

conti748 · 2024-06-05T17:35:58Z

@5agado @angarsky @adrien-code-it
The only solution I found was to roll-back to the image 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8"

sirutBuasai · 2024-06-11T17:45:47Z

Hi all,
Patched images for PT 2.1 and PT 2.2 are released. See linked release tags.

PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.13-pt-sagemaker-2.2.0-inf-py310
PyTorch 2.2 SageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.9-pt-graviton-sagemaker-2.2.1-inf-cpu-py310
PyTorch 2.1 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.12-pt-sagemaker-2.1.0-inf-py310
PyTorch 2.1 sageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.10-pt-graviton-sagemaker-2.1.0-inf-cpu-py310

PT 1.13 is still WIP, will update release status once merged and built.

sirutBuasai · 2024-06-14T00:08:22Z

PT 1.13 has been released: https://github.com/aws/deep-learning-containers/releases/tag/v1.26-pt-sagemaker-1.13.1-inf-cpu-py39

All images are patched, closing issue.

dylanhellems changed the title ~~[bug] Recent PyTorch containers causing Zombie Process~~ [bug] Recent PyTorch images causing Zombie Process May 29, 2024

5agado mentioned this issue Jun 4, 2024

Zombie process exception aws/sagemaker-pytorch-inference-toolkit#165

Open

sirutBuasai closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Recent PyTorch images causing Zombie Process #3965

[bug] Recent PyTorch images causing Zombie Process #3965

dylanhellems commented May 29, 2024 •

edited

Loading

greeshmaPr commented May 30, 2024 •

edited

Loading

sirutBuasai commented May 30, 2024

conti748 commented Jun 1, 2024

adrien-code-it commented Jun 2, 2024

alan1420 commented Jun 2, 2024

5agado commented Jun 4, 2024 •

edited

Loading

angarsky commented Jun 4, 2024

5agado commented Jun 4, 2024 •

edited

Loading

sirutBuasai commented Jun 5, 2024 •

edited

Loading

conti748 commented Jun 5, 2024

sirutBuasai commented Jun 11, 2024

sirutBuasai commented Jun 14, 2024

[bug] Recent PyTorch images causing Zombie Process #3965

[bug] Recent PyTorch images causing Zombie Process #3965

Comments

dylanhellems commented May 29, 2024 • edited Loading

greeshmaPr commented May 30, 2024 • edited Loading

sirutBuasai commented May 30, 2024

conti748 commented Jun 1, 2024

adrien-code-it commented Jun 2, 2024

alan1420 commented Jun 2, 2024

5agado commented Jun 4, 2024 • edited Loading

angarsky commented Jun 4, 2024

5agado commented Jun 4, 2024 • edited Loading

sirutBuasai commented Jun 5, 2024 • edited Loading

conti748 commented Jun 5, 2024

sirutBuasai commented Jun 11, 2024

sirutBuasai commented Jun 14, 2024

dylanhellems commented May 29, 2024 •

edited

Loading

greeshmaPr commented May 30, 2024 •

edited

Loading

5agado commented Jun 4, 2024 •

edited

Loading

5agado commented Jun 4, 2024 •

edited

Loading

sirutBuasai commented Jun 5, 2024 •

edited

Loading