Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Recent PyTorch images causing Zombie Process #3965

Closed
6 tasks done
dylanhellems opened this issue May 29, 2024 · 12 comments
Closed
6 tasks done

[bug] Recent PyTorch images causing Zombie Process #3965

dylanhellems opened this issue May 29, 2024 · 12 comments

Comments

@dylanhellems
Copy link

dylanhellems commented May 29, 2024

Checklist

Concise Description:
As of this May 22nd release of the PyTorch 2.1.0 images, our SageMaker Endpoints and Batch Transform Jobs using the new images have been failing. No obvious errors are thrown other than a psutil.ZombieProcess: PID still exists but it's a zombie from the pytorch_serving entrypoint.

DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8

Current behavior:
SageMaker Endpoints and Batch Transform Jobs are failing with a psutil.ZombieProcess: PID still exists but it's a zombie error from the pytorch_serving entrypoint.

Expected behavior:
SageMaker Endpoints and Batch Transform Jobs work as expected.

Additional context:
We had previously been using the 2.1.0-cpu-py310 and 2.1.0-gpu-py310 images but have had to pin the images back to their May 14th releases. The error is present in both pytorch-training and pytorch-inference. We made no changes to our deployments during this time, they simply started to fail out of the blue once the new image was released.

Here is the full stacktrace from a failed Batch Transform Job:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)
   
psutil.ZombieProcess: PID still exists but it's a zombie (pid=104)
@dylanhellems dylanhellems changed the title [bug] Recent PyTorch containers causing Zombie Process [bug] Recent PyTorch images causing Zombie Process May 29, 2024
@greeshmaPr
Copy link

greeshmaPr commented May 30, 2024

We too are facing the same error. Traceback is :

['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model']
serving.main()
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
_start_torchserve()
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 257, in call
return attempt.get(self._wrap_exception)
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
ts_process = _retrieve_ts_server_process()
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 266, in call
raise attempt.get()
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
if TS_NAMESPACE in process.cmdline():
File "/opt/conda/lib/python3.9/site-packages/psutil/init.py", line 719, in cmdline
return self._proc.cmdline()
File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1714, in wrapper
return fun(self, *args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1853, in cmdline
self._raise_if_zombie()
File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
raise ZombieProcess(self.pid, self._name, self._ppid)

The base image that we are using is 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.13.1-cpu-py39

@sirutBuasai
Copy link
Contributor

Hi, we are tracking this issue internally. The current fix is in progess with this aws/sagemaker-pytorch-inference-toolkit#166. Alternatively, a quick workaround if running the DLC manually would be to add --init flag to the command.
Eg:

docker run --init --name sagemaker_pt_dlc 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-graviton:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve

@conti748
Copy link

conti748 commented Jun 1, 2024

Hi @sirutBuasai,

I am working on a Batch-Transform job using a pytorch-model and the 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310 and I am experiencing the same error.

I saw the new release 2.0.24 of the package sagemaker-pytorch-inference-toolkit, I tried installing it on the image using the requirements.txt, but I got the same error.

@adrien-code-it
Copy link

Hi @sirutBuasai,

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.

I saw here that the fix was merged in aws:master two days ago (https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/166), however my latest deployment today still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ?
Regards

@alan1420
Copy link

alan1420 commented Jun 2, 2024

@conti748 @adrien-code-it I've tried putting git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git in requirements.txt and it works

@5agado
Copy link

5agado commented Jun 4, 2024

Same situation as @conti748. I tried to add to the model-inference-requirements as suggested by @alan1420 , but didn't work.
Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker as base image.

@angarsky
Copy link

angarsky commented Jun 4, 2024

Get the same traceback as @dylanhellems: I've compared it by filenames and lines.

Base image in our case: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker

We are experiencing intermittent errors on inference endpoints during new cold container starts (scaling). Usually several next requests to endpoint resolve the issue, but yeah - it's not stable behaviour.

Our Dockerfile is:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker

# Update torch version to resolve the issue with `mmcv` and `import mmdet.apis`.
# Similar issue: https://github.com/open-mmlab/mmdetection/issues/4291
RUN pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cpu

# Install MMDetection framework.
RUN pip install -U openmim && \
  mim install mmengine && \
  mim install mmcv && \
  mim install mmdet

# Install some extra pip packages.
RUN pip install imutils sagemaker flask

# An attempt to fix permissions issue.
RUN mkdir -p /logs && chmod -R 777 /logs

# NOTE: SageMaker in a local mode overrides the SAGEMAKER_* variables.
ENV AWS_DEFAULT_REGION us-east-1

# Use single worker for a serverless mode.
ENV SAGEMAKER_MODEL_SERVER_WORKERS 1

# Cleanup
RUN pip cache purge \
  && rm -rf /tmp/tmp* \
  && rm -iRf /root/.cache

EXPOSE 8080 8081
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["torchserve", "--start", "--ts-config", "/home/model-server/config.properties", "--model-store", "/home/model-server/"]

@5agado
Copy link

5agado commented Jun 4, 2024

The recent new releases would have solved all the issues, if sagemaker-pytorch-inference was updated to include the fix, instead it is still stuck to 2.0.23 :/

@sirutBuasai
Copy link
Contributor

sirutBuasai commented Jun 5, 2024

Hi, We are in the process of upgrading toolkit versions in PyTorch Inference DLCs.
Please track the following progress for each images here:
PyTorch 2.2 SageMaker Inference DLC: #3984
PyTorch 2.2 SageMaker Graviton Inference DLC: #3985
PyTorch 2.1 SageMaker Inference DLC: #3986
PyTorch 2.1 sageMaker Graviton Inference DLC: #3987
PyTorch 1.13 SageMaker Inference DLC: #3988

Once PRs are merged, I will update when the images are publicly released again.

@conti748
Copy link

conti748 commented Jun 5, 2024

@5agado @angarsky @adrien-code-it
The only solution I found was to roll-back to the image 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8"

@sirutBuasai
Copy link
Contributor

Hi all,
Patched images for PT 2.1 and PT 2.2 are released. See linked release tags.

PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.13-pt-sagemaker-2.2.0-inf-py310
PyTorch 2.2 SageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.9-pt-graviton-sagemaker-2.2.1-inf-cpu-py310
PyTorch 2.1 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.12-pt-sagemaker-2.1.0-inf-py310
PyTorch 2.1 sageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.10-pt-graviton-sagemaker-2.1.0-inf-cpu-py310

PT 1.13 is still WIP, will update release status once merged and built.

@sirutBuasai
Copy link
Contributor

PT 1.13 has been released: https://github.com/aws/deep-learning-containers/releases/tag/v1.26-pt-sagemaker-1.13.1-inf-cpu-py39

All images are patched, closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants