Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore zombie processes when detecting TorchServe status #166

Merged
merged 6 commits into from
May 31, 2024

Conversation

namannandan
Copy link
Contributor

@namannandan namannandan commented May 29, 2024

Description of changes:
When checking to see if the TorchServe process is running, we iterate through the current list of running processes using psutil:

def _retrieve_ts_server_process():
ts_server_processes = list()
for process in psutil.process_iter():
if TS_NAMESPACE in process.cmdline():
ts_server_processes.append(process)

Calling the command() psutil API on a zombie process raises the psutil.ZombieProcess exception. This unhandled exception causes TorchServe to be stopped which is not expected behavior in DLC: https://github.com/aws/deep-learning-containers/tree/master/pytorch/inference

  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=9)

We can ignore zombie processes when detecting the presence of a running TorchServe process. Reference: https://psutil.readthedocs.io/en/latest/#psutil.ZombieProcess

Tests:

  • CI
  • Manual testing
    • Without the fix in this PR
$ docker run --name sagemaker_pt_dlc  -p 8080:8080 --mount type=bind,source=/home/ubuntu/dlc_test/dump,target=/opt/ml/model -v /home/ubuntu/dlc_test/dump:/hostfs -e SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS=600 -e SAGEMAKER_MODEL_SERVER_WORKERS=1 -e OMP_NUM_THREADS=1 -e DNNL_VERBOSE=1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve
.....
.....
2024-05-31T17:49:14,665 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2257.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:c988cd31b0a0,timestamp:1717177754
2024-05-31T17:49:14,665 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:c988cd31b0a0,timestamp:1717177754
['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model']
Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    return self._proc.cmdline()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    return fun(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    self._raise_if_zombie()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=10)
  • With the fix in this PR
$ docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker /bin/bash
$ git clone https://github.com/namannandan/sagemaker-pytorch-inference-toolkit.git
$ cd sagemaker-pytorch-inference-toolkit
$ git checkout psutil-zombie-fix
$ pip install .
.....
$ docker container commit ca81ce0ac2c7 test:psutil-fix
$ docker run --name sagemaker_pt_dlc  -p 8080:8080 --mount type=bind,source=/home/ubuntu/dlc_test/dump,target=/opt/ml/model -v /home/ubuntu/dlc_test/dump:/hostfs -e SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS=600 -e SAGEMAKER_MODEL_SERVER_WORKERS=1 -e OMP_NUM_THREADS=1 -e DNNL_VERBOSE=1 test:psutil-fix serve
.....
.....
2024-05-31T18:00:40,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-05-31T18:00:40,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-05-31T18:00:40,502 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-05-31T18:00:40,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-05-31T18:00:40,511 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1717178440511
2024-05-31T18:00:40,537 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-05-31T18:00:41,286 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - /opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
2024-05-31T18:00:41,287 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   return self.fget.__get__(instance, owner)()
2024-05-31T18:00:41,456 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - /opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
2024-05-31T18:00:41,456 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   warnings.warn(
2024-05-31T18:00:41,460 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 949
2024-05-31T18:00:41,461 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2217.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:09c62ecb1844,timestamp:1717178441
2024-05-31T18:00:41,461 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178441
2024-05-31T18:01:39,547 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:456.5026435852051|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:39.51191711425781|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:8.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:61538.0625|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1350.4765625|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.3|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
.....
.....

Torchserve continues to run and container does not get terminated.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@namannandan namannandan requested review from nskool and lxning May 29, 2024 23:18
@visinfo
Copy link

visinfo commented May 30, 2024

@namannandan should we just check the process status rather than swallowing the exception ?
see reference
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

@namannandan namannandan changed the title Ignore processes that are not running when detecting TorchServe status Ignore zombie processes when detecting TorchServe status May 30, 2024
@namannandan
Copy link
Contributor Author

@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

Thanks @visinfo that makes sense, updated the PR.

@namannandan namannandan requested a review from chen3933 May 31, 2024 15:42
@namannandan namannandan merged commit 9a24052 into aws:master May 31, 2024
1 check passed
@adrien-code-it
Copy link

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.
I saw here that the fix was merged in aws:master two days ago, however my latest deployment still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ?
Regards

@5agado
Copy link

5agado commented Jun 4, 2024

As for @adrien-code-it , I also tried on a new model and 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker as well as 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker, and still get the error.

@namannandan, @visinfo is there something we need to do to deploy using the update? Or when will it be distributed to all instances?

@adrien-code-it
Copy link

adrien-code-it commented Jun 4, 2024

@5agado I was able to deploy my model by adding a requirements.txt file alongside my inference.py file, and specify to pip install the latest sagemaker-pytorch-inference-toolkit :
git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git

Although it's not a permanent solution (I would prefer pulling a fixed version, not the latest), it's working as of now.
Moreover, when deploying, it still fails once, but then it succeeds the second time..

@5agado
Copy link

5agado commented Jun 4, 2024

@adrien-code-it are you deploying the model as endpoint, or using in batch-transform?
I tried the same with the latter, but doesn't work for me (I think related to the "succeeds the second time" aspect you mention there)

@adrien-code-it
Copy link

@5agado the fix in requirements.txt seems to only work when deploying the model as endpoint (for inference, in my case) :(

For batch-transform, unfortunately I didn't see any fix working...
Maybe @namannandan has a solution ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants