Ignore zombie processes when detecting TorchServe status #166

namannandan · 2024-05-29T23:17:56Z

Description of changes:
When checking to see if the TorchServe process is running, we iterate through the current list of running processes using psutil:

sagemaker-pytorch-inference-toolkit/src/sagemaker_pytorch_serving_container/torchserve.py

Lines 183 to 188 in 36a842e

    
           def _retrieve_ts_server_process(): 
        
               ts_server_processes = list() 
        
               for process in psutil.process_iter(): 
        
                   if TS_NAMESPACE in process.cmdline(): 
        
                       ts_server_processes.append(process)

Calling the command() psutil API on a zombie process raises the psutil.ZombieProcess exception. This unhandled exception causes TorchServe to be stopped which is not expected behavior in DLC: https://github.com/aws/deep-learning-containers/tree/master/pytorch/inference

  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=9)

We can ignore zombie processes when detecting the presence of a running TorchServe process. Reference: https://psutil.readthedocs.io/en/latest/#psutil.ZombieProcess

Tests:

CI
Manual testing
- Without the fix in this PR

$ docker run --name sagemaker_pt_dlc  -p 8080:8080 --mount type=bind,source=/home/ubuntu/dlc_test/dump,target=/opt/ml/model -v /home/ubuntu/dlc_test/dump:/hostfs -e SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS=600 -e SAGEMAKER_MODEL_SERVER_WORKERS=1 -e OMP_NUM_THREADS=1 -e DNNL_VERBOSE=1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve
.....
.....
2024-05-31T17:49:14,665 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2257.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:c988cd31b0a0,timestamp:1717177754
2024-05-31T17:49:14,665 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:c988cd31b0a0,timestamp:1717177754
['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model']
Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    return self._proc.cmdline()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    return fun(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    self._raise_if_zombie()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=10)

With the fix in this PR

$ docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker /bin/bash
$ git clone https://github.com/namannandan/sagemaker-pytorch-inference-toolkit.git
$ cd sagemaker-pytorch-inference-toolkit
$ git checkout psutil-zombie-fix
$ pip install .
.....
$ docker container commit ca81ce0ac2c7 test:psutil-fix
$ docker run --name sagemaker_pt_dlc  -p 8080:8080 --mount type=bind,source=/home/ubuntu/dlc_test/dump,target=/opt/ml/model -v /home/ubuntu/dlc_test/dump:/hostfs -e SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS=600 -e SAGEMAKER_MODEL_SERVER_WORKERS=1 -e OMP_NUM_THREADS=1 -e DNNL_VERBOSE=1 test:psutil-fix serve
.....
.....
2024-05-31T18:00:40,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-05-31T18:00:40,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-05-31T18:00:40,502 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-05-31T18:00:40,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-05-31T18:00:40,511 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1717178440511
2024-05-31T18:00:40,537 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-05-31T18:00:41,286 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - /opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
2024-05-31T18:00:41,287 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   return self.fget.__get__(instance, owner)()
2024-05-31T18:00:41,456 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - /opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
2024-05-31T18:00:41,456 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   warnings.warn(
2024-05-31T18:00:41,460 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 949
2024-05-31T18:00:41,461 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2217.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:09c62ecb1844,timestamp:1717178441
2024-05-31T18:00:41,461 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178441
2024-05-31T18:01:39,547 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:456.5026435852051|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:39.51191711425781|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:8.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:61538.0625|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1350.4765625|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.3|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
.....
.....

Torchserve continues to run and container does not get terminated.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

visinfo · 2024-05-30T10:07:33Z

@namannandan should we just check the process status rather than swallowing the exception ?
see reference
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

namannandan · 2024-05-30T18:30:25Z

@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

Thanks @visinfo that makes sense, updated the PR.

This reverts commit bba00bd.

adrien-code-it · 2024-06-02T12:38:28Z

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.
I saw here that the fix was merged in aws:master two days ago, however my latest deployment still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ?
Regards

5agado · 2024-06-04T09:57:47Z

As for @adrien-code-it , I also tried on a new model and 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker as well as 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker, and still get the error.

@namannandan, @visinfo is there something we need to do to deploy using the update? Or when will it be distributed to all instances?

adrien-code-it · 2024-06-04T11:14:36Z

@5agado I was able to deploy my model by adding a requirements.txt file alongside my inference.py file, and specify to pip install the latest sagemaker-pytorch-inference-toolkit :
git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git

Although it's not a permanent solution (I would prefer pulling a fixed version, not the latest), it's working as of now.
Moreover, when deploying, it still fails once, but then it succeeds the second time..

5agado · 2024-06-04T11:18:06Z

@adrien-code-it are you deploying the model as endpoint, or using in batch-transform?
I tried the same with the latter, but doesn't work for me (I think related to the "succeeds the second time" aspect you mention there)

adrien-code-it · 2024-06-04T11:24:55Z

@5agado the fix in requirements.txt seems to only work when deploying the model as endpoint (for inference, in my case) :(

For batch-transform, unfortunately I didn't see any fix working...
Maybe @namannandan has a solution ?

Ignore processes that are not running when detecting TorchServe status

f9d48ea

namannandan requested review from nskool and lxning May 29, 2024 23:18

Detect zombie processes and ignore them before calling cmdline API

6dd7674

namannandan changed the title ~~Ignore processes that are not running when detecting TorchServe status~~ Ignore zombie processes when detecting TorchServe status May 30, 2024

sirutBuasai mentioned this pull request May 30, 2024

[bug] Recent PyTorch images causing Zombie Process aws/deep-learning-containers#3965

Closed

6 tasks

namannandan self-assigned this May 30, 2024

namannandan added 4 commits May 30, 2024 14:43

Update PT DLC framework version to 2.1.0 and 2.2.0

f95a48c

Update instance type to ensure newer CUDA driver version

bba00bd

upgrade DLAMI version for tests

b92208e

Revert "Update instance type to ensure newer CUDA driver version"

5403bae

This reverts commit bba00bd.

namannandan requested a review from chen3933 May 31, 2024 15:42

chen3933 approved these changes May 31, 2024

View reviewed changes

namannandan merged commit 9a24052 into aws:master May 31, 2024
1 check passed

5agado mentioned this pull request Jun 4, 2024

Zombie process exception #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore zombie processes when detecting TorchServe status #166

Ignore zombie processes when detecting TorchServe status #166

namannandan commented May 29, 2024 •

edited

Loading

visinfo commented May 30, 2024 •

edited

Loading

namannandan commented May 30, 2024

adrien-code-it commented Jun 2, 2024

5agado commented Jun 4, 2024

adrien-code-it commented Jun 4, 2024 •

edited

Loading

5agado commented Jun 4, 2024

adrien-code-it commented Jun 4, 2024

	def _retrieve_ts_server_process():
	ts_server_processes = list()

	for process in psutil.process_iter():
	if TS_NAMESPACE in process.cmdline():
	ts_server_processes.append(process)

Ignore zombie processes when detecting TorchServe status #166

Ignore zombie processes when detecting TorchServe status #166

Conversation

namannandan commented May 29, 2024 • edited Loading

visinfo commented May 30, 2024 • edited Loading

namannandan commented May 30, 2024

adrien-code-it commented Jun 2, 2024

5agado commented Jun 4, 2024

adrien-code-it commented Jun 4, 2024 • edited Loading

5agado commented Jun 4, 2024

adrien-code-it commented Jun 4, 2024

namannandan commented May 29, 2024 •

edited

Loading

visinfo commented May 30, 2024 •

edited

Loading

adrien-code-it commented Jun 4, 2024 •

edited

Loading