Prebuilt PyTorch image difference #139

ruijianw · 2019-11-08T18:47:43Z

Hi there,

I am bringing some PyTorch Model outside of SageMaker,

Here are my steps:

Build my own docker image on top of prebuilt images (pytorch-training vs pytorch-inference vs sagemaker-pytorch(before 1.2.0)
Finish the customized model_fn, predict_fn, input_fn, output_fn.
Deploy the model.

Here are my observations:

With sagemaker-pytorch version 1.1.0, CPU, everything works.
With pytorch-inference, version 1.2.0, CPU, the code are not copied to the container, I guess I should follow this documentation? https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
With pytorch-training, version 1.2.0, CPU, when I tried to deploy the model locally, it throws errors as following:

Attaching to tmpkyn4_ew2_algo-1-dgrlv_1
algo-1-dgrlv_1  | Traceback (most recent call last):
algo-1-dgrlv_1  |   File "/opt/conda/bin/serve", line 8, in <module>
algo-1-dgrlv_1  |     sys.exit(main())
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/site-packages/sagemaker_containers/cli/serve.py", line 17, in main
algo-1-dgrlv_1  |     server.start(env.ServingEnv().framework_module)
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/site-packages/sagemaker_containers/_server.py", line 75, in start
algo-1-dgrlv_1  |     nginx = subprocess.Popen(['nginx', '-c', nginx_config_file])
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/subprocess.py", line 709, in __init__
algo-1-dgrlv_1  |     restore_signals, start_new_session)
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/subprocess.py", line 1344, in _execute_child
algo-1-dgrlv_1  |     raise child_exception_type(errno_num, err_msg, err_filename)
algo-1-dgrlv_1  | FileNotFoundError: [Errno 2] No such file or directory: 'nginx': 'nginx'
tmpkyn4_ew2_algo-1-dgrlv_1 exited with code 1
Aborting on container exit...

Then wait for container to run until time out.

My questions are:

Any insights for the problem above?
What is the difference between pytorch-training and pytorch-inference?
I checked the Dockerfile among those 3 versions, it seems there are a lot of change for pytorch-<inference|training> from sagemaker-pytroch. If I am not missing something here, it is probably worth to revisit the image for pytorch-<inference|training>?

The text was updated successfully, but these errors were encountered:

nadiaya · 2019-11-08T21:56:45Z

Could you share how do create training and then deploying trained model locally?

Before we had one container (sagemaker-pytorch) with both training and serving/inference functionality. To reduce the size of the images we split them into two: pytorch-training and pytorch-inference. The intent is that pytorch-training would only be used for training and pytorch-inference would be used to deploy model and run predictions against it.

From the error message you posted it seems that the problem is caused by using training image to run inference, though I would need more information about how you are training and hosting the model.

ruijianw · 2019-11-08T22:01:24Z

There is no training, the model is pretrained.

Pesudo code like following:

pytorch_estimator = PyTorchModel(entry_point = 'entrypoint.py',
                                 model_data = MODEL_PATH,
                                 name = MODEL_NAME,
                                 role=role,
                                 image=CONTAINER_IMAGE)

predictor = pytorch_estimator.deploy(instance_type='local',
                                     initial_instance_count=1)

Please let me know if you want more details

nadiaya · 2019-11-08T22:34:40Z

What image (CONTAINER_IMAGE) do you use to create PyTorchModel?

ruijianw · 2019-11-08T22:41:42Z

This is a customized image on top of prebuilt aws sagemaker image.

For prebuilt images, I tried:

sagemaker-pytorch
pytorch-training
pytorch-inference

Only 1 works, 2 and 3 failed in different ways.

nadiaya · 2019-11-08T22:59:40Z

2 is expected to fail.
1 and 3 should work.

What error do you get when using pytorch-inference container?

ruijianw · 2019-11-08T23:13:51Z

It cannot find the entrypoint.py file, I checked docker image, there is only opt/ml/model folder, no code file.

Some more observations:

The logs said MXNet worker started, makes me feel weird
The source code was uploaded to s3 successfully according to the log output, there is a source.tar.gz, I download it and verified that.

nadiaya · 2019-11-09T00:14:44Z

You see this message because it uses MMS (Mxnet Model Server) to serve the predictions.
I can't reproduce the issue. The exact code sample as well as produced logs would really help.

ruijianw · 2019-11-11T14:58:36Z

I am closing the issue for now since you cannot reproduce it. I will do more experiments.

I may reopen it once I got more info.

ruijianw · 2019-11-11T16:33:54Z

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

nadiaya · 2019-11-11T18:56:52Z

Thanks!

When do you get this error? on start up or when trying to run predictions?

ruijianw · 2019-11-11T19:39:06Z

when trying to run predictions. The container started successfully, please refer to the following logs for spinning up the container:

algo-1-pmyh1_1  | 2019-11-11 16:30:48,040 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,059 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
algo-1-pmyh1_1  | Model server started.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9030-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9030.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9015-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9015.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9021-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9021.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9029-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9029.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9012-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9012.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9024-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9024.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9003-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9003.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9008-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9008.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9016-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9016.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9020-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9020.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9017-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9017.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9027-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9027.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9031.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9011-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9011.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9013-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9013.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9005.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9022.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9007.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9023-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9023.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9002.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9018-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9018.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9009-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9009.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9014-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9014.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9025-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9025.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9004.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9001.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9006-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9006.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9019-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9019.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9010-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9010.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9026-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9026.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9028-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9028.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 462
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9029-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 463
algo-1-pmyh1_1  | 2019-11-11 16:30:48,565 [INFO ] W-9030-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 460
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 475
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9008-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 455
algo-1-pmyh1_1  | 2019-11-11 16:30:48,577 [INFO ] W-9024-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 476
algo-1-pmyh1_1  | 2019-11-11 16:30:48,580 [INFO ] W-9027-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 471
algo-1-pmyh1_1  | 2019-11-11 16:30:48,583 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 478
algo-1-pmyh1_1  | 2019-11-11 16:30:48,585 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 483
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9026-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9031-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,599 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,605 [INFO ] W-9023-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 504
algo-1-pmyh1_1  | 2019-11-11 16:30:48,610 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 501
algo-1-pmyh1_1  | 2019-11-11 16:30:48,611 [INFO ] W-9019-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,615 [INFO ] W-9014-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 514
algo-1-pmyh1_1  | 2019-11-11 16:30:48,617 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 516
algo-1-pmyh1_1  | 2019-11-11 16:30:48,618 [INFO ] W-9017-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9012-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 523
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9020-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 519
algo-1-pmyh1_1  | 2019-11-11 16:30:48,625 [INFO ] W-9015-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,631 [INFO ] W-9011-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 522
algo-1-pmyh1_1  | 2019-11-11 16:30:48,633 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,636 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 535
algo-1-pmyh1_1  | 2019-11-11 16:30:48,643 [INFO ] W-9025-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 542
algo-1-pmyh1_1  | 2019-11-11 16:30:48,645 [INFO ] W-9009-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 543
algo-1-pmyh1_1  | 2019-11-11 16:30:48,650 [INFO ] W-9018-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,664 [INFO ] W-9028-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 541
algo-1-pmyh1_1  | 2019-11-11 16:30:48,666 [INFO ] W-9013-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 562
algo-1-pmyh1_1  | 2019-11-11 16:30:48,671 [INFO ] W-9021-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 570
algo-1-pmyh1_1  | 2019-11-11 16:30:48,673 [INFO ] W-9016-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 576
algo-1-pmyh1_1  | 2019-11-11 16:30:48,676 [INFO ] W-9010-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 579
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 10
algo-1-pmyh1_1  | 2019-11-11 16:30:49,982 [INFO ] pool-1-thread-33 ACCESS_LOG - /172.18.0.1:58984 "GET /ping HTTP/1.1" 200 11```

stale · 2019-11-18T22:04:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-11-26T06:08:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ChoiByungWook · 2019-12-03T02:09:22Z

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

Apologies for the late response.

That specific error happens when attempting to import your entrypoint.py as shown here: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/transformer.py#L143

The entrypoint.py is expected to be in a specific directory, which will get extended using the PythonPath: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L103

The specific directory itself is defined by: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/environment.py#L32

The entrypoint.py should be placed in that specific directory by the Python SDK depending on the framework version specified as shown here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/pytorch/model.py#L148

Looking at how you are starting the inference jobs, it looks like the framework_version is being omitted, which may not cause the conditional to place the entrypoint.py into the specified directory.

I apologize for the experience as this is not ideal, however is there any chance you can retry your job after placing a framework version higher than 1.2?

Thanks!

nadiaya · 2020-06-09T19:50:27Z

Closing due to inactivity.

ruijianw closed this as completed Nov 11, 2019

ruijianw reopened this Nov 11, 2019

stale bot added the stale label Nov 18, 2019

laurenyu added type: question Further information is requested and removed stale labels Nov 19, 2019

stale bot added the stale label Nov 26, 2019

ChoiByungWook removed the stale label Dec 4, 2019

nadiaya closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prebuilt PyTorch image difference #139

Prebuilt PyTorch image difference #139

ruijianw commented Nov 8, 2019

nadiaya commented Nov 8, 2019

ruijianw commented Nov 8, 2019

nadiaya commented Nov 8, 2019

ruijianw commented Nov 8, 2019 •

edited

nadiaya commented Nov 8, 2019

ruijianw commented Nov 8, 2019

nadiaya commented Nov 9, 2019

ruijianw commented Nov 11, 2019

ruijianw commented Nov 11, 2019

nadiaya commented Nov 11, 2019

ruijianw commented Nov 11, 2019

stale bot commented Nov 18, 2019

stale bot commented Nov 26, 2019

ChoiByungWook commented Dec 3, 2019 •

edited

nadiaya commented Jun 9, 2020

Prebuilt PyTorch image difference #139

Prebuilt PyTorch image difference #139

Comments

ruijianw commented Nov 8, 2019

nadiaya commented Nov 8, 2019

ruijianw commented Nov 8, 2019

nadiaya commented Nov 8, 2019

ruijianw commented Nov 8, 2019 • edited

nadiaya commented Nov 8, 2019

ruijianw commented Nov 8, 2019

nadiaya commented Nov 9, 2019

ruijianw commented Nov 11, 2019

ruijianw commented Nov 11, 2019

nadiaya commented Nov 11, 2019

ruijianw commented Nov 11, 2019

stale bot commented Nov 18, 2019

stale bot commented Nov 26, 2019

ChoiByungWook commented Dec 3, 2019 • edited

nadiaya commented Jun 9, 2020

ruijianw commented Nov 8, 2019 •

edited

ChoiByungWook commented Dec 3, 2019 •

edited