Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prebuilt PyTorch image difference #139

Closed
ruijianw opened this issue Nov 8, 2019 · 15 comments
Closed

Prebuilt PyTorch image difference #139

ruijianw opened this issue Nov 8, 2019 · 15 comments
Labels
type: question Further information is requested

Comments

@ruijianw
Copy link

ruijianw commented Nov 8, 2019

Hi there,

I am bringing some PyTorch Model outside of SageMaker,

Here are my steps:

  1. Build my own docker image on top of prebuilt images (pytorch-training vs pytorch-inference vs sagemaker-pytorch(before 1.2.0)
  2. Finish the customized model_fn, predict_fn, input_fn, output_fn.
  3. Deploy the model.

Here are my observations:

  1. With sagemaker-pytorch version 1.1.0, CPU, everything works.
  2. With pytorch-inference, version 1.2.0, CPU, the code are not copied to the container, I guess I should follow this documentation? https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
  3. With pytorch-training, version 1.2.0, CPU, when I tried to deploy the model locally, it throws errors as following:
Attaching to tmpkyn4_ew2_algo-1-dgrlv_1
algo-1-dgrlv_1  | Traceback (most recent call last):
algo-1-dgrlv_1  |   File "/opt/conda/bin/serve", line 8, in <module>
algo-1-dgrlv_1  |     sys.exit(main())
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/site-packages/sagemaker_containers/cli/serve.py", line 17, in main
algo-1-dgrlv_1  |     server.start(env.ServingEnv().framework_module)
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/site-packages/sagemaker_containers/_server.py", line 75, in start
algo-1-dgrlv_1  |     nginx = subprocess.Popen(['nginx', '-c', nginx_config_file])
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/subprocess.py", line 709, in __init__
algo-1-dgrlv_1  |     restore_signals, start_new_session)
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/subprocess.py", line 1344, in _execute_child
algo-1-dgrlv_1  |     raise child_exception_type(errno_num, err_msg, err_filename)
algo-1-dgrlv_1  | FileNotFoundError: [Errno 2] No such file or directory: 'nginx': 'nginx'
tmpkyn4_ew2_algo-1-dgrlv_1 exited with code 1
Aborting on container exit...

Then wait for container to run until time out.

My questions are:

  1. Any insights for the problem above?
  2. What is the difference between pytorch-training and pytorch-inference?
  3. I checked the Dockerfile among those 3 versions, it seems there are a lot of change for pytorch-<inference|training> from sagemaker-pytroch. If I am not missing something here, it is probably worth to revisit the image for pytorch-<inference|training>?
@nadiaya
Copy link
Contributor

nadiaya commented Nov 8, 2019

Could you share how do create training and then deploying trained model locally?

Before we had one container (sagemaker-pytorch) with both training and serving/inference functionality. To reduce the size of the images we split them into two: pytorch-training and pytorch-inference. The intent is that pytorch-training would only be used for training and pytorch-inference would be used to deploy model and run predictions against it.

From the error message you posted it seems that the problem is caused by using training image to run inference, though I would need more information about how you are training and hosting the model.

@ruijianw
Copy link
Author

ruijianw commented Nov 8, 2019

There is no training, the model is pretrained.

Pesudo code like following:

pytorch_estimator = PyTorchModel(entry_point = 'entrypoint.py',
                                 model_data = MODEL_PATH,
                                 name = MODEL_NAME,
                                 role=role,
                                 image=CONTAINER_IMAGE)

predictor = pytorch_estimator.deploy(instance_type='local',
                                     initial_instance_count=1)

Please let me know if you want more details

@nadiaya
Copy link
Contributor

nadiaya commented Nov 8, 2019

What image (CONTAINER_IMAGE) do you use to create PyTorchModel?

@ruijianw
Copy link
Author

ruijianw commented Nov 8, 2019

This is a customized image on top of prebuilt aws sagemaker image.

For prebuilt images, I tried:

  1. sagemaker-pytorch
  2. pytorch-training
  3. pytorch-inference

Only 1 works, 2 and 3 failed in different ways.

@nadiaya
Copy link
Contributor

nadiaya commented Nov 8, 2019

2 is expected to fail.
1 and 3 should work.

What error do you get when using pytorch-inference container?

@ruijianw
Copy link
Author

ruijianw commented Nov 8, 2019

It cannot find the entrypoint.py file, I checked docker image, there is only opt/ml/model folder, no code file.

Some more observations:

  1. The logs said MXNet worker started, makes me feel weird
  2. The source code was uploaded to s3 successfully according to the log output, there is a source.tar.gz, I download it and verified that.

@nadiaya
Copy link
Contributor

nadiaya commented Nov 9, 2019

  1. You see this message because it uses MMS (Mxnet Model Server) to serve the predictions.
  2. I can't reproduce the issue. The exact code sample as well as produced logs would really help.

@ruijianw
Copy link
Author

I am closing the issue for now since you cannot reproduce it. I will do more experiments.

I may reopen it once I got more info.

@ruijianw
Copy link
Author

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

@ruijianw ruijianw reopened this Nov 11, 2019
@nadiaya
Copy link
Contributor

nadiaya commented Nov 11, 2019

Thanks!

When do you get this error? on start up or when trying to run predictions?

@ruijianw
Copy link
Author

when trying to run predictions. The container started successfully, please refer to the following logs for spinning up the container:

algo-1-pmyh1_1  | 2019-11-11 16:30:48,040 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,059 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
algo-1-pmyh1_1  | Model server started.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9030-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9030.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9015-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9015.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9021-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9021.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9029-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9029.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9012-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9012.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9024-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9024.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9003-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9003.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9008-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9008.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9016-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9016.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9020-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9020.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9017-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9017.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9027-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9027.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9031.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9011-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9011.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9013-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9013.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9005.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9022.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9007.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9023-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9023.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9002.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9018-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9018.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9009-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9009.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9014-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9014.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9025-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9025.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9004.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9001.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9006-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9006.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9019-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9019.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9010-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9010.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9026-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9026.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9028-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9028.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 462
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9029-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 463
algo-1-pmyh1_1  | 2019-11-11 16:30:48,565 [INFO ] W-9030-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 460
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 475
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9008-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 455
algo-1-pmyh1_1  | 2019-11-11 16:30:48,577 [INFO ] W-9024-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 476
algo-1-pmyh1_1  | 2019-11-11 16:30:48,580 [INFO ] W-9027-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 471
algo-1-pmyh1_1  | 2019-11-11 16:30:48,583 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 478
algo-1-pmyh1_1  | 2019-11-11 16:30:48,585 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 483
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9026-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9031-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,599 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,605 [INFO ] W-9023-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 504
algo-1-pmyh1_1  | 2019-11-11 16:30:48,610 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 501
algo-1-pmyh1_1  | 2019-11-11 16:30:48,611 [INFO ] W-9019-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,615 [INFO ] W-9014-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 514
algo-1-pmyh1_1  | 2019-11-11 16:30:48,617 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 516
algo-1-pmyh1_1  | 2019-11-11 16:30:48,618 [INFO ] W-9017-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9012-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 523
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9020-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 519
algo-1-pmyh1_1  | 2019-11-11 16:30:48,625 [INFO ] W-9015-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,631 [INFO ] W-9011-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 522
algo-1-pmyh1_1  | 2019-11-11 16:30:48,633 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,636 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 535
algo-1-pmyh1_1  | 2019-11-11 16:30:48,643 [INFO ] W-9025-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 542
algo-1-pmyh1_1  | 2019-11-11 16:30:48,645 [INFO ] W-9009-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 543
algo-1-pmyh1_1  | 2019-11-11 16:30:48,650 [INFO ] W-9018-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,664 [INFO ] W-9028-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 541
algo-1-pmyh1_1  | 2019-11-11 16:30:48,666 [INFO ] W-9013-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 562
algo-1-pmyh1_1  | 2019-11-11 16:30:48,671 [INFO ] W-9021-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 570
algo-1-pmyh1_1  | 2019-11-11 16:30:48,673 [INFO ] W-9016-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 576
algo-1-pmyh1_1  | 2019-11-11 16:30:48,676 [INFO ] W-9010-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 579
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 10
algo-1-pmyh1_1  | 2019-11-11 16:30:49,982 [INFO ] pool-1-thread-33 ACCESS_LOG - /172.18.0.1:58984 "GET /ping HTTP/1.1" 200 11```

@stale
Copy link

stale bot commented Nov 18, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 18, 2019
@laurenyu laurenyu added type: question Further information is requested and removed stale labels Nov 19, 2019
@stale
Copy link

stale bot commented Nov 26, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 26, 2019
@ChoiByungWook
Copy link
Contributor

ChoiByungWook commented Dec 3, 2019

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

Apologies for the late response.

That specific error happens when attempting to import your entrypoint.py as shown here: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/transformer.py#L143

The entrypoint.py is expected to be in a specific directory, which will get extended using the PythonPath: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L103

The specific directory itself is defined by: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/environment.py#L32

The entrypoint.py should be placed in that specific directory by the Python SDK depending on the framework version specified as shown here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/pytorch/model.py#L148

Looking at how you are starting the inference jobs, it looks like the framework_version is being omitted, which may not cause the conditional to place the entrypoint.py into the specified directory.

I apologize for the experience as this is not ideal, however is there any chance you can retry your job after placing a framework version higher than 1.2?

Thanks!

@nadiaya
Copy link
Contributor

nadiaya commented Jun 9, 2020

Closing due to inactivity.

@nadiaya nadiaya closed this as completed Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants