Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

Open
athewsey opened this issue Sep 17, 2021 · 0 comments
Open

[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

athewsey opened this issue Sep 17, 2021 · 0 comments

Comments

@athewsey
Copy link
Contributor

Link to the notebook

https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_pytorch

Describe the bug

Since the release of PyTorch DLC v1.8.1, the PyTorch MME example fails to properly deploy the MME endpoint (since we deliberately just tagged "1.8" in the sample, to consume any bug fixes without expecting breaking changes). This was also because the sample is known to work on more recent patches of older versions e.g. 1.7.1 and 1.6.1 but not 1.7.0 and 1.6.0, so specifying minor version only helped encourage users to hit the right versions if they tried to downgrade.

I've done a bit of investigation on this, but have been unable to find the exact cause or a solution that works. v1.8.1 upgrades TorchServe from 0.3 to 0.4, so it's likely something changed in TorchServe to stop it from recognising the model bundle & starting correctly.

To reproduce

Run through the multi_model_pytorch example notebook.

Logs

Some logs from the failed endpoint:

['torchserve', '--start', '--model-store', '/', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/etc/log4j.properties', '--models', 'model.mar']
...
2021-09-06 07:39:28,480 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar
2021-09-06 07:39:28,484 [WARN ] main org.pytorch.serve.ModelServer - Failed to load model: model.mar
org.pytorch.serve.archive.ModelNotFoundException: Model not found at: model.mar
#011at org.pytorch.serve.archive.ModelArchive.downloadModel(ModelArchive.java:86)
#011at org.pytorch.serve.wlm.ModelManager.createModelArchive(ModelManager.java:135)
#011at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:112)
#011at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:227)
#011at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:327)
#011at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:114)
#011at org.pytorch.serve.ModelServer.main(ModelServer.java:95)
2021-09-06 07:39:28,494 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.2021-09-06 07:39:28,570 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-09-06 07:39:28,570 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-09-06 07:39:28,572 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.

...But NOTE that:

  1. These warnings about model.mar were actually present on previous versions too which still worked. Modifying the sample to save a model.mar in the root of model.tar.gz does not fix the failure.
  2. Although the server does report that it starts, no successful ping checks get passed and SageMaker eventually fails it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant