[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

athewsey · 2021-09-17T02:51:10Z

Link to the notebook

https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_pytorch

Describe the bug

Since the release of PyTorch DLC v1.8.1, the PyTorch MME example fails to properly deploy the MME endpoint (since we deliberately just tagged "1.8" in the sample, to consume any bug fixes without expecting breaking changes). This was also because the sample is known to work on more recent patches of older versions e.g. 1.7.1 and 1.6.1 but not 1.7.0 and 1.6.0, so specifying minor version only helped encourage users to hit the right versions if they tried to downgrade.

I've done a bit of investigation on this, but have been unable to find the exact cause or a solution that works. v1.8.1 upgrades TorchServe from 0.3 to 0.4, so it's likely something changed in TorchServe to stop it from recognising the model bundle & starting correctly.

To reproduce

Run through the multi_model_pytorch example notebook.

Logs

Some logs from the failed endpoint:

['torchserve', '--start', '--model-store', '/', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/etc/log4j.properties', '--models', 'model.mar']
...
2021-09-06 07:39:28,480 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar
2021-09-06 07:39:28,484 [WARN ] main org.pytorch.serve.ModelServer - Failed to load model: model.mar
org.pytorch.serve.archive.ModelNotFoundException: Model not found at: model.mar
#011at org.pytorch.serve.archive.ModelArchive.downloadModel(ModelArchive.java:86)
#011at org.pytorch.serve.wlm.ModelManager.createModelArchive(ModelManager.java:135)
#011at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:112)
#011at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:227)
#011at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:327)
#011at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:114)
#011at org.pytorch.serve.ModelServer.main(ModelServer.java:95)
2021-09-06 07:39:28,494 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.2021-09-06 07:39:28,570 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-09-06 07:39:28,570 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-09-06 07:39:28,572 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.

...But NOTE that:

These warnings about model.mar were actually present on previous versions too which still worked. Modifying the sample to save a model.mar in the root of model.tar.gz does not fix the failure.
Although the server does report that it starts, no successful ping checks get passed and SageMaker eventually fails it.

The text was updated successfully, but these errors were encountered:

athewsey mentioned this issue Sep 17, 2021

Pin PyTorch MME to container v1.8.0 #2944

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

athewsey commented Sep 17, 2021

[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

[Bug Report] PyTorch MME example fails with container v1.8.1 #2943

Comments

athewsey commented Sep 17, 2021