You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the release of PyTorch DLC v1.8.1, the PyTorch MME example fails to properly deploy the MME endpoint (since we deliberately just tagged "1.8" in the sample, to consume any bug fixes without expecting breaking changes). This was also because the sample is known to work on more recent patches of older versions e.g. 1.7.1 and 1.6.1 but not 1.7.0 and 1.6.0, so specifying minor version only helped encourage users to hit the right versions if they tried to downgrade.
I've done a bit of investigation on this, but have been unable to find the exact cause or a solution that works. v1.8.1 upgrades TorchServe from 0.3 to 0.4, so it's likely something changed in TorchServe to stop it from recognising the model bundle & starting correctly.
To reproduce
Run through the multi_model_pytorch example notebook.
Logs
Some logs from the failed endpoint:
['torchserve', '--start', '--model-store', '/', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/etc/log4j.properties', '--models', 'model.mar']
...
2021-09-06 07:39:28,480 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar
2021-09-06 07:39:28,484 [WARN ] main org.pytorch.serve.ModelServer - Failed to load model: model.mar
org.pytorch.serve.archive.ModelNotFoundException: Model not found at: model.mar
#011at org.pytorch.serve.archive.ModelArchive.downloadModel(ModelArchive.java:86)
#011at org.pytorch.serve.wlm.ModelManager.createModelArchive(ModelManager.java:135)
#011at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:112)
#011at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:227)
#011at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:327)
#011at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:114)
#011at org.pytorch.serve.ModelServer.main(ModelServer.java:95)
2021-09-06 07:39:28,494 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.2021-09-06 07:39:28,570 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-09-06 07:39:28,570 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-09-06 07:39:28,572 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
...But NOTE that:
These warnings about model.mar were actually present on previous versions too which still worked. Modifying the sample to save a model.mar in the root of model.tar.gz does not fix the failure.
Although the server does report that it starts, no successful ping checks get passed and SageMaker eventually fails it.
The text was updated successfully, but these errors were encountered:
Link to the notebook
https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_pytorch
Describe the bug
Since the release of PyTorch DLC v1.8.1, the PyTorch MME example fails to properly deploy the MME endpoint (since we deliberately just tagged
"1.8"
in the sample, to consume any bug fixes without expecting breaking changes). This was also because the sample is known to work on more recent patches of older versions e.g.1.7.1
and1.6.1
but not1.7.0
and1.6.0
, so specifying minor version only helped encourage users to hit the right versions if they tried to downgrade.I've done a bit of investigation on this, but have been unable to find the exact cause or a solution that works. v1.8.1 upgrades TorchServe from 0.3 to 0.4, so it's likely something changed in TorchServe to stop it from recognising the model bundle & starting correctly.
To reproduce
Run through the
multi_model_pytorch
example notebook.Logs
Some logs from the failed endpoint:
...But NOTE that:
model.mar
were actually present on previous versions too which still worked. Modifying the sample to save amodel.mar
in the root ofmodel.tar.gz
does not fix the failure.The text was updated successfully, but these errors were encountered: