-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Description
Was trying out the
https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker_neo_compilation_jobs/deploy_tensorflow_model_on_Inf1_instance
and was able to get it to work with sagemaker1.x but running into issues with sagemaker2.x
Steps
-
With 2.x it was defaulting to script mode. So used the following to change the scripts.
https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_moving_from_framework_mode_to_script_mode/tensorflow_moving_from_framework_mode_to_script_mode.ipynb -
Things are working fine till "Deploy the compiled model on a SageMaker endpoint"
-
In invoke the endpoint step see the following failures.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
---------------------------------------------------------------------------
ModelError Traceback (most recent call last)
<ipython-input-13-a13c7ab7b16b> in <module>
12 display.display(im)
13 # Invoke endpoint with image
---> 14 predict_response = optimized_predictor.predict(data)
15
16 print("========================================")
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/serving.py in predict(self, data, initial_args)
116 args["CustomAttributes"] = self._model_attributes
117
--> 118 return super(Predictor, self).predict(data, args)
119
120
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant)
111
112 request_args = self._create_request_args(data, initial_args, target_model, target_variant)
--> 113 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
114 return self._handle_response(response)
115
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
674 error_code = parsed_response.get("Error", {}).get("Code")
675 error_class = self.exceptions.from_code(error_code)
--> 676 raise error_class(parsed_response, operation_name)
677 else:
678 return parsed_response
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/tensorflow-training-2021-02-23-23-14-00-380ml-inf1 in account 448570897954 for more information.
- Looked into the cloudwatch logs and could not find what is going on.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: Sagemaker2.5
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
- Framework version: 1.15.0
- Python version: conda-tensorflow-p36
- CPU or GPU: Inf1
- Custom Docker image (Y/N): N