Skip to content

With sagemaker2.x not able to get tensorflow_distributed_mnist_neo_inf1.ipynb working in jupyter lab #2175

@aws-vrnatham

Description

@aws-vrnatham

Description
Was trying out the
https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker_neo_compilation_jobs/deploy_tensorflow_model_on_Inf1_instance

and was able to get it to work with sagemaker1.x but running into issues with sagemaker2.x

Steps

  1. With 2.x it was defaulting to script mode. So used the following to change the scripts.
    https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_moving_from_framework_mode_to_script_mode/tensorflow_moving_from_framework_mode_to_script_mode.ipynb

  2. Things are working fine till "Deploy the compiled model on a SageMaker endpoint"

  3. In invoke the endpoint step see the following failures.

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz

---------------------------------------------------------------------------
ModelError                                Traceback (most recent call last)
<ipython-input-13-a13c7ab7b16b> in <module>
     12     display.display(im)
     13     # Invoke endpoint with image
---> 14     predict_response = optimized_predictor.predict(data)
     15 
     16     print("========================================")

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/serving.py in predict(self, data, initial_args)
    116                 args["CustomAttributes"] = self._model_attributes
    117 
--> 118         return super(Predictor, self).predict(data, args)
    119 
    120 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant)
    111 
    112         request_args = self._create_request_args(data, initial_args, target_model, target_variant)
--> 113         response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
    114         return self._handle_response(response)
    115 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    355                     "%s() only accepts keyword arguments." % py_operation_name)
    356             # The "self" in this scope is referring to the BaseClient.
--> 357             return self._make_api_call(operation_name, kwargs)
    358 
    359         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    674             error_code = parsed_response.get("Error", {}).get("Code")
    675             error_class = self.exceptions.from_code(error_code)
--> 676             raise error_class(parsed_response, operation_name)
    677         else:
    678             return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/tensorflow-training-2021-02-23-23-14-00-380ml-inf1 in account 448570897954 for more information.
  1. Looked into the cloudwatch logs and could not find what is going on.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: Sagemaker2.5
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
  • Framework version: 1.15.0
  • Python version: conda-tensorflow-p36
  • CPU or GPU: Inf1
  • Custom Docker image (Y/N): N

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions