[Bug Report] with Instance_type = "local_gpu", SageMaker unable to find PyTorch inference container image. #3110

sermolin · 2022-01-26T01:16:57Z

Similar issue to one reported in aws/sagemaker-python-sdk#1099 and aws/sagemaker-python-sdk#1105.

Link to the notebook
Add the link to the notebook.

Describe the bug

Bug-1
py_version="py3", is missing from PyTorch estimator. Without it, training fails. Correct estimator:
cifar10_estimator = PyTorch(
entry_point="source/cifar10.py",
role=role,
framework_version="1.7.1",
py_version="py3",
instance_count=1,
instance_type=instance_type
)
Bug-2
if we set Instance_type = "local_gpu",
the deployment fails:
cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1, instance_type=instance_type)
CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.7.1-gpu-py3']' returned non-zero exit status 1.

Adding $!pip install -U sagemaker did not help

if we set Instance_type = "ml.p3.2xlarge", deployment succeeds.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Clone repo.
add py_version="py3" to the estimator.

Logs
If applicable, add logs to help explain your problem.
You may also attach an .ipynb file to this issue if it includes relevant logs or output.

CalledProcessError Traceback (most recent call last)
in
2
3 #cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge")
----> 4 cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1, instance_type=instance_type)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
959
960 model.name = model_name
--> 961
962 return model.deploy(
963 instance_type=instance_type,

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
787
788 if instance_type and instance_type.startswith("ml.inf") and not self._is_compiled_model:
--> 789 LOGGER.warning(
790 "Your model is not compiled. Please compile your model before using Inferentia."
791 )

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
3568 if kms_key:
3569 config_options["KmsKeyId"] = kms_key
-> 3570 if data_capture_config_dict is not None:
3571 config_options["DataCaptureConfig"] = data_capture_config_dict
3572

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
3070 """
3071 LOGGER.info("Creating endpoint with name %s", endpoint_name)
-> 3072
3073 tags = tags or []
3074 tags = _append_project_tags(tags)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_endpoint(self, EndpointName, EndpointConfigName, Tags)
346 endpoint = _LocalEndpoint(EndpointName, EndpointConfigName, Tags, self.sagemaker_session)
347 LocalSagemakerClient._endpoints[EndpointName] = endpoint
--> 348 endpoint.serve()
349
350 def update_endpoint(self, EndpointName, EndpointConfigName): # pylint: disable=unused-argument

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in serve(self)
576 )
577 self.container.serve(
--> 578 self.primary_container["ModelDataUrl"], self.primary_container["Environment"]
579 )
580

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in serve(self, model_dir, environment)
285
286 if _ecr_login_if_needed(self.sagemaker_session.boto_session, self.image):
--> 287 _pull_image(self.image)
288
289 self._generate_compose_file(

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in _pull_image(image)
1094 logger.info("docker command: %s", pull_image_command)
1095
-> 1096 subprocess.check_output(pull_image_command.split())
1097 logger.info("image pulled: %s", image)

~/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py in check_output(timeout, *popenargs, **kwargs)
354
355 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 356 **kwargs).stdout
357
358

~/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
436 if check and retcode:
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
440

CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.7.1-gpu-py3']' returned non-zero exit status 1.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] with Instance_type = "local_gpu", SageMaker unable to find PyTorch inference container image. #3110

[Bug Report] with Instance_type = "local_gpu", SageMaker unable to find PyTorch inference container image. #3110

sermolin commented Jan 26, 2022

[Bug Report] with Instance_type = "local_gpu", SageMaker unable to find PyTorch inference container image. #3110

[Bug Report] with Instance_type = "local_gpu", SageMaker unable to find PyTorch inference container image. #3110

Comments

sermolin commented Jan 26, 2022