Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] with Instance_type = "local_gpu", SageMaker unable to find PyTorch inference container image. #3110

Open
sermolin opened this issue Jan 26, 2022 · 0 comments

Comments

@sermolin
Copy link
Contributor

Similar issue to one reported in aws/sagemaker-python-sdk#1099 and aws/sagemaker-python-sdk#1105.

Link to the notebook
Add the link to the notebook.

Describe the bug

  1. Bug-1
    py_version="py3", is missing from PyTorch estimator. Without it, training fails. Correct estimator:
    cifar10_estimator = PyTorch(
    entry_point="source/cifar10.py",
    role=role,
    framework_version="1.7.1",
    py_version="py3",
    instance_count=1,
    instance_type=instance_type
    )

  2. Bug-2
    if we set Instance_type = "local_gpu",
    the deployment fails:
    cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1, instance_type=instance_type)
    CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.7.1-gpu-py3']' returned non-zero exit status 1.

Adding $!pip install -U sagemaker did not help

if we set Instance_type = "ml.p3.2xlarge", deployment succeeds.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Clone repo.
add py_version="py3" to the estimator.

Logs
If applicable, add logs to help explain your problem.
You may also attach an .ipynb file to this issue if it includes relevant logs or output.


CalledProcessError Traceback (most recent call last)
in
2
3 #cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge")
----> 4 cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1, instance_type=instance_type)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
959
960 model.name = model_name
--> 961
962 return model.deploy(
963 instance_type=instance_type,

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
787
788 if instance_type and instance_type.startswith("ml.inf") and not self._is_compiled_model:
--> 789 LOGGER.warning(
790 "Your model is not compiled. Please compile your model before using Inferentia."
791 )

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
3568 if kms_key:
3569 config_options["KmsKeyId"] = kms_key
-> 3570 if data_capture_config_dict is not None:
3571 config_options["DataCaptureConfig"] = data_capture_config_dict
3572

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
3070 """
3071 LOGGER.info("Creating endpoint with name %s", endpoint_name)
-> 3072
3073 tags = tags or []
3074 tags = _append_project_tags(tags)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_endpoint(self, EndpointName, EndpointConfigName, Tags)
346 endpoint = _LocalEndpoint(EndpointName, EndpointConfigName, Tags, self.sagemaker_session)
347 LocalSagemakerClient._endpoints[EndpointName] = endpoint
--> 348 endpoint.serve()
349
350 def update_endpoint(self, EndpointName, EndpointConfigName): # pylint: disable=unused-argument

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in serve(self)
576 )
577 self.container.serve(
--> 578 self.primary_container["ModelDataUrl"], self.primary_container["Environment"]
579 )
580

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in serve(self, model_dir, environment)
285
286 if _ecr_login_if_needed(self.sagemaker_session.boto_session, self.image):
--> 287 _pull_image(self.image)
288
289 self._generate_compose_file(

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in _pull_image(image)
1094 logger.info("docker command: %s", pull_image_command)
1095
-> 1096 subprocess.check_output(pull_image_command.split())
1097 logger.info("image pulled: %s", image)

~/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py in check_output(timeout, *popenargs, **kwargs)
354
355 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 356 **kwargs).stdout
357
358

~/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
436 if check and retcode:
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
440

CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.7.1-gpu-py3']' returned non-zero exit status 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant