Skip to content

v2.102 crashes when launching Pytorch estimator job #3279

@rahul003

Description

@rahul003

Describe the bug

ClientError                               Traceback (most recent call last)
<ipython-input-15-919fc4f433e5> in <module>
     50         disable_profiler=True,
     51         base_job_name=base_job_name,
---> 52         **kwargs
     53     )

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/sagemaker/pytorch/estimator.py in __init__(self, entry_point, framework_version, py_version, source_dir, hyperparameters, image_uri, distribution, **kwargs)
    226             if instance_type[:3] == "ml.":
    227                 instance_type = instance_type[3:]
--> 228             validate_distribution_instance(self.sagemaker_session, distribution, instance_type)
    229 
    230             distribution = validate_distribution(

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/sagemaker/fw_utils.py in validate_distribution_instance(sagemaker_session, distribution, instance_type)
    873 
    874     instance_desc = sagemaker_session.boto_session.client("ec2").describe_instance_types(
--> 875         InstanceTypes=[f"{instance_type}"]
    876     )
    877     if "GpuInfo" not in instance_desc["InstanceTypes"][0]:

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    506                 )
    507             # The "self" in this scope is referring to the BaseClient.
--> 508             return self._make_api_call(operation_name, kwargs)
    509 
    510         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    913             error_code = parsed_response.get("Error", {}).get("Code")
    914             error_class = self.exceptions.from_code(error_code)
--> 915             raise error_class(parsed_response, operation_name)
    916         else:
    917             return parsed_response

ClientError: An error occurred (UnauthorizedOperation) when calling the DescribeInstanceTypes operation: You are not authorized to perform this operation.

To reproduce
https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple.ipynb

Expected behavior
With v2.100 it works fine and launches the job

Screenshots or logs
Log above

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.102
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
  • Framework version: 1.11
  • Python version: 3.8
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions