Skip to content

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

@saimidu

Description

@saimidu

Describe the bug
Framework Estimators such as PyTorch take the distribution parameter, where instance_groups is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the instance_group EstimatorBase parameter.

The Framework Estimator validates the distribution parameter as a part of the Estimator initialization. At the end of the validation function, the distribution["instance_groups"] value gets reassigned, from a list of InstanceGroup objects to a list of instance-group names.

Python passes non-primitive data structures such as dict and list by reference, which means that this in-place modification of the contents of the distribution["instance_groups"] dictionary value implicitly changes the contents of the input parameter without telling the user.

When I reuse the same distribution dictionary across multiple sequential SM Training Job attempts (such as this code where we retry the same job in a different region if we encounter CapacityErrors), it causes a ValueError in the validation step in the next attempt.

To reproduce
This is a simplified version of test_hc_smdataparallel_mnist. Run the following with a folder named mnist, an empty python script mnist/smdataparallel_mnist.py, and the role value changed to any SageMaker Execution IAM Role name:

mnist_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "mnist")
region = "us-west-2"
region_2 = "us-east-1"
instance_count = 2
training_group = InstanceGroup("train_group", "ml.p4d.24xlarge", instance_count)
distribution = {
    "smdistributed": {"dataparallel": {"enabled": True}},
    "instance_groups": [training_group],
}
estimator_parameter = {
    "entry_point": "smdataparallel_mnist.py",
    "role": "SageMakerRole",
    "source_dir": mnist_path,
    "instance_groups": [training_group],
    "distribution": distribution,
}

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session,
    **estimator_parameter,
)

sagemaker_session_2 = sagemaker.Session(boto_session=boto3.Session(region_name=region_2))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session_2,
    **estimator_parameter,
)

Expected behavior
The creation of Estimator objects should not modify input parameters, such as the distribution dict, in-place. In this specific scenario, the distribution["instance_groups"] list could continue to be a list of InstanceGroup objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use <InstanceGroupObject>.instance_group_name.

Screenshots or logs

                if train_instance_group not in instance_groups:
                    # check if train instance groups belongs to what user defined in estimator set up
                    raise ValueError(
>                       f"Invalid training instance group {train_instance_group.instance_group_name} !"
                    )
E                   AttributeError: 'str' object has no attribute 'instance_group_name'

test_venv/lib/python3.8/site-packages/sagemaker/fw_utils.py:898: AttributeError

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: v2.155.0, but code hasn't changed in latest version v2.181.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): all Frameworks, but this issue was reproduced on PyTorch
  • Framework version: N/A
  • Python version: N/A
  • CPU or GPU: N/A
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions