Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused

**Describe the bug**
Framework Estimators such as [PyTorch](https://github.com/aws/sagemaker-python-sdk/blob/b7539c4a779461a895678c3267396b1a2ae650ed/src/sagemaker/pytorch/estimator.py#L55) take the `distribution` parameter, where `instance_groups` is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the [instance_group](https://github.com/aws/sagemaker-python-sdk/blob/b7539c4a779461a895678c3267396b1a2ae650ed/src/sagemaker/estimator.py#L167) EstimatorBase parameter.

The Framework Estimator [validates](https://github.com/aws/sagemaker-python-sdk/blob/b7539c4a779461a895678c3267396b1a2ae650ed/src/sagemaker/pytorch/estimator.py#L270-L279) the `distribution` parameter as a part of the Estimator initialization. At the end of the validation function, the `distribution["instance_groups"]` value [gets reassigned](https://github.com/aws/sagemaker-python-sdk/blob/b7539c4a779461a895678c3267396b1a2ae650ed/src/sagemaker/fw_utils.py#L934-L935), from a list of `InstanceGroup` objects to a list of instance-group names.

Python passes non-primitive data structures such as `dict` and `list` by reference, which means that this in-place modification of the contents of the `distribution["instance_groups"]` dictionary value implicitly changes the contents of the input parameter without telling the user.

When I reuse the same `distribution` dictionary across multiple sequential SM Training Job attempts (such as [this code](https://github.com/aws/deep-learning-containers/blob/163e4ed6b06dcdb81ec2c10e0065abeb3e69880a/test/sagemaker_tests/pytorch/training/integration/sagemaker/__init__.py#L80-L106) where we retry the same job in a different region if we encounter CapacityErrors), it causes a [ValueError in the validation step](https://github.com/aws/sagemaker-python-sdk/blob/b7539c4a779461a895678c3267396b1a2ae650ed/src/sagemaker/fw_utils.py#L897-L899) in the next attempt.

**To reproduce**
This is a simplified version of [test_hc_smdataparallel_mnist](https://github.com/aws/deep-learning-containers/blob/163e4ed6b06dcdb81ec2c10e0065abeb3e69880a/test/sagemaker_tests/pytorch/training/integration/sagemaker/test_smdataparallel.py#L181). Run the following with a folder named `mnist`, an empty python script `mnist/smdataparallel_mnist.py`, and the `role` value changed to any SageMaker Execution IAM Role name:
```
mnist_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "mnist")
region = "us-west-2"
region_2 = "us-east-1"
instance_count = 2
training_group = InstanceGroup("train_group", "ml.p4d.24xlarge", instance_count)
distribution = {
    "smdistributed": {"dataparallel": {"enabled": True}},
    "instance_groups": [training_group],
}
estimator_parameter = {
    "entry_point": "smdataparallel_mnist.py",
    "role": "SageMakerRole",
    "source_dir": mnist_path,
    "instance_groups": [training_group],
    "distribution": distribution,
}

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session,
    **estimator_parameter,
)

sagemaker_session_2 = sagemaker.Session(boto_session=boto3.Session(region_name=region_2))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session_2,
    **estimator_parameter,
)
```

**Expected behavior**
The creation of `Estimator` objects should not modify input parameters, such as the `distribution` dict, in-place. In this specific scenario, the `distribution["instance_groups"]` list could continue to be a list of `InstanceGroup` objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use [\<InstanceGroupObject\>.instance_group_name](https://github.com/aws/sagemaker-python-sdk/blob/b7539c4a779461a895678c3267396b1a2ae650ed/src/sagemaker/instance_group.py#L51).

**Screenshots or logs**
```
                if train_instance_group not in instance_groups:
                    # check if train instance groups belongs to what user defined in estimator set up
                    raise ValueError(
>                       f"Invalid training instance group {train_instance_group.instance_group_name} !"
                    )
E                   AttributeError: 'str' object has no attribute 'instance_group_name'

test_venv/lib/python3.8/site-packages/sagemaker/fw_utils.py:898: AttributeError
```

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: v2.155.0, but code hasn't changed in latest version v2.181.0
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: all Frameworks, but this issue was reproduced on PyTorch
- **Framework version**: N/A
- **Python version**: N/A
- **CPU or GPU**: N/A
- **Custom Docker image (Y/N)**: N

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions