-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Framework Estimators such as PyTorch take the distribution parameter, where instance_groups is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the instance_group EstimatorBase parameter.
The Framework Estimator validates the distribution parameter as a part of the Estimator initialization. At the end of the validation function, the distribution["instance_groups"] value gets reassigned, from a list of InstanceGroup objects to a list of instance-group names.
Python passes non-primitive data structures such as dict and list by reference, which means that this in-place modification of the contents of the distribution["instance_groups"] dictionary value implicitly changes the contents of the input parameter without telling the user.
When I reuse the same distribution dictionary across multiple sequential SM Training Job attempts (such as this code where we retry the same job in a different region if we encounter CapacityErrors), it causes a ValueError in the validation step in the next attempt.
To reproduce
This is a simplified version of test_hc_smdataparallel_mnist. Run the following with a folder named mnist, an empty python script mnist/smdataparallel_mnist.py, and the role value changed to any SageMaker Execution IAM Role name:
mnist_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "mnist")
region = "us-west-2"
region_2 = "us-east-1"
instance_count = 2
training_group = InstanceGroup("train_group", "ml.p4d.24xlarge", instance_count)
distribution = {
"smdistributed": {"dataparallel": {"enabled": True}},
"instance_groups": [training_group],
}
estimator_parameter = {
"entry_point": "smdataparallel_mnist.py",
"role": "SageMakerRole",
"source_dir": mnist_path,
"instance_groups": [training_group],
"distribution": distribution,
}
sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
pytorch = PyTorch(
sagemaker_session=sagemaker_session,
**estimator_parameter,
)
sagemaker_session_2 = sagemaker.Session(boto_session=boto3.Session(region_name=region_2))
pytorch = PyTorch(
sagemaker_session=sagemaker_session_2,
**estimator_parameter,
)
Expected behavior
The creation of Estimator objects should not modify input parameters, such as the distribution dict, in-place. In this specific scenario, the distribution["instance_groups"] list could continue to be a list of InstanceGroup objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use <InstanceGroupObject>.instance_group_name.
Screenshots or logs
if train_instance_group not in instance_groups:
# check if train instance groups belongs to what user defined in estimator set up
raise ValueError(
> f"Invalid training instance group {train_instance_group.instance_group_name} !"
)
E AttributeError: 'str' object has no attribute 'instance_group_name'
test_venv/lib/python3.8/site-packages/sagemaker/fw_utils.py:898: AttributeError
System information
A description of your system. Please provide:
- SageMaker Python SDK version: v2.155.0, but code hasn't changed in latest version v2.181.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): all Frameworks, but this issue was reproduced on PyTorch
- Framework version: N/A
- Python version: N/A
- CPU or GPU: N/A
- Custom Docker image (Y/N): N
Additional context
Add any other context about the problem here.