-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Describe the bug
When using the following:
PARENT_TUNER = HyperparameterTuner.attach(
tuning_job_name = PARENT_TUNING_JOB_NAME
)
...on a tuning job where its job definition has:
...
"StoppingCondition": {
"MaxRuntimeInSeconds": 3600,
"MaxWaitTimeInSeconds": 7200
},
"EnableNetworkIsolation": false,
"EnableInterContainerTrafficEncryption": false,
"EnableManagedSpotTraining": true
...
The max_wait and use_spot_instances setting are both None. I traced back to:
sagemaker-python-sdk/src/sagemaker/estimator.py
Lines 811 to 874 in 481719f
| def _prepare_init_params_from_job_description(cls, job_details, model_channel_name=None): | |
| """Convert the job description to init params that can be handled by the | |
| class constructor | |
| Args: | |
| job_details: the returned job details from a describe_training_job | |
| API call. | |
| model_channel_name (str): Name of the channel where pre-trained | |
| model data will be downloaded. | |
| Returns: | |
| dictionary: The transformed init_params | |
| """ | |
| init_params = dict() | |
| init_params["role"] = job_details["RoleArn"] | |
| init_params["instance_count"] = job_details["ResourceConfig"]["InstanceCount"] | |
| init_params["instance_type"] = job_details["ResourceConfig"]["InstanceType"] | |
| init_params["volume_size"] = job_details["ResourceConfig"]["VolumeSizeInGB"] | |
| init_params["max_run"] = job_details["StoppingCondition"]["MaxRuntimeInSeconds"] | |
| init_params["input_mode"] = job_details["AlgorithmSpecification"]["TrainingInputMode"] | |
| init_params["base_job_name"] = base_from_name(job_details["TrainingJobName"]) | |
| init_params["output_path"] = job_details["OutputDataConfig"]["S3OutputPath"] | |
| init_params["output_kms_key"] = job_details["OutputDataConfig"]["KmsKeyId"] | |
| if "EnableNetworkIsolation" in job_details: | |
| init_params["enable_network_isolation"] = job_details["EnableNetworkIsolation"] | |
| has_hps = "HyperParameters" in job_details | |
| init_params["hyperparameters"] = job_details["HyperParameters"] if has_hps else {} | |
| if "AlgorithmName" in job_details["AlgorithmSpecification"]: | |
| init_params["algorithm_arn"] = job_details["AlgorithmSpecification"]["AlgorithmName"] | |
| elif "TrainingImage" in job_details["AlgorithmSpecification"]: | |
| init_params["image_uri"] = job_details["AlgorithmSpecification"]["TrainingImage"] | |
| else: | |
| raise RuntimeError( | |
| "Invalid AlgorithmSpecification. Either TrainingImage or " | |
| "AlgorithmName is expected. None was found." | |
| ) | |
| if "MetricDefinitons" in job_details["AlgorithmSpecification"]: | |
| init_params["metric_definitions"] = job_details["AlgorithmSpecification"][ | |
| "MetricsDefinition" | |
| ] | |
| if "EnableInterContainerTrafficEncryption" in job_details: | |
| init_params["encrypt_inter_container_traffic"] = job_details[ | |
| "EnableInterContainerTrafficEncryption" | |
| ] | |
| subnets, security_group_ids = vpc_utils.from_dict(job_details.get(vpc_utils.VPC_CONFIG_KEY)) | |
| if subnets: | |
| init_params["subnets"] = subnets | |
| if security_group_ids: | |
| init_params["security_group_ids"] = security_group_ids | |
| if "InputDataConfig" in job_details and model_channel_name: | |
| for channel in job_details["InputDataConfig"]: | |
| if channel["ChannelName"] == model_channel_name: | |
| init_params["model_channel_name"] = model_channel_name | |
| init_params["model_uri"] = channel["DataSource"]["S3DataSource"]["S3Uri"] | |
| break | |
| return init_params |
It seems use_spot_instances and max_wait do not get carried over to newly created estimator.
To reproduce
See above.
Expected behavior
use_spot_instances and max_wait etc. should all be carried over to newly attach()ed tuner. This also affects warm start helpers like identical_data_and_algorithm().
If applicable, add screenshots or logs to help explain your problem.
**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: v2.0.0
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**:
- **Framework version**:
- **Python version**:
- **CPU or GPU**:
- **Custom Docker image (Y/N)**: N, official image classification image
**Additional context**
Add any other context about the problem here.