diff --git a/sample/sagemaker/2017-07-24/service-2.json b/sample/sagemaker/2017-07-24/service-2.json index eb3132f..fdf9dfb 100644 --- a/sample/sagemaker/2017-07-24/service-2.json +++ b/sample/sagemaker/2017-07-24/service-2.json @@ -104,6 +104,32 @@ "output":{"shape":"BatchDescribeModelPackageOutput"}, "documentation":"

This action batch describes a list of versioned model packages

" }, + "BatchRebootClusterNodes":{ + "name":"BatchRebootClusterNodes", + "http":{ + "method":"POST", + "requestUri":"/" + }, + "input":{"shape":"BatchRebootClusterNodesRequest"}, + "output":{"shape":"BatchRebootClusterNodesResponse"}, + "errors":[ + {"shape":"ResourceNotFound"} + ], + "documentation":"

Reboots specific nodes within a SageMaker HyperPod cluster using a soft recovery mechanism. BatchRebootClusterNodes performs a graceful reboot of the specified nodes by calling the Amazon Elastic Compute Cloud RebootInstances API, which attempts to cleanly shut down the operating system before restarting the instance.

This operation is useful for recovering from transient issues or applying certain configuration changes that require a restart.

" + }, + "BatchReplaceClusterNodes":{ + "name":"BatchReplaceClusterNodes", + "http":{ + "method":"POST", + "requestUri":"/" + }, + "input":{"shape":"BatchReplaceClusterNodesRequest"}, + "output":{"shape":"BatchReplaceClusterNodesResponse"}, + "errors":[ + {"shape":"ResourceNotFound"} + ], + "documentation":"

Replaces specific nodes within a SageMaker HyperPod cluster with new hardware. BatchReplaceClusterNodes terminates the specified instances and provisions new replacement instances with the same configuration but fresh hardware. The Amazon Machine Image (AMI) and instance configuration remain the same.

This operation is useful for recovering from hardware failures or persistent issues that cannot be resolved through a reboot.

" + }, "CreateAction":{ "name":"CreateAction", "http":{ @@ -4587,6 +4613,30 @@ } }, "shapes":{ + "AcceleratorPartitionConfig":{ + "type":"structure", + "required":[ + "Type", + "Count" + ], + "members":{ + "Type":{ + "shape":"MIGProfileType", + "documentation":"

The Multi-Instance GPU (MIG) profile type that defines the partition configuration. The profile specifies the compute and memory allocation for each partition instance. The available profile types depend on the instance type specified in the compute quota configuration.

" + }, + "Count":{ + "shape":"AcceleratorPartitionConfigCountInteger", + "documentation":"

The number of accelerator partitions to allocate with the specified partition type. If you don't specify a value for vCPU and MemoryInGiB, SageMaker AI automatically allocates ratio-based values for those parameters based on the accelerator partition count you provide.

", + "box":true + } + }, + "documentation":"

Configuration for allocating accelerator partitions.

" + }, + "AcceleratorPartitionConfigCountInteger":{ + "type":"integer", + "max":10000000, + "min":0 + }, "AcceleratorsAmount":{ "type":"integer", "box":true, @@ -7153,6 +7203,242 @@ }, "documentation":"

Provides summary information about the model package.

" }, + "BatchRebootClusterNodeLogicalIdsError":{ + "type":"structure", + "required":[ + "NodeLogicalId", + "ErrorCode", + "Message" + ], + "members":{ + "NodeLogicalId":{ + "shape":"ClusterNodeLogicalId", + "documentation":"

The logical node ID of the node that encountered an error during the reboot operation.

" + }, + "ErrorCode":{ + "shape":"BatchRebootClusterNodesErrorCode", + "documentation":"

The error code associated with the error encountered when rebooting a node by logical node ID.

Possible values:

" + }, + "Message":{ + "shape":"String", + "documentation":"

A human-readable message describing the error encountered when rebooting a node by logical node ID.

" + } + }, + "documentation":"

Represents an error encountered when rebooting a node (identified by its logical node ID) from a SageMaker HyperPod cluster.

" + }, + "BatchRebootClusterNodeLogicalIdsErrors":{ + "type":"list", + "member":{"shape":"BatchRebootClusterNodeLogicalIdsError"}, + "max":25, + "min":0 + }, + "BatchRebootClusterNodesError":{ + "type":"structure", + "required":[ + "NodeId", + "ErrorCode", + "Message" + ], + "members":{ + "NodeId":{ + "shape":"ClusterNodeId", + "documentation":"

The EC2 instance ID of the node that encountered an error during the reboot operation.

" + }, + "ErrorCode":{ + "shape":"BatchRebootClusterNodesErrorCode", + "documentation":"

The error code associated with the error encountered when rebooting a node.

Possible values:

" + }, + "Message":{ + "shape":"String", + "documentation":"

A human-readable message describing the error encountered when rebooting a node.

" + } + }, + "documentation":"

Represents an error encountered when rebooting a node from a SageMaker HyperPod cluster.

" + }, + "BatchRebootClusterNodesErrorCode":{ + "type":"string", + "enum":[ + "InstanceIdNotFound", + "InvalidInstanceStatus", + "InstanceIdInUse", + "InternalServerError" + ] + }, + "BatchRebootClusterNodesErrors":{ + "type":"list", + "member":{"shape":"BatchRebootClusterNodesError"}, + "max":25, + "min":0 + }, + "BatchRebootClusterNodesRequest":{ + "type":"structure", + "required":["ClusterName"], + "members":{ + "ClusterName":{ + "shape":"ClusterNameOrArn", + "documentation":"

The name or Amazon Resource Name (ARN) of the SageMaker HyperPod cluster containing the nodes to reboot.

" + }, + "NodeIds":{ + "shape":"BatchRebootClusterNodesRequestNodeIdsList", + "documentation":"

A list of EC2 instance IDs to reboot using soft recovery. You can specify between 1 and 25 instance IDs.

" + }, + "NodeLogicalIds":{ + "shape":"BatchRebootClusterNodesRequestNodeLogicalIdsList", + "documentation":"

A list of logical node IDs to reboot using soft recovery. You can specify between 1 and 25 logical node IDs.

The NodeLogicalId is a unique identifier that persists throughout the node's lifecycle and can be used to track nodes that are still being provisioned and don't yet have an EC2 instance ID assigned.

" + } + } + }, + "BatchRebootClusterNodesRequestNodeIdsList":{ + "type":"list", + "member":{"shape":"ClusterNodeId"}, + "max":25, + "min":1 + }, + "BatchRebootClusterNodesRequestNodeLogicalIdsList":{ + "type":"list", + "member":{"shape":"ClusterNodeLogicalId"}, + "max":25, + "min":1 + }, + "BatchRebootClusterNodesResponse":{ + "type":"structure", + "members":{ + "Successful":{ + "shape":"ClusterNodeIds", + "documentation":"

A list of EC2 instance IDs for which the reboot operation was successfully initiated.

" + }, + "Failed":{ + "shape":"BatchRebootClusterNodesErrors", + "documentation":"

A list of errors encountered for EC2 instance IDs that could not be rebooted. Each error includes the instance ID, an error code, and a descriptive message.

" + }, + "FailedNodeLogicalIds":{ + "shape":"BatchRebootClusterNodeLogicalIdsErrors", + "documentation":"

A list of errors encountered for logical node IDs that could not be rebooted. Each error includes the logical node ID, an error code, and a descriptive message. This field is only present when NodeLogicalIds were provided in the request.

" + }, + "SuccessfulNodeLogicalIds":{ + "shape":"ClusterNodeLogicalIdList", + "documentation":"

A list of logical node IDs for which the reboot operation was successfully initiated. This field is only present when NodeLogicalIds were provided in the request.

" + } + } + }, + "BatchReplaceClusterNodeLogicalIdsError":{ + "type":"structure", + "required":[ + "NodeLogicalId", + "ErrorCode", + "Message" + ], + "members":{ + "NodeLogicalId":{ + "shape":"ClusterNodeLogicalId", + "documentation":"

The logical node ID of the node that encountered an error during the replacement operation.

" + }, + "ErrorCode":{ + "shape":"BatchReplaceClusterNodesErrorCode", + "documentation":"

The error code associated with the error encountered when replacing a node by logical node ID.

Possible values:

" + }, + "Message":{ + "shape":"String", + "documentation":"

A human-readable message describing the error encountered when replacing a node by logical node ID.

" + } + }, + "documentation":"

Represents an error encountered when replacing a node (identified by its logical node ID) in a SageMaker HyperPod cluster.

" + }, + "BatchReplaceClusterNodeLogicalIdsErrors":{ + "type":"list", + "member":{"shape":"BatchReplaceClusterNodeLogicalIdsError"}, + "max":25, + "min":0 + }, + "BatchReplaceClusterNodesError":{ + "type":"structure", + "required":[ + "NodeId", + "ErrorCode", + "Message" + ], + "members":{ + "NodeId":{ + "shape":"ClusterNodeId", + "documentation":"

The EC2 instance ID of the node that encountered an error during the replacement operation.

" + }, + "ErrorCode":{ + "shape":"BatchReplaceClusterNodesErrorCode", + "documentation":"

The error code associated with the error encountered when replacing a node.

Possible values:

" + }, + "Message":{ + "shape":"String", + "documentation":"

A human-readable message describing the error encountered when replacing a node.

" + } + }, + "documentation":"

Represents an error encountered when replacing a node in a SageMaker HyperPod cluster.

" + }, + "BatchReplaceClusterNodesErrorCode":{ + "type":"string", + "enum":[ + "InstanceIdNotFound", + "InvalidInstanceStatus", + "InstanceIdInUse", + "InternalServerError" + ] + }, + "BatchReplaceClusterNodesErrors":{ + "type":"list", + "member":{"shape":"BatchReplaceClusterNodesError"}, + "max":25, + "min":0 + }, + "BatchReplaceClusterNodesRequest":{ + "type":"structure", + "required":["ClusterName"], + "members":{ + "ClusterName":{ + "shape":"ClusterNameOrArn", + "documentation":"

The name or Amazon Resource Name (ARN) of the SageMaker HyperPod cluster containing the nodes to replace.

" + }, + "NodeIds":{ + "shape":"BatchReplaceClusterNodesRequestNodeIdsList", + "documentation":"

A list of EC2 instance IDs to replace with new hardware. You can specify between 1 and 25 instance IDs.

Replace operations destroy all instance volumes (root and secondary). Ensure you have backed up any important data before proceeding.

" + }, + "NodeLogicalIds":{ + "shape":"BatchReplaceClusterNodesRequestNodeLogicalIdsList", + "documentation":"

A list of logical node IDs to replace with new hardware. You can specify between 1 and 25 logical node IDs.

The NodeLogicalId is a unique identifier that persists throughout the node's lifecycle and can be used to track nodes that are still being provisioned and don't yet have an EC2 instance ID assigned.

" + } + } + }, + "BatchReplaceClusterNodesRequestNodeIdsList":{ + "type":"list", + "member":{"shape":"ClusterNodeId"}, + "max":25, + "min":1 + }, + "BatchReplaceClusterNodesRequestNodeLogicalIdsList":{ + "type":"list", + "member":{"shape":"ClusterNodeLogicalId"}, + "max":25, + "min":1 + }, + "BatchReplaceClusterNodesResponse":{ + "type":"structure", + "members":{ + "Successful":{ + "shape":"ClusterNodeIds", + "documentation":"

A list of EC2 instance IDs for which the replacement operation was successfully initiated.

" + }, + "Failed":{ + "shape":"BatchReplaceClusterNodesErrors", + "documentation":"

A list of errors encountered for EC2 instance IDs that could not be replaced. Each error includes the instance ID, an error code, and a descriptive message.

" + }, + "FailedNodeLogicalIds":{ + "shape":"BatchReplaceClusterNodeLogicalIdsErrors", + "documentation":"

A list of errors encountered for logical node IDs that could not be replaced. Each error includes the logical node ID, an error code, and a descriptive message. This field is only present when NodeLogicalIds were provided in the request.

" + }, + "SuccessfulNodeLogicalIds":{ + "shape":"ClusterNodeLogicalIdList", + "documentation":"

A list of logical node IDs for which the replacement operation was successfully initiated. This field is only present when NodeLogicalIds were provided in the request.

" + } + } + }, "BatchStrategy":{ "type":"string", "enum":[ @@ -9138,6 +9424,10 @@ "UltraServerInfo":{ "shape":"UltraServerInfo", "documentation":"

Contains information about the UltraServer.

" + }, + "PrivateDnsHostname":{ + "shape":"ClusterPrivateDnsHostname", + "documentation":"

The private DNS hostname of the SageMaker HyperPod cluster node.

" } }, "documentation":"

Lists a summary of the properties of an instance (also called a node interchangeably) of a SageMaker HyperPod cluster.

" @@ -9819,6 +10109,10 @@ "MemoryInGiB":{ "shape":"MemoryInGiBAmount", "documentation":"

The amount of memory in GiB to allocate. If you specify a value only for this parameter, SageMaker AI automatically allocates a ratio-based value for vCPU based on this memory that you provide. For example, if you allocate 200 out of 400 total memory in GiB, SageMaker AI uses the ratio of 0.5 and allocates values to vCPU. Accelerators are set to 0.

" + }, + "AcceleratorPartition":{ + "shape":"AcceleratorPartitionConfig", + "documentation":"

The accelerator partition configuration for fractional GPU allocation.

" } }, "documentation":"

Configuration of the resources used for the compute allocation definition.

" @@ -19647,7 +19941,7 @@ }, "TargetResources":{ "shape":"SageMakerResourceNames", - "documentation":"

The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod) that can use this training plan.

Training plans are specific to their target resource.

" + "documentation":"

The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod, SageMaker Endpoints) that can use this training plan.

Training plans are specific to their target resource.

" }, "ReservedCapacitySummaries":{ "shape":"ReservedCapacitySummaries", @@ -30935,6 +31229,39 @@ "min":0, "pattern":"(https|s3)://([^/]+)/?(.*)" }, + "MIGProfileType":{ + "type":"string", + "enum":[ + "mig-1g.5gb", + "mig-1g.10gb", + "mig-1g.18gb", + "mig-1g.20gb", + "mig-1g.23gb", + "mig-1g.35gb", + "mig-1g.45gb", + "mig-1g.47gb", + "mig-2g.10gb", + "mig-2g.20gb", + "mig-2g.35gb", + "mig-2g.45gb", + "mig-2g.47gb", + "mig-3g.20gb", + "mig-3g.40gb", + "mig-3g.71gb", + "mig-3g.90gb", + "mig-3g.93gb", + "mig-4g.20gb", + "mig-4g.40gb", + "mig-4g.71gb", + "mig-4g.90gb", + "mig-4g.93gb", + "mig-7g.40gb", + "mig-7g.80gb", + "mig-7g.141gb", + "mig-7g.180gb", + "mig-7g.186gb" + ] + }, "MLFramework":{ "type":"string", "max":128, @@ -39164,7 +39491,8 @@ "type":"string", "enum":[ "training-job", - "hyperpod-cluster" + "hyperpod-cluster", + "endpoint" ] }, "SageMakerResourceNames":{ @@ -39504,7 +39832,7 @@ }, "TargetResources":{ "shape":"SageMakerResourceNames", - "documentation":"

The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod) to search for in the offerings.

Training plans are specific to their target resource.

" + "documentation":"

The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod, SageMaker Endpoints) to search for in the offerings.

Training plans are specific to their target resource.

" } } }, @@ -42375,7 +42703,7 @@ }, "TargetResources":{ "shape":"SageMakerResourceNames", - "documentation":"

The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod) for this training plan offering.

Training plans are specific to their target resource.

" + "documentation":"

The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod, SageMaker Endpoints) for this training plan offering.

Training plans are specific to their target resource.

" }, "RequestedStartTimeAfter":{ "shape":"Timestamp", @@ -42519,7 +42847,7 @@ }, "TargetResources":{ "shape":"SageMakerResourceNames", - "documentation":"

The target resources (e.g., training jobs, HyperPod clusters) that can use this training plan.

Training plans are specific to their target resource.

" + "documentation":"

The target resources (e.g., training jobs, HyperPod clusters, Endpoints) that can use this training plan.

Training plans are specific to their target resource.

" }, "ReservedCapacitySummaries":{ "shape":"ReservedCapacitySummaries", diff --git a/src/sagemaker_core/main/code_injection/shape_dag.py b/src/sagemaker_core/main/code_injection/shape_dag.py index c2ed820..a683a65 100644 --- a/src/sagemaker_core/main/code_injection/shape_dag.py +++ b/src/sagemaker_core/main/code_injection/shape_dag.py @@ -1,4 +1,11 @@ SHAPE_DAG = { + "AcceleratorPartitionConfig": { + "members": [ + {"name": "Type", "shape": "MIGProfileType", "type": "string"}, + {"name": "Count", "shape": "AcceleratorPartitionConfigCountInteger", "type": "integer"}, + ], + "type": "structure", + }, "AccessForbidden": { "members": [{"name": "Message", "shape": "Message", "type": "string"}], "type": "structure", @@ -1047,6 +1054,144 @@ "members": [{"name": "Errors", "shape": "BatchPutMetricsErrorList", "type": "list"}], "type": "structure", }, + "BatchRebootClusterNodeLogicalIdsError": { + "members": [ + {"name": "NodeLogicalId", "shape": "ClusterNodeLogicalId", "type": "string"}, + {"name": "ErrorCode", "shape": "BatchRebootClusterNodesErrorCode", "type": "string"}, + {"name": "Message", "shape": "String", "type": "string"}, + ], + "type": "structure", + }, + "BatchRebootClusterNodeLogicalIdsErrors": { + "member_shape": "BatchRebootClusterNodeLogicalIdsError", + "member_type": "structure", + "type": "list", + }, + "BatchRebootClusterNodesError": { + "members": [ + {"name": "NodeId", "shape": "ClusterNodeId", "type": "string"}, + {"name": "ErrorCode", "shape": "BatchRebootClusterNodesErrorCode", "type": "string"}, + {"name": "Message", "shape": "String", "type": "string"}, + ], + "type": "structure", + }, + "BatchRebootClusterNodesErrors": { + "member_shape": "BatchRebootClusterNodesError", + "member_type": "structure", + "type": "list", + }, + "BatchRebootClusterNodesRequest": { + "members": [ + {"name": "ClusterName", "shape": "ClusterNameOrArn", "type": "string"}, + { + "name": "NodeIds", + "shape": "BatchRebootClusterNodesRequestNodeIdsList", + "type": "list", + }, + { + "name": "NodeLogicalIds", + "shape": "BatchRebootClusterNodesRequestNodeLogicalIdsList", + "type": "list", + }, + ], + "type": "structure", + }, + "BatchRebootClusterNodesRequestNodeIdsList": { + "member_shape": "ClusterNodeId", + "member_type": "string", + "type": "list", + }, + "BatchRebootClusterNodesRequestNodeLogicalIdsList": { + "member_shape": "ClusterNodeLogicalId", + "member_type": "string", + "type": "list", + }, + "BatchRebootClusterNodesResponse": { + "members": [ + {"name": "Successful", "shape": "ClusterNodeIds", "type": "list"}, + {"name": "Failed", "shape": "BatchRebootClusterNodesErrors", "type": "list"}, + { + "name": "FailedNodeLogicalIds", + "shape": "BatchRebootClusterNodeLogicalIdsErrors", + "type": "list", + }, + { + "name": "SuccessfulNodeLogicalIds", + "shape": "ClusterNodeLogicalIdList", + "type": "list", + }, + ], + "type": "structure", + }, + "BatchReplaceClusterNodeLogicalIdsError": { + "members": [ + {"name": "NodeLogicalId", "shape": "ClusterNodeLogicalId", "type": "string"}, + {"name": "ErrorCode", "shape": "BatchReplaceClusterNodesErrorCode", "type": "string"}, + {"name": "Message", "shape": "String", "type": "string"}, + ], + "type": "structure", + }, + "BatchReplaceClusterNodeLogicalIdsErrors": { + "member_shape": "BatchReplaceClusterNodeLogicalIdsError", + "member_type": "structure", + "type": "list", + }, + "BatchReplaceClusterNodesError": { + "members": [ + {"name": "NodeId", "shape": "ClusterNodeId", "type": "string"}, + {"name": "ErrorCode", "shape": "BatchReplaceClusterNodesErrorCode", "type": "string"}, + {"name": "Message", "shape": "String", "type": "string"}, + ], + "type": "structure", + }, + "BatchReplaceClusterNodesErrors": { + "member_shape": "BatchReplaceClusterNodesError", + "member_type": "structure", + "type": "list", + }, + "BatchReplaceClusterNodesRequest": { + "members": [ + {"name": "ClusterName", "shape": "ClusterNameOrArn", "type": "string"}, + { + "name": "NodeIds", + "shape": "BatchReplaceClusterNodesRequestNodeIdsList", + "type": "list", + }, + { + "name": "NodeLogicalIds", + "shape": "BatchReplaceClusterNodesRequestNodeLogicalIdsList", + "type": "list", + }, + ], + "type": "structure", + }, + "BatchReplaceClusterNodesRequestNodeIdsList": { + "member_shape": "ClusterNodeId", + "member_type": "string", + "type": "list", + }, + "BatchReplaceClusterNodesRequestNodeLogicalIdsList": { + "member_shape": "ClusterNodeLogicalId", + "member_type": "string", + "type": "list", + }, + "BatchReplaceClusterNodesResponse": { + "members": [ + {"name": "Successful", "shape": "ClusterNodeIds", "type": "list"}, + {"name": "Failed", "shape": "BatchReplaceClusterNodesErrors", "type": "list"}, + { + "name": "FailedNodeLogicalIds", + "shape": "BatchReplaceClusterNodeLogicalIdsErrors", + "type": "list", + }, + { + "name": "SuccessfulNodeLogicalIds", + "shape": "ClusterNodeLogicalIdList", + "type": "list", + }, + ], + "type": "structure", + }, "BatchTransformInput": { "members": [ {"name": "DataCapturedDestinationS3Uri", "shape": "DestinationS3Uri", "type": "string"}, @@ -1689,6 +1834,7 @@ "type": "structure", }, {"name": "UltraServerInfo", "shape": "UltraServerInfo", "type": "structure"}, + {"name": "PrivateDnsHostname", "shape": "ClusterPrivateDnsHostname", "type": "string"}, ], "type": "structure", }, @@ -1961,6 +2107,11 @@ {"name": "Accelerators", "shape": "AcceleratorsAmount", "type": "integer"}, {"name": "VCpu", "shape": "VCpuAmount", "type": "float"}, {"name": "MemoryInGiB", "shape": "MemoryInGiBAmount", "type": "float"}, + { + "name": "AcceleratorPartition", + "shape": "AcceleratorPartitionConfig", + "type": "structure", + }, ], "type": "structure", }, diff --git a/src/sagemaker_core/main/resources.py b/src/sagemaker_core/main/resources.py index f6610c4..0e4d21c 100644 --- a/src/sagemaker_core/main/resources.py +++ b/src/sagemaker_core/main/resources.py @@ -29088,7 +29088,7 @@ class TrainingPlan(Base): unhealthy_instance_count: The number of instances in the training plan that are currently in an unhealthy state. available_spare_instance_count: The number of available spare instances in the training plan. total_ultra_server_count: The total number of UltraServers reserved to this training plan. - target_resources: The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod) that can use this training plan. Training plans are specific to their target resource. A training plan designed for SageMaker training jobs can only be used to schedule and run training jobs. A training plan for HyperPod clusters can be used exclusively to provide compute resources to a cluster's instance group. + target_resources: The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod, SageMaker Endpoints) that can use this training plan. Training plans are specific to their target resource. A training plan designed for SageMaker training jobs can only be used to schedule and run training jobs. A training plan for HyperPod clusters can be used exclusively to provide compute resources to a cluster's instance group. A training plan for SageMaker endpoints can be used exclusively to provide compute resources to SageMaker endpoints for model deployment. reserved_capacity_summaries: The list of Reserved Capacity providing the underlying compute resources of the plan. """ diff --git a/src/sagemaker_core/main/shapes.py b/src/sagemaker_core/main/shapes.py index 6d91f87..d927e09 100644 --- a/src/sagemaker_core/main/shapes.py +++ b/src/sagemaker_core/main/shapes.py @@ -452,6 +452,21 @@ class RawMetricData(Base): step: Optional[int] = Unassigned() +class AcceleratorPartitionConfig(Base): + """ + AcceleratorPartitionConfig + Configuration for allocating accelerator partitions. + + Attributes + ---------------------- + type: The Multi-Instance GPU (MIG) profile type that defines the partition configuration. The profile specifies the compute and memory allocation for each partition instance. The available profile types depend on the instance type specified in the compute quota configuration. + count: The number of accelerator partitions to allocate with the specified partition type. If you don't specify a value for vCPU and MemoryInGiB, SageMaker AI automatically allocates ratio-based values for those parameters based on the accelerator partition count you provide. + """ + + type: str + count: int + + class ActionSource(Base): """ ActionSource @@ -2659,6 +2674,74 @@ class BatchDescribeModelPackageOutput(Base): ) +class BatchRebootClusterNodeLogicalIdsError(Base): + """ + BatchRebootClusterNodeLogicalIdsError + Represents an error encountered when rebooting a node (identified by its logical node ID) from a SageMaker HyperPod cluster. + + Attributes + ---------------------- + node_logical_id: The logical node ID of the node that encountered an error during the reboot operation. + error_code: The error code associated with the error encountered when rebooting a node by logical node ID. Possible values: InstanceIdNotFound: The node does not exist in the specified cluster. InvalidInstanceStatus: The node is in a state that does not allow rebooting. Wait for the node to finish any ongoing changes before retrying. InstanceIdInUse: Another operation is already in progress for this node. Wait for the operation to complete before retrying. InternalServerError: An internal error occurred while processing this node. + message: A human-readable message describing the error encountered when rebooting a node by logical node ID. + """ + + node_logical_id: str + error_code: str + message: str + + +class BatchRebootClusterNodesError(Base): + """ + BatchRebootClusterNodesError + Represents an error encountered when rebooting a node from a SageMaker HyperPod cluster. + + Attributes + ---------------------- + node_id: The EC2 instance ID of the node that encountered an error during the reboot operation. + error_code: The error code associated with the error encountered when rebooting a node. Possible values: InstanceIdNotFound: The instance does not exist in the specified cluster. InvalidInstanceStatus: The instance is in a state that does not allow rebooting. Wait for the instance to finish any ongoing changes before retrying. InstanceIdInUse: Another operation is already in progress for this node. Wait for the operation to complete before retrying. InternalServerError: An internal error occurred while processing this node. + message: A human-readable message describing the error encountered when rebooting a node. + """ + + node_id: str + error_code: str + message: str + + +class BatchReplaceClusterNodeLogicalIdsError(Base): + """ + BatchReplaceClusterNodeLogicalIdsError + Represents an error encountered when replacing a node (identified by its logical node ID) in a SageMaker HyperPod cluster. + + Attributes + ---------------------- + node_logical_id: The logical node ID of the node that encountered an error during the replacement operation. + error_code: The error code associated with the error encountered when replacing a node by logical node ID. Possible values: InstanceIdNotFound: The node does not exist in the specified cluster. InvalidInstanceStatus: The node is in a state that does not allow replacement. Wait for the node to finish any ongoing changes before retrying. InstanceIdInUse: Another operation is already in progress for this node. Wait for the operation to complete before retrying. InternalServerError: An internal error occurred while processing this node. + message: A human-readable message describing the error encountered when replacing a node by logical node ID. + """ + + node_logical_id: str + error_code: str + message: str + + +class BatchReplaceClusterNodesError(Base): + """ + BatchReplaceClusterNodesError + Represents an error encountered when replacing a node in a SageMaker HyperPod cluster. + + Attributes + ---------------------- + node_id: The EC2 instance ID of the node that encountered an error during the replacement operation. + error_code: The error code associated with the error encountered when replacing a node. Possible values: InstanceIdNotFound: The instance does not exist in the specified cluster. InvalidInstanceStatus: The instance is in a state that does not allow replacement. Wait for the instance to finish any ongoing changes before retrying. InstanceIdInUse: Another operation is already in progress for this node. Wait for the operation to complete before retrying. InternalServerError: An internal error occurred while processing this node. + message: A human-readable message describing the error encountered when replacing a node. + """ + + node_id: str + error_code: str + message: str + + class MonitoringCsvDatasetFormat(Base): """ MonitoringCsvDatasetFormat @@ -3904,6 +3987,7 @@ class ClusterNodeSummary(Base): last_software_update_time: The time when SageMaker last updated the software of the instances in the cluster. instance_status: The status of the instance. ultra_server_info: Contains information about the UltraServer. + private_dns_hostname: The private DNS hostname of the SageMaker HyperPod cluster node. """ instance_group_name: str @@ -3914,6 +3998,7 @@ class ClusterNodeSummary(Base): node_logical_id: Optional[str] = Unassigned() last_software_update_time: Optional[datetime.datetime] = Unassigned() ultra_server_info: Optional[UltraServerInfo] = Unassigned() + private_dns_hostname: Optional[str] = Unassigned() class ClusterOrchestratorEksConfig(Base): @@ -4327,6 +4412,7 @@ class ComputeQuotaResourceConfig(Base): accelerators: The number of accelerators to allocate. If you don't specify a value for vCPU and MemoryInGiB, SageMaker AI automatically allocates ratio-based values for those parameters based on the number of accelerators you provide. For example, if you allocate 16 out of 32 total accelerators, SageMaker AI uses the ratio of 0.5 and allocates values to vCPU and MemoryInGiB. v_cpu: The number of vCPU to allocate. If you specify a value only for vCPU, SageMaker AI automatically allocates ratio-based values for MemoryInGiB based on this vCPU parameter. For example, if you allocate 20 out of 40 total vCPU, SageMaker AI uses the ratio of 0.5 and allocates values to MemoryInGiB. Accelerators are set to 0. memory_in_gi_b: The amount of memory in GiB to allocate. If you specify a value only for this parameter, SageMaker AI automatically allocates a ratio-based value for vCPU based on this memory that you provide. For example, if you allocate 200 out of 400 total memory in GiB, SageMaker AI uses the ratio of 0.5 and allocates values to vCPU. Accelerators are set to 0. + accelerator_partition: The accelerator partition configuration for fractional GPU allocation. """ instance_type: str @@ -4334,6 +4420,7 @@ class ComputeQuotaResourceConfig(Base): accelerators: Optional[int] = Unassigned() v_cpu: Optional[float] = Unassigned() memory_in_gi_b: Optional[float] = Unassigned() + accelerator_partition: Optional[AcceleratorPartitionConfig] = Unassigned() class ResourceSharingConfig(Base): @@ -12455,7 +12542,7 @@ class TrainingPlanSummary(Base): available_instance_count: The number of instances currently available for use in this training plan. in_use_instance_count: The number of instances currently in use from this training plan. total_ultra_server_count: The total number of UltraServers allocated to this training plan. - target_resources: The target resources (e.g., training jobs, HyperPod clusters) that can use this training plan. Training plans are specific to their target resource. A training plan designed for SageMaker training jobs can only be used to schedule and run training jobs. A training plan for HyperPod clusters can be used exclusively to provide compute resources to a cluster's instance group. + target_resources: The target resources (e.g., training jobs, HyperPod clusters, Endpoints) that can use this training plan. Training plans are specific to their target resource. A training plan designed for SageMaker training jobs can only be used to schedule and run training jobs. A training plan for HyperPod clusters can be used exclusively to provide compute resources to a cluster's instance group. A training plan for SageMaker endpoints can be used exclusively to provide compute resources to SageMaker endpoints for model deployment. reserved_capacity_summaries: A list of reserved capacities associated with this training plan, including details such as instance types, counts, and availability zones. """ @@ -13688,7 +13775,7 @@ class TrainingPlanOffering(Base): Attributes ---------------------- training_plan_offering_id: The unique identifier for this training plan offering. - target_resources: The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod) for this training plan offering. Training plans are specific to their target resource. A training plan designed for SageMaker training jobs can only be used to schedule and run training jobs. A training plan for HyperPod clusters can be used exclusively to provide compute resources to a cluster's instance group. + target_resources: The target resources (e.g., SageMaker Training Jobs, SageMaker HyperPod, SageMaker Endpoints) for this training plan offering. Training plans are specific to their target resource. A training plan designed for SageMaker training jobs can only be used to schedule and run training jobs. A training plan for HyperPod clusters can be used exclusively to provide compute resources to a cluster's instance group. A training plan for SageMaker endpoints can be used exclusively to provide compute resources to SageMaker endpoints for model deployment. requested_start_time_after: The requested start time that the user specified when searching for the training plan offering. requested_end_time_before: The requested end time that the user specified when searching for the training plan offering. duration_hours: The number of whole hours in the total duration for this training plan offering. diff --git a/src/sagemaker_core/tools/api_coverage.json b/src/sagemaker_core/tools/api_coverage.json index 22cda04..d5155c5 100644 --- a/src/sagemaker_core/tools/api_coverage.json +++ b/src/sagemaker_core/tools/api_coverage.json @@ -1 +1 @@ -{"SupportedAPIs": 365, "UnsupportedAPIs": 15} \ No newline at end of file +{"SupportedAPIs": 365, "UnsupportedAPIs": 17} \ No newline at end of file