Skip to content

PySparkProcessor - Possibility to choose different instance types for the driver node and the worker nodes #3616

@HarryPommier

Description

@HarryPommier

Describe the feature you'd like
It would be nice to have the possibility to choose different instance types for the driver node and the worker nodes when using the PySparkProcessor.

How would this feature be used? Please describe.

pyspark_processor = PySparkProcessor(
    base_job_name=...,
    framework_version=...,
    role=...,
    driver_instance_type="ml.m5.4xlarge",
    worker_instance_type="ml.m5.large",
    instance_count=...,
    sagemaker_session=pipeline_session,
    max_runtime_in_seconds=...,
)

Describe alternatives you've considered
It is possible to choose a high-memory instance for all instances but it could be unnecessarily costly for the user.

Additional context
Some pyspark operations (e.g. .toPandas()) are memory expensive for the driver node.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions