Skip to content

ProcessingStep in SageMaker Pipelines - allow customisation of the S3 destination path #3278

@tomaszdudek7

Description

@tomaszdudek7

Hi,

I want to fully customize the S3 destination path of my ProcessingStep. For example, I'd like to save its results to s3://bucket/<pipeline_execution_id>/preprocessing/data.csv.

However, it seems like you can not customize it that way. The ProcessingOutput and its destination parameter seems to be explicitly IGNORED, as can be seen in the source code.

Question: Why is that the case? Are there any workarounds?


APPENDIX 1:

Such behavior is fully possible when using Transformer and TransformerStep.

If you use sagemaker.workflow.functions.Join and sagemaker.workflow.execution_variables.ExecutionVariables, you can create desired path as follows:

path_that_i_want = Join(on='/', values=["s3://bucket", ExecutionVariables. PIPELINE_EXECUTION_ID, "transformer"])

and then set it in output_path of the Transformer:

from sagemaker.transformer import Transformer
from sagemaker.workflow.steps import TransformStep

trf = Transformer(
    #other params omitted
    output_path=path_that_i_want
)

trf_step = TransformStep(
    #other params omitted
    transformer=trf
)

If I then add trf_step to my Pipeline, its result will be saved to s3://bucket/<execution_id>/transformer.


APPENDIX 2:

Such behavior is partially possible with the Estimator.

Similar to appendix 1, if you specify the output_path inside the Estimator as follows:

path_that_i_want = Join(on='/', values=["s3://bucket", ExecutionVariables. PIPELINE_EXECUTION_ID, "estimator"])

and then add it to the Estimator, your models will be saved to s3://bucket/<execution_id>/estimator/pipelines-<execution_id>-<step_name>-<some_weird_id>/.

This partially does the trick, but still adds an unnecessary prefix in S3 in a form of pipelines-<execution_id>-<step_name>-<some_weird_id>/.


APPENDIX 3:

(not to make X-Y question).

The reason I ask is because I don't like the current pathing of SM Pipelines. The S3 artifacts are grouped by steps and then by executions. I'd like to group it by execution and then by steps.

Having this:

  • s3://bucket/<execution_id>/models
  • s3://bucket/<execution_id>/preprocessing
  • s3://bucket/<execution_id>/postprocessing
    would be way better than having this:
  • s3://bucket/models/<execution_id>
  • s3://bucket/preprocessing/<execution_id>
  • s3://bucket/postprocessing/<execution_id>

If something goes wrong with a given execution, you'd be able to quickly browse all artifacts of that execution just by opening one subdirectory, namely <execution_id>. The current state requires you to juggle between many subdirectories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions