ProcessingStep in SageMaker Pipelines - allow customisation of the S3 destination path

Hi,

I want to fully customize the S3 destination path of my `ProcessingStep`.  For example, I'd like to save its results to `s3://bucket/<pipeline_execution_id>/preprocessing/data.csv`. 

However, it seems like you can not customize it that way. The `ProcessingOutput` and its `destination` parameter seems to be explicitly **IGNORED**, as can be seen [in the source code](https://github.com/aws/sagemaker-python-sdk/blob/cc38f053cad8689735079a1f142815c064cf4744/src/sagemaker/processing.py#L1207).

**Question: Why is that the case? Are there any workarounds?**

-----
APPENDIX 1:

Such behavior is _fully_ possible when using `Transformer` and `TransformerStep`.

If you use `sagemaker.workflow.functions.Join` and `sagemaker.workflow.execution_variables.ExecutionVariables`, you can create desired path as follows:
```
path_that_i_want = Join(on='/', values=["s3://bucket", ExecutionVariables. PIPELINE_EXECUTION_ID, "transformer"])
```
and then set it in `output_path` of the `Transformer`:
```
from sagemaker.transformer import Transformer
from sagemaker.workflow.steps import TransformStep

trf = Transformer(
    #other params omitted
    output_path=path_that_i_want
)

trf_step = TransformStep(
    #other params omitted
    transformer=trf
)
```
If I then add `trf_step` to my `Pipeline`, its result will be saved to `s3://bucket/<execution_id>/transformer`.

------
APPENDIX 2:

Such behavior is _partially_ possible with the `Estimator`.

Similar to appendix 1, if you specify the `output_path` inside the `Estimator` as follows:
```
path_that_i_want = Join(on='/', values=["s3://bucket", ExecutionVariables. PIPELINE_EXECUTION_ID, "estimator"])
```
and then add it to the `Estimator`, your models will be saved to `s3://bucket/<execution_id>/estimator/pipelines-<execution_id>-<step_name>-<some_weird_id>/`.

This partially does the trick, but still adds an unnecessary prefix in S3 in a form of `pipelines-<execution_id>-<step_name>-<some_weird_id>/`.

------
APPENDIX 3:

(not to make X-Y question).

The reason I ask is because I don't like the current pathing of SM Pipelines. The S3 artifacts are grouped by steps and then by executions.  I'd like to group it by execution and then by steps.

Having this:
* `s3://bucket/<execution_id>/models`
* `s3://bucket/<execution_id>/preprocessing`
* `s3://bucket/<execution_id>/postprocessing`
would be way better than having this:
* `s3://bucket/models/<execution_id>`
* `s3://bucket/preprocessing/<execution_id>` 
* `s3://bucket/postprocessing/<execution_id>`

If something _goes wrong_ with a given execution, you'd be able to quickly browse all artifacts of that execution just by opening one subdirectory, namely `<execution_id>`. The current state requires you to juggle between many subdirectories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ProcessingStep in SageMaker Pipelines - allow customisation of the S3 destination path #3278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ProcessingStep in SageMaker Pipelines - allow customisation of the S3 destination path #3278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions