-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi,
I want to fully customize the S3 destination path of my ProcessingStep
. For example, I'd like to save its results to s3://bucket/<pipeline_execution_id>/preprocessing/data.csv
.
However, it seems like you can not customize it that way. The ProcessingOutput
and its destination
parameter seems to be explicitly IGNORED, as can be seen in the source code.
Question: Why is that the case? Are there any workarounds?
APPENDIX 1:
Such behavior is fully possible when using Transformer
and TransformerStep
.
If you use sagemaker.workflow.functions.Join
and sagemaker.workflow.execution_variables.ExecutionVariables
, you can create desired path as follows:
path_that_i_want = Join(on='/', values=["s3://bucket", ExecutionVariables. PIPELINE_EXECUTION_ID, "transformer"])
and then set it in output_path
of the Transformer
:
from sagemaker.transformer import Transformer
from sagemaker.workflow.steps import TransformStep
trf = Transformer(
#other params omitted
output_path=path_that_i_want
)
trf_step = TransformStep(
#other params omitted
transformer=trf
)
If I then add trf_step
to my Pipeline
, its result will be saved to s3://bucket/<execution_id>/transformer
.
APPENDIX 2:
Such behavior is partially possible with the Estimator
.
Similar to appendix 1, if you specify the output_path
inside the Estimator
as follows:
path_that_i_want = Join(on='/', values=["s3://bucket", ExecutionVariables. PIPELINE_EXECUTION_ID, "estimator"])
and then add it to the Estimator
, your models will be saved to s3://bucket/<execution_id>/estimator/pipelines-<execution_id>-<step_name>-<some_weird_id>/
.
This partially does the trick, but still adds an unnecessary prefix in S3 in a form of pipelines-<execution_id>-<step_name>-<some_weird_id>/
.
APPENDIX 3:
(not to make X-Y question).
The reason I ask is because I don't like the current pathing of SM Pipelines. The S3 artifacts are grouped by steps and then by executions. I'd like to group it by execution and then by steps.
Having this:
s3://bucket/<execution_id>/models
s3://bucket/<execution_id>/preprocessing
s3://bucket/<execution_id>/postprocessing
would be way better than having this:s3://bucket/models/<execution_id>
s3://bucket/preprocessing/<execution_id>
s3://bucket/postprocessing/<execution_id>
If something goes wrong with a given execution, you'd be able to quickly browse all artifacts of that execution just by opening one subdirectory, namely <execution_id>
. The current state requires you to juggle between many subdirectories.