Skip to content

Use of source_dir argument for FrameworkProcessor disables step caching functionality #4246

@francisco-camargo

Description

@francisco-camargo

Describe the bug
I am attempting to find a Processor that allows me to specify requirements.txt per container/step and allow containers to import custom code from scripts I write besides the one specified by the code argument. To this end, it seems like sagemaker.processing.FrameworkProcessor is a good option.

However, when I use the source_dir argument (which enables me to import custom code), it seems to prevent step caching from working.

To reproduce

Here is the code such that step caching does work:

I define

BASE_DIR = os.path.dirname(os.path.realpath(__file__))

and the processing step is defined as

    sklearn_processor = FrameworkProcessor(
        estimator_cls=SKLearn,
        framework_version="0.23-1",
        instance_type=processing_instance_type,
        instance_count=processing_instance_count,
        base_job_name=base_job_name,
        sagemaker_session=pipeline_session,
        role=role,
        )
    step_args = sklearn_processor.run(
        outputs=[
            ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
            ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
            ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
        ],
        code=os.path.join(BASE_DIR, "preprocess.py"),
        arguments=[
            '--input-data', input_data,
            '--random-seed', random_seed,
            '--train-fraction', train_fraction,
            '--validation-fraction', validation_fraction,
            ],
        )
    step_process = ProcessingStep(
        name=step_name,
        step_args=step_args,
        cache_config=CacheConfig(enable_caching=True, expire_after="T12h"),
        )

This works as expected; takes about 5 minutes to run the first time and then only a couple seconds afterward.

If I switch to

        code="preprocess.py",
        source_dir=BASE_DIR,

Caching no longer works and it takes 5 minutes to execute the pipeline every time.

Note: if I instead switch to

        code=os.path.join(BASE_DIR, "preprocess.py"),
        source_dir=BASE_DIR,

The SageMaker Pipeline fails
image

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:
    sagemaker-2.196.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
  • Framework version:
  • Python version:
    3.11
  • CPU or GPU:
  • Custom Docker image (Y/N):

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions