Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScriptProcessor does not check local_code config before uploading code to S3 #3560

Open
lodo1995 opened this issue Dec 27, 2022 · 4 comments

Comments

@lodo1995
Copy link

Describe the bug
When a LocalSession or LocalPipelineSession is configured to use local code, as follows

session.config = {'local': {'local_code': True}}

the code passed to a pipeline ProcessingStep or directly to the run method of a processor (ScriptProcessor, FrameworkProcessor, ...) should not be uploaded to S3.

However, ScriptProcessor does not honor this. Its _include_code_in_inputs method (which is called unconditionally by the _normalize_args of the base class Processor, which in turn is called both when running directly and through a pipeline) unconditionally tries to upload the code to S3.

def _include_code_in_inputs(self, inputs, code, kms_key=None):

Compare this to the Model class, used for example in the TrainingStep. Its _upload_code method checks the session configuration and does not upload to S3 when local code is enabled.

def _upload_code(self, key_prefix: str, repack: bool = False) -> None:

To reproduce
In the absence of any AWS credentials (which should not be needed when running completely locally), the following code will fail to upload the processing.py script to S3 (botocore.exceptions.NoCredentialsError). Note that, in addition to the following code, a processing.py file must exist in the working directory (but its contents don't matter).

Code
import boto3
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor
from sagemaker.workflow.steps import ProcessingStep

role = 'arn:aws:iam::123456789012:role/MyRole'

local_pipeline_session = LocalPipelineSession(boto_session = boto3.Session(region_name = 'eu-west-1'))
local_pipeline_session.config = {'local': {'local_code': True}}

script_processor = ScriptProcessor(
    image_uri = 'docker.io/library/python:3.8',
    command = ['python'],
    instance_type = 'local',
    instance_count = 1,
    sagemaker_session = local_pipeline_session,
    role = role,
)

processing_step = ProcessingStep(
    name = 'Processing Step',
    processor = script_processor,
    code = 'processing.py',
    inputs = [
        ProcessingInput(
            source = './input-data',
            destination = '/opt/ml/processing/input',
        )
    ],
    outputs = [
        ProcessingOutput(
            source = '/opt/ml/processing/output',
            destination = './output-data',
        )
    ],
)

pipeline = Pipeline(
    name = 'MyPipeline',
    steps = [processing_step],
    sagemaker_session = local_pipeline_session
)

pipeline.upsert(role_arn = role)

pipeline_run = pipeline.start()

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.126.0
@lodo1995 lodo1995 added the bug label Dec 27, 2022
@clausagerskov
Copy link

@lodo1995 any developments? is local development actually possible at the moment?

@lodo1995
Copy link
Author

@clausagerskov in general, local development is partially possible. Meaning, some things do work, others (such as the one described in this bug) don't. Your milage may vary.

Regarding this specific bug, as far as I can tell no AWS developer even looked at it. Nor did anyone look at any of the other bugs that I opened. I don't have time to take care of all of this, so I decided to just avoid Sagemaker for the time being.

@Adamwgoh
Copy link

Adamwgoh commented Jan 9, 2024

Following this as well.

https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-local-mode.html

Although local mode in documentation is said to be supported, it requires user to upload input code into S3 as stated in this issue. If there is a forced upload as a side effect, what is the reason why this is a must if local mode meant to use local resources to run said pipeline ?

@clausagerskov
Copy link

i wonder if it is possible to emulate an S3 location without having to pay for some tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants