Skip to content

FrameworkProcessor doesn't install packages in requirements.txt if it's in a Sagemaker Project #3166

@mstfldmr

Description

@mstfldmr

Describe the bug
I created a pipeline with only one ProcessingStep based on FrameworkProcessor. When I upsert() and start() the pipeline from a notebook, it runs well.

The code is in a Sagemaker project, generated using an AWS-provided template. When I push a change to Codecommit, the processing job fails because of a missing package. Although the package is in source_dir/requirements.txt, it's not installed.

Traceback (most recent call last):  File "preprocess.py", line 7, in <module>    import sagemaker
--
ModuleNotFoundError: No module named 'sagemaker'

To reproduce

def get_pipeline(
    region,
    sagemaker_project_arn=None,
    role=None,
    default_bucket=None,
    model_package_group_name="MstfPackageGroup",
    pipeline_name="MstfPipeline",
    base_job_prefix="Mstf",
    feature_group_name="ninja-x5-y3-feature-group-02-14-30-03",
    from_date="2021-05-13",
    to_date="2022-05-13"
):

    sagemaker_session = get_session(region, default_bucket)
    s3_default_bucket = sagemaker_session.default_bucket()
    if role is None:
        role = sagemaker.session.get_execution_role(sagemaker_session)

    # parameters for pipeline execution
    processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)
    processing_instance_type = ParameterString(
        name="ProcessingInstanceType", default_value="ml.m5.xlarge"
    )
    training_instance_type = ParameterString(
        name="TrainingInstanceType", default_value="ml.m5.xlarge"
    )
    model_approval_status = ParameterString(
        name="ModelApprovalStatus", default_value="PendingManualApproval"
    )
    
    feature_group_name_input = ParameterString(
        name="InputFeatureGroupName",
        default_value=feature_group_name
    )
    
    input_s3_bucket = ParameterString(
        name="DefaultS3Bucket",
        default_value=f"s3://{s3_default_bucket}/{base_job_prefix}"
    )
    
    input_from_date = ParameterString(
        name="DataFromDate",
        default_value=from_date
    )
    
    input_to_date = ParameterString(
        name="DataToDate",
        default_value=to_date
    )
    
    est_cls = sagemaker.sklearn.estimator.SKLearn
    framework_version_str="0.23-1"

    script_processor = FrameworkProcessor(
        estimator_cls=est_cls,
        framework_version=framework_version_str,
        role=role,
        instance_count=1,
        instance_type="ml.m5.xlarge",
        sagemaker_session=sagemaker_session
    )
    
    processor_run_args = script_processor.get_run_args(
        code="preprocess.py",
        source_dir=os.path.join(BASE_DIR, "preprocessing"),
        inputs=[],
        outputs=[
            ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
            ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
            ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
        ],
        arguments=[
                    "--feature_group_name", feature_group_name_input,
                    "--input_s3_bucket", input_s3_bucket,
                    "--from_date", input_from_date,
                    "--to_date", input_to_date],
    )
    
    step_process = ProcessingStep(
        name="PreprocessMstfData",
        processor=script_processor,
        inputs=processor_run_args.inputs,
        outputs=processor_run_args.outputs,
        job_arguments=processor_run_args.arguments,
        code=processor_run_args.code,
    )


    # pipeline instance
    pipeline = Pipeline(
        name=pipeline_name,
        parameters=[
            processing_instance_type,
            processing_instance_count,
            training_instance_type,
            model_approval_status,
            feature_group_name_input,
            input_s3_bucket,
            input_from_date,
            input_to_date
        ],
        steps=[step_process],
        sagemaker_session=sagemaker_session,
    )
    return pipeline

Expected behavior
Install sagemaker package because it's in requirements.txt and run preprocess.py successfully.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.86.2
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKLearn
  • Framework version: 0.23-1
  • Python version: 3.7
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions