Skip to content

Sagemaker Pipeline DHCP error in VPC config #3955

@m-rajput

Description

@m-rajput

Describe the bug
When running sagemaker pipeline with VPC network config, it throws the following error:

ClientError: ClientError: expected DHCP options to include keys domain-name-servers and domain-name, but missing one or more attributes: {DhcpOptionsId:dopt-XXXXXXXXXXXXXXXXX DomainNameServers:[0xc009bd0e08] DomainNameSearch:<nil>}.Please refer to https://docs.aws.amazon.com/vpc/latest/userguide/DHCPOptionSet.html for more details

It probably wants the DHCP option set to have both domain-name and domain-name-servers, but ours doesn't have domain-name as mentioned below.

Our DHCP options set:

DHCP option set ID
dopt-XXXXXXXXXXXXXXXXX

NetBIOS name servers
–

Domain name
–

NetBIOS node type
–

Domain name servers
XX.X.X.X

Owner
XXXXXXXXXXXX

NTP servers
–

Reported it as a bug because Sagemaker studio plus numerous other AWS services runs fine in same VPC.

To reproduce

  1. Create a sagemaker pipeline, and in the first processing step of the pipeline add the network config

  2. NetworkConfig instance:

network_config = sagemaker.network.NetworkConfig(
    security_group_ids=["sg-XXXXXXXXXXXXXXXXX"],
    subnets=["subnet-XXXXXXXXXXXXXXXXX"],
)
  1. Add this to the processing step:
processing_instance_count = ParameterInteger(
    name="ProcessingInstanceCount", default_value=1
)
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.t3.medium"
)
training_instance_type = ParameterString(
    name="TrainingInstanceType", default_value="ml.t3.medium"
)

param_postgres_db = ParameterString(
    name="PostgreSqlDBName",
)
param_clickhouse_db = ParameterString(
    name="ClickHouseDBName",
)
param_event_id = ParameterString(
    name="EventId",
)
param_role_arn = ParameterString(name="SagemakerExecutionRoleARN")

pipeline_session = PipelineSession()
cache_config = CacheConfig(enable_caching=True, expire_after="PT12H")
BASE_DIR = os.path.dirname(os.path.realpath(__file__))
default_bucket = <ADD_BUCKET>

# Step 1 - query and build dataset
dataset_builder_processor = ScriptProcessor(
    role=param_role_arn,
    image_uri=custom_image_uri,
    command=["python"],
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    volume_size_in_gb=20,
    max_runtime_in_seconds=7200,
    base_job_name="Dataset-Builder",
    sagemaker_session=pipeline_session,
    network_config=network_config,
)
  1. Create the processing step:
dataset_process = ProcessingStep(
        name="Query-Data-and-Build-Features",
        processor=dataset_builder_processor,
        inputs=[
            ProcessingInput(
                source=(
                    f"s3://{default_bucket}/"
                    "Feature-Groups-Builder-Pipeline/src_module/"
                ),
                destination="/opt/ml/processing/input/code/src/",
                input_name="src_module",
            )
        ],
        outputs=[
            ProcessingOutput(
                output_name="visitor_features_df",
                source="/opt/ml/processing/output/visitor_features_df",
            ),
            ProcessingOutput(
                output_name="exhibitor_features_df",
                source="/opt/ml/processing/output/exhibitor_features_df",
            ),
            ProcessingOutput(
                output_name="visitor_embeddings_df",
                source="/opt/ml/processing/output/visitor_embeddings_df",
            ),
            ProcessingOutput(
                output_name="exhibitor_embeddings_df",
                source="/opt/ml/processing/output/exhibitor_embeddings_df",
            ),
            ProcessingOutput(
                output_name="vis_exh_interactions_df",
                source="/opt/ml/processing/output/vis_exh_interactions_df",
            ),
            ProcessingOutput(
                output_name="idx_mapper",
                source="/opt/ml/processing/output/idx_mapper",
            ),
        ],
        code=os.path.join(BASE_DIR, "dataset_builder.py"),
        job_arguments=[
            "--postgres_db",
            param_postgres_db,
            "--clickhouse_db",
            param_clickhouse_db,
            "--event_id",
            param_event_id.to_string(),
            "--role_arn",
            param_role_arn,
        ],
        cache_config=cache_config,
    )
  1. Create pipeline instance:
# pipeline instance
    pipeline = Pipeline(
        name=pipeline_name,
        parameters=[
            processing_instance_type,
            processing_instance_count,
            training_instance_type,
            param_postgres_db,
            param_clickhouse_db,
            param_event_id,
            param_role_arn,
        ],
        steps=[dataset_process], # There are also other steps but execution never reaches there
        sagemaker_session=pipeline_session,
    )
  1. Upser and run pipeline:
upsert_response = pipeline.upsert(
    role_arn=<SAGEMAKER_PIPELINE_EXECUTION_ROLE>,
    description="Build number: 181",
)

execution = pipeline.start(
    PostgreSqlDBName="postgres_db_name",
    ClickHouseDBName="ch_db_name",
    EventId="99",
    SagemakerExecutionRoleARN=<SAGEMAKER_PIPELINE_EXECUTION_ROLE>,
)

Expected behavior
Sagemaker pipeline should not throw this error.

Screenshots or logs
Screenshot 2023-06-23 at 11 11 39 AM

This step failed. For more information, view the logs

ClientError: ClientError: expected DHCP options to include keys domain-name-servers and domain-name, but missing one or more attributes: {DhcpOptionsId:dopt-XXXXXXXXXXXXXXXXX DomainNameServers:[0xc009bd0e08] DomainNameSearch:<nil>}.Please refer to https://docs.aws.amazon.com/vpc/latest/userguide/DHCPOptionSet.html for more details

System information

  • SageMaker Python SDK version: 2.161.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): -
  • Framework version: -
  • Python version: 3.9
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Yes (Base image python:3.9-slim-buster)

Additional context

  1. Custom image used in processing step:
FROM python:3.9-slim-buster

RUN pip3 install --no-cache-dir \
numpy==1.24.3 \
pandas==2.0.2 \
psycopg2-binary==2.9.6 \
sqlalchemy==2.0.15 \
clickhouse-driver==0.2.6 \
sentence-transformers==2.2.2 \
sagemaker==2.161.0 \
boto3==1.26.145

ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python"]
  1. There are no private subnets in our VPC
  2. VPC only has IPv4 CIDR
  3. DHCP option set is linked to VPC
  4. Sagemaker studio also runs in VPC, but there are no such problems with either studio or any other aws service
  5. Other sagemaker.network.NetworkConfig parameters were left to default

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions