Skip to content

ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code #4195

@celsofranssa

Description

@celsofranssa

I am using the Sagemaker Pytorch Estimator based on a custom docker image stored in AWS ECR.

from sagemaker.pytorch.estimator import PyTorch

    role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Sagemaker correctly clones the sources from GitHub and performs the checkout into the specified branch.

The Bug:
However, it only copies the main.py to /opt/ml/code inside the container instead of the holy-cloned source code, which causes ModuleNotFoundError: No module named 'source':

Traceback (most recent call last):
2y9byzwyxr-algo-1-reuoy  |   File "/opt/ml/code/main.py", line 15, in <module>
2y9byzwyxr-algo-1-reuoy  |     from source.helper.EvalHelper import EvalHelper
2y9byzwyxr-algo-1-reuoy  | ModuleNotFoundError: No module named 'source'

Logging the /opt/ml/code content only shows the main.py:

print(f"Content: {os.listdir(os.getcwd())}")
['main.py']

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions