Skip to content

tests: unskip three tests with individual issues in train#5894

Merged
lucasjia-aws merged 6 commits into
aws:masterfrom
lucasjia-aws:fix/TestBenchmarkEvaluatorIntegration
May 28, 2026
Merged

tests: unskip three tests with individual issues in train#5894
lucasjia-aws merged 6 commits into
aws:masterfrom
lucasjia-aws:fix/TestBenchmarkEvaluatorIntegration

Conversation

@lucasjia-aws
Copy link
Copy Markdown
Collaborator

@lucasjia-aws lucasjia-aws commented May 26, 2026

test_hp_contract_mpi_script -- unskip with fix
test_benchmark_evaluation_base_model_only -- unskip
test_benchmark_evaluation_nova_model -- remain skipped but eliminate dependency to cross account resources

test_hp_contract_mpi_script fix

Problem: mpi_driver.py used absolute imports (from sagemaker.train.container_drivers.distributed_drivers.mpi_utils import ...), but the sagemaker-train package is not installed in the training container. The driver scripts are uploaded as standalone files to /opt/ml/input/data/sm_drivers/ and executed directly, so the absolute imports fail with ModuleNotFoundError: No module named 'sagemaker.train'.

Fix: Changed mpi_driver.py to use sys.path.insert + relative module name imports, matching the pattern already used by torchrun_driver.py, which works correctly in the container environment.

@lucasjia-aws lucasjia-aws changed the title tests: unskip three tests tests: unskip three tests in train May 26, 2026
The MPI driver script used absolute imports (from sagemaker.train.container_drivers...)
which fail at runtime in the training container because sagemaker-train is not installed
there. The driver scripts are copied to /opt/ml/input/data/sm_drivers/ and executed
directly by the container entrypoint.

Changed to sys.path-based relative imports matching the pattern used by
torchrun_driver.py, which works correctly in the container environment.
Remove cross-account dependency in test_benchmark_evaluation_nova_model by
replacing resources from account 052150106756 with our test account
(729646638167) in us-east-1. Also removed mlflow_tracking_server_arn since
no MLflow server exists in us-east-1.

Test remains skipped pending us-east-1 test infrastructure migration to a
dedicated test account.
@lucasjia-aws lucasjia-aws changed the title tests: unskip three tests in train tests: unskip three individual tests in train May 28, 2026
@lucasjia-aws lucasjia-aws changed the title tests: unskip three individual tests in train tests: unskip three tests with individual issues in train May 28, 2026
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this failure of sagemaker.train not being available a cause of sagemaker not being installed within the training container (or the correct version of sagemaker)? I'm not sure relative paths is the correct fix here, although feel free to correct me if I'm wrong

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! The driver scripts (mpi_driver.py, torchrun_driver.py, etc.) are not meant to run as part of an installed sagemaker-train package. They're uploaded as standalone files to /opt/ml/input/data/sm_drivers/ in the training container and executed directly by the entrypoint bash script. The container uses a standard AWS DLC image (e.g., pytorch-training:2.0.0-cpu-py310) which doesn't have sagemaker-train installed — and it shouldn't need to.

torchrun_driver.py already uses sys.path.insert + relative module imports and works correctly. mpi_driver.py was the only driver using absolute imports, which was inconsistent with the rest of the codebase. I'm actually mirroring what we have done and is running fine here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I see in PySDK V2 that a similar approach is being used here too, thanks for the clarification!

PySDK V2 link: https://github.com/aws/sagemaker-python-sdk/blob/master-v2/src/sagemaker/modules/train/container_drivers/distributed_drivers/mpi_driver.py#L31

@lucasjia-aws lucasjia-aws merged commit 16ffb94 into aws:master May 28, 2026
21 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants