Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PyTorch Estimator for SageMaker training jobs #2031

Merged
merged 2 commits into from Jan 24, 2024

Conversation

AdeelH
Copy link
Collaborator

@AdeelH AdeelH commented Jan 18, 2024

Overview

This PR replaces the use of the generic Estimator with the PyTorch-specific Estimator provided by the SageMaker Python SDK. This allows us to distribute training jobs across multiple instances using torchrun.

Other changes:

  • Add train_image, train_instance_type, and train_instance_count RV config options.

Checklist

  • Added unit tests, if applicable
  • Updated documentation, if applicable
  • Added needs-backport label if the change should be back-ported to the previous release
  • PR has a name that won't get you publicly shamed for vagueness

Notes

The current hack of creating a fake python script will not be necessary once aws/sagemaker-python-sdk#4324 is merged and released.

Testing Instructions

Sample command:

SAGEMAKER_TRAIN_INSTANCE_TYPE=ml.p3.8xlarge SAGEMAKER_TRAIN_INSTANCE_COUNT=2 rastervision run sagemaker \
	"rastervision_pytorch_backend/rastervision/pytorch_backend/examples/semantic_segmentation/isprs_potsdam.py" \
	-a raw_uri "s3://raster-vision-raw-data/isprs-potsdam" \
	-a root_uri "s3://raster-vision-ahassan/rvtest/sagemaker_dist/isprs_potsdam_2024_01_23b/"

- This is required for using the image via AWS SageMaker's PyTorch Estimator.
- Also bump boto3 and awscli to compatible versions.
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (6a918ba) 85.16% compared to head (1752019) 85.23%.

Files Patch % Lines
...rastervision/aws_sagemaker/aws_sagemaker_runner.py 98.48% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2031      +/-   ##
==========================================
+ Coverage   85.16%   85.23%   +0.06%     
==========================================
  Files         196      196              
  Lines        9856     9908      +52     
==========================================
+ Hits         8394     8445      +51     
- Misses       1462     1463       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AdeelH AdeelH marked this pull request as ready for review January 24, 2024 21:32
@AdeelH AdeelH merged commit 87e59e2 into azavea:master Jan 24, 2024
2 checks passed
@AdeelH AdeelH deleted the sagemaker-ddp branch January 24, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant