Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert pytorchddp distribution to smdistributed distribution #4698

Merged
merged 8 commits into from
May 22, 2024

Conversation

tombousso
Copy link
Contributor

@tombousso tombousso commented May 22, 2024

The pytorchddp and smdistributed launchers are very similar and causing confusion for customers over which one to use. To simplify while still supporting both config options, this PR internally converts configs which use pytorchddp to use smdistributed instead.

I've tested that the unit tests pass, and tested a few jobs with the change. I tried running the integration tests locally but it seems like the setup is non-trivial so I'm hoping to just run them using the review pipeline.

@tombousso tombousso marked this pull request as ready for review May 22, 2024 17:08
@tombousso tombousso requested a review from a team as a code owner May 22, 2024 17:08
@tombousso tombousso requested review from liujiaorr and removed request for a team May 22, 2024 17:08
@liujiaorr liujiaorr merged commit 6196b75 into aws:master May 22, 2024
11 checks passed
knikure added a commit that referenced this pull request Jun 5, 2024
* fix: mainline alt config parsing (#4602)

* fix: parsing

* fix: commit tests

* fix: types

* updated

* fix

* Add Triton v24.03 URI (#4605)

Co-authored-by: Nikhil Kulkarni <nikhilsk@amazon.com>

* feature: support session tag chaining for training job (#4596)

* feature: support session tag chaining for training job

* fix: resolve typo

* fix: resolve typo and build failure

* fix: resolve typo and unit test failure

---------

Co-authored-by: Jessica Zhu <jessicazhu3@106775307+jessicazhu3@users.noreply.github.com>

* prepare release v2.217.0

* update development version to v2.217.1.dev0

* fix: properly close files in lineage queries and tests (#4587)

Closes #4458

* feature: set default allow_pickle param to False (#4557)

* breaking: set default allow_pickle param to False

* breaking: fix unit tests and linting

NumpyDeserializer will not allow deserialization
unless allow_pickle flag is set to True explicitly

* fix: black-check

---------

Co-authored-by: Ashwin Krishna <ashwikri@amazon.com>

* Fix:invalid component error with new metadata (#4634)

* fix: invalid component name

* tests

* format

* fix vulnerable model integ tests llama 2

* updated

* fix: training dataset location

* prepare release v2.218.0

* update development version to v2.218.1.dev0

* chore: update skipped flaky tests (#4644)

* Update skipped flaky tests

* flake8

* format

* format

* chore: release tgi 2.0.1 (#4642)

* chore: release tgi 2.0.1

* minor fix

---------

Co-authored-by: Zhaoqi <52220743+zhaoqizqwang@users.noreply.github.com>

* fix: Fix UserAgent logging in Python SDK (#4647)

* prepare release v2.218.1

* update development version to v2.218.2.dev0

* feature: allow choosing js payload by alias in private method

* Updates for SMP v2.3.1 (#4660)

Co-authored-by: Suhit Kodgule <skodgule@amazon.com>

* chore(deps): bump jinja2 from 3.1.3 to 3.1.4 in /doc (#4655)

Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.3...3.1.4)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump tqdm from 4.66.2 to 4.66.3 in /tests/data/serve_resources/mlflow/pytorch (#4650)

Bumps [tqdm](https://github.com/tqdm/tqdm) from 4.66.2 to 4.66.3.
- [Release notes](https://github.com/tqdm/tqdm/releases)
- [Commits](tqdm/tqdm@v4.66.2...v4.66.3)

---
updated-dependencies:
- dependency-name: tqdm
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump jinja2 from 3.1.3 to 3.1.4 in /requirements/extras (#4654)

Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.3...3.1.4)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* prepare release v2.219.0

* update development version to v2.219.1.dev0

* fix: skip flakey tests pending investigation (#4667)

* change: update image_uri_configs  05-09-2024 07:17:41 PST

* Add tensorflow_serving support for mlflow models and enable lineage tracking for mlflow models (#4662)

* Initial commit for tensorflow_serving support of MLflow

* Add integ tests for mlflow tf_serving

* fix style issues

* remove unused attributes from tf builder

* Add deep ping for tf_serving local mode

* Initial commit for lineage impl

* Initial commit for tensorflow_serving support of MLflow

* Add integ tests for mlflow tf_serving

* fix style issues

* remove unused attributes from tf builder

* Add deep ping for tf_serving local mode

* Add integ tests and uts

* fix local mode for tf_serving

* Allow lineage tracking only in sagemaker endpoint mode

* fix regex pattern

* fix style issues

* fix regex pattern and hard coded py version in ut

* fix missing session

* Resolve pr comments and fix regex for mlflow registry and ids

* fix: model builder race condition on sagemaker session (#4673)

Co-authored-by: Jonathan Makunga <makung@amazon.com>

* feat: Add telemetry support for mlflow models (#4674)

* Initial commit for telemetry support

* Fix style issues and add more logger messages

* fix value error messages in ut

* feat: add new images for HF TGI release (#4677)

* chore: add new images for HF TGI release

* test

* feature: AutoGluon 1.1.0 image_uris update (#4679)

Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-154.us-west-2.compute.internal>

* change: add debug logs to workflow container dist creation (#4682)

* prepare release v2.220.0

* update development version to v2.220.1.dev0

* fix: Image URI should take precedence for HF models (#4684)

* Fix: Image URI should take precedence for HF models

* Fix formatting

* Fix formatting

* Fix formatting

* Increase coverage -  UT pass

* feat: onboard tei image config to pysdk (#4681)

* feat: onboard tei image config to pysdk

* fix formatting issue

* minor fix func name

* fix unit tests

---------

Co-authored-by: Mufaddal Rohawala <89424143+mufaddal-rohawala@users.noreply.github.com>

* fix: model builder limited container support for endpoint mode. (#4683)

* Allow ModelBuilder's endpoint mode for Jumpstart models packaged with containers other than TGI and DJL

* increase coverage

* Add JS Support for MMS Serving

* Add JS Support for MMS Serving

* Unit tests

* Refactoring

* Refactoring

* Refactoring

---------

Co-authored-by: Jonathan Makunga <makung@amazon.com>

* change: Add more debuging (#4687)

* change: cover tei with image_uris.retrieve API (#4689)

* fix: JS Model with non-TGI/non-DJL deployment failure (#4688)

* Debug

* Debug

* Debug

* Debug

* Debug

* Debug

* fix docstyle

* Refactoring

* Add Integ tests

---------

Co-authored-by: Jonathan Makunga <makung@amazon.com>

* Feat: Pull latest tei container for sentence similiarity models on HuggingFace hub (#4686)

* Update: Pull latest tei container for sentence similiarity models

* Fix formatting

* Address PR comments

* Fix formatting

* Fix check

* Switch sentence similarity to be deployed on tgi

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Introduce TEI builder with TGI server

* Fix formmatting

* Add integ test

* Fix formatting

* Add integ test

* Add integ test

* Add integ test

* Add integ test

* Add integ test

* Fix formatting

* Move to G5 for integ test

* Fix formatting

* Integ test updates

* Integ test updates

* Integ test updates

* Fix formatting

* Integ test updates

* Move back to generate for ping

* Integ test updates

* Integ test updates

* Fix: Add Image URI overrides for transformers models (#4693)

* Fix: Add Image URI overrides for transformers models

* Increase coverage

* Fix formatting

* prepare release v2.221.0

* update development version to v2.221.1.dev0

* Add tei cpu image (#4695)

* Add tei cpu image

* fix format issue

* fix unit tests

* fix typo

* fix typo

* Feat: Add TEI support for ModelBuilder (#4694)

* Add TEI Serving

* Add TEI Serving

* Add TEI Serving

* Add TEI Serving

* Add TEI Serving

* Add TEI Serving

* Notebook testing

* Notebook testing

* Notebook testing

* Refactoring

* Refactoring

* UT

* UT

* Refactoring

* Test coverage

* Refactoring

* Refactoring

---------

Co-authored-by: Jonathan Makunga <makung@amazon.com>

* Convert pytorchddp distribution to smdistributed distribution (#4698)

* rewrite pytorchddp to smdistributed

* remove instance type check

* Update estimator.py

* remove validate_pytorch_distribution

* fix

* fix unit tests

* fix formatting

* check instance type not None

* prepare release v2.221.1

* update development version to v2.221.2.dev0

* Update: SM Endpoint Routing Strategy Support. (#4702)

* RoutingConfig

* Refactoring

* Docstring

* UT

* Refactoring

* Refactoring

---------

Co-authored-by: Jonathan Makunga <makung@amazon.com>

* change: update image_uri_configs  05-29-2024 07:17:35 PST

* Making project name in workflow files dynamic (#4708)

* fix: Fix ci unit-tests (#4713)

* chore(deps): bump requests from 2.31.0 to 2.32.2 in /tests/data/serve_resources/mlflow/pytorch (#4709)

Bumps [requests](https://github.com/psf/requests) from 2.31.0 to 2.32.2.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.31.0...v2.32.2)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump apache-airflow from 2.9.0 to 2.9.1 in /requirements/extras (#4703)

* chore(deps): bump apache-airflow in /requirements/extras

Bumps [apache-airflow](https://github.com/apache/airflow) from 2.9.0 to 2.9.1.
- [Release notes](https://github.com/apache/airflow/releases)
- [Changelog](https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst)
- [Commits](apache/airflow@2.9.0...2.9.1)

---
updated-dependencies:
- dependency-name: apache-airflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update tox.ini to bump apache-airflow

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kalyani Nikure <110067132+knikure@users.noreply.github.com>

* chore(deps): bump mlflow from 2.10.2 to 2.12.1 in /tests/data/serve_resources/mlflow/pytorch (#4690)

Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.10.2 to 2.12.1.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md)
- [Commits](mlflow/mlflow@v2.10.2...v2.12.1)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump mlflow from 2.11.1 to 2.12.1 in /tests/data/serve_resources/mlflow/xgboost (#4692)

Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.11.1 to 2.12.1.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md)
- [Commits](mlflow/mlflow@v2.11.1...v2.12.1)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump mlflow from 2.11.1 to 2.12.1 in /tests/data/serve_resources/mlflow/tensorflow (#4691)

Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.11.1 to 2.12.1.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md)
- [Commits](mlflow/mlflow@v2.11.1...v2.12.1)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* change: Updates for DJL 0.28.0 release (#4701)

* Sync Branch

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Haotian An <33510317+Captainia@users.noreply.github.com>
Co-authored-by: Nikhil Kulkarni <knikhil29@gmail.com>
Co-authored-by: Nikhil Kulkarni <nikhilsk@amazon.com>
Co-authored-by: jessicazhu3 <106775307+jessicazhu3@users.noreply.github.com>
Co-authored-by: Jessica Zhu <jessicazhu3@106775307+jessicazhu3@users.noreply.github.com>
Co-authored-by: ci <ci>
Co-authored-by: Justin <justinm088@hotmail.com>
Co-authored-by: ASHWIN KRISHNA <38850354+akrishna1995@users.noreply.github.com>
Co-authored-by: Ashwin Krishna <ashwikri@amazon.com>
Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com>
Co-authored-by: Zhaoqi <52220743+zhaoqizqwang@users.noreply.github.com>
Co-authored-by: Kalyani Nikure <110067132+knikure@users.noreply.github.com>
Co-authored-by: Keerthan Vasist <kvasist@amazon.com>
Co-authored-by: SuhitK <kodgule.suhit@gmail.com>
Co-authored-by: Suhit Kodgule <skodgule@amazon.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sagemaker-bot <sagemaker-bot@amazon.com>
Co-authored-by: jiapinw <95885824+jiapinw@users.noreply.github.com>
Co-authored-by: Jonathan Makunga <makung@amazon.com>
Co-authored-by: Prateek M Desai <prateekmdesai04@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-154.us-west-2.compute.internal>
Co-authored-by: Mufaddal Rohawala <89424143+mufaddal-rohawala@users.noreply.github.com>
Co-authored-by: Samrudhi Sharma <154457034+samruds@users.noreply.github.com>
Co-authored-by: Tom Bousso <tombousso@gmail.com>
Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com>
Co-authored-by: Tyler Osterberg <tyoster@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants