Add neuron enhancements (#2355) · aws/deep-learning-containers@e78bf86

Commit

Add neuron enhancements (#2355)

* [test] Add efa test as placeholder (#185)

* [pytorch][sagemaker] PT 1.8.0 cu110 EFA support (#171)

* PT 1.7.1 cu110 EFA support

* rebase PT 1.7.1 dockerfile and add EFA to PT 1.8.0 dockerfile

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* fix nccl version and add test

* update mpi

* fix style

* Fixed NCCL branch name and moved the Horovod installation before SM Distributed

* Disable the framework build and test which is not applicable to this PR

* fix failing test

* Add MPI flags for EFA

* Fixed pytorch nccl version test

* Fixed pytorch nccl version python test and disable fresh builds

* Disable new builds and enabled smdataparallel test

* Re-trigger CI

* Revert build config changes

Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* [TensorFlow][Sagemaker] TF 2.4 cu110 EFA support (#172)

* TF 2.4 cu110 EFA support

* Added -g option for EFA installer

* Update NCCL installation

* Fixed NCCL installation

* Add constant at top

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* update OPEN_MPI

* Install NCCL from source and updated the openMPI path

* Re-trigger CI

* Disable the framework build and test which is not applicable to this PR and added EFA related flag

* Fix mpi flag failure

* Add correct runtime MPI flags

* Add correct MPI flags, modify build config

* Disable new builds and Fixed SM Horovod test

* Enabled smdataparallel test

* Removed building NCCL with specific arch. Use default config which builds for all arch

* Revert build config changes

Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* Run PT to test EFA (#191)

add sanity efa test

* [pytorch] | [test] | [sagemaker] SMModel Parallel pytorch EFA tests on p3dn (#187)

SMModel Parallel pytorch EFA tests

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* [tensorflow] | [test] | [sagemaker] (#188)

add efa test for tf2

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* Run PT Rubik EFA test (#194)

* run pt efa rubik

* skip inference

* revert

* Run rubik efa tests on tf2 (#195)

* run rubik efa tests on tf2

* [test][sagemaker] Add reupload_image_to_test_ecr to SM tests conftest (#193)

* [PyTorch][test][sagemaker] EFA test for smdataparallel (#189)

EFA test for smdataparallel

* [habana] Placeholder for Build and Test Functionality for Habana (#197)

* [habana] build functionality

* modify habana dedicated flag

* enable habana build

* build config changes

* add pytorch and modify test configuration

* move build artifact

* test support for habana

* nit changes

* build changes

* nit change

* support for SM and benchmark

* address comments

* build eia and neuron

* enable new builds

* nit

* revert temp configs

* remove dead code from eks test

* [Habana] Add changeset logic (#198)

* changeset logic for habana

* enable habana mode

* test buildspec

* change dockerfiles

* disable habana mode and revert changes

* remove unwanted code

* [test] Run test using existing EC2 instance locally (#201)

* Run test using existing EC2 instance

* rename pytest fixture

* Removing any SM related installs from Dockerfile (#200)

* Removing any SM related installs

* Cleaned Dockerfile.Added 2.5 folder

Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [pytorch/tensorflow] Habana DLC python 3.7, OMPI in base installer and pytorch DLC fixes  (#202)

* Habana Pytorch DLC and OMPI Install In Habana Bases

* Fix docker path

* Rebased and added TF2.5

* Update pytorch to 0.15.0 synapse

* Updated Pytorch docker file (#204)

* Updated Pytorch docker file. Also updated buildspec to pull whl from s3 bucket

* Removed SM packages. Added few more pythom packages. Renamed folder to 0.15

* Minor fix in buildspec

* build habana images

* correct build config

* disable build config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* Update buildspec.yml (#206)

Updated pytorch wheel.
Added HPUBase for test cases.

* SynapseAI 0.15.0 Release DLC Changes (#205)

* SynapseAI 0.15.0 Release

* Add example branch parse and Habana PR build

* Fix extra slash

* Revert ENABLE_HABANA_MODE

* [Habana][Build] Fix torchvision python version py37 (#207)

* Fix torchvision python version py37

* Updated h5py version to 3.1.0

* enable habana mode and disable test

* Using pypi package for torchvision

* add docker build artifacts

* add build artifacts references to buildspec

* revert config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* SynapseAI v0.15.1 release updates (#208)

* SynapseAI v0.15.1 release updates

* build habana switch on

* fix pt parse

* ENABLE_HABANA_MODE=False

* Updating TF binaries with callback fixes (#210)

* Updating TF binaries with callback fixes

* Enabling Habana build

* Resetting ENABLE_HABANA_MODE=False

* SynapseAI v0.15.2 release updates (#209)

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* Updating Torchvision binary (#211)

* Updating Torchvision binary as we need to build with same setup as pytorch for compatibilty

* Enabling Habana mode

* Reset ENABLE_HABANA_MODE= False

* SynapseAI v0.15.3 release updates (#213)

* SynapseAI v0.15.3 release updates
* SynapseAI v0.15.3 release updates

* Enable Habana Mode

* Disable Habana Mode

* address rebase modifications

* [DO NOT MERGE] [autogluon][build, test] Initial PR for training containers (#214)

* [autogluon][build, test] fixing instance types (#218)

* format ecr repo from image uri (#217)

* format ecr repo from image uri

* pytest markers for hpu test

* more markers

* nit habana changes

* [habana][build] fix docker entrypoint (#219)

* fix docker entrypoint

* revert habana mode

* Fixed version in autogluon buildspec (#215)

* Fixed version in autogluon buildspec

* Enabling sagemaker tests

* Enable building a new container

* Added MAJOR_VERSION into docker files, added autogluon_training fixture

* [autogluon][test] SageMaker remote mode tests

* [autogluon][test] removed datasets requirement

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][test] tests fixes (#220)

* [autogluon][test] tests fixes

* [autogluon][test] tests fixes

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] version checks fixes

* [autogluon][test] pip check fixes

* [autogluon][test] pip check fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] applied pillow security fixes to autogluon

* [autogluon][test] removed jupyter dependencies leftovers

* [build][test]Rolling back default parameters changes (#224)

* Rolling back default parameters changes

* [autogluon][test] test fixes

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][release]Releasing Autogluon 0.2.1 (#227)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [autogluon][test]Fixes for AG sanity tests (#226)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fixed release notes logic (#228)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fix for AG release notes (#229)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [autogluon][release] Release AG container (#230)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fix for imp_pip_packages (#231)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Ag release (#232)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [autogluon][build] Build AG 0.3.0

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [habana] fix pip check requirements (#225)

* habana sanity test

* reinstall boto3

* upgrade boto3

* remove comments

* revert temp configs

* [test] Merger testrunner from public (#234)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* SynapseAI v0.15.4 release updates (#233)

* SynapseAI v0.15.4 release updates
* SynapseAI v0.15.4 release updates

* Enable Habana Mode

* Revert "Enable Habana Mode"

This reverts commit 9ed1a8f58d2d5c71977ff0cc660e3228c3dd8874.

* [test] Building AG 0.2.1 (#236)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Remove hb-torch & install into --user for python packages (#237)

* Remove hb-torch before installing AWS torch

* python packages to user space install

* add -y to uninstall

* enable habana mode

* disable habana mode

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [build] habana build modifications (#238)

* habana build modifications

* run test safety

* make sanity test compatible with hpu processor

* fix sanity test

* sync up utility test changes from public repo

* address comments

* revert temp config

* release habana dlc to gamma stage (#243)

* [release] fix numbering on release_images.yml (#244)

* fix_numbering

* move syai inside of job_type

* remove PT1.7 and TF2.5 from release_images.yml (#245)

* Remove keras package before installing tensorflow (#247)

* Remove keras package before installing tensorflow

* Enable habana_mode

* run test safety

* disable habana mode

* revert safety test changes

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#242)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.0 to 2.5.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.5.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [hopper][build] Add hopper build code (#246)

* Merge master into private-master (#248)

* [test] Add hopper_mode to quick checks tests (#251)

* [test] Add efa test as placeholder (#185)

* [pytorch][sagemaker] PT 1.8.0 cu110 EFA support (#171)

* PT 1.7.1 cu110 EFA support

* rebase PT 1.7.1 dockerfile and add EFA to PT 1.8.0 dockerfile

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* fix nccl version and add test

* update mpi

* fix style

* Fixed NCCL branch name and moved the Horovod installation before SM Distributed

* Disable the framework build and test which is not applicable to this PR

* fix failing test

* Add MPI flags for EFA

* Fixed pytorch nccl version test

* Fixed pytorch nccl version python test and disable fresh builds

* Disable new builds and enabled smdataparallel test

* Re-trigger CI

* Revert build config changes

Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* [TensorFlow][Sagemaker] TF 2.4 cu110 EFA support (#172)

* TF 2.4 cu110 EFA support

* Added -g option for EFA installer

* Update NCCL installation

* Fixed NCCL installation

* Add constant at top

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* update OPEN_MPI

* Install NCCL from source and updated the openMPI path

* Re-trigger CI

* Disable the framework build and test which is not applicable to this PR and added EFA related flag

* Fix mpi flag failure

* Add correct runtime MPI flags

* Add correct MPI flags, modify build config

* Disable new builds and Fixed SM Horovod test

* Enabled smdataparallel test

* Removed building NCCL with specific arch. Use default config which builds for all arch

* Revert build config changes

Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* Run PT to test EFA (#191)

add sanity efa test

* [pytorch] | [test] | [sagemaker] SMModel Parallel pytorch EFA tests on p3dn (#187)

SMModel Parallel pytorch EFA tests

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* [tensorflow] | [test] | [sagemaker] (#188)

add efa test for tf2

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* Run PT Rubik EFA test (#194)

* run pt efa rubik

* skip inference

* revert

* Run rubik efa tests on tf2 (#195)

* run rubik efa tests on tf2

* [test][sagemaker] Add reupload_image_to_test_ecr to SM tests conftest (#193)

* [PyTorch][test][sagemaker] EFA test for smdataparallel (#189)

EFA test for smdataparallel

* [habana] Placeholder for Build and Test Functionality for Habana (#197)

* [habana] build functionality

* modify habana dedicated flag

* enable habana build

* build config changes

* add pytorch and modify test configuration

* move build artifact

* test support for habana

* nit changes

* build changes

* nit change

* support for SM and benchmark

* address comments

* build eia and neuron

* enable new builds

* nit

* revert temp configs

* remove dead code from eks test

* [Habana] Add changeset logic (#198)

* changeset logic for habana

* enable habana mode

* test buildspec

* change dockerfiles

* disable habana mode and revert changes

* remove unwanted code

* [test] Run test using existing EC2 instance locally (#201)

* Run test using existing EC2 instance

* rename pytest fixture

* Removing any SM related installs from Dockerfile (#200)

* Removing any SM related installs

* Cleaned Dockerfile.Added 2.5 folder

Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [pytorch/tensorflow] Habana DLC python 3.7, OMPI in base installer and pytorch DLC fixes  (#202)

* Habana Pytorch DLC and OMPI Install In Habana Bases

* Fix docker path

* Rebased and added TF2.5

* Update pytorch to 0.15.0 synapse

* Updated Pytorch docker file (#204)

* Updated Pytorch docker file. Also updated buildspec to pull whl from s3 bucket

* Removed SM packages. Added few more pythom packages. Renamed folder to 0.15

* Minor fix in buildspec

* build habana images

* correct build config

* disable build config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* Update buildspec.yml (#206)

Updated pytorch wheel.
Added HPUBase for test cases.

* SynapseAI 0.15.0 Release DLC Changes (#205)

* SynapseAI 0.15.0 Release

* Add example branch parse and Habana PR build

* Fix extra slash

* Revert ENABLE_HABANA_MODE

* [Habana][Build] Fix torchvision python version py37 (#207)

* Fix torchvision python version py37

* Updated h5py version to 3.1.0

* enable habana mode and disable test

* Using pypi package for torchvision

* add docker build artifacts

* add build artifacts references to buildspec

* revert config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* SynapseAI v0.15.1 release updates (#208)

* SynapseAI v0.15.1 release updates

* build habana switch on

* fix pt parse

* ENABLE_HABANA_MODE=False

* Updating TF binaries with callback fixes (#210)

* Updating TF binaries with callback fixes

* Enabling Habana build

* Resetting ENABLE_HABANA_MODE=False

* SynapseAI v0.15.2 release updates (#209)

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* Updating Torchvision binary (#211)

* Updating Torchvision binary as we need to build with same setup as pytorch for compatibilty

* Enabling Habana mode

* Reset ENABLE_HABANA_MODE= False

* SynapseAI v0.15.3 release updates (#213)

* SynapseAI v0.15.3 release updates
* SynapseAI v0.15.3 release updates

* Enable Habana Mode

* Disable Habana Mode

* address rebase modifications

* [DO NOT MERGE] [autogluon][build, test] Initial PR for training containers (#214)

* [autogluon][build, test] fixing instance types (#218)

* format ecr repo from image uri (#217)

* format ecr repo from image uri

* pytest markers for hpu test

* more markers

* nit habana changes

* [habana][build] fix docker entrypoint (#219)

* fix docker entrypoint

* revert habana mode

* Fixed version in autogluon buildspec (#215)

* Fixed version in autogluon buildspec

* Enabling sagemaker tests

* Enable building a new container

* Added MAJOR_VERSION into docker files, added autogluon_training fixture

* [autogluon][test] SageMaker remote mode tests

* [autogluon][test] removed datasets requirement

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][test] tests fixes (#220)

* [autogluon][test] tests fixes

* [autogluon][test] tests fixes

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] version checks fixes

* [autogluon][test] pip check fixes

* [autogluon][test] pip check fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] applied pillow security fixes to autogluon

* [autogluon][test] removed jupyter dependencies leftovers

* [build][test]Rolling back default parameters changes (#224)

* Rolling back default parameters changes

* [autogluon][test] test fixes

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][test]Fixes for AG sanity tests (#226)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fixed release notes logic (#228)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fix for imp_pip_packages (#231)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [habana] fix pip check requirements (#225)

* habana sanity test

* reinstall boto3

* upgrade boto3

* remove comments

* revert temp configs

* SynapseAI v0.15.4 release updates (#233)

* SynapseAI v0.15.4 release updates
* SynapseAI v0.15.4 release updates

* Enable Habana Mode

* Revert "Enable Habana Mode"

This reverts commit 9ed1a8f58d2d5c71977ff0cc660e3228c3dd8874.

* Remove hb-torch & install into --user for python packages (#237)

* Remove hb-torch before installing AWS torch

* python packages to user space install

* add -y to uninstall

* enable habana mode

* disable habana mode

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [build] habana build modifications (#238)

* habana build modifications

* run test safety

* make sanity test compatible with hpu processor

* fix sanity test

* sync up utility test changes from public repo

* address comments

* revert temp config

* remove PT1.7 and TF2.5 from release_images.yml (#245)

* Remove keras package before installing tensorflow (#247)

* Remove keras package before installing tensorflow

* Enable habana_mode

* run test safety

* disable habana mode

* revert safety test changes

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [hopper][build] Add hopper build code (#246)

* Merge master into private-master (#248)

* [test] Add hopper_mode to quick checks tests (#251)

* followup sync changes

* [hopper][build] sync hopper dockerfiles with huggingface dockerfiles (#254)

* [hopper][build] sync hopper dockerfiles with huggingface dockerfiles

* Enable hopper mode

* Fix bug with CI for Hopper

* Use py38 wheel and disable debug env vars

* Update xla wheel and set buildspec correctly for hopper

* Fix framework path and artifact name

* Fix framework version path

* Disable hopper mode

Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com>

* [hopper][build] Add more wheels for hopper (#258)

* buildspec and status modifications (#261)

* [hopper][pytorch][test] Fix horovod tests (#266)

* Reinstall horovod for hopper

* Enable hopper mode

* Remove hopper dedicated

* Revert hopper dedicated

* Update dlc_developer_config.toml

* [hopper][test] Fix getting framework for hopper (#265)

* [hopper][test] Fix getting framework for hopper

* Add dummy change to trigger build

* Add dummy change in buildspec to trigger build

* Add dummy change in dockerfile

* Remove hopper dedicated

* Update main.py

* Update main.py

* Update main.py

* Remove dummy changes

* Update dlc_developer_config.toml

* [hopper][pytorch][build] Update transformers wheel (#267)

* [hopper][pytorch][build] Update transformers wheel to the latest (#269)

* [hopper][pytorch][build] Update hopper wheels (#270)

* [habana] fix pip check and unpin werkzeug package (#271)

* unpin werkzeug package

* install latest version

* fix rebase changes

* fix pip check

* revert temp config

* install typing

* build habana dlc

* revert temp changes

* release PT1.9 diy/sm (#272)

* [release] adjust customer_type for diy/sm (#273)

* adjust customer_type

* adjust customer_type

* nit change

* remove neuron (#274)

* add habana packages to release page (#241)

* [hopper][build][pytorch] Update hopper pytorch wheels (#275)

* Update hopper pytorch wheels

* [hopper][build][pytorch] Update transformers wheel (#276)

* [hopper][build][pytorch] Update transformers wheel

* [hopper][build][pytorch] Update transformers wheel (#278)

* [hopper][build][pytorch] Update transformers wheel

* Disable hopper mode

* Synch HF images from public (#281)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [hopper][build][pytorch] Upgrade transformers to 11.0 (#282)

* Upgrade transformers to 11.0

* Update transformers version

* Disable hopper mode

* trigger builds

* retrigger builds

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [hopper][huggingface_tensorflow][huggingface_pytorch][build][test] Build and test Hopper images with sm pysdk (#280)

* Added the changes to build hopper images with sm pysdk

* Added the tests to run using sm pysdk

* Added debug lines

* Run SM local tests and address comments

* Deactivated ecs and eks tests.

* Reverting the dev config changes

* [test][sagemaker] Make PySDK binary selection logic generic for the SM tests and SM local tests (#283)

* Make PySDK binary selection logic generic for the SM and SM local tests

* Make hopper mode true

* Revert the changes

* [hopper][build][pytorch][tensorflow] Update fw wheels with init changes (#284)

* [hopper][build] Update fw wheels with init changes

* Enable test flags

* Fix typo

* Disable test flags

* [hopper][build][pytorch] Fix Hopper DT NaN issue (#288)

* Fix Hopper DT NaN issue

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* [hopper][build][pytorch][tensorflow] Fix licence files (#289)

* [hopper] [build] [pytorch] Updating SM trcomp PT wheels for DT support (#293)

* Updating SM trcomp PT wheels for DT support

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* [hopper][build][pytorch] Include examples dir in transformers wheel (#291)

* Include examples dir in transformers wheel

* Update transformers wheel

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* [hopper] [test] [sagemaker] Adding tests targeting the SM Training Compiler integrated containers Private master (#286)

* Fix bugs in framework init functions. +new Fx Wheels for HF-trcomp
Create remote and local test for HF-PT-trcomp
Create remote tests for HF-TF-trcomp
Make tests shorter

* Added handlers for non implemented tests

* Updating HF-trcomp tests to look for log messages indicating trcomp has been ingaged in the training logs

* Fix for smdebug EC2 test.

* Adding HF-PT-trcomp tests to test different trcomp configs. Porting testing to work with HF-TF-trcomp.

* Finalizing HF-trcomp tests
Fixed HF-TF-trcomp build recipe. Add redundancy to all trcomp build recipes
Fixing test dependencies

* Increasing retries for HF trcomp tests

* Skipping HF-PT-trcomp local test since it hangs. Will fix later

* Reverting test mode

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [test] Fix smart retry benchmark tests (#1452) (#296)

* Fix for multithreading error in SM local tests

* Rollback dlc_developer_config changes

* Fix for SM local tests

* Rolled back dev_config changes

* Fix for multithreading error in SM local tests

* Rollback dlc_developer_config changes

* Fix for SM local tests

* Rolled back dev_config changes

* Fix for smart retry benchmark tests

Co-authored-by: Sergey Togulev <togulev@amazon.com>
(cherry picked from commit df440538a7c5f580301c5f3a1c56c14beab48821)

Fix smart retry (#1451)

* Fix for multithreading error in SM local tests

* Rollback dlc_developer_config changes

* Fix for SM local tests

* Rolled back dev_config changes

Co-authored-by: Sergey Togulev <togulev@amazon.com>
(cherry picked from commit 97fb152a7022f252d4349742cbc7d7c3bc0af9a6)

[test] Smart retry functionality (#1414)

* check pytest cache

* enable builds

* enable builds

* enable builds

* enable builds

* disable builds

* disable builds

* enable builds

* Added -p to mkdir

* Using dinamic obj name

* Added try-catches

* Moved everything to separate functions

* Fixed a small bug

* Removed separate functions

* Removed separate functions

* Fixed bugs

* Fixed bugs

* Fixed bugs

* Added tests for sagemaker

* Typo fix

* Added last-failed for sagemaker

* Fixing sm-local tests

* Removed json

* updated ec2 commands

* using string in threads pool instead of dict

* moved to p.map again

* moved to p.map again

* Rolled back dev_config changes

* Fixed sm-local tests

* Fixed sm-local tests

* Fixed sm-local tests

* refactored pytest_cache.py

* fixed a bug

* removed code for sagemaker remote tests

* rolled back dev config

* A few changes after the review

* A few changes after the review

* Fixed a typo

* Added account number parameter

* Refactored utils instantiating

* A few NITs

Co-authored-by: Sergey Togulev <togulev@amazon.com>

(cherry picked from commit 5938a87927cbd7c4500a04a98c2d58dea82d3dad)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Fix for smart retry (#300)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [trcomp] [build] Fixing debug artifact path for trcomp (#299)

* [trcomp] [build] Fixing debug artifact path for trcomp

* fix: Adding additional checks to trcomp HF-PT debug tests to ensure debug artifacts are uploaded.

* Reverting PR test config

* [hopper][build][pytorch] Fix transformers gradient clipping issue (#304)

* Fix transformers gradient clipping issue

* Trigger build

* Use pipeline-built transformers wheel

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* release_images.yml with hopper images (#306)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Release trcomp (#307)

* release_images.yml with hopper images

* Added trcomp

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [hopper][build][pytorch] Add distributed training entry point (#308)

* [hopper][build][pytorch] Add distributed training entry point

* Disable tests

* Skipping benchmark tests for trcomp containers (#309)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [tensorflow][build][test] Tensorflow2.6 with SM PySDK keynote3 (#287)

* Tensorflow2.6 with SM PySDK keynote3

* Adding leftover changes

* Increase image size

* Use partially complete keynote3 PySDK

* Added changes to pass pr quick checks

* Minor fix for sanity and quick checks

* Fixing the download path

* Log absolute path

* Fixing the path for pr checks

* Reformatted using black -l 120

* Addressed comments

* Increased image size

* After the latest wheel release

* [config] Fix `do_build` config option (#1494)

* Set do_build as false

* Sync the cpu dockerfile with public master

* Added the keras version pinning

* Minor fix

* Pinned tensorflow io

* Make gpu dockerfile same as public with pinned tfio

* Install new sm binaries

* Added the increased sizes

* Added changes for tf2.6.2

* Make image baseline 8000

* Changed the tf2.6.2 binaries to many_linux latest

* Revert dlc developer config

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Skipping sm debugger tests for trcomp containers (#310)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* add graviton support (#313)

* revert graviton release specs (#314)

* [trcomp][build][pytorch] Fix distributed training entry point (#315)

* [trcomp][build][pytorch] Fix distributed training entry point

* Skipping sm debugger tests for trcomp containers

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [build]|[test]|[tensorflow] Made changes to build TF2.6.2 with SmPySDK and Boto (#316)

* Made changes to build TF2.6.2 with SmPySDK and Boto

* Revert temp chagnes

* Added sanity check tests

* release graviton for gamma testing (#317)

* [huggingface-neuron] Update release_images.yml (#318)

* Update release_images.yml (#319)

* Update release_images.yml

For hf neuron for the time being have disable_sm_tag to True

* Update release_images.yml

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [trcomp] [pytorch] [build] Defaulting GPU_NUM_DEVICES to 1 (#321)

* [trcomp] [pytorch] [build] Defaulting GPU_NUM_DEVICES to 1

* [trcomp] [pytorch] [test] Testing default value of GPU_NUM_DEVICES

* Reverting PR config

* Upgrade pillow in TF hopper container (#322)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Pillow fix (#323)

* Upgrade pillow in TF hopper container

* fixed a typo in a dockerfile

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [trcomp] [pytorch] [build] Fixing CVEs (#324)

* [trcomp] [pytorch] [build] Fixing CVEs

* Skipping not needed frameworks

* Removing hf-pt to trigger hopper tests

* Trying to execute hopper tests

* Skipping not needed frameworks

* Fixed dependency check issues self-discovery

* Addded print for debugging

* [trcomp] [pytorch] [build] Fixing CVE in bokeh

* Moved bokeh installation into a different block

* Removed temp logging

* [trcomp] [pytorch] [build] Fixing CVE in numpy and ipython

* Rollback temp changes

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#295)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.1 to 2.5.2.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.1...v2.5.2)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [trcomp] [pytorch] [build] Fixing perf issues in g4dn instances (#325)

* [trcomp] [pytorch] [build] Fixing perf issues in g4dn instances

* Revert PR check config

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [test][sanity] Removed temp changes from test runner (#327)

* [trcomp] [pytorch] [build] Fixing CVEs

* Skipping not needed frameworks

* Removing hf-pt to trigger hopper tests

* Trying to execute hopper tests

* Skipping not needed frameworks

* Fixed dependency check issues self-discovery

* Addded print for debugging

* [trcomp] [pytorch] [build] Fixing CVE in bokeh

* Moved bokeh installation into a different block

* Removed temp logging

* Rollback temp changes

* Rollback temp changes

Co-authored-by: Loki <lokravi@amazon.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Using pypi sagemaker (#332)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Merging from PUBLIC (#333)

* Merging from PUBLIC

* Fixed docker login

* Fixed parameter passing

* Fixed import

* Fixed sm_helper import

* Rollback config changes

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [Trcomp][CI] logic change copied from PR331 (#337)

* [Trcomp][CI] logic change copied from PR331

* comment out failed dockerfile commands

* revert dev config

* update dev config

* address comments

* set dev config

* fix typo

* update

* remove sagemaker test skip

* sync with PUBLIC

* remove unwanted habana test

* revert dev config

* remove sagemaker test skip for pytorch trcomp

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [trcomp] [pytorch] [build] Adding support for PyTorch 1.10 (#329)

* [trcomp] [pytorch] [build] Adding support for PyTorch 1.10

* Setting developer config for PR validation tests

* [trcomp] [pytorch] [build] Release PyTorch 1.10.0

* [trcomp] [pytorch] [build] Adding common training dependencies

* [trcomp] [pytorch] [test] Changing tests to reflect changes to HF logging in 4.16.2

* [trcomp] [pytorch] [build] Adding common training dependencies

* [trcomp] [pytorch] [build] Upgrading PT from 1.10.0 to 1.10.2

* [trcomp] [pytorch] [build] Adding torchaudio binaries

* [trcomp] [pytorch] [build] Updating NCCL version in binaries

* [trcomp] [pytorch] [test] Adding back skip markers after bad merge

* [trcomp] [pytorch] [build] Updating torch version to reflect X.Y.Z+cuABC

* [trcomp] [pytorch] [build] Fixing numpy version to fix dependency for package numba

* fiix sanity failures

* rename dockerfile

* remove duplicate test skip logic

* update e3 test skip logic

* fix sagemaker test directory

* fix sanity test

* enable ec2 test run and fix smdebug test

* nit change

* fix framework name

* fix variable name

* [trcomp] [test] Removing/Replacing internal code names

* [trcomp] [pytorch] [build] Fixing GPU_NUM_DEVICES issue with Distributed Training

* [trcomp] [pytorch] [build] Adding support for G5 instances with A10 GPUs

* Reverting developer config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* [trcomp][build] fix the base image version for TF 2.6.3 (#331)

* fix the base image version

* update dev config

* upgrade numpy & openssl

* downgrade numpy to 1.21

* fix sanity tests

* enable ec2 test

* update ec2 test skip logic

* update dockerfile name logic

* update

* update

* update

* fix typo

* update

* update

* update

* fix typo

* skip horovod test

* update

* update dev config

* fix sagemaker test path

* update sagemaker test skip fixture

* update

* update dev config

* revert dev config

Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* [release] release HF Trcomp TF2.6.3 & PT 1.10.2 (#338)

* release HF Trcomp TF2.6.3 & PT 1.10.2

* backup previous release_images.yml

* Sync eks infrastructure changes (#340)

* Graviton eks infrastructure (#1579)

* initial commit

* add pre-deploy

* add nodegroup support

* modify eks buildspec

* build a cluster

* add kubeconfig

* nit change

* revert temp changes

* explictly set managed node

* remove managed option

* add option to upgrade nodegroup

* nit change

* template update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-219.us-west-2.compute.internal>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* [eks] Upgrade EKS nodegroups and enable eks test for graviton (#1821)

* ung

* enable eks test for graviton

* build image

* disable config

* deploy graviton nodegroups

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-219.us-west-2.compute.internal>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* upgrade nodegroup (#341)

* Merge from PUBLIC repo @ef69cf4 (#339)

* test merge from PUBLIC

* trigger test

* update dev config

* revert dockerfile change

* change dockerfile

* update utils

* debug modified dockerfile regexp

* debug github handler file changed

* revert debug info, and force to_build to true

* enable habana build

* fix merge error

* restore files from PUBLIC

* revert dev config and "changeset limited to 20files" work around

* [build] Find buildspecs using configured env vars (#366)

* [pytorch][build] Remove patch version from buildspec file name (#376)

* Sync from public repo (#387)

* release pt-1.10.0 (#1616)

* release pt-1.10.0

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [huggingface_pytorch][NEURON][build] Huggingface Neuron inference DLC (#1578)

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Venky Natham <vrnatham@amazon.com>

* [build][graviton][mxnet][pytorch] fix graviton image build (#1618)

* fix graviton image build

* revert dev config

* Run dependency check on HF neuron images (#1622)

* [tensorflow][test][benchmark] Makeshift fix for flaky benchmark tests (#1575)

* Makeshift fix for flaky benchmark tests

* Shifted the if condition

* Reverting change

* Removing unnecessary import

* reverting temp changes

* Add support for multistage dockerfiles for e3/sagemaker (#1532)

* Exclude dependency check library from tool (#1611)

* [MXNet][build][test] Release MX 1.9.0 inference & training binaries (#1217)

Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com>
Co-authored-by: Wei Chu <weichu@amazon.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Update release images for MX1.9 (#1639)

* Run MX sagemaker benchmarks on SM images (#1640)

* [test][sagemaker]Sm remote smart retry (#1573)

* Refactored mxnet sm multi-region tests

* Rollback devconfig changes

* Update SM smart retry

* converting custom_cache_directory to string

* converting custom_cache_directory to string

* converting custom_cache_directory to string

* upload cache to s3

* upload cache to s3

* upload cache to s3

* upload cache to s3

* upload cache to s3

* added broken test

* added broken and working tests

* added broken and working tests

* added broken and working tests

* Fixed bug

* Fixed bug

* Revert temp changes

* Fixed bug

* Rolled back temp changes

* Added a few comments

* A few edits after review

* Rolled-back temp changes

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [doc] Added NVIDIA Triton inference containers to available images (#1591)

* [NEURON][TEST] - Update the manifest for 1.17.0 release (#1632)

* [neuron][huggingface] Update MMS version in HF Neuron DLCs (#1644)

* support py38 in MX sagemaker tests (#1652)

* Update MX 1.9 example images (#1654)

* Update numpy version in MX images (#1656)

* Pin numpy to <1.20 in MX 1.9 images (#1657)

* Pin numpy to <1.20 in MX 1.9 images

* update buildspec

* Habana Synapseai v1.2.0 dockerfiles (#1627)

* Habana 1.1.1 release update
* Update docker image path to 1.1.1 release docker
* Added 1.9.1 pytorch
* Added 2.7.0 tensorflow

* Turn on habana_mode=true

* update framework binaries

* update dockerfile to py38+ul20

* Fix Pytorch docker container path

* update license files

* Update 1.2.0 links

* update binaries for PT1.10

* update pt binaries

* remove pytorch_binary from buildspec

* Remove dataclass/typing workaround from previous releases

* fix few build failures

* Unpin Pillow package and fix dataclass/typing on 2.7 instead of 2.5

* unpin request

* allow openssl cve

* update tf wheel with tensorflow-cpu

* fix security issue

* nit change

* revert developer config

Co-authored-by: Wei Chu <weichu@amazon.com>
Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [NEURON][BUILD][MX] - update to sdk1.17.0 (#1636)

* [NEURON][BUILD][TF2.5] - update to use sdk1.17.0 and also tf2.5.2 (#1635)

* Release MX inference images for MXNet 1.9 (#1662)

* update availabel_images.md for MX1.9 (#1655)

* [NEURON][BUILD][PT] - move to sdk1.17.0 and also use pytorch 1.10.1 (#1634)

* [NEURON][RELEASE] - Update yml file to add PT1.10.1 and TF2.5.2 (#1668)

* Relase Neuron Images for sdk1.16.0

Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* don't look for sm tag

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Add neuron release 1.16.1 version

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add neuron release 1.16.1

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* update available images for neuron

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* fix md file to have py37 for pt

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add old neuron versions

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Release PT1.10.1 and TF2.5.2 Neuron DLC

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add to release_images.yml

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add mxnet

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update release_images.yml

* Update .release_images_template.yml

Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [NEURON][BUILD][TF] - Upgrade tf1.15.5 to use the neuron sdk 1.17.0 (#1642)

* Release neuron sdk1.17.0 version of tf1.15.5 dlc (#1673)

* Relase Neuron Images for sdk1.16.0

Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* don't look for sm tag

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Add neuron release 1.16.1 version

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add neuron release 1.16.1

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* update available images for neuron

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* fix md file to have py37 for pt

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add old neuron versions

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Release PT1.10.1 and TF2.5.2 Neuron DLC

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add to release_images.yml

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add mxnet

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update release_images.yml

* Update .release_images_template.yml

* release neuron sdk 1.17.0 version of tf1.15.5

Signed-off-by: Venky Natham <vrnatham@amazon.com>

Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* tensorflow_serving 2.8 e3 inference container  (#1671)

* add wip dockerfiles

* add tensorflow_model_server

* update ci instructions

* update tensorrt

* change pyversion in buidlpsec;rm files in /tmp for stray_file_test

* update cve allow list

* revert dev config

* udpate tmp file delete

* revert dev config, add tf27 buildspec

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* Add e3 dockerfiles for tensorflow 2.8 (#1647)

* add tf28 e3 container dockerfiles

* update buildspec

* use numpy tensorflow dependency

* update buildspec to reflect change of python version

* Update buildspec.yml

* install cudnn-dev

* update cudnn

* fix typo

* enable safety scan

* update horovod installation

* A few security upgrades

* upgrade pillow to 9.0.1
* urllib3 to the latest
* ignore numpy false positive vulnerability

* Fixed urllib constrain

* Skipped a couple of safety tests

* Turn off safety scan

* update wheel

* remove tempory pem file

* revert dev config

* set dev config with safety check

* revert dev config

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#1677)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.2 to 2.5.3.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.2...v2.5.3)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* use js_import instead of js_include for TF serving nginx configuration (#1666)

* tf2.7 inf build

* fix buildspec

* nginx configuration

* revert config

* remove -

* rename tfs file

* nginx configuration

* remove js_content

* manage export statement

* fix nginx errors

* Enabling safety test

* revert temp changes

* address comments

* nit change

* change file name

* adjust file name

* nit change

* enable inference build

* revert buildspecfile changes

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [Habana]|[Builld]|[Test] Enable Safety Scan Ignore list for Habana numpy issues (#1678)

* Enable Safety Scan Ignore list for Habana numpy issues

* Changed the ignore messages

* Reverted developer config changes

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>

* add TF2.8 in release images (#1676)

* Release neuron sdk 1.17.0 tf1.15.5 (#1681)

* Relase Neuron Images for sdk1.16.0

Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* don't look for sm tag

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Add neuron release 1.16.1 version

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add neuron release 1.16.1

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* update available images for neuron

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* fix md file to have py37 for pt

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add old neuron versions

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Release PT1.10.1 and TF2.5.2 Neuron DLC

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add to release_images.yml

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add mxnet

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update release_images.yml

* Update .release_images_template.yml

* release neuron sdk 1.17.0 version of tf1.15.5

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* release neuron sdk 1.17.0 based tf1.15.5

Signed-off-by: Venky Natham <vrnatham@amazon.com>

Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Add support for sagemaker-like E3 tag (#1672)

* [canary] Update python versions for TF canaries (#1682)

* fix TF tests issues (#1684)

* Habana DLC Perf/TestSuite TF/PT tests -- gaudi test suite (#1567)

* Habana DLC Perf/TestSuite TF/PT tests
* Add Habana DLAMI Tensorflow Performance Benchmarks
* Add Habana DLAMI PyTorch Performance Benchmarks
* Add Habana DLAMI Tensorflow Test Suite
* Add Habana DLAMI PyTorch Test Suite

* Apply gaudi-test-suite to test bert, rn50, maskrcnn, framework, etc.

* Test cleanup and exit code fix

* Fix gaudi-test-suite branch name

* To extract the Throughput correctly

* Update scripts for 1.2.0 release

* Add tf requirement installation

* Remove comments

* fix test scripts

* enable habana mode

* configure git creds

* build habana images

* adjust test dir

* run benchmark tests

* fix docker command

* update pt binary

* build new image

* use dedicated github granch

* nit change

* pin pt setuptools

* pin setuptools

* fix log file

* fix benchmark test

* awscli support

* fix dep check

* nit changes

* run benchmark test

* adjust pytest timeout for habana

* turn off benchmark mode

* add habana fixture

* run benchmark test

* increase timeout

* revert temp config

* increase timeout to 5hr

* build image

* run benchmark test

* revert temp config

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Buke Ao <bukeao@bukeao-vm.habana-labs.com>
Co-authored-by: Anny Chung <achung@habana.ai>
Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: Wei Chu <weichu@amazon.com>

* change cudnn version for tf2.8 for compatibility with p2 instances (#1688)

* update cudnn version

* update buildspec

* test on p2 instance

* revert dev config

Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* Habana release v1.2 images for TF and PT (#1687)

* release v1.2

* nit

* habana release v1.2 (#1691)

* Bump numpy in /test/sagemaker_tests/pytorch/inference (#1679)

Bumps [numpy](https://github.com/numpy/numpy) from 1.16.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.16.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [NEURON][BUILD][HF] - move hf neuron dlc to use latest sdk (#1669)

* [NEURON][BUILD][HF] - use ubuntu18 (#1700)

* use ubuntu18

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* enable test

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* remove libtinfo6 install as that is specific to u20

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update dlc_developer_config.toml

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [NEURON][BUILD][TF] - Move tf2.5.2 neuron to sdk 1.17.1 (#1696)

* [NEURON][BUILD][MX] - Move to neuron sdk1.17.1 (#1698)

* [NEURON][BUILD][PT] - Move pt1.10 to neuron sdk1.17.1 (#1699)

* [NEURON][BUILD][TF] - Move tf1.15.5 to use neuron sdk 1.17.1 (#1697)

* Release neuron sdk 1.17.1 version (#1702)

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* [doc] Update available images for neuron sdk release 1.17.1 (#1703)

* Add release images definition for HF PyTorch Neuron (#1694)

* [PyTorch E3] PT 1.10.2 DLC release (#1683)

* pt1.10.2

* add dgl

* update vision binaries

* update numpy and pillow versions

* fix numpy 1.22.0 installation

* update versions for cpu

* pin ipython version

* fix ipython installation

* update dgl pt container tests

* config for e3 only

* pin numpy version

* skip CVE 44463

* fix format

* update dev config

* update dev config

* disable dgl

* disable dgl cpu test for eks

* revert graviton changes

* revert sagemaker wheel

* remove pt1.10.0 buildspec

* revert dev config

* Update dlc_developer_config.toml

Co-authored-by: Wei Chu <weichu@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Add tf2.7 training sagemaker dockerfiles (#1628)

* add tf2.7 sagemaker dockerfiles

* update buidlspec

* remove non-compatible python packages

* add dependencies for kebros

* use manylinux wheels; add sagemaker dockerfiles

* update horovod installation env vars

* update horovod installation script

* use numpy as tensorflow dep

* Update buildspec.yml

* install boost

* increase image size limit

* update pillow and add docker lables

* use wheels from smdebuggers pipelines

* fix sanity test

* add labels for tf 2.7 sm cpu

* rerun

* build+rerun

* reinstall horovod cpu

* install smdebug directly from tag

* fix typo;

* Revert "fix typo;"

This reverts commit c5bd300d2141a91ac4f3f1d6d13711aa975370cb.

* Revert "install smdebug directly from tag"

This reverts commit c51ef6b95b20de6f65397f34a29806ab77c03461.

* Executing safety check in PR

* install smdebug directly from the branch

* bump up tensorflow to 2.7.1

* install higher version of tensorflow-io to avoid overriding tensorflow

* Ignoring a false positive vulnerability

* install tfds

* change pytest comands

* do not install dependencies as they have been installed in the dockerfiles

* add SAGEMAKER_TRAINING_MODULE environment variable

* remove pem file in tmp folder

* update sagemaker-tensorflow

* add smdataparallel

* revert rm /tmp

* remove /tmp/git-secrets

* experiment with an smdebug fix

* Revert "experiment with an smdebug fix"

This reverts commit b19ee8347ed6208ff9c2ac81d489dba785632199.

* skip test_keras_mirrored.py

* fix error in buidlspec

* Revert "fix error in buidlspec"

This reverts commit b973fa415e324a6f63d1fb816b22848e35600934.

* revert developer_config

* fix buildspec

* fix buildspec

* fix py version

* revert buildspec to mainline

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* pt1.10.2 release images (#1706)

* pt1.10.2 release images

* add example

* TF2.8: Clean up dockerfiles, update HVD test (#1693)

* update pt1.10.2 release images (#1707)

* update pt1.10.2 release images

* Update release_images.yml

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [Build][tensorflow] fix TF27 GPU CVE-2022-24407 (#1710)

* test

* update

* udpate

* update

* should fail

* test cpu and gpu

* update gpu sasl package

* udpate libsasl manually

* update

* add TF27 release images (#1714)

* [Tensorflow] add comment on py39 installation on TF 2.8 dockerfiles (#1715)

* document TF28 dockerfile

* update

* Release TF2.8 e3 images (#1716)

Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* [Tensorflow][Test][ec2] Fix Habana Tensorflow EC2 tests (#1704)

* Changed dev config to build images

* Added safety check test true

* Changed the build to true in buildspec

* Add logic to upload and read from s3 with a break statement

* Remove break and fix tail bug

* Change loop time and last line of script

* Added modularity

* Removing unwanted logs

* Modifying the while loop to check if the test can end early

* Reformatting the code

* Fixing bugs and refactoring

* Minor fix

* refactored code and added buckets for each account

* Refactored to include the ValueError within execute_async method

* Implemented bucket logic

* Reverting temp changes

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>

* bug fix (#1717)

* re-releas TF27 sagemaker cpu training (#1720)

* [build][pytorch] pt1.10 add openssh support (#1619)

* [tensorflow] Bug fixes to TF2.8 E3 images (#1723)

* [tensorflow] Bug fixes to TF2.8 E3 images

* add sasl install

* upgrade sasl instead of reinstalling

* Revert "upgrade sasl instead of reinstalling"

This reverts commit 51eb07408a404edde16e5bb2ddb3aa3b782a37a7.

* [Habana] [test] [ec2, sagemaker]  Fix to skip SM tests for Habana and modify async testing API (#1724)

* Fix to skip SM tests for Habana and modify async testing API

* Added the hang detection window variable

* Revert developer config

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>

* Move sasl to upgrade instead of install (#1726)

* Add dependabot config file to scan Dockerfiles (#1727)

* Add dependabot config file to scan Dockerfiles

* Update dependabot.yml

* [PyTorch] PyTorch 1.10.2 SageMaker DLC (#1709)

* pt1.10.2 sm dlc

* merge from upstream master

* refactor smdebug installation

* set enable_test_promotion:false for e3

Co-authored-by: Wei Chu <weichu@amazon.com>

* Configured release_images.yml for TF2.8e3 re-release and PT1.10.2 SM release (#1737)

* Configured release_images.yml for TF2.8e3 re-release

* Update release_images.yml

* Add Pytorch release changes to the yml

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [build][pytorch] pytorch 1.9 add openssh support (#1621)

* add openssh support

* build training image only

* revert dev config

* update

* update package version

* udpate

* revert dev config

* [tensorflow] Add dockerfiles for TF2.8 (#1685)

* add sagemaker dockerfiles

* update developer config

* update buildspec

* fix typos

* fix typo for python version

* add smdebug

* add sagemaker-tensorflow

* add smdataparallel

* remove tmp files

* update test config

* remove wrong ldlib path

* update tensorflow-io version`

* remove sagemaker-tensorflow til py39 pkg become available

* remove sagemaker-tensorflow

* add sagemaker-tensorflow

* install sagemaker-tensorflow from source

* install tfds

* do not install tesnorflow-dataset in the tests as it was installed in the image

* set datetime_tag to false

* correct python version

* update buildspec

* pass arguments related to python to e3 and sagemaker stages as env vars

* install smdebug from the tag

* minor update for sagemaker-tensorflow installation

* bug fix

* Changes to config file

* Make fix for cyrus CVE

* Change configs file to disable safety_check_test

* bump up requests

* run benchmark without rebuild

* run sagemaker rc tests

* run efa tests

* unistall tfds as it is installed in the image already

* run rc tests

* remove unused env vars

* fix license

* update buildspec to build sagemaker images only

* Revert "update buildspec to build sagemaker images only"

This reverts commit 908c89dcec178fe964346516cf12f52b6448868d.

* skip safty checks

* remove license from sagemaker stage

* revert dlc_developer_config.toml

* remove unused comments

* skip test_keras_mirrored for TF2.7

* fix styling issues

* add env var for TF version

* comment out e3 and example images build

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>
Co-authored-by: Shantanu Tripathi <shantanutripathi237@gmail.com>

* Update available images for TF2.7 SM and TF2.8 E3 (#1741)

* [TensorFlow] bump up tensorflow to 2.6.3 (#1721)

* [TensorFlow] add sagmaker dockerfiles for tensorflow_serving (#1689)

* add sagmaker dockerfiles

* build sagemaker image

* pass build args as env variables to sagemaker stages, remove unused
dockerfle

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com>

* [tensorflow] [build] [test] TF2.8 SM image fix (#1748)

* TF2.8 SM image fix

* Rebuild images with new SMDebug tag released

* Changed the smdebug versioning format

* Removed additional code for skipping tests

* Change buildspec and revert temp changes

* Added newline at ends of buildspecs

* [tensorflow][build][sagemaker] enhance gunicorn logging (#1750)

* [autogluon][build] AutoGluon 0.3.1 container patching (#1734)

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [autogluon][build] AutoGluon 0.3.2 container (AG 0.3.1 with patched images) (#1752)

* [release] Add TF 2.8 SM DLCs to release images (#1755)

* [doc] Update available images (#1754)

* [release] Add AG 0.3.2 images to release (#1757)

* [Tensorflow][Test][benchmark][ec2]  Invoke all the Habana benchmark tests using async execution  (#1711)

* Basic config for building images and running the tests

* Added timeout for benchmark runs

* Invoking async execution for the benchmark test

* Added uuid to logs and increased loop time

* Added background p…

Loading branch information

83 people committed Oct 20, 2022

1 parent 1b3d5ed commit e78bf86

.release_images_template.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -1058,4 +1058,4 @@ release_images: @@
           cuda_version: "cu112"
           example: False
           disable_sm_tag: False  # [Default: False] This option is not used by Example images
-          force_release: False
+          force_release: False

data/ignore_ids_safety_scan.json

-Original file line number
+Diff line change
@@ Expand Up / @@ -410,6 +410,20 @@ @@
                     "51159":"cryptography>38.0.1 does not exist yet"
                 }
             },
+            "training-neuron":{
+                "_comment":"py2 is deprecated",
+                "py2": {
+                },
+                "py3": {
+                    "43453":"numpy > 1.22.0 is not available for py37",
+                    "44715":"numpy > 1.22.0 is not available for py37",
+                    "44717":"numpy > 1.22.0 is not available for py37",
+                    "44716":"numpy > 1.22.0 is not available for py37",
+                    "51159":"cryptography>38.0.1 does not exist yet",
+                    "51358":"Safety is test pkg and not part of image",
+                    "51457":"Ignored- please check https://github.com/pytest-dev/py/issues/287"
+                }
+            },
             "inference": {
                 "_comment":"py2 is deprecated",
                 "py2": {
@@ Expand Down @@

pytorch/buildspec-1-10-neuron.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -2,17 +2,33 @@ account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment> @@
     region: &REGION <set-$REGION-in-environment>
     framework: &FRAMEWORK pytorch
     version: &VERSION 1.10.2
+    os_version: &OS_VERSION ubuntu18.04
     short_version: &SHORT_VERSION "1.10"
     arch_type: x86
     repository_info:
+      training_repository: &TRAINING_REPOSITORY
+        image_type: &TRAINING_IMAGE_TYPE training
+        root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
+        repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE, "-", neuron]
+        repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
       inference_repository: &INFERENCE_REPOSITORY
           image_type: &INFERENCE_IMAGE_TYPE inference
           root: !join [ *FRAMEWORK, "/", *INFERENCE_IMAGE_TYPE ]
           repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *INFERENCE_IMAGE_TYPE, "-", neuron]
           repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
     context:
+      training_context: &TRAINING_CONTEXT
+        changehostname:
+          source: docker/build_artifacts/changehostname.c
+          target: changehostname.c
+        start_with_right_hostname:
+          source: docker/build_artifacts/start_with_right_hostname.sh
+          target: start_with_right_hostname.sh
+        deep_learning_container:
+          source: ../../src/deep_learning_container.py
+          target: deep_learning_container.py
       inference_context: &INFERENCE_CONTEXT
         neuron-monitor:
           source: docker/build_artifacts/neuron-monitor.sh
@@ Expand All / @@ -28,6 +44,19 @@ context: @@
           target: config.properties
     images:
+      BuildNeuronPTTrainingPy3DockerImage:
+        <<: *TRAINING_REPOSITORY
+        build: &PYTORCH_INF_TRAINING_PY3 false
+        image_size_baseline: 5000
+        device_type: &DEVICE_TYPE neuron
+        python_version: &DOCKER_PYTHON_VERSION py3
+        tag_python_version: &TAG_PYTHON_VERSION py36
+        neuron_sdk_version: &NEURON_SDK_VERSION sdk2.1.1
+        os_version: &OS_VERSION ubuntu18.04
+        tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *NEURON_SDK_VERSION, "-", *OS_VERSION ]
+        docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *NEURON_SDK_VERSION, /Dockerfile., neuron ]
+        context:
+          <<: *TRAINING_CONTEXT
       BuildNeuronPTInferencePy3DockerImage:
         <<: *INFERENCE_REPOSITORY
         build: &PYTORCH_INF_INFERENCE_PY3 false
@@ Expand Down @@

pytorch/buildspec-1-11-neuron.yml

-Original file line number
+Diff line change
@@ -0,0 +1,59 @@
+    account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
+    region: &REGION <set-$REGION-in-environment>
+    framework: &FRAMEWORK pytorch
+    version: &VERSION 1.11.0
+    os_version: &OS_VERSION ubuntu20.04
+    short_version: &SHORT_VERSION "1.11"
+    arch_type: x86
+    repository_info:
+      training_repository: &TRAINING_REPOSITORY
+        image_type: &TRAINING_IMAGE_TYPE training
+        root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
+        repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE, "-", neuron]
+        repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
+      inference_repository: &INFERENCE_REPOSITORY
+          image_type: &INFERENCE_IMAGE_TYPE inference
+          root: !join [ *FRAMEWORK, "/", *INFERENCE_IMAGE_TYPE ]
+          repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *INFERENCE_IMAGE_TYPE, "-", neuron]
+          repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
+    context:
+      training_context: &TRAINING_CONTEXT
+        changehostname:
+          source: docker/build_artifacts/changehostname.c
+          target: changehostname.c
+        start_with_right_hostname:
+          source: docker/build_artifacts/start_with_right_hostname.sh
+          target: start_with_right_hostname.sh
+        deep_learning_container:
+          source: ../../src/deep_learning_container.py
+          target: deep_learning_container.py
+      inference_context: &INFERENCE_CONTEXT
+        neuron-monitor:
+          source: docker/build_artifacts/neuron-monitor.sh
+          target: neuron-monitor.sh
+        neuron-entrypoint:
+          source: docker/build_artifacts/neuron-entrypoint.py
+          target: neuron-entrypoint.py
+        torchserve-neuron:
+          source: docker/build_artifacts/torchserve-neuron.sh
+          target: torchserve-neuron.sh
+        config:
+          source: docker/build_artifacts/config.properties
+          target: config.properties
+    images:
+      BuildNeuronPTTrainingPy3DockerImage:
+        <<: *TRAINING_REPOSITORY
+        build: &PYTORCH_INF_TRAINING_PY3 false
+        image_size_baseline: 10000
+        device_type: &DEVICE_TYPE neuron
+        python_version: &DOCKER_PYTHON_VERSION py3
+        tag_python_version: &TAG_PYTHON_VERSION py38
+        neuron_sdk_version: &NEURON_SDK_VERSION sdk2.3.0
+        os_version: &OS_VERSION ubuntu20.04
+        tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *NEURON_SDK_VERSION, "-", *OS_VERSION ]
+        docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *NEURON_SDK_VERSION, /Dockerfile., neuron ]
+        context:
+          <<: *TRAINING_CONTEXT

pytorch/buildspec-neuron.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,18 +1,34 @@
  
    account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>

    region: &REGION <set-$REGION-in-environment>

    framework: &FRAMEWORK pytorch

    version: &VERSION 1.10.2

    short_version: &SHORT_VERSION "1.10"

    version: &VERSION 1.11.0

    os_version: &OS_VERSION ubuntu20.04

    short_version: &SHORT_VERSION "1.11"

    arch_type: x86

    repository_info:

      training_repository: &TRAINING_REPOSITORY

        image_type: &TRAINING_IMAGE_TYPE training

        root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]

        repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE, "-", neuron]

        repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]

      inference_repository: &INFERENCE_REPOSITORY

          image_type: &INFERENCE_IMAGE_TYPE inference

          root: !join [ *FRAMEWORK, "/", *INFERENCE_IMAGE_TYPE ]

          repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *INFERENCE_IMAGE_TYPE, "-", neuron]

          repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]

    context:

      training_context: &TRAINING_CONTEXT

        changehostname:

          source: docker/build_artifacts/changehostname.c

          target: changehostname.c

        start_with_right_hostname:

          source: docker/build_artifacts/start_with_right_hostname.sh

          target: start_with_right_hostname.sh

        deep_learning_container:

          source: ../../src/deep_learning_container.py

          target: deep_learning_container.py

      inference_context: &INFERENCE_CONTEXT

        neuron-monitor:

          source: docker/build_artifacts/neuron-monitor.sh

    @@ -28,18 +44,29 @@ context:
  
          target: config.properties

    images:

      BuildNeuronPTInferencePy3DockerImage:

        <<: *INFERENCE_REPOSITORY

        build: &PYTORCH_INF_INFERENCE_PY3 false

      # BuildNeuronPTInferencePy3DockerImage:

      #   <<: *INFERENCE_REPOSITORY

      #   build: &PYTORCH_INF_INFERENCE_PY3 false

      #   image_size_baseline: 10000

      #   device_type: &DEVICE_TYPE neuron

      #   python_version: &DOCKER_PYTHON_VERSION py3

      #   tag_python_version: &TAG_PYTHON_VERSION py37

      #   os_version: &OS_VERSION ubuntu18.04

      #   neuron_sdk_version: &NEURON_SDK_VERSION sdk1.19.0

      #   tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *NEURON_SDK_VERSION, "-", *OS_VERSION ]

      #   docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *NEURON_SDK_VERSION, /Dockerfile., neuron ]

      #   context:

      #     <<: *INFERENCE_CONTEXT

      BuildNeuronPTTrainingPy3DockerImage:

        <<: *TRAINING_REPOSITORY

        build: &PYTORCH_INF_TRAINING_PY3 false

        image_size_baseline: 10000

        device_type: &DEVICE_TYPE neuron

        python_version: &DOCKER_PYTHON_VERSION py3

        tag_python_version: &TAG_PYTHON_VERSION py37

        os_version: &OS_VERSION ubuntu18.04

        neuron_sdk_version: &NEURON_SDK_VERSION sdk1.19.0

        tag_python_version: &TAG_PYTHON_VERSION py38

        neuron_sdk_version: &NEURON_SDK_VERSION sdk2.3.0

        os_version: &OS_VERSION ubuntu20.04

        tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *NEURON_SDK_VERSION, "-", *OS_VERSION ]

        docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *NEURON_SDK_VERSION, /Dockerfile., neuron ]

        context:

          <<: *INFERENCE_CONTEXT

          <<: *TRAINING_CONTEXT

0 comments on commit `e78bf86`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `e78bf86`

Commit

There are no files selected for viewing

0 comments on commit e78bf86

0 comments on commit `e78bf86`