Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mxnet, tensorflow, pytorch] | [build, test] | [sagemaker] Update Pillow version #208

Merged
merged 2 commits into from May 28, 2020

Conversation

arjkesh
Copy link
Contributor

@arjkesh arjkesh commented May 20, 2020

Issue #, if available:

Checklist

  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [build] | [test] | [build, test] | [ec2, ecs, eks, sagemaker]
  • (If applicable) I've documented below the DLC image/dockerfile this relates to

Description:
Part of this is updating old Dockerfiles (i.e. 2.0.1) so that if people use these Dockerfiles they will not use the Pillow version with security update. The main issue we resolve is updating the test dependency for SM PT tests

Tests run:

DLC image/dockerfile:
mxnet 1.6, pt 1.4, tf 2.0.1

Additional context:
Resolving Pillow issue

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@arjkesh arjkesh changed the title [mxnet, tensorflow, pytorch] | [build, testUpdate Pillow 6.2.0 to 6.2.2 [mxnet, tensorflow, pytorch] | [build, test] | [sagemaker] Update Pillow 6.2.0 to 6.2.2 May 20, 2020
nskool
nskool previously approved these changes May 20, 2020
@@ -105,7 +105,7 @@ RUN ${PIP} install --no-cache --upgrade \
onnx==1.6.0 \
numpy==1.17.2 \
pandas==0.25.1 \
Pillow==6.2.0 \
Pillow==6.2.2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update this to 7.1.2 for all py3 images?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try it. Note that some of these are old images that I'm not sure we will release again, so I wanted to keep the major version the same.

@Satish615 if I update, do you want me to rebuild all the old containers and test, or do you think it's safe to just test the new containers?

@arjkesh arjkesh changed the title [mxnet, tensorflow, pytorch] | [build, test] | [sagemaker] Update Pillow 6.2.0 to 6.2.2 [mxnet, tensorflow, pytorch] | [build, test] | [sagemaker] Update Pillow version May 22, 2020
@arjkesh arjkesh merged commit 50f387c into aws:master May 28, 2020
tejaschumbalkar pushed a commit to tejaschumbalkar/deep-learning-containers that referenced this pull request Jul 15, 2022
* SynapseAI v0.15.1 release updates

* build habana switch on

* fix pt parse

* ENABLE_HABANA_MODE=False
tejaschumbalkar pushed a commit to tejaschumbalkar/deep-learning-containers that referenced this pull request Jul 15, 2022
* SynapseAI v0.15.1 release updates

* build habana switch on

* fix pt parse

* ENABLE_HABANA_MODE=False
tejaschumbalkar added a commit that referenced this pull request Oct 20, 2022
* [test] Add efa test as placeholder (#185)

* [pytorch][sagemaker] PT 1.8.0 cu110 EFA support (#171)

* PT 1.7.1 cu110 EFA support

* rebase PT 1.7.1 dockerfile and add EFA to PT 1.8.0 dockerfile

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* fix nccl version and add test

* update mpi

* fix style

* Fixed NCCL branch name and moved the Horovod installation before SM Distributed

* Disable the framework build and test which is not applicable to this PR

* fix failing test

* Add MPI flags for EFA

* Fixed pytorch nccl version test

* Fixed pytorch nccl version python test and disable fresh builds

* Disable new builds and enabled smdataparallel test

* Re-trigger CI

* Revert build config changes

Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* [TensorFlow][Sagemaker] TF 2.4 cu110 EFA support (#172)

* TF 2.4 cu110 EFA support

* Added -g option for EFA installer

* Update NCCL installation

* Fixed NCCL installation

* Add constant at top

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* update OPEN_MPI

* Install NCCL from source and updated the openMPI path

* Re-trigger CI

* Disable the framework build and test which is not applicable to this PR and added EFA related flag

* Fix mpi flag failure

* Add correct runtime MPI flags

* Add correct MPI flags, modify build config

* Disable new builds and Fixed SM Horovod test

* Enabled smdataparallel test

* Removed building NCCL with specific arch. Use default config which builds for all arch

* Revert build config changes

Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* Run PT to test EFA (#191)

add sanity efa test

* [pytorch] | [test] | [sagemaker] SMModel Parallel pytorch EFA tests on p3dn (#187)

SMModel Parallel pytorch EFA tests

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* [tensorflow] | [test] | [sagemaker] (#188)

add efa test for tf2

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* Run PT Rubik EFA test (#194)

* run pt efa rubik

* skip inference

* revert

* Run rubik efa tests on tf2 (#195)

* run rubik efa tests on tf2

* [test][sagemaker] Add reupload_image_to_test_ecr to SM tests conftest (#193)

* [PyTorch][test][sagemaker] EFA test for smdataparallel (#189)

EFA test for smdataparallel

* [habana] Placeholder for Build and Test Functionality for Habana (#197)

* [habana] build functionality

* modify habana dedicated flag

* enable habana build

* build config changes

* add pytorch and modify test configuration

* move build artifact

* test support for habana

* nit changes

* build changes

* nit change

* support for SM and benchmark

* address comments

* build eia and neuron

* enable new builds

* nit

* revert temp configs

* remove dead code from eks test

* [Habana] Add changeset logic (#198)

* changeset logic for habana

* enable habana mode

* test buildspec

* change dockerfiles

* disable habana mode and revert changes

* remove unwanted code

* [test] Run test using existing EC2 instance locally (#201)

* Run test using existing EC2 instance

* rename pytest fixture

* Removing any SM related installs from Dockerfile (#200)

* Removing any SM related installs

* Cleaned Dockerfile.Added 2.5 folder

Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [pytorch/tensorflow] Habana DLC python 3.7, OMPI in base installer and pytorch DLC fixes  (#202)

* Habana Pytorch DLC and OMPI Install In Habana Bases

* Fix docker path

* Rebased and added TF2.5

* Update pytorch to 0.15.0 synapse

* Updated Pytorch docker file (#204)

* Updated Pytorch docker file. Also updated buildspec to pull whl from s3 bucket

* Removed SM packages. Added few more pythom packages. Renamed folder to 0.15

* Minor fix in buildspec

* build habana images

* correct build config

* disable build config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* Update buildspec.yml (#206)

Updated pytorch wheel.
Added HPUBase for test cases.

* SynapseAI 0.15.0 Release DLC Changes (#205)

* SynapseAI 0.15.0 Release

* Add example branch parse and Habana PR build

* Fix extra slash

* Revert ENABLE_HABANA_MODE

* [Habana][Build] Fix torchvision python version py37 (#207)

* Fix torchvision python version py37

* Updated h5py version to 3.1.0

* enable habana mode and disable test

* Using pypi package for torchvision

* add docker build artifacts

* add build artifacts references to buildspec

* revert config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* SynapseAI v0.15.1 release updates (#208)

* SynapseAI v0.15.1 release updates

* build habana switch on

* fix pt parse

* ENABLE_HABANA_MODE=False

* Updating TF binaries with callback fixes (#210)

* Updating TF binaries with callback fixes

* Enabling Habana build

* Resetting ENABLE_HABANA_MODE=False

* SynapseAI v0.15.2 release updates (#209)

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* Updating Torchvision binary (#211)

* Updating Torchvision binary as we need to build with same setup as pytorch for compatibilty

* Enabling Habana mode

* Reset ENABLE_HABANA_MODE= False

* SynapseAI v0.15.3 release updates (#213)

* SynapseAI v0.15.3 release updates
* SynapseAI v0.15.3 release updates

* Enable Habana Mode

* Disable Habana Mode

* address rebase modifications

* [DO NOT MERGE] [autogluon][build, test] Initial PR for training containers (#214)

* [autogluon][build, test] fixing instance types (#218)

* format ecr repo from image uri (#217)

* format ecr repo from image uri

* pytest markers for hpu test

* more markers

* nit habana changes

* [habana][build] fix docker entrypoint (#219)

* fix docker entrypoint

* revert habana mode

* Fixed version in autogluon buildspec (#215)

* Fixed version in autogluon buildspec

* Enabling sagemaker tests

* Enable building a new container

* Added MAJOR_VERSION into docker files, added autogluon_training fixture

* [autogluon][test] SageMaker remote mode tests

* [autogluon][test] removed datasets requirement

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][test] tests fixes (#220)

* [autogluon][test] tests fixes

* [autogluon][test] tests fixes

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] version checks fixes

* [autogluon][test] pip check fixes

* [autogluon][test] pip check fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] applied pillow security fixes to autogluon

* [autogluon][test] removed jupyter dependencies leftovers

* [build][test]Rolling back default parameters changes (#224)

* Rolling back default parameters changes

* [autogluon][test] test fixes

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][release]Releasing Autogluon 0.2.1 (#227)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [autogluon][test]Fixes for AG sanity tests (#226)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fixed release notes logic (#228)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fix for AG release notes (#229)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [autogluon][release] Release AG container (#230)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fix for imp_pip_packages (#231)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Ag release (#232)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [autogluon][build] Build AG 0.3.0

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [habana] fix pip check requirements (#225)

* habana sanity test

* reinstall boto3

* upgrade boto3

* remove comments

* revert temp configs

* [test] Merger testrunner from public (#234)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* SynapseAI v0.15.4 release updates (#233)

* SynapseAI v0.15.4 release updates
* SynapseAI v0.15.4 release updates

* Enable Habana Mode

* Revert "Enable Habana Mode"

This reverts commit 9ed1a8f58d2d5c71977ff0cc660e3228c3dd8874.

* [test] Building AG 0.2.1 (#236)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Remove hb-torch & install into --user for python packages (#237)

* Remove hb-torch before installing AWS torch

* python packages to user space install

* add -y to uninstall

* enable habana mode

* disable habana mode

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [build] habana build modifications (#238)

* habana build modifications

* run test safety

* make sanity test compatible with hpu processor

* fix sanity test

* sync up utility test changes from public repo

* address comments

* revert temp config

* release habana dlc to gamma stage (#243)

* [release] fix numbering on release_images.yml (#244)

* fix_numbering

* move syai inside of job_type

* remove PT1.7 and TF2.5 from release_images.yml (#245)

* Remove keras package before installing tensorflow (#247)

* Remove keras package before installing tensorflow

* Enable habana_mode

* run test safety

* disable habana mode

* revert safety test changes

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#242)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.0 to 2.5.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.5.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [hopper][build] Add hopper build code (#246)

* Merge master into private-master (#248)

* [test] Add hopper_mode to quick checks tests (#251)

* [test] Add efa test as placeholder (#185)

* [pytorch][sagemaker] PT 1.8.0 cu110 EFA support (#171)

* PT 1.7.1 cu110 EFA support

* rebase PT 1.7.1 dockerfile and add EFA to PT 1.8.0 dockerfile

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* fix nccl version and add test

* update mpi

* fix style

* Fixed NCCL branch name and moved the Horovod installation before SM Distributed

* Disable the framework build and test which is not applicable to this PR

* fix failing test

* Add MPI flags for EFA

* Fixed pytorch nccl version test

* Fixed pytorch nccl version python test and disable fresh builds

* Disable new builds and enabled smdataparallel test

* Re-trigger CI

* Revert build config changes

Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* [TensorFlow][Sagemaker] TF 2.4 cu110 EFA support (#172)

* TF 2.4 cu110 EFA support

* Added -g option for EFA installer

* Update NCCL installation

* Fixed NCCL installation

* Add constant at top

* Install hwloc, dependency of smdataparallel

* Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA

* Updated EFA version to 1.11.2 which comes with MPI v4.1.0

* update OPEN_MPI

* Install NCCL from source and updated the openMPI path

* Re-trigger CI

* Disable the framework build and test which is not applicable to this PR and added EFA related flag

* Fix mpi flag failure

* Add correct runtime MPI flags

* Add correct MPI flags, modify build config

* Disable new builds and Fixed SM Horovod test

* Enabled smdataparallel test

* Removed building NCCL with specific arch. Use default config which builds for all arch

* Revert build config changes

Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>
Co-authored-by: Akhil Mehra <armehra@amazon.com>

* Run PT to test EFA (#191)

add sanity efa test

* [pytorch] | [test] | [sagemaker] SMModel Parallel pytorch EFA tests on p3dn (#187)

SMModel Parallel pytorch EFA tests

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* [tensorflow] | [test] | [sagemaker] (#188)

add efa test for tf2

Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Lai Wei <royweilai@gmail.com>
Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com>

* Run PT Rubik EFA test (#194)

* run pt efa rubik

* skip inference

* revert

* Run rubik efa tests on tf2 (#195)

* run rubik efa tests on tf2

* [test][sagemaker] Add reupload_image_to_test_ecr to SM tests conftest (#193)

* [PyTorch][test][sagemaker] EFA test for smdataparallel (#189)

EFA test for smdataparallel

* [habana] Placeholder for Build and Test Functionality for Habana (#197)

* [habana] build functionality

* modify habana dedicated flag

* enable habana build

* build config changes

* add pytorch and modify test configuration

* move build artifact

* test support for habana

* nit changes

* build changes

* nit change

* support for SM and benchmark

* address comments

* build eia and neuron

* enable new builds

* nit

* revert temp configs

* remove dead code from eks test

* [Habana] Add changeset logic (#198)

* changeset logic for habana

* enable habana mode

* test buildspec

* change dockerfiles

* disable habana mode and revert changes

* remove unwanted code

* [test] Run test using existing EC2 instance locally (#201)

* Run test using existing EC2 instance

* rename pytest fixture

* Removing any SM related installs from Dockerfile (#200)

* Removing any SM related installs

* Cleaned Dockerfile.Added 2.5 folder

Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [pytorch/tensorflow] Habana DLC python 3.7, OMPI in base installer and pytorch DLC fixes  (#202)

* Habana Pytorch DLC and OMPI Install In Habana Bases

* Fix docker path

* Rebased and added TF2.5

* Update pytorch to 0.15.0 synapse

* Updated Pytorch docker file (#204)

* Updated Pytorch docker file. Also updated buildspec to pull whl from s3 bucket

* Removed SM packages. Added few more pythom packages. Renamed folder to 0.15

* Minor fix in buildspec

* build habana images

* correct build config

* disable build config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* Update buildspec.yml (#206)

Updated pytorch wheel.
Added HPUBase for test cases.

* SynapseAI 0.15.0 Release DLC Changes (#205)

* SynapseAI 0.15.0 Release

* Add example branch parse and Habana PR build

* Fix extra slash

* Revert ENABLE_HABANA_MODE

* [Habana][Build] Fix torchvision python version py37 (#207)

* Fix torchvision python version py37

* Updated h5py version to 3.1.0

* enable habana mode and disable test

* Using pypi package for torchvision

* add docker build artifacts

* add build artifacts references to buildspec

* revert config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* SynapseAI v0.15.1 release updates (#208)

* SynapseAI v0.15.1 release updates

* build habana switch on

* fix pt parse

* ENABLE_HABANA_MODE=False

* Updating TF binaries with callback fixes (#210)

* Updating TF binaries with callback fixes

* Enabling Habana build

* Resetting ENABLE_HABANA_MODE=False

* SynapseAI v0.15.2 release updates (#209)

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* SynapseAI v0.15.2 release updates
* SynapseAI v0.15.2 release updates

* Fix folder naming

* Re-Disable ENABLE_HABANA_MODE in build_config.py

* Updating Torchvision binary (#211)

* Updating Torchvision binary as we need to build with same setup as pytorch for compatibilty

* Enabling Habana mode

* Reset ENABLE_HABANA_MODE= False

* SynapseAI v0.15.3 release updates (#213)

* SynapseAI v0.15.3 release updates
* SynapseAI v0.15.3 release updates

* Enable Habana Mode

* Disable Habana Mode

* address rebase modifications

* [DO NOT MERGE] [autogluon][build, test] Initial PR for training containers (#214)

* [autogluon][build, test] fixing instance types (#218)

* format ecr repo from image uri (#217)

* format ecr repo from image uri

* pytest markers for hpu test

* more markers

* nit habana changes

* [habana][build] fix docker entrypoint (#219)

* fix docker entrypoint

* revert habana mode

* Fixed version in autogluon buildspec (#215)

* Fixed version in autogluon buildspec

* Enabling sagemaker tests

* Enable building a new container

* Added MAJOR_VERSION into docker files, added autogluon_training fixture

* [autogluon][test] SageMaker remote mode tests

* [autogluon][test] removed datasets requirement

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][test] tests fixes (#220)

* [autogluon][test] tests fixes

* [autogluon][test] tests fixes

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] removed jupyter dependencies leftovers

* [autogluon][test] version checks fixes

* [autogluon][test] pip check fixes

* [autogluon][test] pip check fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] sm_local tests fixes

* [autogluon][test] applied pillow security fixes to autogluon

* [autogluon][test] removed jupyter dependencies leftovers

* [build][test]Rolling back default parameters changes (#224)

* Rolling back default parameters changes

* [autogluon][test] test fixes

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com>

* [autogluon][test]Fixes for AG sanity tests (#226)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fixed release notes logic (#228)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Fix for imp_pip_packages (#231)

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

* [release] Fixed release notes logic

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [habana] fix pip check requirements (#225)

* habana sanity test

* reinstall boto3

* upgrade boto3

* remove comments

* revert temp configs

* SynapseAI v0.15.4 release updates (#233)

* SynapseAI v0.15.4 release updates
* SynapseAI v0.15.4 release updates

* Enable Habana Mode

* Revert "Enable Habana Mode"

This reverts commit 9ed1a8f58d2d5c71977ff0cc660e3228c3dd8874.

* Remove hb-torch & install into --user for python packages (#237)

* Remove hb-torch before installing AWS torch

* python packages to user space install

* add -y to uninstall

* enable habana mode

* disable habana mode

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [build] habana build modifications (#238)

* habana build modifications

* run test safety

* make sanity test compatible with hpu processor

* fix sanity test

* sync up utility test changes from public repo

* address comments

* revert temp config

* remove PT1.7 and TF2.5 from release_images.yml (#245)

* Remove keras package before installing tensorflow (#247)

* Remove keras package before installing tensorflow

* Enable habana_mode

* run test safety

* disable habana mode

* revert safety test changes

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [hopper][build] Add hopper build code (#246)

* Merge master into private-master (#248)

* [test] Add hopper_mode to quick checks tests (#251)

* followup sync changes

* [hopper][build] sync hopper dockerfiles with huggingface dockerfiles (#254)

* [hopper][build] sync hopper dockerfiles with huggingface dockerfiles

* Enable hopper mode

* Fix bug with CI for Hopper

* Use py38 wheel and disable debug env vars

* Update xla wheel and set buildspec correctly for hopper

* Fix framework path and artifact name

* Fix framework version path

* Disable hopper mode

Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com>

* [hopper][build] Add more wheels for hopper (#258)

* buildspec and status modifications (#261)

* [hopper][pytorch][test] Fix horovod tests (#266)

* Reinstall horovod for hopper

* Enable hopper mode

* Remove hopper dedicated

* Revert hopper dedicated

* Update dlc_developer_config.toml

* [hopper][test] Fix getting framework for hopper (#265)

* [hopper][test] Fix getting framework for hopper

* Add dummy change to trigger build

* Add dummy change in buildspec to trigger build

* Add dummy change in dockerfile

* Remove hopper dedicated

* Update main.py

* Update main.py

* Update main.py

* Remove dummy changes

* Update dlc_developer_config.toml

* [hopper][pytorch][build] Update transformers wheel (#267)

* [hopper][pytorch][build] Update transformers wheel to the latest (#269)

* [hopper][pytorch][build] Update hopper wheels (#270)

* [habana] fix pip check and unpin werkzeug package (#271)

* unpin werkzeug package

* install latest version

* fix rebase changes

* fix pip check

* revert temp config

* install typing

* build habana dlc

* revert temp changes

* release PT1.9 diy/sm (#272)

* [release] adjust customer_type for diy/sm (#273)

* adjust customer_type

* adjust customer_type

* nit change

* remove neuron (#274)

* add habana packages to release page (#241)

* [hopper][build][pytorch] Update hopper pytorch wheels (#275)

* Update hopper pytorch wheels

* [hopper][build][pytorch] Update transformers wheel (#276)

* [hopper][build][pytorch] Update transformers wheel

* [hopper][build][pytorch] Update transformers wheel (#278)

* [hopper][build][pytorch] Update transformers wheel

* Disable hopper mode

* Synch HF images from public (#281)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [hopper][build][pytorch] Upgrade transformers to 11.0 (#282)

* Upgrade transformers to 11.0

* Update transformers version

* Disable hopper mode

* trigger builds

* retrigger builds

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [hopper][huggingface_tensorflow][huggingface_pytorch][build][test] Build and test Hopper images with sm pysdk (#280)

* Added the changes to build hopper images with sm pysdk

* Added the tests to run using sm pysdk

* Added debug lines

* Run SM local tests and address comments

* Deactivated ecs and eks tests.

* Reverting the dev config changes

* [test][sagemaker] Make PySDK binary selection logic generic for the SM tests and SM local tests (#283)

* Make PySDK binary selection logic generic for the SM and SM local tests

* Make hopper mode true

* Revert the changes

* [hopper][build][pytorch][tensorflow] Update fw wheels with init changes (#284)

* [hopper][build] Update fw wheels with init changes

* Enable test flags

* Fix typo

* Disable test flags

* [hopper][build][pytorch] Fix Hopper DT NaN issue (#288)

* Fix Hopper DT NaN issue

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* [hopper][build][pytorch][tensorflow] Fix licence files (#289)

* [hopper] [build] [pytorch] Updating SM trcomp PT wheels for DT support (#293)

* Updating SM trcomp PT wheels for DT support

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* [hopper][build][pytorch] Include examples dir in transformers wheel (#291)

* Include examples dir in transformers wheel

* Update transformers wheel

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* [hopper] [test] [sagemaker] Adding tests targeting the SM Training Compiler integrated containers Private master (#286)

* Fix bugs in framework init functions. +new Fx Wheels for HF-trcomp
Create remote and local test for HF-PT-trcomp
Create remote tests for HF-TF-trcomp
Make tests shorter

* Added handlers for non implemented tests

* Updating HF-trcomp tests to look for log messages indicating trcomp has been ingaged in the training logs

* Fix for smdebug EC2 test.

* Adding HF-PT-trcomp tests to test different trcomp configs. Porting testing to work with HF-TF-trcomp.

* Finalizing HF-trcomp tests
Fixed HF-TF-trcomp build recipe. Add redundancy to all trcomp build recipes
Fixing test dependencies

* Increasing retries for HF trcomp tests

* Skipping HF-PT-trcomp local test since it hangs. Will fix later

* Reverting test mode

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [test] Fix smart retry benchmark tests (#1452) (#296)

* Fix for multithreading error in SM local tests

* Rollback dlc_developer_config changes

* Fix for SM local tests

* Rolled back dev_config changes

* Fix for multithreading error in SM local tests

* Rollback dlc_developer_config changes

* Fix for SM local tests

* Rolled back dev_config changes

* Fix for smart retry benchmark tests

Co-authored-by: Sergey Togulev <togulev@amazon.com>
(cherry picked from commit df440538a7c5f580301c5f3a1c56c14beab48821)

Fix smart retry (#1451)

* Fix for multithreading error in SM local tests

* Rollback dlc_developer_config changes

* Fix for SM local tests

* Rolled back dev_config changes

Co-authored-by: Sergey Togulev <togulev@amazon.com>
(cherry picked from commit 97fb152a7022f252d4349742cbc7d7c3bc0af9a6)

[test] Smart retry functionality (#1414)

* check pytest cache

* enable builds

* enable builds

* enable builds

* enable builds

* disable builds

* disable builds

* enable builds

* Added -p to mkdir

* Using dinamic obj name

* Added try-catches

* Moved everything to separate functions

* Fixed a small bug

* Removed separate functions

* Removed separate functions

* Fixed bugs

* Fixed bugs

* Fixed bugs

* Added tests for sagemaker

* Typo fix

* Added last-failed for sagemaker

* Fixing sm-local tests

* Removed json

* updated ec2 commands

* using string in threads pool instead of dict

* moved to p.map again

* moved to p.map again

* Rolled back dev_config changes

* Fixed sm-local tests

* Fixed sm-local tests

* Fixed sm-local tests

* refactored pytest_cache.py

* fixed a bug

* removed code for sagemaker remote tests

* rolled back dev config

* A few changes after the review

* A few changes after the review

* Fixed a typo

* Added account number parameter

* Refactored utils instantiating

* A few NITs

Co-authored-by: Sergey Togulev <togulev@amazon.com>

(cherry picked from commit 5938a87927cbd7c4500a04a98c2d58dea82d3dad)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Fix for smart retry (#300)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [trcomp] [build] Fixing debug artifact path for trcomp (#299)

* [trcomp] [build] Fixing debug artifact path for trcomp

* fix: Adding additional checks to trcomp HF-PT debug tests to ensure debug artifacts are uploaded.

* Reverting PR test config

* [hopper][build][pytorch] Fix transformers gradient clipping issue (#304)

* Fix transformers gradient clipping issue

* Trigger build

* Use pipeline-built transformers wheel

* Update dlc_developer_config.toml

Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com>

* release_images.yml with hopper images (#306)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [release] Release trcomp (#307)

* release_images.yml with hopper images

* Added trcomp

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [hopper][build][pytorch] Add distributed training entry point (#308)

* [hopper][build][pytorch] Add distributed training entry point

* Disable tests

* Skipping benchmark tests for trcomp containers (#309)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [tensorflow][build][test] Tensorflow2.6 with SM PySDK keynote3 (#287)

* Tensorflow2.6 with SM PySDK keynote3

* Adding leftover changes

* Increase image size

* Use partially complete keynote3 PySDK

* Added changes to pass pr quick checks

* Minor fix for sanity and quick checks

* Fixing the download path

* Log absolute path

* Fixing the path for pr checks

* Reformatted using black -l 120

* Addressed comments

* Increased image size

* After the latest wheel release

* [config] Fix `do_build` config option (#1494)

* Set do_build as false

* Sync the cpu dockerfile with public master

* Added the keras version pinning

* Minor fix

* Pinned tensorflow io

* Make gpu dockerfile same as public with pinned tfio

* Install new sm binaries

* Added the increased sizes

* Added changes for tf2.6.2

* Make image baseline 8000

* Changed the tf2.6.2 binaries to many_linux latest

* Revert dlc developer config

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Skipping sm debugger tests for trcomp containers (#310)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* add graviton support (#313)

* revert graviton release specs (#314)

* [trcomp][build][pytorch] Fix distributed training entry point (#315)

* [trcomp][build][pytorch] Fix distributed training entry point

* Skipping sm debugger tests for trcomp containers

Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [build]|[test]|[tensorflow] Made changes to build TF2.6.2 with SmPySDK and Boto (#316)

* Made changes to build TF2.6.2 with SmPySDK and Boto

* Revert temp chagnes

* Added sanity check tests

* release graviton for gamma testing (#317)

* [huggingface-neuron] Update release_images.yml (#318)

* Update release_images.yml (#319)

* Update release_images.yml

For hf neuron for the time being have disable_sm_tag to True

* Update release_images.yml

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [trcomp] [pytorch] [build] Defaulting GPU_NUM_DEVICES to 1 (#321)

* [trcomp] [pytorch] [build] Defaulting GPU_NUM_DEVICES to 1

* [trcomp] [pytorch] [test] Testing default value of GPU_NUM_DEVICES

* Reverting PR config

* Upgrade pillow in TF hopper container (#322)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Pillow fix (#323)

* Upgrade pillow in TF hopper container

* fixed a typo in a dockerfile

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [trcomp] [pytorch] [build] Fixing CVEs (#324)

* [trcomp] [pytorch] [build] Fixing CVEs

* Skipping not needed frameworks

* Removing hf-pt to trigger hopper tests

* Trying to execute hopper tests

* Skipping not needed frameworks

* Fixed dependency check issues self-discovery

* Addded print for debugging

* [trcomp] [pytorch] [build] Fixing CVE in bokeh

* Moved bokeh installation into a different block

* Removed temp logging

* [trcomp] [pytorch] [build] Fixing CVE in numpy and ipython

* Rollback temp changes

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#295)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.1 to 2.5.2.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.1...v2.5.2)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [trcomp] [pytorch] [build] Fixing perf issues in g4dn instances (#325)

* [trcomp] [pytorch] [build] Fixing perf issues in g4dn instances

* Revert PR check config

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [test][sanity] Removed temp changes from test runner (#327)

* [trcomp] [pytorch] [build] Fixing CVEs

* Skipping not needed frameworks

* Removing hf-pt to trigger hopper tests

* Trying to execute hopper tests

* Skipping not needed frameworks

* Fixed dependency check issues self-discovery

* Addded print for debugging

* [trcomp] [pytorch] [build] Fixing CVE in bokeh

* Moved bokeh installation into a different block

* Removed temp logging

* Rollback temp changes

* Rollback temp changes

Co-authored-by: Loki <lokravi@amazon.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Using pypi sagemaker (#332)

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* Merging from PUBLIC (#333)

* Merging from PUBLIC

* Fixed docker login

* Fixed parameter passing

* Fixed import

* Fixed sm_helper import

* Rollback config changes

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [Trcomp][CI] logic change copied from PR331 (#337)

* [Trcomp][CI] logic change copied from PR331

* comment out failed dockerfile commands

* revert dev config

* update dev config

* address comments

* set dev config

* fix typo

* update

* remove sagemaker test skip

* sync with PUBLIC

* remove unwanted habana test

* revert dev config

* remove sagemaker test skip for pytorch trcomp

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>

* [trcomp] [pytorch] [build] Adding support for PyTorch 1.10 (#329)

* [trcomp] [pytorch] [build] Adding support for PyTorch 1.10

* Setting developer config for PR validation tests

* [trcomp] [pytorch] [build] Release PyTorch 1.10.0

* [trcomp] [pytorch] [build] Adding common training dependencies

* [trcomp] [pytorch] [test] Changing tests to reflect changes to HF logging in 4.16.2

* [trcomp] [pytorch] [build] Adding common training dependencies

* [trcomp] [pytorch] [build] Upgrading PT from 1.10.0 to 1.10.2

* [trcomp] [pytorch] [build] Adding torchaudio binaries

* [trcomp] [pytorch] [build] Updating NCCL version in binaries

* [trcomp] [pytorch] [test] Adding back skip markers after bad merge

* [trcomp] [pytorch] [build] Updating torch version to reflect X.Y.Z+cuABC

* [trcomp] [pytorch] [build] Fixing numpy version to fix dependency for package numba

* fiix sanity failures

* rename dockerfile

* remove duplicate test skip logic

* update e3 test skip logic

* fix sagemaker test directory

* fix sanity test

* enable ec2 test run and fix smdebug test

* nit change

* fix framework name

* fix variable name

* [trcomp] [test] Removing/Replacing internal code names

* [trcomp] [pytorch] [build] Fixing GPU_NUM_DEVICES issue with Distributed Training

* [trcomp] [pytorch] [build] Adding support for G5 instances with A10 GPUs

* Reverting developer config

Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* [trcomp][build] fix the base image version for TF 2.6.3 (#331)

* fix the base image version

* update dev config

* upgrade numpy & openssl

* downgrade numpy to 1.21

* fix sanity tests

* enable ec2 test

* update ec2 test skip logic

* update dockerfile name logic

* update

* update

* update

* fix typo

* update

* update

* update

* fix typo

* skip horovod test

* update

* update dev config

* fix sagemaker test path

* update sagemaker test skip fixture

* update

* update dev config

* revert dev config

Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* [release] release HF Trcomp TF2.6.3 & PT 1.10.2 (#338)

* release HF Trcomp TF2.6.3 & PT 1.10.2

* backup previous release_images.yml

* Sync eks infrastructure changes (#340)

* Graviton eks infrastructure (#1579)

* initial commit

* add pre-deploy

* add nodegroup support

* modify eks buildspec

* build a cluster

* add kubeconfig

* nit change

* revert temp changes

* explictly set managed node

* remove managed option

* add option to upgrade nodegroup

* nit change

* template update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-219.us-west-2.compute.internal>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* [eks] Upgrade EKS nodegroups and enable eks test for graviton (#1821)

* ung

* enable eks test for graviton

* build image

* disable config

* deploy graviton nodegroups

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-219.us-west-2.compute.internal>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* upgrade nodegroup (#341)

* Merge from PUBLIC repo @ef69cf4 (#339)

* test merge from PUBLIC

* trigger test

* update dev config

* revert dockerfile change

* change dockerfile

* update utils

* debug modified dockerfile regexp

* debug github handler file changed

* revert debug info, and force to_build to true

* enable habana build

* fix merge error

* restore files from PUBLIC

* revert dev config and "changeset limited to 20files" work around

* [build] Find buildspecs using configured env vars (#366)

* [pytorch][build] Remove patch version from buildspec file name (#376)

* Sync from public repo (#387)

* release pt-1.10.0 (#1616)

* release pt-1.10.0

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [huggingface_pytorch][NEURON][build] Huggingface Neuron inference DLC (#1578)

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Venky Natham <vrnatham@amazon.com>

* [build][graviton][mxnet][pytorch] fix graviton image build (#1618)

* fix graviton image build

* revert dev config

* Run dependency check on HF neuron images (#1622)

* [tensorflow][test][benchmark] Makeshift fix for flaky benchmark tests (#1575)

* Makeshift fix for flaky benchmark tests

* Shifted the if condition

* Reverting change

* Removing unnecessary import

* reverting temp changes

* Add support for multistage dockerfiles for e3/sagemaker (#1532)

* Exclude dependency check library from tool (#1611)

* [MXNet][build][test] Release MX 1.9.0 inference & training binaries (#1217)

Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com>
Co-authored-by: Wei Chu <weichu@amazon.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Update release images for MX1.9 (#1639)

* Run MX sagemaker benchmarks on SM images (#1640)

* [test][sagemaker]Sm remote smart retry (#1573)

* Refactored mxnet sm multi-region tests

* Rollback devconfig changes

* Update SM smart retry

* converting custom_cache_directory to string

* converting custom_cache_directory to string

* converting custom_cache_directory to string

* upload cache to s3

* upload cache to s3

* upload cache to s3

* upload cache to s3

* upload cache to s3

* added broken test

* added broken and working tests

* added broken and working tests

* added broken and working tests

* Fixed bug

* Fixed bug

* Revert temp changes

* Fixed bug

* Rolled back temp changes

* Added a few comments

* A few edits after review

* Rolled-back temp changes

Co-authored-by: Sergey Togulev <togulev@amazon.com>

* [doc] Added NVIDIA Triton inference containers to available images (#1591)

* [NEURON][TEST] - Update the manifest for 1.17.0 release (#1632)

* [neuron][huggingface] Update MMS version in HF Neuron DLCs (#1644)

* support py38 in MX sagemaker tests (#1652)

* Update MX 1.9 example images (#1654)

* Update numpy version in MX images (#1656)

* Pin numpy to <1.20 in MX 1.9 images (#1657)

* Pin numpy to <1.20 in MX 1.9 images

* update buildspec

* Habana Synapseai v1.2.0 dockerfiles (#1627)

* Habana 1.1.1 release update
* Update docker image path to 1.1.1 release docker
* Added 1.9.1 pytorch
* Added 2.7.0 tensorflow

* Turn on habana_mode=true

* update framework binaries

* update dockerfile to py38+ul20

* Fix Pytorch docker container path

* update license files

* Update 1.2.0 links

* update binaries for PT1.10

* update pt binaries

* remove pytorch_binary from buildspec

* Remove dataclass/typing workaround from previous releases

* fix few build failures

* Unpin Pillow package and fix dataclass/typing on 2.7 instead of 2.5

* unpin request

* allow openssl cve

* update tf wheel with tensorflow-cpu

* fix security issue

* nit change

* revert developer config

Co-authored-by: Wei Chu <weichu@amazon.com>
Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>

* [NEURON][BUILD][MX] - update to sdk1.17.0 (#1636)

* [NEURON][BUILD][TF2.5] - update to use sdk1.17.0 and also tf2.5.2 (#1635)

* Release MX inference images for MXNet 1.9 (#1662)

* update availabel_images.md for MX1.9 (#1655)

* [NEURON][BUILD][PT] - move to sdk1.17.0 and also use pytorch 1.10.1 (#1634)

* [NEURON][RELEASE] - Update yml file to add PT1.10.1 and TF2.5.2 (#1668)

* Relase Neuron Images for sdk1.16.0

Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* don't look for sm tag

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Add neuron release 1.16.1 version

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add neuron release 1.16.1

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* update available images for neuron

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* fix md file to have py37 for pt

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add old neuron versions

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Release PT1.10.1 and TF2.5.2 Neuron DLC

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add to release_images.yml

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add mxnet

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update release_images.yml

* Update .release_images_template.yml

Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [NEURON][BUILD][TF] - Upgrade tf1.15.5 to use the neuron sdk 1.17.0 (#1642)

* Release neuron sdk1.17.0 version of tf1.15.5 dlc (#1673)

* Relase Neuron Images for sdk1.16.0

Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* don't look for sm tag

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Add neuron release 1.16.1 version

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add neuron release 1.16.1

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* update available images for neuron

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* fix md file to have py37 for pt

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add old neuron versions

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Release PT1.10.1 and TF2.5.2 Neuron DLC

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add to release_images.yml

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add mxnet

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update release_images.yml

* Update .release_images_template.yml

* release neuron sdk 1.17.0 version of tf1.15.5

Signed-off-by: Venky Natham <vrnatham@amazon.com>

Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* tensorflow_serving 2.8 e3 inference container  (#1671)

* add wip dockerfiles

* add tensorflow_model_server

* update ci instructions

* update tensorrt

* change pyversion in buidlpsec;rm files in /tmp for stray_file_test

* update cve allow list

* revert dev config

* udpate tmp file delete

* revert dev config, add tf27 buildspec

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* Add e3 dockerfiles for tensorflow 2.8 (#1647)

* add tf28 e3 container dockerfiles

* update buildspec

* use numpy tensorflow dependency

* update buildspec to reflect change of python version

* Update buildspec.yml

* install cudnn-dev

* update cudnn

* fix typo

* enable safety scan

* update horovod installation

* A few security upgrades

* upgrade pillow to 9.0.1
* urllib3 to the latest
* ignore numpy false positive vulnerability

* Fixed urllib constrain

* Skipped a couple of safety tests

* Turn off safety scan

* update wheel

* remove tempory pem file

* revert dev config

* set dev config with safety check

* revert dev config

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#1677)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.2 to 2.5.3.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.2...v2.5.3)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* use js_import instead of js_include for TF serving nginx configuration (#1666)

* tf2.7 inf build

* fix buildspec

* nginx configuration

* revert config

* remove -

* rename tfs file

* nginx configuration

* remove js_content

* manage export statement

* fix nginx errors

* Enabling safety test

* revert temp changes

* address comments

* nit change

* change file name

* adjust file name

* nit change

* enable inference build

* revert buildspecfile changes

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [Habana]|[Builld]|[Test] Enable Safety Scan Ignore list for Habana numpy issues (#1678)

* Enable Safety Scan Ignore list for Habana numpy issues

* Changed the ignore messages

* Reverted developer config changes

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>

* add TF2.8 in release images (#1676)

* Release neuron sdk 1.17.0 tf1.15.5 (#1681)

* Relase Neuron Images for sdk1.16.0

Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* don't look for sm tag

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Add neuron release 1.16.1 version

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add neuron release 1.16.1

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* update available images for neuron

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* fix md file to have py37 for pt

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add old neuron versions

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Release PT1.10.1 and TF2.5.2 Neuron DLC

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add to release_images.yml

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* add mxnet

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update release_images.yml

* Update .release_images_template.yml

* release neuron sdk 1.17.0 version of tf1.15.5

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* release neuron sdk 1.17.0 based tf1.15.5

Signed-off-by: Venky Natham <vrnatham@amazon.com>

Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Add support for sagemaker-like E3 tag (#1672)

* [canary] Update python versions for TF canaries (#1682)

* fix TF tests issues (#1684)

* Habana DLC Perf/TestSuite TF/PT tests -- gaudi test suite (#1567)

* Habana DLC Perf/TestSuite TF/PT tests
* Add Habana DLAMI Tensorflow Performance Benchmarks
* Add Habana DLAMI PyTorch Performance Benchmarks
* Add Habana DLAMI Tensorflow Test Suite
* Add Habana DLAMI PyTorch Test Suite

* Apply gaudi-test-suite to test bert, rn50, maskrcnn, framework, etc.

* Test cleanup and exit code fix

* Fix gaudi-test-suite branch name

* To extract the Throughput correctly

* Update scripts for 1.2.0 release

* Add tf requirement installation

* Remove comments

* fix test scripts

* enable habana mode

* configure git creds

* build habana images

* adjust test dir

* run benchmark tests

* fix docker command

* update pt binary

* build new image

* use dedicated github granch

* nit change

* pin pt setuptools

* pin setuptools

* fix log file

* fix benchmark test

* awscli support

* fix dep check

* nit changes

* run benchmark test

* adjust pytest timeout for habana

* turn off benchmark mode

* add habana fixture

* run benchmark test

* increase timeout

* revert temp config

* increase timeout to 5hr

* build image

* run benchmark test

* revert temp config

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Buke Ao <bukeao@bukeao-vm.habana-labs.com>
Co-authored-by: Anny Chung <achung@habana.ai>
Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com>
Co-authored-by: Wei Chu <weichu@amazon.com>

* change cudnn version for tf2.8 for compatibility with p2 instances (#1688)

* update cudnn version

* update buildspec

* test on p2 instance

* revert dev config

Co-authored-by: Qingzi-Lan <qingzila@amazon.com>

* Habana release v1.2 images for TF and PT (#1687)

* release v1.2

* nit

* habana release v1.2 (#1691)

* Bump numpy in /test/sagemaker_tests/pytorch/inference (#1679)

Bumps [numpy](https://github.com/numpy/numpy) from 1.16.4 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.16.4...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [NEURON][BUILD][HF] - move hf neuron dlc to use latest sdk (#1669)

* [NEURON][BUILD][HF] - use ubuntu18 (#1700)

* use ubuntu18

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* enable test

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* remove libtinfo6 install as that is specific to u20

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* Update dlc_developer_config.toml

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [NEURON][BUILD][TF] - Move tf2.5.2 neuron to sdk 1.17.1 (#1696)

* [NEURON][BUILD][MX] - Move to neuron sdk1.17.1 (#1698)

* [NEURON][BUILD][PT] - Move pt1.10 to neuron sdk1.17.1 (#1699)

* [NEURON][BUILD][TF] - Move tf1.15.5 to use neuron sdk 1.17.1 (#1697)

* Release neuron sdk 1.17.1 version (#1702)

Signed-off-by: Venky Natham <vrnatham@amazon.com>

* [doc] Update available images for neuron sdk release 1.17.1 (#1703)

* Add release images definition for HF PyTorch Neuron (#1694)

* [PyTorch E3] PT 1.10.2 DLC release (#1683)

* pt1.10.2

* add dgl

* update vision binaries

* update numpy and pillow versions

* fix numpy 1.22.0 installation

* update versions for cpu

* pin ipython version

* fix ipython installation

* update dgl pt container tests

* config for e3 only

* pin numpy version

* skip CVE 44463

* fix format

* update dev config

* update dev config

* disable dgl

* disable dgl cpu test for eks

* revert graviton changes

* revert sagemaker wheel

* remove pt1.10.0 buildspec

* revert dev config

* Update dlc_developer_config.toml

Co-authored-by: Wei Chu <weichu@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* Add tf2.7 training sagemaker dockerfiles (#1628)

* add tf2.7 sagemaker dockerfiles

* update buidlspec

* remove non-compatible python packages

* add dependencies for kebros

* use manylinux wheels; add sagemaker dockerfiles

* update horovod installation env vars

* update horovod installation script

* use numpy as tensorflow dep

* Update buildspec.yml

* install boost

* increase image size limit

* update pillow and add docker lables

* use wheels from smdebuggers pipelines

* fix sanity test

* add labels for tf 2.7 sm cpu

* rerun

* build+rerun

* reinstall horovod cpu

* install smdebug directly from tag

* fix typo;

* Revert "fix typo;"

This reverts commit c5bd300d2141a91ac4f3f1d6d13711aa975370cb.

* Revert "install smdebug directly from tag"

This reverts commit c51ef6b95b20de6f65397f34a29806ab77c03461.

* Executing safety check in PR

* install smdebug directly from the branch

* bump up tensorflow to 2.7.1

* install higher version of tensorflow-io to avoid overriding tensorflow

* Ignoring a false positive vulnerability

* install tfds

* change pytest comands

* do not install dependencies as they have been installed in the dockerfiles

* add SAGEMAKER_TRAINING_MODULE environment variable

* remove pem file in tmp folder

* update sagemaker-tensorflow

* add smdataparallel

* revert rm /tmp

* remove /tmp/git-secrets

* experiment with an smdebug fix

* Revert "experiment with an smdebug fix"

This reverts commit b19ee8347ed6208ff9c2ac81d489dba785632199.

* skip test_keras_mirrored.py

* fix error in buidlspec

* Revert "fix error in buidlspec"

This reverts commit b973fa415e324a6f63d1fb816b22848e35600934.

* revert developer_config

* fix buildspec

* fix buildspec

* fix py version

* revert buildspec to mainline

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com>
Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>
Co-authored-by: Sergey Togulev <togulev@amazon.com>
Co-authored-by: Qingzi-Lan <qingzila@amazon.com>
Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* pt1.10.2 release images (#1706)

* pt1.10.2 release images

* add example

* TF2.8: Clean up dockerfiles, update HVD test (#1693)

* update pt1.10.2 release images (#1707)

* update pt1.10.2 release images

* Update release_images.yml

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [Build][tensorflow] fix TF27 GPU CVE-2022-24407 (#1710)

* test

* update

* udpate

* update

* should fail

* test cpu and gpu

* update gpu sasl package

* udpate libsasl manually

* update

* add TF27 release images (#1714)

* [Tensorflow] add comment on py39 installation on TF 2.8 dockerfiles (#1715)

* document TF28 dockerfile

* update

* Release TF2.8 e3 images (#1716)

Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com>

* [Tensorflow][Test][ec2] Fix Habana Tensorflow EC2 tests (#1704)

* Changed dev config to build images

* Added safety check test true

* Changed the build to true in buildspec

* Add logic to upload and read from s3 with a break statement

* Remove break and fix tail bug

* Change loop time and last line of script

* Added modularity

* Removing unwanted logs

* Modifying the while loop to check if the test can end early

* Reformatting the code

* Fixing bugs and refactoring

* Minor fix

* refactored code and added buckets for each account

* Refactored to include the ValueError within execute_async method

* Implemented bucket logic

* Reverting temp changes

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>

* bug fix (#1717)

* re-releas TF27 sagemaker cpu training (#1720)

* [build][pytorch] pt1.10 add openssh support (#1619)

* [tensorflow] Bug fixes to TF2.8 E3 images (#1723)

* [tensorflow] Bug fixes to TF2.8 E3 images

* add sasl install

* upgrade sasl instead of reinstalling

* Revert "upgrade sasl instead of reinstalling"

This reverts commit 51eb07408a404edde16e5bb2ddb3aa3b782a37a7.

* [Habana] [test] [ec2, sagemaker]  Fix to skip SM tests for Habana and modify async testing API (#1724)

* Fix to skip SM tests for Habana and modify async testing API

* Added the hang detection window variable

* Revert developer config

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>

* Move sasl to upgrade instead of install (#1726)

* Add dependabot config file to scan Dockerfiles (#1727)

* Add dependabot config file to scan Dockerfiles

* Update dependabot.yml

* [PyTorch] PyTorch 1.10.2 SageMaker DLC (#1709)

* pt1.10.2 sm dlc

* merge from upstream master

* refactor smdebug installation

* set enable_test_promotion:false for e3

Co-authored-by: Wei Chu <weichu@amazon.com>

* Configured release_images.yml for TF2.8e3 re-release and PT1.10.2 SM release (#1737)

* Configured release_images.yml for TF2.8e3 re-release

* Update release_images.yml

* Add Pytorch release changes to the yml

Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>
Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>

* [build][pytorch] pytorch 1.9 add openssh support (#1621)

* add openssh support

* build training image only

* revert dev config

* update

* update package version

* udpate

* revert dev config

* [tensorflow] Add dockerfiles for TF2.8 (#1685)

* add sagemaker dockerfiles

* update developer config

* update buildspec

* fix typos

* fix typo for python version

* add smdebug

* add sagemaker-tensorflow

* add smdataparallel

* remove tmp files

* update test config

* remove wrong ldlib path

* update tensorflow-io version`

* remove sagemaker-tensorflow til py39 pkg become available

* remove sagemaker-tensorflow

* add sagemaker-tensorflow

* install sagemaker-tensorflow from source

* install tfds

* do not install tesnorflow-dataset in the tests as it was installed in the image

* set datetime_tag to false

* correct python version

* update buildspec

* pass arguments related to python to e3 and sagemaker stages as env vars

* install smdebug from the tag

* minor update for sagemaker-tensorflow installation

* bug fix

* Changes to config file

* Make fix for cyrus CVE

* Change configs file to disable safety_check_test

* bump up requests

* run benchmark without rebuild

* run sagemaker rc tests

* run efa tests

* unistall tfds as it is installed in the image already

* run rc tests

* remove unused env vars

* fix license

* update buildspec to build sagemaker images only

* Revert "update buildspec to build sagemaker images only"

This reverts commit 908c89dcec178fe964346516cf12f52b6448868d.

* skip safty checks

* remove license from sagemaker stage

* revert dlc_developer_config.toml

* remove unused comments

* skip test_keras_mirrored for TF2.7

* fix styling issues

* add env var for TF version

* comment out e3 and example images build

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Shantanu Tripathi <trshanta@amazon.com>
Co-authored-by: Shantanu Tripathi <shantanutripathi237@gmail.com>

* Update available images for TF2.7 SM and TF2.8 E3 (#1741)

* [TensorFlow] bump up tensorflow to 2.6.3 (#1721)

* [TensorFlow] add sagmaker dockerfiles for tensorflow_serving (#1689)

* add sagmaker dockerfiles

* build sagemaker image

* pass build args as env variables to sagemaker stages, remove unused
dockerfle

Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com>
Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com>

* [tensorflow] [build] [test] TF2.8 SM image fix (#1748)

* TF2.8 SM image fix

* Rebuild images with new SMDebug tag released

* Changed the smdebug versioning format

* Removed additional code for skipping tests

* Change buildspec and revert temp changes

* Added newline at ends of buildspecs

* [tensorflow][build][sagemaker] enhance gunicorn logging (#1750)

* [autogluon][build] AutoGluon 0.3.1 container patching (#1734)

Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com>

* [autogluon][build] AutoGluon 0.3.2 container (AG 0.3.1 with patched images) (#1752)

* [release] Add TF 2.8 SM DLCs to release images (#1755)

* [doc] Update available images (#1754)

* [release] Add AG 0.3.2 images to release (#1757)

* [Tensorflow][Test][benchmark][ec2]  Invoke all the Habana benchmark tests using async execution  (#1711)

* Basic config for building images and running the tests

* Added timeout for benchmark runs

* Invoking async execution for the benchmark test

* Added uuid to logs and increased loop time

* Added background p…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants