Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [test] Add efa test as placeholder (#185) * [pytorch][sagemaker] PT 1.8.0 cu110 EFA support (#171) * PT 1.7.1 cu110 EFA support * rebase PT 1.7.1 dockerfile and add EFA to PT 1.8.0 dockerfile * Install hwloc, dependency of smdataparallel * Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA * Updated EFA version to 1.11.2 which comes with MPI v4.1.0 * fix nccl version and add test * update mpi * fix style * Fixed NCCL branch name and moved the Horovod installation before SM Distributed * Disable the framework build and test which is not applicable to this PR * fix failing test * Add MPI flags for EFA * Fixed pytorch nccl version test * Fixed pytorch nccl version python test and disable fresh builds * Disable new builds and enabled smdataparallel test * Re-trigger CI * Revert build config changes Co-authored-by: Lai Wei <royweilai@gmail.com> Co-authored-by: Akhil Mehra <armehra@amazon.com> * [TensorFlow][Sagemaker] TF 2.4 cu110 EFA support (#172) * TF 2.4 cu110 EFA support * Added -g option for EFA installer * Update NCCL installation * Fixed NCCL installation * Add constant at top * Install hwloc, dependency of smdataparallel * Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA * Updated EFA version to 1.11.2 which comes with MPI v4.1.0 * update OPEN_MPI * Install NCCL from source and updated the openMPI path * Re-trigger CI * Disable the framework build and test which is not applicable to this PR and added EFA related flag * Fix mpi flag failure * Add correct runtime MPI flags * Add correct MPI flags, modify build config * Disable new builds and Fixed SM Horovod test * Enabled smdataparallel test * Removed building NCCL with specific arch. Use default config which builds for all arch * Revert build config changes Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com> Co-authored-by: Akhil Mehra <armehra@amazon.com> * Run PT to test EFA (#191) add sanity efa test * [pytorch] | [test] | [sagemaker] SMModel Parallel pytorch EFA tests on p3dn (#187) SMModel Parallel pytorch EFA tests Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com> Co-authored-by: Karan Jariwala <karankjariwala@gmail.com> Co-authored-by: Lai Wei <royweilai@gmail.com> Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com> * [tensorflow] | [test] | [sagemaker] (#188) add efa test for tf2 Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com> Co-authored-by: Karan Jariwala <karankjariwala@gmail.com> Co-authored-by: Lai Wei <royweilai@gmail.com> Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com> * Run PT Rubik EFA test (#194) * run pt efa rubik * skip inference * revert * Run rubik efa tests on tf2 (#195) * run rubik efa tests on tf2 * [test][sagemaker] Add reupload_image_to_test_ecr to SM tests conftest (#193) * [PyTorch][test][sagemaker] EFA test for smdataparallel (#189) EFA test for smdataparallel * [habana] Placeholder for Build and Test Functionality for Habana (#197) * [habana] build functionality * modify habana dedicated flag * enable habana build * build config changes * add pytorch and modify test configuration * move build artifact * test support for habana * nit changes * build changes * nit change * support for SM and benchmark * address comments * build eia and neuron * enable new builds * nit * revert temp configs * remove dead code from eks test * [Habana] Add changeset logic (#198) * changeset logic for habana * enable habana mode * test buildspec * change dockerfiles * disable habana mode and revert changes * remove unwanted code * [test] Run test using existing EC2 instance locally (#201) * Run test using existing EC2 instance * rename pytest fixture * Removing any SM related installs from Dockerfile (#200) * Removing any SM related installs * Cleaned Dockerfile.Added 2.5 folder Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> * [pytorch/tensorflow] Habana DLC python 3.7, OMPI in base installer and pytorch DLC fixes (#202) * Habana Pytorch DLC and OMPI Install In Habana Bases * Fix docker path * Rebased and added TF2.5 * Update pytorch to 0.15.0 synapse * Updated Pytorch docker file (#204) * Updated Pytorch docker file. Also updated buildspec to pull whl from s3 bucket * Removed SM packages. Added few more pythom packages. Renamed folder to 0.15 * Minor fix in buildspec * build habana images * correct build config * disable build config Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * Update buildspec.yml (#206) Updated pytorch wheel. Added HPUBase for test cases. * SynapseAI 0.15.0 Release DLC Changes (#205) * SynapseAI 0.15.0 Release * Add example branch parse and Habana PR build * Fix extra slash * Revert ENABLE_HABANA_MODE * [Habana][Build] Fix torchvision python version py37 (#207) * Fix torchvision python version py37 * Updated h5py version to 3.1.0 * enable habana mode and disable test * Using pypi package for torchvision * add docker build artifacts * add build artifacts references to buildspec * revert config Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * SynapseAI v0.15.1 release updates (#208) * SynapseAI v0.15.1 release updates * build habana switch on * fix pt parse * ENABLE_HABANA_MODE=False * Updating TF binaries with callback fixes (#210) * Updating TF binaries with callback fixes * Enabling Habana build * Resetting ENABLE_HABANA_MODE=False * SynapseAI v0.15.2 release updates (#209) * SynapseAI v0.15.2 release updates * SynapseAI v0.15.2 release updates * Fix folder naming * Re-Disable ENABLE_HABANA_MODE in build_config.py * SynapseAI v0.15.2 release updates * SynapseAI v0.15.2 release updates * Fix folder naming * Re-Disable ENABLE_HABANA_MODE in build_config.py * Updating Torchvision binary (#211) * Updating Torchvision binary as we need to build with same setup as pytorch for compatibilty * Enabling Habana mode * Reset ENABLE_HABANA_MODE= False * SynapseAI v0.15.3 release updates (#213) * SynapseAI v0.15.3 release updates * SynapseAI v0.15.3 release updates * Enable Habana Mode * Disable Habana Mode * address rebase modifications * [DO NOT MERGE] [autogluon][build, test] Initial PR for training containers (#214) * [autogluon][build, test] fixing instance types (#218) * format ecr repo from image uri (#217) * format ecr repo from image uri * pytest markers for hpu test * more markers * nit habana changes * [habana][build] fix docker entrypoint (#219) * fix docker entrypoint * revert habana mode * Fixed version in autogluon buildspec (#215) * Fixed version in autogluon buildspec * Enabling sagemaker tests * Enable building a new container * Added MAJOR_VERSION into docker files, added autogluon_training fixture * [autogluon][test] SageMaker remote mode tests * [autogluon][test] removed datasets requirement Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com> * [autogluon][test] tests fixes (#220) * [autogluon][test] tests fixes * [autogluon][test] tests fixes * [autogluon][test] removed jupyter dependencies leftovers * [autogluon][test] removed jupyter dependencies leftovers * [autogluon][test] version checks fixes * [autogluon][test] pip check fixes * [autogluon][test] pip check fixes * [autogluon][test] sm_local tests fixes * [autogluon][test] sm_local tests fixes * [autogluon][test] applied pillow security fixes to autogluon * [autogluon][test] removed jupyter dependencies leftovers * [build][test]Rolling back default parameters changes (#224) * Rolling back default parameters changes * [autogluon][test] test fixes Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com> * [autogluon][release]Releasing Autogluon 0.2.1 (#227) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [autogluon][test]Fixes for AG sanity tests (#226) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [release] Fixed release notes logic (#228) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [release] Fix for AG release notes (#229) * [release] Fixed release notes logic * [release] Fixed release notes logic Co-authored-by: Sergey Togulev <togulev@amazon.com> * [autogluon][release] Release AG container (#230) * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic Co-authored-by: Sergey Togulev <togulev@amazon.com> * [release] Fix for imp_pip_packages (#231) * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic Co-authored-by: Sergey Togulev <togulev@amazon.com> * Ag release (#232) * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [autogluon][build] Build AG 0.3.0 Co-authored-by: Sergey Togulev <togulev@amazon.com> * [habana] fix pip check requirements (#225) * habana sanity test * reinstall boto3 * upgrade boto3 * remove comments * revert temp configs * [test] Merger testrunner from public (#234) Co-authored-by: Sergey Togulev <togulev@amazon.com> * SynapseAI v0.15.4 release updates (#233) * SynapseAI v0.15.4 release updates * SynapseAI v0.15.4 release updates * Enable Habana Mode * Revert "Enable Habana Mode" This reverts commit 9ed1a8f58d2d5c71977ff0cc660e3228c3dd8874. * [test] Building AG 0.2.1 (#236) Co-authored-by: Sergey Togulev <togulev@amazon.com> * Remove hb-torch & install into --user for python packages (#237) * Remove hb-torch before installing AWS torch * python packages to user space install * add -y to uninstall * enable habana mode * disable habana mode Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * [build] habana build modifications (#238) * habana build modifications * run test safety * make sanity test compatible with hpu processor * fix sanity test * sync up utility test changes from public repo * address comments * revert temp config * release habana dlc to gamma stage (#243) * [release] fix numbering on release_images.yml (#244) * fix_numbering * move syai inside of job_type * remove PT1.7 and TF2.5 from release_images.yml (#245) * Remove keras package before installing tensorflow (#247) * Remove keras package before installing tensorflow * Enable habana_mode * run test safety * disable habana mode * revert safety test changes Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#242) Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.0 to 2.5.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.5.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [hopper][build] Add hopper build code (#246) * Merge master into private-master (#248) * [test] Add hopper_mode to quick checks tests (#251) * [test] Add efa test as placeholder (#185) * [pytorch][sagemaker] PT 1.8.0 cu110 EFA support (#171) * PT 1.7.1 cu110 EFA support * rebase PT 1.7.1 dockerfile and add EFA to PT 1.8.0 dockerfile * Install hwloc, dependency of smdataparallel * Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA * Updated EFA version to 1.11.2 which comes with MPI v4.1.0 * fix nccl version and add test * update mpi * fix style * Fixed NCCL branch name and moved the Horovod installation before SM Distributed * Disable the framework build and test which is not applicable to this PR * fix failing test * Add MPI flags for EFA * Fixed pytorch nccl version test * Fixed pytorch nccl version python test and disable fresh builds * Disable new builds and enabled smdataparallel test * Re-trigger CI * Revert build config changes Co-authored-by: Lai Wei <royweilai@gmail.com> Co-authored-by: Akhil Mehra <armehra@amazon.com> * [TensorFlow][Sagemaker] TF 2.4 cu110 EFA support (#172) * TF 2.4 cu110 EFA support * Added -g option for EFA installer * Update NCCL installation * Fixed NCCL installation * Add constant at top * Install hwloc, dependency of smdataparallel * Disabled smdataparallel integration test temporarily since current smdataparallel wheel is incompatible with EFA * Updated EFA version to 1.11.2 which comes with MPI v4.1.0 * update OPEN_MPI * Install NCCL from source and updated the openMPI path * Re-trigger CI * Disable the framework build and test which is not applicable to this PR and added EFA related flag * Fix mpi flag failure * Add correct runtime MPI flags * Add correct MPI flags, modify build config * Disable new builds and Fixed SM Horovod test * Enabled smdataparallel test * Removed building NCCL with specific arch. Use default config which builds for all arch * Revert build config changes Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com> Co-authored-by: Akhil Mehra <armehra@amazon.com> * Run PT to test EFA (#191) add sanity efa test * [pytorch] | [test] | [sagemaker] SMModel Parallel pytorch EFA tests on p3dn (#187) SMModel Parallel pytorch EFA tests Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com> Co-authored-by: Karan Jariwala <karankjariwala@gmail.com> Co-authored-by: Lai Wei <royweilai@gmail.com> Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com> * [tensorflow] | [test] | [sagemaker] (#188) add efa test for tf2 Co-authored-by: Jeetendra Patil <jeet4320@users.noreply.github.com> Co-authored-by: Karan Jariwala <karankjariwala@gmail.com> Co-authored-by: Lai Wei <royweilai@gmail.com> Co-authored-by: yselivonchyk <y.selivonchyk@gmail.com> * Run PT Rubik EFA test (#194) * run pt efa rubik * skip inference * revert * Run rubik efa tests on tf2 (#195) * run rubik efa tests on tf2 * [test][sagemaker] Add reupload_image_to_test_ecr to SM tests conftest (#193) * [PyTorch][test][sagemaker] EFA test for smdataparallel (#189) EFA test for smdataparallel * [habana] Placeholder for Build and Test Functionality for Habana (#197) * [habana] build functionality * modify habana dedicated flag * enable habana build * build config changes * add pytorch and modify test configuration * move build artifact * test support for habana * nit changes * build changes * nit change * support for SM and benchmark * address comments * build eia and neuron * enable new builds * nit * revert temp configs * remove dead code from eks test * [Habana] Add changeset logic (#198) * changeset logic for habana * enable habana mode * test buildspec * change dockerfiles * disable habana mode and revert changes * remove unwanted code * [test] Run test using existing EC2 instance locally (#201) * Run test using existing EC2 instance * rename pytest fixture * Removing any SM related installs from Dockerfile (#200) * Removing any SM related installs * Cleaned Dockerfile.Added 2.5 folder Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> * [pytorch/tensorflow] Habana DLC python 3.7, OMPI in base installer and pytorch DLC fixes (#202) * Habana Pytorch DLC and OMPI Install In Habana Bases * Fix docker path * Rebased and added TF2.5 * Update pytorch to 0.15.0 synapse * Updated Pytorch docker file (#204) * Updated Pytorch docker file. Also updated buildspec to pull whl from s3 bucket * Removed SM packages. Added few more pythom packages. Renamed folder to 0.15 * Minor fix in buildspec * build habana images * correct build config * disable build config Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * Update buildspec.yml (#206) Updated pytorch wheel. Added HPUBase for test cases. * SynapseAI 0.15.0 Release DLC Changes (#205) * SynapseAI 0.15.0 Release * Add example branch parse and Habana PR build * Fix extra slash * Revert ENABLE_HABANA_MODE * [Habana][Build] Fix torchvision python version py37 (#207) * Fix torchvision python version py37 * Updated h5py version to 3.1.0 * enable habana mode and disable test * Using pypi package for torchvision * add docker build artifacts * add build artifacts references to buildspec * revert config Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * SynapseAI v0.15.1 release updates (#208) * SynapseAI v0.15.1 release updates * build habana switch on * fix pt parse * ENABLE_HABANA_MODE=False * Updating TF binaries with callback fixes (#210) * Updating TF binaries with callback fixes * Enabling Habana build * Resetting ENABLE_HABANA_MODE=False * SynapseAI v0.15.2 release updates (#209) * SynapseAI v0.15.2 release updates * SynapseAI v0.15.2 release updates * Fix folder naming * Re-Disable ENABLE_HABANA_MODE in build_config.py * SynapseAI v0.15.2 release updates * SynapseAI v0.15.2 release updates * Fix folder naming * Re-Disable ENABLE_HABANA_MODE in build_config.py * Updating Torchvision binary (#211) * Updating Torchvision binary as we need to build with same setup as pytorch for compatibilty * Enabling Habana mode * Reset ENABLE_HABANA_MODE= False * SynapseAI v0.15.3 release updates (#213) * SynapseAI v0.15.3 release updates * SynapseAI v0.15.3 release updates * Enable Habana Mode * Disable Habana Mode * address rebase modifications * [DO NOT MERGE] [autogluon][build, test] Initial PR for training containers (#214) * [autogluon][build, test] fixing instance types (#218) * format ecr repo from image uri (#217) * format ecr repo from image uri * pytest markers for hpu test * more markers * nit habana changes * [habana][build] fix docker entrypoint (#219) * fix docker entrypoint * revert habana mode * Fixed version in autogluon buildspec (#215) * Fixed version in autogluon buildspec * Enabling sagemaker tests * Enable building a new container * Added MAJOR_VERSION into docker files, added autogluon_training fixture * [autogluon][test] SageMaker remote mode tests * [autogluon][test] removed datasets requirement Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com> * [autogluon][test] tests fixes (#220) * [autogluon][test] tests fixes * [autogluon][test] tests fixes * [autogluon][test] removed jupyter dependencies leftovers * [autogluon][test] removed jupyter dependencies leftovers * [autogluon][test] version checks fixes * [autogluon][test] pip check fixes * [autogluon][test] pip check fixes * [autogluon][test] sm_local tests fixes * [autogluon][test] sm_local tests fixes * [autogluon][test] applied pillow security fixes to autogluon * [autogluon][test] removed jupyter dependencies leftovers * [build][test]Rolling back default parameters changes (#224) * Rolling back default parameters changes * [autogluon][test] test fixes Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Alexander Shirkov <ashyrkou@amazon.com> * [autogluon][test]Fixes for AG sanity tests (#226) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [release] Fixed release notes logic (#228) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [release] Fix for imp_pip_packages (#231) * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic * [release] Fixed release notes logic Co-authored-by: Sergey Togulev <togulev@amazon.com> * [habana] fix pip check requirements (#225) * habana sanity test * reinstall boto3 * upgrade boto3 * remove comments * revert temp configs * SynapseAI v0.15.4 release updates (#233) * SynapseAI v0.15.4 release updates * SynapseAI v0.15.4 release updates * Enable Habana Mode * Revert "Enable Habana Mode" This reverts commit 9ed1a8f58d2d5c71977ff0cc660e3228c3dd8874. * Remove hb-torch & install into --user for python packages (#237) * Remove hb-torch before installing AWS torch * python packages to user space install * add -y to uninstall * enable habana mode * disable habana mode Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * [build] habana build modifications (#238) * habana build modifications * run test safety * make sanity test compatible with hpu processor * fix sanity test * sync up utility test changes from public repo * address comments * revert temp config * remove PT1.7 and TF2.5 from release_images.yml (#245) * Remove keras package before installing tensorflow (#247) * Remove keras package before installing tensorflow * Enable habana_mode * run test safety * disable habana mode * revert safety test changes Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * [hopper][build] Add hopper build code (#246) * Merge master into private-master (#248) * [test] Add hopper_mode to quick checks tests (#251) * followup sync changes * [hopper][build] sync hopper dockerfiles with huggingface dockerfiles (#254) * [hopper][build] sync hopper dockerfiles with huggingface dockerfiles * Enable hopper mode * Fix bug with CI for Hopper * Use py38 wheel and disable debug env vars * Update xla wheel and set buildspec correctly for hopper * Fix framework path and artifact name * Fix framework version path * Disable hopper mode Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com> * [hopper][build] Add more wheels for hopper (#258) * buildspec and status modifications (#261) * [hopper][pytorch][test] Fix horovod tests (#266) * Reinstall horovod for hopper * Enable hopper mode * Remove hopper dedicated * Revert hopper dedicated * Update dlc_developer_config.toml * [hopper][test] Fix getting framework for hopper (#265) * [hopper][test] Fix getting framework for hopper * Add dummy change to trigger build * Add dummy change in buildspec to trigger build * Add dummy change in dockerfile * Remove hopper dedicated * Update main.py * Update main.py * Update main.py * Remove dummy changes * Update dlc_developer_config.toml * [hopper][pytorch][build] Update transformers wheel (#267) * [hopper][pytorch][build] Update transformers wheel to the latest (#269) * [hopper][pytorch][build] Update hopper wheels (#270) * [habana] fix pip check and unpin werkzeug package (#271) * unpin werkzeug package * install latest version * fix rebase changes * fix pip check * revert temp config * install typing * build habana dlc * revert temp changes * release PT1.9 diy/sm (#272) * [release] adjust customer_type for diy/sm (#273) * adjust customer_type * adjust customer_type * nit change * remove neuron (#274) * add habana packages to release page (#241) * [hopper][build][pytorch] Update hopper pytorch wheels (#275) * Update hopper pytorch wheels * [hopper][build][pytorch] Update transformers wheel (#276) * [hopper][build][pytorch] Update transformers wheel * [hopper][build][pytorch] Update transformers wheel (#278) * [hopper][build][pytorch] Update transformers wheel * Disable hopper mode * Synch HF images from public (#281) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [hopper][build][pytorch] Upgrade transformers to 11.0 (#282) * Upgrade transformers to 11.0 * Update transformers version * Disable hopper mode * trigger builds * retrigger builds Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * [hopper][huggingface_tensorflow][huggingface_pytorch][build][test] Build and test Hopper images with sm pysdk (#280) * Added the changes to build hopper images with sm pysdk * Added the tests to run using sm pysdk * Added debug lines * Run SM local tests and address comments * Deactivated ecs and eks tests. * Reverting the dev config changes * [test][sagemaker] Make PySDK binary selection logic generic for the SM tests and SM local tests (#283) * Make PySDK binary selection logic generic for the SM and SM local tests * Make hopper mode true * Revert the changes * [hopper][build][pytorch][tensorflow] Update fw wheels with init changes (#284) * [hopper][build] Update fw wheels with init changes * Enable test flags * Fix typo * Disable test flags * [hopper][build][pytorch] Fix Hopper DT NaN issue (#288) * Fix Hopper DT NaN issue * Update dlc_developer_config.toml Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com> * [hopper][build][pytorch][tensorflow] Fix licence files (#289) * [hopper] [build] [pytorch] Updating SM trcomp PT wheels for DT support (#293) * Updating SM trcomp PT wheels for DT support * Update dlc_developer_config.toml Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com> * [hopper][build][pytorch] Include examples dir in transformers wheel (#291) * Include examples dir in transformers wheel * Update transformers wheel * Update dlc_developer_config.toml Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com> * [hopper] [test] [sagemaker] Adding tests targeting the SM Training Compiler integrated containers Private master (#286) * Fix bugs in framework init functions. +new Fx Wheels for HF-trcomp Create remote and local test for HF-PT-trcomp Create remote tests for HF-TF-trcomp Make tests shorter * Added handlers for non implemented tests * Updating HF-trcomp tests to look for log messages indicating trcomp has been ingaged in the training logs * Fix for smdebug EC2 test. * Adding HF-PT-trcomp tests to test different trcomp configs. Porting testing to work with HF-TF-trcomp. * Finalizing HF-trcomp tests Fixed HF-TF-trcomp build recipe. Add redundancy to all trcomp build recipes Fixing test dependencies * Increasing retries for HF trcomp tests * Skipping HF-PT-trcomp local test since it hangs. Will fix later * Reverting test mode Co-authored-by: Sergey Togulev <togulev@amazon.com> * [test] Fix smart retry benchmark tests (#1452) (#296) * Fix for multithreading error in SM local tests * Rollback dlc_developer_config changes * Fix for SM local tests * Rolled back dev_config changes * Fix for multithreading error in SM local tests * Rollback dlc_developer_config changes * Fix for SM local tests * Rolled back dev_config changes * Fix for smart retry benchmark tests Co-authored-by: Sergey Togulev <togulev@amazon.com> (cherry picked from commit df440538a7c5f580301c5f3a1c56c14beab48821) Fix smart retry (#1451) * Fix for multithreading error in SM local tests * Rollback dlc_developer_config changes * Fix for SM local tests * Rolled back dev_config changes Co-authored-by: Sergey Togulev <togulev@amazon.com> (cherry picked from commit 97fb152a7022f252d4349742cbc7d7c3bc0af9a6) [test] Smart retry functionality (#1414) * check pytest cache * enable builds * enable builds * enable builds * enable builds * disable builds * disable builds * enable builds * Added -p to mkdir * Using dinamic obj name * Added try-catches * Moved everything to separate functions * Fixed a small bug * Removed separate functions * Removed separate functions * Fixed bugs * Fixed bugs * Fixed bugs * Added tests for sagemaker * Typo fix * Added last-failed for sagemaker * Fixing sm-local tests * Removed json * updated ec2 commands * using string in threads pool instead of dict * moved to p.map again * moved to p.map again * Rolled back dev_config changes * Fixed sm-local tests * Fixed sm-local tests * Fixed sm-local tests * refactored pytest_cache.py * fixed a bug * removed code for sagemaker remote tests * rolled back dev config * A few changes after the review * A few changes after the review * Fixed a typo * Added account number parameter * Refactored utils instantiating * A few NITs Co-authored-by: Sergey Togulev <togulev@amazon.com> (cherry picked from commit 5938a87927cbd7c4500a04a98c2d58dea82d3dad) Co-authored-by: Sergey Togulev <togulev@amazon.com> * Fix for smart retry (#300) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [trcomp] [build] Fixing debug artifact path for trcomp (#299) * [trcomp] [build] Fixing debug artifact path for trcomp * fix: Adding additional checks to trcomp HF-PT debug tests to ensure debug artifacts are uploaded. * Reverting PR test config * [hopper][build][pytorch] Fix transformers gradient clipping issue (#304) * Fix transformers gradient clipping issue * Trigger build * Use pipeline-built transformers wheel * Update dlc_developer_config.toml Co-authored-by: pinaraws <47152339+pinaraws@users.noreply.github.com> * release_images.yml with hopper images (#306) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [release] Release trcomp (#307) * release_images.yml with hopper images * Added trcomp Co-authored-by: Sergey Togulev <togulev@amazon.com> * [hopper][build][pytorch] Add distributed training entry point (#308) * [hopper][build][pytorch] Add distributed training entry point * Disable tests * Skipping benchmark tests for trcomp containers (#309) Co-authored-by: Sergey Togulev <togulev@amazon.com> * [tensorflow][build][test] Tensorflow2.6 with SM PySDK keynote3 (#287) * Tensorflow2.6 with SM PySDK keynote3 * Adding leftover changes * Increase image size * Use partially complete keynote3 PySDK * Added changes to pass pr quick checks * Minor fix for sanity and quick checks * Fixing the download path * Log absolute path * Fixing the path for pr checks * Reformatted using black -l 120 * Addressed comments * Increased image size * After the latest wheel release * [config] Fix `do_build` config option (#1494) * Set do_build as false * Sync the cpu dockerfile with public master * Added the keras version pinning * Minor fix * Pinned tensorflow io * Make gpu dockerfile same as public with pinned tfio * Install new sm binaries * Added the increased sizes * Added changes for tf2.6.2 * Make image baseline 8000 * Changed the tf2.6.2 binaries to many_linux latest * Revert dlc developer config Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * Skipping sm debugger tests for trcomp containers (#310) Co-authored-by: Sergey Togulev <togulev@amazon.com> * add graviton support (#313) * revert graviton release specs (#314) * [trcomp][build][pytorch] Fix distributed training entry point (#315) * [trcomp][build][pytorch] Fix distributed training entry point * Skipping sm debugger tests for trcomp containers Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> * [build]|[test]|[tensorflow] Made changes to build TF2.6.2 with SmPySDK and Boto (#316) * Made changes to build TF2.6.2 with SmPySDK and Boto * Revert temp chagnes * Added sanity check tests * release graviton for gamma testing (#317) * [huggingface-neuron] Update release_images.yml (#318) * Update release_images.yml (#319) * Update release_images.yml For hf neuron for the time being have disable_sm_tag to True * Update release_images.yml Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [trcomp] [pytorch] [build] Defaulting GPU_NUM_DEVICES to 1 (#321) * [trcomp] [pytorch] [build] Defaulting GPU_NUM_DEVICES to 1 * [trcomp] [pytorch] [test] Testing default value of GPU_NUM_DEVICES * Reverting PR config * Upgrade pillow in TF hopper container (#322) Co-authored-by: Sergey Togulev <togulev@amazon.com> * Pillow fix (#323) * Upgrade pillow in TF hopper container * fixed a typo in a dockerfile Co-authored-by: Sergey Togulev <togulev@amazon.com> * [trcomp] [pytorch] [build] Fixing CVEs (#324) * [trcomp] [pytorch] [build] Fixing CVEs * Skipping not needed frameworks * Removing hf-pt to trigger hopper tests * Trying to execute hopper tests * Skipping not needed frameworks * Fixed dependency check issues self-discovery * Addded print for debugging * [trcomp] [pytorch] [build] Fixing CVE in bokeh * Moved bokeh installation into a different block * Removed temp logging * [trcomp] [pytorch] [build] Fixing CVE in numpy and ipython * Rollback temp changes Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> Co-authored-by: Sergey Togulev <togulev@amazon.com> * Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#295) Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.1 to 2.5.2. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.1...v2.5.2) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> * [trcomp] [pytorch] [build] Fixing perf issues in g4dn instances (#325) * [trcomp] [pytorch] [build] Fixing perf issues in g4dn instances * Revert PR check config Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> * [test][sanity] Removed temp changes from test runner (#327) * [trcomp] [pytorch] [build] Fixing CVEs * Skipping not needed frameworks * Removing hf-pt to trigger hopper tests * Trying to execute hopper tests * Skipping not needed frameworks * Fixed dependency check issues self-discovery * Addded print for debugging * [trcomp] [pytorch] [build] Fixing CVE in bokeh * Moved bokeh installation into a different block * Removed temp logging * Rollback temp changes * Rollback temp changes Co-authored-by: Loki <lokravi@amazon.com> Co-authored-by: Sergey Togulev <togulev@amazon.com> * Using pypi sagemaker (#332) Co-authored-by: Sergey Togulev <togulev@amazon.com> * Merging from PUBLIC (#333) * Merging from PUBLIC * Fixed docker login * Fixed parameter passing * Fixed import * Fixed sm_helper import * Rollback config changes Co-authored-by: Sergey Togulev <togulev@amazon.com> * [Trcomp][CI] logic change copied from PR331 (#337) * [Trcomp][CI] logic change copied from PR331 * comment out failed dockerfile commands * revert dev config * update dev config * address comments * set dev config * fix typo * update * remove sagemaker test skip * sync with PUBLIC * remove unwanted habana test * revert dev config * remove sagemaker test skip for pytorch trcomp Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> * [trcomp] [pytorch] [build] Adding support for PyTorch 1.10 (#329) * [trcomp] [pytorch] [build] Adding support for PyTorch 1.10 * Setting developer config for PR validation tests * [trcomp] [pytorch] [build] Release PyTorch 1.10.0 * [trcomp] [pytorch] [build] Adding common training dependencies * [trcomp] [pytorch] [test] Changing tests to reflect changes to HF logging in 4.16.2 * [trcomp] [pytorch] [build] Adding common training dependencies * [trcomp] [pytorch] [build] Upgrading PT from 1.10.0 to 1.10.2 * [trcomp] [pytorch] [build] Adding torchaudio binaries * [trcomp] [pytorch] [build] Updating NCCL version in binaries * [trcomp] [pytorch] [test] Adding back skip markers after bad merge * [trcomp] [pytorch] [build] Updating torch version to reflect X.Y.Z+cuABC * [trcomp] [pytorch] [build] Fixing numpy version to fix dependency for package numba * fiix sanity failures * rename dockerfile * remove duplicate test skip logic * update e3 test skip logic * fix sagemaker test directory * fix sanity test * enable ec2 test run and fix smdebug test * nit change * fix framework name * fix variable name * [trcomp] [test] Removing/Replacing internal code names * [trcomp] [pytorch] [build] Fixing GPU_NUM_DEVICES issue with Distributed Training * [trcomp] [pytorch] [build] Adding support for G5 instances with A10 GPUs * Reverting developer config Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> Co-authored-by: Qingzi-Lan <qingzila@amazon.com> * [trcomp][build] fix the base image version for TF 2.6.3 (#331) * fix the base image version * update dev config * upgrade numpy & openssl * downgrade numpy to 1.21 * fix sanity tests * enable ec2 test * update ec2 test skip logic * update dockerfile name logic * update * update * update * fix typo * update * update * update * fix typo * skip horovod test * update * update dev config * fix sagemaker test path * update sagemaker test skip fixture * update * update dev config * revert dev config Co-authored-by: Qingzi-Lan <qingzila@amazon.com> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> * [release] release HF Trcomp TF2.6.3 & PT 1.10.2 (#338) * release HF Trcomp TF2.6.3 & PT 1.10.2 * backup previous release_images.yml * Sync eks infrastructure changes (#340) * Graviton eks infrastructure (#1579) * initial commit * add pre-deploy * add nodegroup support * modify eks buildspec * build a cluster * add kubeconfig * nit change * revert temp changes * explictly set managed node * remove managed option * add option to upgrade nodegroup * nit change * template update Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-219.us-west-2.compute.internal> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> Co-authored-by: Qingzi-Lan <qingzila@amazon.com> * [eks] Upgrade EKS nodegroups and enable eks test for graviton (#1821) * ung * enable eks test for graviton * build image * disable config * deploy graviton nodegroups Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-219.us-west-2.compute.internal> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> Co-authored-by: Qingzi-Lan <qingzila@amazon.com> * upgrade nodegroup (#341) * Merge from PUBLIC repo @ef69cf4 (#339) * test merge from PUBLIC * trigger test * update dev config * revert dockerfile change * change dockerfile * update utils * debug modified dockerfile regexp * debug github handler file changed * revert debug info, and force to_build to true * enable habana build * fix merge error * restore files from PUBLIC * revert dev config and "changeset limited to 20files" work around * [build] Find buildspecs using configured env vars (#366) * [pytorch][build] Remove patch version from buildspec file name (#376) * Sync from public repo (#387) * release pt-1.10.0 (#1616) * release pt-1.10.0 Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [huggingface_pytorch][NEURON][build] Huggingface Neuron inference DLC (#1578) Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> Co-authored-by: Venky Natham <vrnatham@amazon.com> * [build][graviton][mxnet][pytorch] fix graviton image build (#1618) * fix graviton image build * revert dev config * Run dependency check on HF neuron images (#1622) * [tensorflow][test][benchmark] Makeshift fix for flaky benchmark tests (#1575) * Makeshift fix for flaky benchmark tests * Shifted the if condition * Reverting change * Removing unnecessary import * reverting temp changes * Add support for multistage dockerfiles for e3/sagemaker (#1532) * Exclude dependency check library from tool (#1611) * [MXNet][build][test] Release MX 1.9.0 inference & training binaries (#1217) Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com> Co-authored-by: Wei Chu <weichu@amazon.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * Update release images for MX1.9 (#1639) * Run MX sagemaker benchmarks on SM images (#1640) * [test][sagemaker]Sm remote smart retry (#1573) * Refactored mxnet sm multi-region tests * Rollback devconfig changes * Update SM smart retry * converting custom_cache_directory to string * converting custom_cache_directory to string * converting custom_cache_directory to string * upload cache to s3 * upload cache to s3 * upload cache to s3 * upload cache to s3 * upload cache to s3 * added broken test * added broken and working tests * added broken and working tests * added broken and working tests * Fixed bug * Fixed bug * Revert temp changes * Fixed bug * Rolled back temp changes * Added a few comments * A few edits after review * Rolled-back temp changes Co-authored-by: Sergey Togulev <togulev@amazon.com> * [doc] Added NVIDIA Triton inference containers to available images (#1591) * [NEURON][TEST] - Update the manifest for 1.17.0 release (#1632) * [neuron][huggingface] Update MMS version in HF Neuron DLCs (#1644) * support py38 in MX sagemaker tests (#1652) * Update MX 1.9 example images (#1654) * Update numpy version in MX images (#1656) * Pin numpy to <1.20 in MX 1.9 images (#1657) * Pin numpy to <1.20 in MX 1.9 images * update buildspec * Habana Synapseai v1.2.0 dockerfiles (#1627) * Habana 1.1.1 release update * Update docker image path to 1.1.1 release docker * Added 1.9.1 pytorch * Added 2.7.0 tensorflow * Turn on habana_mode=true * update framework binaries * update dockerfile to py38+ul20 * Fix Pytorch docker container path * update license files * Update 1.2.0 links * update binaries for PT1.10 * update pt binaries * remove pytorch_binary from buildspec * Remove dataclass/typing workaround from previous releases * fix few build failures * Unpin Pillow package and fix dataclass/typing on 2.7 instead of 2.5 * unpin request * allow openssl cve * update tf wheel with tensorflow-cpu * fix security issue * nit change * revert developer config Co-authored-by: Wei Chu <weichu@amazon.com> Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> * [NEURON][BUILD][MX] - update to sdk1.17.0 (#1636) * [NEURON][BUILD][TF2.5] - update to use sdk1.17.0 and also tf2.5.2 (#1635) * Release MX inference images for MXNet 1.9 (#1662) * update availabel_images.md for MX1.9 (#1655) * [NEURON][BUILD][PT] - move to sdk1.17.0 and also use pytorch 1.10.1 (#1634) * [NEURON][RELEASE] - Update yml file to add PT1.10.1 and TF2.5.2 (#1668) * Relase Neuron Images for sdk1.16.0 Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0 Signed-off-by: Venky Natham <vrnatham@amazon.com> * don't look for sm tag Signed-off-by: Venky Natham <vrnatham@amazon.com> * Add neuron release 1.16.1 version Signed-off-by: Venky Natham <vrnatham@amazon.com> * add neuron release 1.16.1 Signed-off-by: Venky Natham <vrnatham@amazon.com> * update available images for neuron Signed-off-by: Venky Natham <vrnatham@amazon.com> * fix md file to have py37 for pt Signed-off-by: Venky Natham <vrnatham@amazon.com> * add old neuron versions Signed-off-by: Venky Natham <vrnatham@amazon.com> * Release PT1.10.1 and TF2.5.2 Neuron DLC Signed-off-by: Venky Natham <vrnatham@amazon.com> * add to release_images.yml Signed-off-by: Venky Natham <vrnatham@amazon.com> * add mxnet Signed-off-by: Venky Natham <vrnatham@amazon.com> * Update release_images.yml * Update .release_images_template.yml Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com> Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [NEURON][BUILD][TF] - Upgrade tf1.15.5 to use the neuron sdk 1.17.0 (#1642) * Release neuron sdk1.17.0 version of tf1.15.5 dlc (#1673) * Relase Neuron Images for sdk1.16.0 Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0 Signed-off-by: Venky Natham <vrnatham@amazon.com> * don't look for sm tag Signed-off-by: Venky Natham <vrnatham@amazon.com> * Add neuron release 1.16.1 version Signed-off-by: Venky Natham <vrnatham@amazon.com> * add neuron release 1.16.1 Signed-off-by: Venky Natham <vrnatham@amazon.com> * update available images for neuron Signed-off-by: Venky Natham <vrnatham@amazon.com> * fix md file to have py37 for pt Signed-off-by: Venky Natham <vrnatham@amazon.com> * add old neuron versions Signed-off-by: Venky Natham <vrnatham@amazon.com> * Release PT1.10.1 and TF2.5.2 Neuron DLC Signed-off-by: Venky Natham <vrnatham@amazon.com> * add to release_images.yml Signed-off-by: Venky Natham <vrnatham@amazon.com> * add mxnet Signed-off-by: Venky Natham <vrnatham@amazon.com> * Update release_images.yml * Update .release_images_template.yml * release neuron sdk 1.17.0 version of tf1.15.5 Signed-off-by: Venky Natham <vrnatham@amazon.com> Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com> Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * tensorflow_serving 2.8 e3 inference container (#1671) * add wip dockerfiles * add tensorflow_model_server * update ci instructions * update tensorrt * change pyversion in buidlpsec;rm files in /tmp for stray_file_test * update cve allow list * revert dev config * udpate tmp file delete * revert dev config, add tf27 buildspec Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> Co-authored-by: Qingzi-Lan <qingzila@amazon.com> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> * Add e3 dockerfiles for tensorflow 2.8 (#1647) * add tf28 e3 container dockerfiles * update buildspec * use numpy tensorflow dependency * update buildspec to reflect change of python version * Update buildspec.yml * install cudnn-dev * update cudnn * fix typo * enable safety scan * update horovod installation * A few security upgrades * upgrade pillow to 9.0.1 * urllib3 to the latest * ignore numpy false positive vulnerability * Fixed urllib constrain * Skipped a couple of safety tests * Turn off safety scan * update wheel * remove tempory pem file * revert dev config * set dev config with safety check * revert dev config Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Qingzi-Lan <qingzila@amazon.com> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> * Bump tensorflow in /test/sagemaker_tests/huggingface_tensorflow/training (#1677) Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.2 to 2.5.3. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.2...v2.5.3) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * use js_import instead of js_include for TF serving nginx configuration (#1666) * tf2.7 inf build * fix buildspec * nginx configuration * revert config * remove - * rename tfs file * nginx configuration * remove js_content * manage export statement * fix nginx errors * Enabling safety test * revert temp changes * address comments * nit change * change file name * adjust file name * nit change * enable inference build * revert buildspecfile changes Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> * [Habana]|[Builld]|[Test] Enable Safety Scan Ignore list for Habana numpy issues (#1678) * Enable Safety Scan Ignore list for Habana numpy issues * Changed the ignore messages * Reverted developer config changes Co-authored-by: Shantanu Tripathi <trshanta@amazon.com> * add TF2.8 in release images (#1676) * Release neuron sdk 1.17.0 tf1.15.5 (#1681) * Relase Neuron Images for sdk1.16.0 Release PT1.9.1, TF1.15.5, Tf2.5.1, MX:1.8.0 Signed-off-by: Venky Natham <vrnatham@amazon.com> * don't look for sm tag Signed-off-by: Venky Natham <vrnatham@amazon.com> * Add neuron release 1.16.1 version Signed-off-by: Venky Natham <vrnatham@amazon.com> * add neuron release 1.16.1 Signed-off-by: Venky Natham <vrnatham@amazon.com> * update available images for neuron Signed-off-by: Venky Natham <vrnatham@amazon.com> * fix md file to have py37 for pt Signed-off-by: Venky Natham <vrnatham@amazon.com> * add old neuron versions Signed-off-by: Venky Natham <vrnatham@amazon.com> * Release PT1.10.1 and TF2.5.2 Neuron DLC Signed-off-by: Venky Natham <vrnatham@amazon.com> * add to release_images.yml Signed-off-by: Venky Natham <vrnatham@amazon.com> * add mxnet Signed-off-by: Venky Natham <vrnatham@amazon.com> * Update release_images.yml * Update .release_images_template.yml * release neuron sdk 1.17.0 version of tf1.15.5 Signed-off-by: Venky Natham <vrnatham@amazon.com> * release neuron sdk 1.17.0 based tf1.15.5 Signed-off-by: Venky Natham <vrnatham@amazon.com> Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com> Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * Add support for sagemaker-like E3 tag (#1672) * [canary] Update python versions for TF canaries (#1682) * fix TF tests issues (#1684) * Habana DLC Perf/TestSuite TF/PT tests -- gaudi test suite (#1567) * Habana DLC Perf/TestSuite TF/PT tests * Add Habana DLAMI Tensorflow Performance Benchmarks * Add Habana DLAMI PyTorch Performance Benchmarks * Add Habana DLAMI Tensorflow Test Suite * Add Habana DLAMI PyTorch Test Suite * Apply gaudi-test-suite to test bert, rn50, maskrcnn, framework, etc. * Test cleanup and exit code fix * Fix gaudi-test-suite branch name * To extract the Throughput correctly * Update scripts for 1.2.0 release * Add tf requirement installation * Remove comments * fix test scripts * enable habana mode * configure git creds * build habana images * adjust test dir * run benchmark tests * fix docker command * update pt binary * build new image * use dedicated github granch * nit change * pin pt setuptools * pin setuptools * fix log file * fix benchmark test * awscli support * fix dep check * nit changes * run benchmark test * adjust pytest timeout for habana * turn off benchmark mode * add habana fixture * run benchmark test * increase timeout * revert temp config * increase timeout to 5hr * build image * run benchmark test * revert temp config Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> Co-authored-by: Buke Ao <bukeao@bukeao-vm.habana-labs.com> Co-authored-by: Anny Chung <achung@habana.ai> Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> Co-authored-by: Tejas Chumbalkar <34728580+tejaschumbalkar@users.noreply.github.com> Co-authored-by: Wei Chu <weichu@amazon.com> * change cudnn version for tf2.8 for compatibility with p2 instances (#1688) * update cudnn version * update buildspec * test on p2 instance * revert dev config Co-authored-by: Qingzi-Lan <qingzila@amazon.com> * Habana release v1.2 images for TF and PT (#1687) * release v1.2 * nit * habana release v1.2 (#1691) * Bump numpy in /test/sagemaker_tests/pytorch/inference (#1679) Bumps [numpy](https://github.com/numpy/numpy) from 1.16.4 to 1.21.0. - [Release notes](https://github.com/numpy/numpy/releases) - [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt) - [Commits](https://github.com/numpy/numpy/compare/v1.16.4...v1.21.0) --- updated-dependencies: - dependency-name: numpy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [NEURON][BUILD][HF] - move hf neuron dlc to use latest sdk (#1669) * [NEURON][BUILD][HF] - use ubuntu18 (#1700) * use ubuntu18 Signed-off-by: Venky Natham <vrnatham@amazon.com> * enable test Signed-off-by: Venky Natham <vrnatham@amazon.com> * remove libtinfo6 install as that is specific to u20 Signed-off-by: Venky Natham <vrnatham@amazon.com> * Update dlc_developer_config.toml Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [NEURON][BUILD][TF] - Move tf2.5.2 neuron to sdk 1.17.1 (#1696) * [NEURON][BUILD][MX] - Move to neuron sdk1.17.1 (#1698) * [NEURON][BUILD][PT] - Move pt1.10 to neuron sdk1.17.1 (#1699) * [NEURON][BUILD][TF] - Move tf1.15.5 to use neuron sdk 1.17.1 (#1697) * Release neuron sdk 1.17.1 version (#1702) Signed-off-by: Venky Natham <vrnatham@amazon.com> * [doc] Update available images for neuron sdk release 1.17.1 (#1703) * Add release images definition for HF PyTorch Neuron (#1694) * [PyTorch E3] PT 1.10.2 DLC release (#1683) * pt1.10.2 * add dgl * update vision binaries * update numpy and pillow versions * fix numpy 1.22.0 installation * update versions for cpu * pin ipython version * fix ipython installation * update dgl pt container tests * config for e3 only * pin numpy version * skip CVE 44463 * fix format * update dev config * update dev config * disable dgl * disable dgl cpu test for eks * revert graviton changes * revert sagemaker wheel * remove pt1.10.0 buildspec * revert dev config * Update dlc_developer_config.toml Co-authored-by: Wei Chu <weichu@amazon.com> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * Add tf2.7 training sagemaker dockerfiles (#1628) * add tf2.7 sagemaker dockerfiles * update buidlspec * remove non-compatible python packages * add dependencies for kebros * use manylinux wheels; add sagemaker dockerfiles * update horovod installation env vars * update horovod installation script * use numpy as tensorflow dep * Update buildspec.yml * install boost * increase image size limit * update pillow and add docker lables * use wheels from smdebuggers pipelines * fix sanity test * add labels for tf 2.7 sm cpu * rerun * build+rerun * reinstall horovod cpu * install smdebug directly from tag * fix typo; * Revert "fix typo;" This reverts commit c5bd300d2141a91ac4f3f1d6d13711aa975370cb. * Revert "install smdebug directly from tag" This reverts commit c51ef6b95b20de6f65397f34a29806ab77c03461. * Executing safety check in PR * install smdebug directly from the branch * bump up tensorflow to 2.7.1 * install higher version of tensorflow-io to avoid overriding tensorflow * Ignoring a false positive vulnerability * install tfds * change pytest comands * do not install dependencies as they have been installed in the dockerfiles * add SAGEMAKER_TRAINING_MODULE environment variable * remove pem file in tmp folder * update sagemaker-tensorflow * add smdataparallel * revert rm /tmp * remove /tmp/git-secrets * experiment with an smdebug fix * Revert "experiment with an smdebug fix" This reverts commit b19ee8347ed6208ff9c2ac81d489dba785632199. * skip test_keras_mirrored.py * fix error in buidlspec * Revert "fix error in buidlspec" This reverts commit b973fa415e324a6f63d1fb816b22848e35600934. * revert developer_config * fix buildspec * fix buildspec * fix py version * revert buildspec to mainline Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> Co-authored-by: tejaschumbalkar <tejaschumbalkar@gmail.com> Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> Co-authored-by: Sergey Togulev <togulev@amazon.com> Co-authored-by: Qingzi-Lan <qingzila@amazon.com> Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> * pt1.10.2 release images (#1706) * pt1.10.2 release images * add example * TF2.8: Clean up dockerfiles, update HVD test (#1693) * update pt1.10.2 release images (#1707) * update pt1.10.2 release images * Update release_images.yml Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [Build][tensorflow] fix TF27 GPU CVE-2022-24407 (#1710) * test * update * udpate * update * should fail * test cpu and gpu * update gpu sasl package * udpate libsasl manually * update * add TF27 release images (#1714) * [Tensorflow] add comment on py39 installation on TF 2.8 dockerfiles (#1715) * document TF28 dockerfile * update * Release TF2.8 e3 images (#1716) Co-authored-by: Qingzi-Lan <83724147+Qingzi-Lan@users.noreply.github.com> * [Tensorflow][Test][ec2] Fix Habana Tensorflow EC2 tests (#1704) * Changed dev config to build images * Added safety check test true * Changed the build to true in buildspec * Add logic to upload and read from s3 with a break statement * Remove break and fix tail bug * Change loop time and last line of script * Added modularity * Removing unwanted logs * Modifying the while loop to check if the test can end early * Reformatting the code * Fixing bugs and refactoring * Minor fix * refactored code and added buckets for each account * Refactored to include the ValueError within execute_async method * Implemented bucket logic * Reverting temp changes Co-authored-by: Shantanu Tripathi <trshanta@amazon.com> * bug fix (#1717) * re-releas TF27 sagemaker cpu training (#1720) * [build][pytorch] pt1.10 add openssh support (#1619) * [tensorflow] Bug fixes to TF2.8 E3 images (#1723) * [tensorflow] Bug fixes to TF2.8 E3 images * add sasl install * upgrade sasl instead of reinstalling * Revert "upgrade sasl instead of reinstalling" This reverts commit 51eb07408a404edde16e5bb2ddb3aa3b782a37a7. * [Habana] [test] [ec2, sagemaker] Fix to skip SM tests for Habana and modify async testing API (#1724) * Fix to skip SM tests for Habana and modify async testing API * Added the hang detection window variable * Revert developer config Co-authored-by: Shantanu Tripathi <trshanta@amazon.com> * Move sasl to upgrade instead of install (#1726) * Add dependabot config file to scan Dockerfiles (#1727) * Add dependabot config file to scan Dockerfiles * Update dependabot.yml * [PyTorch] PyTorch 1.10.2 SageMaker DLC (#1709) * pt1.10.2 sm dlc * merge from upstream master * refactor smdebug installation * set enable_test_promotion:false for e3 Co-authored-by: Wei Chu <weichu@amazon.com> * Configured release_images.yml for TF2.8e3 re-release and PT1.10.2 SM release (#1737) * Configured release_images.yml for TF2.8e3 re-release * Update release_images.yml * Add Pytorch release changes to the yml Co-authored-by: Shantanu Tripathi <trshanta@amazon.com> Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> * [build][pytorch] pytorch 1.9 add openssh support (#1621) * add openssh support * build training image only * revert dev config * update * update package version * udpate * revert dev config * [tensorflow] Add dockerfiles for TF2.8 (#1685) * add sagemaker dockerfiles * update developer config * update buildspec * fix typos * fix typo for python version * add smdebug * add sagemaker-tensorflow * add smdataparallel * remove tmp files * update test config * remove wrong ldlib path * update tensorflow-io version` * remove sagemaker-tensorflow til py39 pkg become available * remove sagemaker-tensorflow * add sagemaker-tensorflow * install sagemaker-tensorflow from source * install tfds * do not install tesnorflow-dataset in the tests as it was installed in the image * set datetime_tag to false * correct python version * update buildspec * pass arguments related to python to e3 and sagemaker stages as env vars * install smdebug from the tag * minor update for sagemaker-tensorflow installation * bug fix * Changes to config file * Make fix for cyrus CVE * Change configs file to disable safety_check_test * bump up requests * run benchmark without rebuild * run sagemaker rc tests * run efa tests * unistall tfds as it is installed in the image already * run rc tests * remove unused env vars * fix license * update buildspec to build sagemaker images only * Revert "update buildspec to build sagemaker images only" This reverts commit 908c89dcec178fe964346516cf12f52b6448868d. * skip safty checks * remove license from sagemaker stage * revert dlc_developer_config.toml * remove unused comments * skip test_keras_mirrored for TF2.7 * fix styling issues * add env var for TF version * comment out e3 and example images build Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> Co-authored-by: Shantanu Tripathi <trshanta@amazon.com> Co-authored-by: Shantanu Tripathi <shantanutripathi237@gmail.com> * Update available images for TF2.7 SM and TF2.8 E3 (#1741) * [TensorFlow] bump up tensorflow to 2.6.3 (#1721) * [TensorFlow] add sagmaker dockerfiles for tensorflow_serving (#1689) * add sagmaker dockerfiles * build sagemaker image * pass build args as env variables to sagemaker stages, remove unused dockerfle Co-authored-by: arjkesh <33526713+arjkesh@users.noreply.github.com> Co-authored-by: Sai Parthasarathy Miduthuri <saimidu@amazon.com> * [tensorflow] [build] [test] TF2.8 SM image fix (#1748) * TF2.8 SM image fix * Rebuild images with new SMDebug tag released * Changed the smdebug versioning format * Removed additional code for skipping tests * Change buildspec and revert temp changes * Added newline at ends of buildspecs * [tensorflow][build][sagemaker] enhance gunicorn logging (#1750) * [autogluon][build] AutoGluon 0.3.1 container patching (#1734) Co-authored-by: Sergey Togulev <34056697+SergTogul@users.noreply.github.com> * [autogluon][build] AutoGluon 0.3.2 container (AG 0.3.1 with patched images) (#1752) * [release] Add TF 2.8 SM DLCs to release images (#1755) * [doc] Update available images (#1754) * [release] Add AG 0.3.2 images to release (#1757) * [Tensorflow][Test][benchmark][ec2] Invoke all the Habana benchmark tests using async execution (#1711) * Basic config for building images and running the tests * Added timeout for benchmark runs * Invoking async execution for the benchmark test * Added uuid to logs and increased loop time * Added background p…
- Loading branch information