Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Gradtape ZCC to 2.1.2 #384

Merged
merged 1 commit into from
Oct 23, 2020
Merged

Extend Gradtape ZCC to 2.1.2 #384

merged 1 commit into from
Oct 23, 2020

Conversation

NihalHarish
Copy link
Contributor

Description of changes:

  • Gradient tape ZCC features were backported into AWS TF 2.1.2
  • Updating the test to reflect this change.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@NRauschmayr NRauschmayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good to me

@codecov-io
Copy link

codecov-io commented Oct 23, 2020

Codecov Report

Merging #384 into master will decrease coverage by 2.82%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #384      +/-   ##
==========================================
- Coverage   85.49%   82.66%   -2.83%     
==========================================
  Files          86       86              
  Lines        6514     6514              
==========================================
- Hits         5569     5385     -184     
- Misses        945     1129     +184     
Impacted Files Coverage Δ
smdebug/tensorflow/utils.py 63.45% <0.00%> (-24.88%) ⬇️
smdebug/tensorflow/singleton_utils.py 83.33% <0.00%> (-16.67%) ⬇️
smdebug/tensorflow/keras.py 79.01% <0.00%> (-13.32%) ⬇️
smdebug/tensorflow/callable_cache.py 69.56% <0.00%> (-13.05%) ⬇️
smdebug/tensorflow/collection.py 84.53% <0.00%> (-11.35%) ⬇️
smdebug/tensorflow/tensor_ref.py 82.25% <0.00%> (-6.46%) ⬇️
smdebug/tensorflow/base_hook.py 76.19% <0.00%> (-4.37%) ⬇️
smdebug/tensorflow/session.py 88.46% <0.00%> (-3.37%) ⬇️
smdebug/core/tfevent/event_file_writer.py 93.75% <0.00%> (-2.50%) ⬇️
smdebug/core/collection_manager.py 90.24% <0.00%> (-2.44%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef3b6b2...28e0047. Read the comment docs.

@NihalHarish NihalHarish merged commit 6884ee9 into master Oct 23, 2020
@NihalHarish NihalHarish deleted the extend_grad_tape_zcc branch October 23, 2020 05:48
NihalHarish added a commit that referenced this pull request Oct 23, 2020
NihalHarish added a commit that referenced this pull request Oct 30, 2020
leleamol added a commit that referenced this pull request Dec 8, 2020
…low_datasets version (#114)

* Update sagemaker.md (#250)

* Bumping version to 0.9.0 (#251)

* Skip using standalone keras Py3.7+ (#253)

* Gradtape zcc (#252)

* Fix Incorrect Log Statement (#256)

* Incorrect number of tensors saved with MirroredStrategy (#257)

* Change Version to 0.8.1 (#258)

* Save Scalars With Mirrored Strategy (#259)

* skip flaky test (#262)

* Don't export to collections for all workers with unsupported distrib training (#263)

* version bump (#265)

* Avoiding Basehook object pickling (#266)

* handle eager tensors (#271)

* TF 2.x: Support for keras to estimator (#268)

* Revert "TF 2.x: Support for keras to estimator (#268)" (#273)

This reverts commit 749bded.

* Disable TB Testing  (#275)

* Support for TF 2 estimator (#274)

* Adding a TF2 Hvd example and test (#279)

* Moved end of training log from info to debug (#281)

#280

* Adding action class (#285)

* Adding action class
Actions added: stop trianing job, email,  sms

* Fix buildspec used for PR CI (#287)

* Adding a test to check that PT model is saved without issues (#283)

* test that model can be pickled without issues

* Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (#282)

* Pin pytest version (#293)

* Load IRIS Dataset from S3 (#298)

* Load dataset from s3 (#299)

* remove problematic log (#300)

* Change Enum (#301)

* Doc update (#292)

* rename enum (#305)

* version bump to 0.9.1 (#304)

* modify asserts (#307)

* version compare (#306)

* Support TF 2.3 Tests (#312)

* Disable TB in ZCC for AWS TF 2.3.0 (#316)

* Update Assert Statements For New TF 2.2.0 DLC (#320)

* Version Bump (#319)

* add a note for TF 2.2 limited support (#303)


Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>

* TF 2.2 documentation update  (#322)

* update TF 2.2 smdebug features
* Update code samples/notes for new pySDK and smdebug/add and fix links
* add 'New features' note
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>

* Adding pagination in list_training_jobs (#323)

* Adding pagination in list_Training_jobs

* Test Custom Step Usecase (#331)

* save tf2 model (#333)

* Add ability to only save shapes of tensors (#328)

* Revert "Add ability to only save shapes of tensors (#328)" (#337)

This reverts commit c9eb769.

* Function to Test If the hook has been configured with the Default hook config (#332)

* Default hook config (#338)

* version bump (#339)

* TF ZCC limitation footnote (#342)

* Ability to save shapes (#341)

* WIP saveshape

* Add shape writer

* Add pytorch test

* Add untested keras test

* fix syntax

* fix syntax

* Import

* Import

* Add tests for TF

* Simplify read code

* Add read API and tests

* Add mxnet test

* Add s3 and json tests

* lint

* Fix payload

* fix import

* Handle different num tensors for losses

* Fix exact equal condition

* Fix mode bug

* trigger CI

* Add support for distributed training with writer map

* Check that value throws exception

* Fix tests to make them more resilient

* Fix mxnet and pytorch tests

* Remove tensor names

* pre-commmit

* Fix get_mode

* Fix bug with old index files

* Fix keras test with names of tensors

* Set original name to None if tf_obj is None

* Fix mirrored test for cpu

* Add docs

* trigger CI

* Fix shape writer get

* Simplify by removing shape writer

* Cleanup

* Fix name of writer

* Addressed review comments

* trigger ci

* retrigger CI

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Support Inputs and Labels in the dict format (#345)

* 0.9.4 (#347)

* Refactor Make Numpy Array (#329)

* warn gradtape users  about tf.function support (#348)

* Support all tf types (#346)

* Model Subclassing Test (#351)

* Modify Should Save Tensor Test To Work on Any Version of TF (#352)

* framework version updates (#360)

* list training jobs improvements (#349)

* Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic.
* if there are training jobs found with prefix, we break
 * if there are exceptions caught more than 5 times we break.

* Handle Deprecation Of experimental_ref api (#356)

* check file exist before moving (#364)

* check file exist before moving when closing the file.

* Support Saving Tensors in Graph Mode with add_for_mode (#353)

* Change layer name logic (#357)

* Pass Variable Length Argument To Old Function Call (#366)

* test concat layers (#367)

* Update README.md (#371)

* Pinning the version of tensorflow_datasets package so that it does not require updating TF (#373)

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (#372)

* Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (#374)

* returning list instead of dict keys (#376)

fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always.

* Add support for mixed precision training (#378)

* Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (#380)

* pytorch tmp (#382)

* extend zcc to 2.1.2 (#384)

* disable pytorch (#386)

* Removed the redundant installation of smdebug and smdebug-rules (#391)

* Incrementing the version to 0.9.5 (#396)

* pin tensorflow dataset in test config (#399)

* add back test

* revert some changes

* unpin pytest version

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Vikas-kum <vikumar@amazon.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>
leleamol added a commit that referenced this pull request Dec 8, 2020
* rotation policy

* fix tests

* fix write event call

* add comments in code

* add a test through hook

* fix rotation

* some fixes

* delete file if empty

* enable multi-process test

* fix multi-process test

* add pt distrib test

* Revert "add pt distrib test"

This reverts commit a8fc661.

* enable write to s3

* address some review comments

* address some more review comments

* cleanup

* some fixes

* make timestamp mandatory
* filename timestamp matches 1st event

* more cleanup and fixes

* consolidate classes
* timestamp in UTC

* address review comments

* edit base_start_time

* remove delete if empty

* default queue size and flush secs

* Add timestamp test

* add abs and rel timestamp in record

* save default values to constants file

* Cached the names of parsed files to avoid parsing them everytime.

* address review comments

* lazy file creation
* drop events if file creation fails
* rename file to event end ts

* correct s3 bucket name

* test timestamp with file rotation

check if timestamp of all events in a file are lesser than timestamp in file name

* remove ref to s3

* remove changes to s3.py

* add checks for healthy writer

* test file open failure

* Cleanup hook

* Added the buffer for looking up trace file, removed the get_events_at_time function, updated the implementation of get_events to return the active events

* make timestamp mandatory everywhere

* fix mxnet test

* Corrected the multiplier for microseconds

* remove flush_secs

* Updating the tests directory with new file format.

* Simplify class structure

* save base_start_time in record

* Updated the test directories to the updated YYYYMMDDHR format

* init env variables once

* Renamed the function and added function comments

* address some review comments

* cleanup

* Fixed the trace file look for start and end time events

* Truncating the trace files and updating the test file.

* fix pt test

* fallback node ID

* Removed the functionality to cap the upper_bound_timestamp

* Optimize the refreshing the file list based on the last available timestamp in the datasource viz. local or S3

* Correctly named the file suffix. Truncated the horovod timeline file

* Added the functionality to download the S3 files in parallel

* Addressed the review comments

* address review comments

* Trace events writer - part 2 (#6)

* ensure there's a dir for the new file
* add .tmp
* handle the case when events are far apart
* fix a mistake in cur_hour
* updated last_file_close_time to now

Co-authored-by: Vikas-kum <dev.vikas94@gmail.com>

* Record step duration in keras hook (#8)

* add step duration to keras hook

Co-authored-by: Vikas-kum <dev.vikas94@gmail.com>

* test TF step time with timeline writer (#9)

* Read node ID from Resource config (#10)

* read host ID from resource config

* use timeline writer directly (#11)

* Added functionality to record node_id in the events (#7)

* Added functionality to record node_id in the events

* Added the test to verify node id from file

* Moved the functions to extract node id and timestamp to utils directory.

* Add profiler config parser (#12)

* Timeline file name timestamp in us (#15)

* file timestamp in us

* Add comprehensive tests for detailed profiler config (#18)

* adding comprehensive tests

* refactoring fixtures

* renaming vars

* remove imports

* remove extraneous fixture

* PR changes

* documenting test cases

* documenting test cases

* refactoring fixtures

* Supporting efficiently downloding s3 files for distributed training (#14)

* Supporting efficiently downloding s3 files for distributed training

* updated op_name and args when recording step duration (#17)

* fixes for right directory name(#20)

* Fix folder name (#21)

* fixes
* change all variables to microsecs

* Updating the files to fix the pre-commit failures (#23)

* Change invalid file path (#25)

* change invalid file path

* fix other precommit errors

* Add error handling for parsing profiler config (#27)

* Fixing the tests for CI (#28)

* Fixing the tests for CI

* fix out_dir bug

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Default path for profiler has changed (#29)

* Update and correct some documentation (#30)

* Enabling TF profiler in smdebug (#5)

* Enabling TF profiler in smdebug
Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* change variable name and folder path (#35)

* change variable name and folder path

* add tests to check rotation policy

* Add ProfilerSystemMetricFileParser and basic tests (#16)

* Add ProfilerSystemMetricFileParser and basic tests
* Refactor MetricsReaderBase class
* Fix timestamp to event files mapping for both MetricsReader and SystemMetricsReader
* rename MetricsReader to AlgorithmMetricsReader

* refactoring. Providing a way to avoid cache and hence going OOM (#38)

* refactoring. Providing a way to avoid cache and hence going OOM
* modifying test cases to have use_in_memory_cache param

* Time annotations in PyTorch hook (#13)

* modified pytorch hook to record time annotations
Co-authored-by: Vikas Kumar <vikumar@amazon.com>

* Pulling in changes from smdebug repo to private (#39)

* latest commit from smdebug repo master is 
* Disable TB Testing  (#275) with commit id b8661de
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Reorganizing the profiler tests for PR CI build (#41)

* Organized the profiler tests.

* Updated the tests.sh for PR CI build

* Updated the tests.sh for PR CI build

* profiler dashboards (#4)

* add files for profiler dashboards
* updated dashboards to use timeline reader
* fixed bug 2,5,6,7,9,10 from bugbash
* fixed bug 1,3,4,8,16,17,19 from bugbash
* linked x-axis of timeline charts

* Creating a generic profiler dashboard & report (#42)

* Creating a generic profiler dashboard which can take a training job name and region
and execute the notebook.

* review comments

* Updated notebooks and added Pandas functionalities (#43) (#44)

* updated notebook and added Pandas functionalities
* minor fixes in profiler_generic_dashboard.ipynb

Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com>

* Enable file rotation for Horovod trace file (#33)

* Hvd file reader and rotation of files

Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Pytorch profiler new (#40)

* adding profiling info to pytorch hook

* imore changes

* capturing forward and backward time from within pytorch hook
Note that hook provides backward end time, so backward start time
is approximated to end of last forward/backward or now
So, forward times and backward end times should be accurate while
backward start time is approximated.

* irmeoved print statements

* ran pre-commit and removed some log statements

* pre commit run

* Fixed the assert

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* reverted the temporary changes

* Fixed the assert

* FIxing the CI test failure

* Fixed the code to include the last layer

* Updated the tests and refactored the TraceEvent class.

* Converted the rnn test to pytest variant

* Fixed the assert for passing CI

Co-authored-by: Vikas-Kum <dev.vikas94@gmail.com>
Co-authored-by: Vikas Kumar <vikumar@amazon.com>

* Python profiler (#36)


Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Changes to horovod file parser (#46)

* TF2 profiler tests (#48)

* test detailed step/time based profiling

* Bug fixes for autograd profiler in Pytorch hook. (#50)

* fixed pytorch hook

* fixed merge conflict

* fixed bug in hook

* Adding action class (#285) (#54)

* Adding action class
Actions added: stop trianing job, email,  sms

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Pull in changes from the sagemaker-debugger repository (#55)

* Pull in changes from the sagemaker-debugger repository

* Typecasting profiling parameters to int (#52)

* Refactor analysis utils (#57)

* Integration tests for profiler on sagemaker (#19)

scripts and infrastructure code

* Typecasting str profiling parameters to bool (#58)

* Typecasting str profiling parameters to bool

* Add pyinstrument for python profiling (#56)

* Make DetailedProfilingConfig a string in profiler config (#67)

* detailed profiling config now is string

* install tf_datasets (#66)

* Convert profiler data to pandas frame (#47)

* add class to convert profiler data to pandas frame

* fixed local reader

* add notebook for pandas queries

* added code to find workload balancing issues in multi GPU training

* Adding more checks to integration tests (#73)

* pytorch Added step event, mode and more details to detailed profiling (#78)

* Added step event, mode and more details to detailed profiling
* Changing op name string
* Making op_name equivalent to TF
* changing step num to mode_step
* Adding phase to autograd events

* Change timeline node_id for distributed workers (#80)

* change timeline node_id for distributed workers

* Add integration tests for detailed profiling and python profiling (#71)

* Fixing a bug where step num was not correctly used when enabling
detailed profiling
Dumping the torch autograd profiler every step. If there are multiple steps
then data builds up and can cause gpu memory build up.

* Feature to profile for different step phases
2.Capturing profiling step phases for pytorch
3.Fix bug with path string which was always having cprofile in path
even if pyinstrument profiler is used

* Fix pre-commit

* Fix call to stats_filename

* Fixing PythonStepStats

* auto commit

* ifix
x

* iFix

* fix

* pre commit fix

* fix bug

* removed code

* make profiling parameters case insensitive

* docstring for case insensitive config

* precommit

* push profiler images to alpha and get tag from environment variable

* push profiler images to alpha and get tag from environment variable

* Add height param to HeatMap

* specify registry ID as env variable, alpha by default

* Some cleanup, adding total time in cprofile

* Refactored metricsHistogram and stepHistogram and amde more modular

* separate usepyinstrument

* iFixes for metrics historgram

* Fixing StepHistogram

* removing pritn with logger

* refactoring

* changes in detailed profiling

* remove imports

* notebook fixes and histogram class fixes

* Adding wheel lfile

* running pre-commit

* fix tests

* Adding unique thread id , pid, for trace event parser
In every event added event_phase, node_id

* pre-commit

* fixing notebook and other changes

* fix check for event_Args None

* Changing ntoebook

* upload files to s3 during test

* minor fix

* create new s3 folder for stats

* fix syntax errors

* Some cleanup

* Fix int typecast for rotatemaxfilesizebytes (#19)

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Pull in smdebug 145d43b (#38)

* Pull in latest smdebug (0.9.1) (upto commit 145d43b)
* Reverting the change to GET_OBJECTS_MULTIPROCESSING_THRESHOLD in #14.

* Adding metadata file for TF Profiler parser to include startitime (#4)

* TF profiler event parser
* fix can_start_prof bug
* populate start time
* handle tf trace json in reader
* separate file for metadata

* Reorder the writing of events so that events get correctly written according to their end timestamp. (#39)

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Enable profiling between steps for tensorflow (#2)

* Dump HTML for each pyinstrument stats file (#16)

* output html in python profiler
* dump output html for pyinstrument

* Add higher level analysis functions for cProfile python profiling (#6)

* Updated preview notebooks  (#8)

* Valid trace file check (#41)

* fix valid trace file check
* change log level

* Adding analysis utils and updating the analysis notebook (#9)

* add pandas analysis utils
* update profiler analysis notebook (#32)
* Updated analysis utils (#34)
* add python profiling to notebook (untested)

Co-authored-by: NRauschmayr <n.rauschmayr@gmail.com>
Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* check record end time similar to c++ writer (#45)

* remove flakiness offset from sm tests (#43)

* Add example notebook fixes for python profiling (#46)

* Refactored profiler dashboards  (#42)

* refactored dashboards to plot new system metrics

* updated step timeline chart to plot train/eval/global step

* bugfixes for analysis notebook (#44)

* Bugfixes in analysis and notebooks (#49)

* Followup to the PR on analysis utils (#50)

* Prevent metrics reader from reading invalid files (#52)

* Modify horovod tests to generate check for horovod timeline (#51)

* Bugfixes  (#57)

* fix for dashboards

* Add timeline image for bottlenecks notebook (#59)

* Error handling for pyinstrument (#58)

* Enable/disable python profiling after forward pass of pytorch hook instead of backward pass (#56)

* Pytorch integration tests (#33)

* Enabling integration tests for pytorch

* Fixed the job index for codebuild project.

* Fixed the job index for codebuild project.

* Fixing the codebuild project to install smdebugger in docker

* Fixing codebuild project

* Adding cpu jobs

* Adjusted the parameters for cpu jobs

* PyTorch detailed profiler traces are not present in detailed_profiling directory.

* Fixing the test yml file.

* Fixing the test yml file.

* Removed commented code.

* Added test configuration for absent profiler.

* Preloading the cifar10 dataset into source directory.

* ENabled the assert for checking the timestamp

* adjusted the tracefile counts

* Fixed the job names, added tests for cprofile

* Updated the job configs

* Adjusted the expected trace file count.

* Changed the order in which the trace events are written

* Reduced the batch size for cpu tests.

* Reduced the batch size for cpu tests.

* Fixed the imports

* Added capability to handle html file.

* Adding horovod tests for integration

* Adding horovod tests for integration

* Fixed the assert for horovod trace file count

* Valid trace file check (#41)

* fix valid trace file check
* change log level

* Fixed the expected count of stats and trace files.

* Fixed the profiler config name UsePyinstrument

* Preloading mnist dataset to avoid downloading it from internet during training.

* Bugfixes in analysis and notebooks (#49)

* Added test scenario to test the file rotations.

* Adding more test scenarios

* Adding integration test for distributed training using distributed api

* Adding horovod training with resnet50 and cifar10

* FIxing tehe launcher script for resnet50 with horovod.

* Increased the batch size

* Supporting res50 and cifar with horovod.

* Fixed the validation for horovod tracefiles.

* Update tests/sagemaker/test_profiler_pytorch.py

Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Scheduling sagemaker jobs in parallel.

* Fixed the config file path.

Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Fix buildspec yaml file for TF integration tests (#66)

* Merge latest changes from smdebug to smprofiler (#68)

* Updating analysis utils (#63)

* Modify step stats util to compute stats for multiproc data
* Modify utils to handle multi-node data
* Modify notebook utils to handle multi-node data

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Merge timeline for framework events (#5)

* Fixing the CI failure caused by awscli (#72)

* Add metrics config (#67)

* Add API functions to python profiling analysis for correlation with framework metrics (#53)

* Dataloader analysis for PyTorch (#64)

* Adding the functions to get the dataloader events for pytorch

* Adding the training script and notebook for dataloader analysis

* Fixed the timeconversion from timestamp to UTC and fixed the local reader for system tracefiles.

* Updating the dataloader analysis notebook

* Updated the notebook with analysis for batch processing.

* Updated notebook to display python profiler stats.

* Updated the notebook with documenation and layout

* Updated the notebook to have static contents

* Updating the notebook to handle absence of traceevents

* FIxed the tracevents as per the current format and added notebook for triggering the pytorch training jobs

* Moved the analysis functions from notebook to a class

* Updated the utility functions to retrieve the dataloader events

* Added the test scripts for horovod and distributed training

* Adding a script that uses dummy custom dataloader

* Addressed the review comments

* Updated the utility code and added a training script that uses custom datasets

* Added hyper parameteres for custom dataset training.

* Fix TF event file decompression issue (#73)

* Fix bugs in keras hook (#75)

* Reorder events in pytorch hook (#60)

* Refactor metrics config (#76)

* Perf benchmark (#31)

* Fix for hvd reader issue and one more change (#74)

* Fixing the batch time analysis in interactive notebook to not generate incorrect plot (#81)

* Fixing the compuation of batchtime

* Fixing the compuation of batchtime

* retrigger CI

* Attempting to fix PR CI

* Attempting to fix PR CI for PyTorch

* Attempting to fix PR CI for PyTorch

* Merge timeline fixes (#82)

* Merge timeline fixes
1) putting the node_ids as threads.
2) Providing right sort order for processes and threads
3) Fixing bugs

* add check if gpu is available (#62)

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Performance benchmarking for PyTorch (#78)

* Pytorch performance tests

* Fixed the estimator

* Fixed the training script for correct metrics generation

* Added train duration metrics in the training script

* Adjusted the alarm values

* Adjusted the alarm values

* Fixed the job name for no smdebug and no profiler

* Optimized the training script and added comments in the driver script.

* Updated the scripts for framework only training job

* Removed the unenecessary code.

* Updating the instance types.

* Notebook for interactive analysis (#69)

* Notebook for interactive analysis

* add python profiling to interactive analysis notebook

* Updated the interactive notebook with dataloader analysis for pytorch

* updated the utility functions to retrieve the dataloader events

* some changes to the nb

* some fixes to the nb

* fixes

* reset index

* editing nb content

* fixes

* nit fix

* fixes after metricsconfig

* update notebooks

* add updated job notebooks

* updated notebooks for bug bash

* update TF notebook

* rename notebooks

* rename notebooks

* updating notebooks with feedback

* Renamed Profiler to EagleEye

* minor edits

* scripts

* fix

* Updated the interactive anlaysis notebook with minor fix.

* Updated the instance type for rules to ml.m5.8xlarge'

* Updated the rules instances to ml.r5.4xlarge'

* miyoung's changes

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Fixed the metrics names to have correct instance names. (#88)

* Added empty name in an event during merge_timeline if it is missing (#87)

* Add an empty name only for Horovod and Herring events if name is missing for E events.

* Add ProfilerTrial class and profiler builtin rules  (#54)

* add files for gpu usage rule
* adding rule to detect cpu bottlenecks
* add rule to detect outliers in step duration
* added node id to rule analysis
* add rule for checking gpu memory increase
* added rules for batch size and max initialization time
* add rule to detect load balancing issues in multi GPU training
* add dockerfiles to build rule container
* applying changes from https://github.com/awslabs/sagemaker-profiler/commit/57dfe2bd960ae798610b6ff52f661a4f5475eded fixed output directory and label legends

Co-authored-by: Vandana Kannan <vandana268@gmail.com>
Co-authored-by: Vikas Kumar <dev.vikas94@gmail.com>

* Fixing the writing of first event in the tracefile that stores the start time from epoch (#85)

* Fixing the writing of first event in the tracefile.

* Added the master table to ensure that we always write the metaevent in the new traceevent file.

* Fixing bugs in KerasHook and profiler utils (#89)

* Change smdebug version in notebooks (#90)

* change smdebug version
* rename tf_python_stats_dir to python_stats_dir

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Dynamic ON/OFF Herring timeline for PyTorch framework (#80)

* Fix pytest version (#91)

* support mixed precision training (#96)

* merging sys metrics and bottlenecks in the timeline (#93)

* merging sys metrics and bottlenecks in the timeline

* Fix hvd failures and add native TF training in TF integration tests (#97)

* Reading rule stop signal file and stopping the rule if gracetime has … (#98)

* Reading rule stop signal file and stopping the rule if gracetime(60s) has passed

* [Sync] Sync smdebug with sagemaker-debugger master branch (#95)

Co-authored-by: Vikas-kum <vikumar@amazon.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>

* add rule for framework metrics  (#100)

* add rule for framework metrics overview

* update report

* replaced matplolib figures with bokeh charts

* fix pre-commit error

* minor fixes in report notebook

Co-authored-by: Connor Goggins <cgoggins0@gmail.com>

* Update Profiler Trial and Rules to Generate Report on Every Invoke (#102)

* [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule (#105)

* [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule

Update ProfilerReport rule to emit RuleEvaluationConditionMet if any subrule
having rule evaluation confition met.

* Update to emit RuleEvaluationConditionMet at the end of job

* Fix comment

* add unit test for ProfilerReport

* remove scanel_interval passed in

* Update unit tests

* Fix incorrect comment on last step.

* Update log message.

* Sync with sagemaker-debugger master branch and fix issue with tensorflow_datasets version (#114)

* Update sagemaker.md (#250)

* Bumping version to 0.9.0 (#251)

* Skip using standalone keras Py3.7+ (#253)

* Gradtape zcc (#252)

* Fix Incorrect Log Statement (#256)

* Incorrect number of tensors saved with MirroredStrategy (#257)

* Change Version to 0.8.1 (#258)

* Save Scalars With Mirrored Strategy (#259)

* skip flaky test (#262)

* Don't export to collections for all workers with unsupported distrib training (#263)

* version bump (#265)

* Avoiding Basehook object pickling (#266)

* handle eager tensors (#271)

* TF 2.x: Support for keras to estimator (#268)

* Revert "TF 2.x: Support for keras to estimator (#268)" (#273)

This reverts commit 749bded.

* Disable TB Testing  (#275)

* Support for TF 2 estimator (#274)

* Adding a TF2 Hvd example and test (#279)

* Moved end of training log from info to debug (#281)

#280

* Adding action class (#285)

* Adding action class
Actions added: stop trianing job, email,  sms

* Fix buildspec used for PR CI (#287)

* Adding a test to check that PT model is saved without issues (#283)

* test that model can be pickled without issues

* Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (#282)

* Pin pytest version (#293)

* Load IRIS Dataset from S3 (#298)

* Load dataset from s3 (#299)

* remove problematic log (#300)

* Change Enum (#301)

* Doc update (#292)

* rename enum (#305)

* version bump to 0.9.1 (#304)

* modify asserts (#307)

* version compare (#306)

* Support TF 2.3 Tests (#312)

* Disable TB in ZCC for AWS TF 2.3.0 (#316)

* Update Assert Statements For New TF 2.2.0 DLC (#320)

* Version Bump (#319)

* add a note for TF 2.2 limited support (#303)


Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>

* TF 2.2 documentation update  (#322)

* update TF 2.2 smdebug features
* Update code samples/notes for new pySDK and smdebug/add and fix links
* add 'New features' note
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>

* Adding pagination in list_training_jobs (#323)

* Adding pagination in list_Training_jobs

* Test Custom Step Usecase (#331)

* save tf2 model (#333)

* Add ability to only save shapes of tensors (#328)

* Revert "Add ability to only save shapes of tensors (#328)" (#337)

This reverts commit c9eb769.

* Function to Test If the hook has been configured with the Default hook config (#332)

* Default hook config (#338)

* version bump (#339)

* TF ZCC limitation footnote (#342)

* Ability to save shapes (#341)

* WIP saveshape

* Add shape writer

* Add pytorch test

* Add untested keras test

* fix syntax

* fix syntax

* Import

* Import

* Add tests for TF

* Simplify read code

* Add read API and tests

* Add mxnet test

* Add s3 and json tests

* lint

* Fix payload

* fix import

* Handle different num tensors for losses

* Fix exact equal condition

* Fix mode bug

* trigger CI

* Add support for distributed training with writer map

* Check that value throws exception

* Fix tests to make them more resilient

* Fix mxnet and pytorch tests

* Remove tensor names

* pre-commmit

* Fix get_mode

* Fix bug with old index files

* Fix keras test with names of tensors

* Set original name to None if tf_obj is None

* Fix mirrored test for cpu

* Add docs

* trigger CI

* Fix shape writer get

* Simplify by removing shape writer

* Cleanup

* Fix name of writer

* Addressed review comments

* trigger ci

* retrigger CI

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Support Inputs and Labels in the dict format (#345)

* 0.9.4 (#347)

* Refactor Make Numpy Array (#329)

* warn gradtape users  about tf.function support (#348)

* Support all tf types (#346)

* Model Subclassing Test (#351)

* Modify Should Save Tensor Test To Work on Any Version of TF (#352)

* framework version updates (#360)

* list training jobs improvements (#349)

* Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic.
* if there are training jobs found with prefix, we break
 * if there are exceptions caught more than 5 times we break.

* Handle Deprecation Of experimental_ref api (#356)

* check file exist before moving (#364)

* check file exist before moving when closing the file.

* Support Saving Tensors in Graph Mode with add_for_mode (#353)

* Change layer name logic (#357)

* Pass Variable Length Argument To Old Function Call (#366)

* test concat layers (#367)

* Update README.md (#371)

* Pinning the version of tensorflow_datasets package so that it does not require updating TF (#373)

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (#372)

* Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (#374)

* returning list instead of dict keys (#376)

fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always.

* Add support for mixed precision training (#378)

* Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (#380)

* pytorch tmp (#382)

* extend zcc to 2.1.2 (#384)

* disable pytorch (#386)

* Removed the redundant installation of smdebug and smdebug-rules (#391)

* Incrementing the version to 0.9.5 (#396)

* pin tensorflow dataset in test config (#399)

* add back test

* revert some changes

* unpin pytest version

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Vikas-kum <vikumar@amazon.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>

* Changing the Herring user-facing API (#110)

* [TRSL-998] Update Rule Test with Result Checking (#106)

* [TRSL-998] Update Rule Test with Result Checking

Update existing rule testing to assert against rule output. This will ensure
rule are tested with its report result which should be deterministic thru CI.

* Generate HTML Report at every ProfilerReport invoke (#112)

This change adds HTML report generation at the end of every invoke of ProfilerReport rule.

* Update RuleEvaluationConditionMet to indicate end of the rule (#115)

* fix: Remove the hard code notebook file path (#117)

* Run rules tests in CI (#116)

* Log fix memory issue fix (#113)

* Changed the Herring API and variable names (#118)

* Removing the functionality to attach the backward hook to the module (#125)

* Removing the functionality to attach the backward hook to the module

* Updated the number of traceevents as the backward hook is no longer registered.

* Herring TF2 Native Graident Tape SMDebugger support (#122)

* Fix bug in base hook (#127)

* Minor bugfixes/changes in rules (#126)

* minor bugfixes for rules

* Updating batch size rule (#123)

* fix for batch size rule

* Dataloader rule (#108)

* added dataloader rule and updated profiler report

* Redesign TF dataloader metrics collection (#92)

* Update profiler config parser to match latest SDK changes (#120)

* Replaced herringsinglenode command with smddpsinglenode (#129)

* Updating the version for profiler GA release (#124)

* Updating the version for profiler GA release

* Trigger Build

* Trigger Build

* Trigger Build

* Fix paths in profiler report (#131)

* changed path in profiler report

* fixed env variable (#132)

* making info log to debug from trace event parser as it is very verbose (#134)

* Only do detailed profiling for supported TF versions. (#135)

* Update PT tests (#136)

* Fix bug in parser (#137)

* smdistributed.dataparallel should be invoked from mpi command (#138)

* smdistributed.dataparallel should be invoked from mpi command

* Added comments

* Bugfix: Invalid Worker (#139)

* smdistributed.dataparallel environment check (#140)

* smdistributed.dataparallel environment check

* addressed comments

* Modified check_smdataparallel_env logic

* Install rules packages in PR CI (#143)

* Removed the files and folders that are not required in the public repository

* Removed the integration tests.

* FIxed the pre-commit checks

Co-authored-by: Vandana Kannan <vandana268@gmail.com>
Co-authored-by: Vikas-kum <dev.vikas94@gmail.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com>
Co-authored-by: Neelesh Dodda <ndodda@amazon.com>
Co-authored-by: Rajan Singh <srajanku@amazon.com>
Co-authored-by: sife <sifei.li@hotmail.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Vikas Kumar <vikumar@amazon.com>
Co-authored-by: Anirudh <aanirud@amazon.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Connor Goggins <cgoggins0@gmail.com>
Co-authored-by: JC-Gu <jiacheg@amazon.com>
NihalHarish added a commit that referenced this pull request Jan 25, 2021
* rotation policy

* fix tests

* fix write event call

* add comments in code

* add a test through hook

* fix rotation

* some fixes

* delete file if empty

* enable multi-process test

* fix multi-process test

* add pt distrib test

* Revert "add pt distrib test"

This reverts commit a8fc661.

* enable write to s3

* address some review comments

* address some more review comments

* cleanup

* some fixes

* make timestamp mandatory
* filename timestamp matches 1st event

* more cleanup and fixes

* consolidate classes
* timestamp in UTC

* address review comments

* edit base_start_time

* remove delete if empty

* default queue size and flush secs

* Add timestamp test

* add abs and rel timestamp in record

* save default values to constants file

* Cached the names of parsed files to avoid parsing them everytime.

* address review comments

* lazy file creation
* drop events if file creation fails
* rename file to event end ts

* correct s3 bucket name

* test timestamp with file rotation

check if timestamp of all events in a file are lesser than timestamp in file name

* remove ref to s3

* remove changes to s3.py

* add checks for healthy writer

* test file open failure

* Cleanup hook

* Added the buffer for looking up trace file, removed the get_events_at_time function, updated the implementation of get_events to return the active events

* make timestamp mandatory everywhere

* fix mxnet test

* Corrected the multiplier for microseconds

* remove flush_secs

* Updating the tests directory with new file format.

* Simplify class structure

* save base_start_time in record

* Updated the test directories to the updated YYYYMMDDHR format

* init env variables once

* Renamed the function and added function comments

* address some review comments

* cleanup

* Fixed the trace file look for start and end time events

* Truncating the trace files and updating the test file.

* fix pt test

* fallback node ID

* Removed the functionality to cap the upper_bound_timestamp

* Optimize the refreshing the file list based on the last available timestamp in the datasource viz. local or S3

* Correctly named the file suffix. Truncated the horovod timeline file

* Added the functionality to download the S3 files in parallel

* Addressed the review comments

* address review comments

* Trace events writer - part 2 (#6)

* ensure there's a dir for the new file
* add .tmp
* handle the case when events are far apart
* fix a mistake in cur_hour
* updated last_file_close_time to now

Co-authored-by: Vikas-kum <dev.vikas94@gmail.com>

* Record step duration in keras hook (#8)

* add step duration to keras hook

Co-authored-by: Vikas-kum <dev.vikas94@gmail.com>

* test TF step time with timeline writer (#9)

* Read node ID from Resource config (#10)

* read host ID from resource config

* use timeline writer directly (#11)

* Added functionality to record node_id in the events (#7)

* Added functionality to record node_id in the events

* Added the test to verify node id from file

* Moved the functions to extract node id and timestamp to utils directory.

* Add profiler config parser (#12)

* Timeline file name timestamp in us (#15)

* file timestamp in us

* Add comprehensive tests for detailed profiler config (#18)

* adding comprehensive tests

* refactoring fixtures

* renaming vars

* remove imports

* remove extraneous fixture

* PR changes

* documenting test cases

* documenting test cases

* refactoring fixtures

* Supporting efficiently downloding s3 files for distributed training (#14)

* Supporting efficiently downloding s3 files for distributed training

* updated op_name and args when recording step duration (#17)

* fixes for right directory name(#20)

* Fix folder name (#21)

* fixes
* change all variables to microsecs

* Updating the files to fix the pre-commit failures (#23)

* Change invalid file path (#25)

* change invalid file path

* fix other precommit errors

* Add error handling for parsing profiler config (#27)

* Fixing the tests for CI (#28)

* Fixing the tests for CI

* fix out_dir bug

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Default path for profiler has changed (#29)

* Update and correct some documentation (#30)

* Enabling TF profiler in smdebug (#5)

* Enabling TF profiler in smdebug
Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* change variable name and folder path (#35)

* change variable name and folder path

* add tests to check rotation policy

* Add ProfilerSystemMetricFileParser and basic tests (#16)

* Add ProfilerSystemMetricFileParser and basic tests
* Refactor MetricsReaderBase class
* Fix timestamp to event files mapping for both MetricsReader and SystemMetricsReader
* rename MetricsReader to AlgorithmMetricsReader

* refactoring. Providing a way to avoid cache and hence going OOM (#38)

* refactoring. Providing a way to avoid cache and hence going OOM
* modifying test cases to have use_in_memory_cache param

* Time annotations in PyTorch hook (#13)

* modified pytorch hook to record time annotations
Co-authored-by: Vikas Kumar <vikumar@amazon.com>

* Pulling in changes from smdebug repo to private (#39)

* latest commit from smdebug repo master is 
* Disable TB Testing  (#275) with commit id b8661de
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Reorganizing the profiler tests for PR CI build (#41)

* Organized the profiler tests.

* Updated the tests.sh for PR CI build

* Updated the tests.sh for PR CI build

* profiler dashboards (#4)

* add files for profiler dashboards
* updated dashboards to use timeline reader
* fixed bug 2,5,6,7,9,10 from bugbash
* fixed bug 1,3,4,8,16,17,19 from bugbash
* linked x-axis of timeline charts

* Creating a generic profiler dashboard & report (#42)

* Creating a generic profiler dashboard which can take a training job name and region
and execute the notebook.

* review comments

* Updated notebooks and added Pandas functionalities (#43) (#44)

* updated notebook and added Pandas functionalities
* minor fixes in profiler_generic_dashboard.ipynb

Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com>

* Enable file rotation for Horovod trace file (#33)

* Hvd file reader and rotation of files

Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Pytorch profiler new (#40)

* adding profiling info to pytorch hook

* imore changes

* capturing forward and backward time from within pytorch hook
Note that hook provides backward end time, so backward start time
is approximated to end of last forward/backward or now
So, forward times and backward end times should be accurate while
backward start time is approximated.

* irmeoved print statements

* ran pre-commit and removed some log statements

* pre commit run

* Fixed the assert

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* reverted the temporary changes

* Fixed the assert

* FIxing the CI test failure

* Fixed the code to include the last layer

* Updated the tests and refactored the TraceEvent class.

* Converted the rnn test to pytest variant

* Fixed the assert for passing CI

Co-authored-by: Vikas-Kum <dev.vikas94@gmail.com>
Co-authored-by: Vikas Kumar <vikumar@amazon.com>

* Python profiler (#36)


Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Changes to horovod file parser (#46)

* TF2 profiler tests (#48)

* test detailed step/time based profiling

* Bug fixes for autograd profiler in Pytorch hook. (#50)

* fixed pytorch hook

* fixed merge conflict

* fixed bug in hook

* Adding action class (#285) (#54)

* Adding action class
Actions added: stop trianing job, email,  sms

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Pull in changes from the sagemaker-debugger repository (#55)

* Pull in changes from the sagemaker-debugger repository

* Typecasting profiling parameters to int (#52)

* Refactor analysis utils (#57)

* Integration tests for profiler on sagemaker (#19)

scripts and infrastructure code

* Typecasting str profiling parameters to bool (#58)

* Typecasting str profiling parameters to bool

* Add pyinstrument for python profiling (#56)

* Make DetailedProfilingConfig a string in profiler config (#67)

* detailed profiling config now is string

* install tf_datasets (#66)

* Convert profiler data to pandas frame (#47)

* add class to convert profiler data to pandas frame

* fixed local reader

* add notebook for pandas queries

* added code to find workload balancing issues in multi GPU training

* Adding more checks to integration tests (#73)

* pytorch Added step event, mode and more details to detailed profiling (#78)

* Added step event, mode and more details to detailed profiling
* Changing op name string
* Making op_name equivalent to TF
* changing step num to mode_step
* Adding phase to autograd events

* Change timeline node_id for distributed workers (#80)

* change timeline node_id for distributed workers

* Add integration tests for detailed profiling and python profiling (#71)

* Fixing a bug where step num was not correctly used when enabling
detailed profiling
Dumping the torch autograd profiler every step. If there are multiple steps
then data builds up and can cause gpu memory build up.

* Feature to profile for different step phases
2.Capturing profiling step phases for pytorch
3.Fix bug with path string which was always having cprofile in path
even if pyinstrument profiler is used

* Fix pre-commit

* Fix call to stats_filename

* Fixing PythonStepStats

* auto commit

* ifix
x

* iFix

* fix

* pre commit fix

* fix bug

* removed code

* make profiling parameters case insensitive

* docstring for case insensitive config

* precommit

* push profiler images to alpha and get tag from environment variable

* push profiler images to alpha and get tag from environment variable

* Add height param to HeatMap

* specify registry ID as env variable, alpha by default

* Some cleanup, adding total time in cprofile

* Refactored metricsHistogram and stepHistogram and amde more modular

* separate usepyinstrument

* iFixes for metrics historgram

* Fixing StepHistogram

* removing pritn with logger

* refactoring

* changes in detailed profiling

* remove imports

* notebook fixes and histogram class fixes

* Adding wheel lfile

* running pre-commit

* fix tests

* Adding unique thread id , pid, for trace event parser
In every event added event_phase, node_id

* pre-commit

* fixing notebook and other changes

* fix check for event_Args None

* Changing ntoebook

* upload files to s3 during test

* minor fix

* create new s3 folder for stats

* fix syntax errors

* Some cleanup

* Fix int typecast for rotatemaxfilesizebytes (#19)

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Pull in smdebug 145d43b (#38)

* Pull in latest smdebug (0.9.1) (upto commit 145d43b)
* Reverting the change to GET_OBJECTS_MULTIPROCESSING_THRESHOLD in #14.

* Adding metadata file for TF Profiler parser to include startitime (#4)

* TF profiler event parser
* fix can_start_prof bug
* populate start time
* handle tf trace json in reader
* separate file for metadata

* Reorder the writing of events so that events get correctly written according to their end timestamp. (#39)

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Enable profiling between steps for tensorflow (#2)

* Dump HTML for each pyinstrument stats file (#16)

* output html in python profiler
* dump output html for pyinstrument

* Add higher level analysis functions for cProfile python profiling (#6)

* Updated preview notebooks  (#8)

* Valid trace file check (#41)

* fix valid trace file check
* change log level

* Adding analysis utils and updating the analysis notebook (#9)

* add pandas analysis utils
* update profiler analysis notebook (#32)
* Updated analysis utils (#34)
* add python profiling to notebook (untested)

Co-authored-by: NRauschmayr <n.rauschmayr@gmail.com>
Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* check record end time similar to c++ writer (#45)

* remove flakiness offset from sm tests (#43)

* Add example notebook fixes for python profiling (#46)

* Refactored profiler dashboards  (#42)

* refactored dashboards to plot new system metrics

* updated step timeline chart to plot train/eval/global step

* bugfixes for analysis notebook (#44)

* Bugfixes in analysis and notebooks (#49)

* Followup to the PR on analysis utils (#50)

* Prevent metrics reader from reading invalid files (#52)

* Modify horovod tests to generate check for horovod timeline (#51)

* Bugfixes  (#57)

* fix for dashboards

* Add timeline image for bottlenecks notebook (#59)

* Error handling for pyinstrument (#58)

* Enable/disable python profiling after forward pass of pytorch hook instead of backward pass (#56)

* Pytorch integration tests (#33)

* Enabling integration tests for pytorch

* Fixed the job index for codebuild project.

* Fixed the job index for codebuild project.

* Fixing the codebuild project to install smdebugger in docker

* Fixing codebuild project

* Adding cpu jobs

* Adjusted the parameters for cpu jobs

* PyTorch detailed profiler traces are not present in detailed_profiling directory.

* Fixing the test yml file.

* Fixing the test yml file.

* Removed commented code.

* Added test configuration for absent profiler.

* Preloading the cifar10 dataset into source directory.

* ENabled the assert for checking the timestamp

* adjusted the tracefile counts

* Fixed the job names, added tests for cprofile

* Updated the job configs

* Adjusted the expected trace file count.

* Changed the order in which the trace events are written

* Reduced the batch size for cpu tests.

* Reduced the batch size for cpu tests.

* Fixed the imports

* Added capability to handle html file.

* Adding horovod tests for integration

* Adding horovod tests for integration

* Fixed the assert for horovod trace file count

* Valid trace file check (#41)

* fix valid trace file check
* change log level

* Fixed the expected count of stats and trace files.

* Fixed the profiler config name UsePyinstrument

* Preloading mnist dataset to avoid downloading it from internet during training.

* Bugfixes in analysis and notebooks (#49)

* Added test scenario to test the file rotations.

* Adding more test scenarios

* Adding integration test for distributed training using distributed api

* Adding horovod training with resnet50 and cifar10

* FIxing tehe launcher script for resnet50 with horovod.

* Increased the batch size

* Supporting res50 and cifar with horovod.

* Fixed the validation for horovod tracefiles.

* Update tests/sagemaker/test_profiler_pytorch.py

Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Scheduling sagemaker jobs in parallel.

* Fixed the config file path.

Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Fix buildspec yaml file for TF integration tests (#66)

* Merge latest changes from smdebug to smprofiler (#68)

* Updating analysis utils (#63)

* Modify step stats util to compute stats for multiproc data
* Modify utils to handle multi-node data
* Modify notebook utils to handle multi-node data

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Merge timeline for framework events (#5)

* Fixing the CI failure caused by awscli (#72)

* Add metrics config (#67)

* Add API functions to python profiling analysis for correlation with framework metrics (#53)

* Dataloader analysis for PyTorch (#64)

* Adding the functions to get the dataloader events for pytorch

* Adding the training script and notebook for dataloader analysis

* Fixed the timeconversion from timestamp to UTC and fixed the local reader for system tracefiles.

* Updating the dataloader analysis notebook

* Updated the notebook with analysis for batch processing.

* Updated notebook to display python profiler stats.

* Updated the notebook with documenation and layout

* Updated the notebook to have static contents

* Updating the notebook to handle absence of traceevents

* FIxed the tracevents as per the current format and added notebook for triggering the pytorch training jobs

* Moved the analysis functions from notebook to a class

* Updated the utility functions to retrieve the dataloader events

* Added the test scripts for horovod and distributed training

* Adding a script that uses dummy custom dataloader

* Addressed the review comments

* Updated the utility code and added a training script that uses custom datasets

* Added hyper parameteres for custom dataset training.

* Fix TF event file decompression issue (#73)

* Fix bugs in keras hook (#75)

* Reorder events in pytorch hook (#60)

* Refactor metrics config (#76)

* Perf benchmark (#31)

* Fix for hvd reader issue and one more change (#74)

* Fixing the batch time analysis in interactive notebook to not generate incorrect plot (#81)

* Fixing the compuation of batchtime

* Fixing the compuation of batchtime

* retrigger CI

* Attempting to fix PR CI

* Attempting to fix PR CI for PyTorch

* Attempting to fix PR CI for PyTorch

* Merge timeline fixes (#82)

* Merge timeline fixes
1) putting the node_ids as threads.
2) Providing right sort order for processes and threads
3) Fixing bugs

* add check if gpu is available (#62)

Co-authored-by: Vikas-kum <vikumar@amazon.com>

* Performance benchmarking for PyTorch (#78)

* Pytorch performance tests

* Fixed the estimator

* Fixed the training script for correct metrics generation

* Added train duration metrics in the training script

* Adjusted the alarm values

* Adjusted the alarm values

* Fixed the job name for no smdebug and no profiler

* Optimized the training script and added comments in the driver script.

* Updated the scripts for framework only training job

* Removed the unenecessary code.

* Updating the instance types.

* Notebook for interactive analysis (#69)

* Notebook for interactive analysis

* add python profiling to interactive analysis notebook

* Updated the interactive notebook with dataloader analysis for pytorch

* updated the utility functions to retrieve the dataloader events

* some changes to the nb

* some fixes to the nb

* fixes

* reset index

* editing nb content

* fixes

* nit fix

* fixes after metricsconfig

* update notebooks

* add updated job notebooks

* updated notebooks for bug bash

* update TF notebook

* rename notebooks

* rename notebooks

* updating notebooks with feedback

* Renamed Profiler to EagleEye

* minor edits

* scripts

* fix

* Updated the interactive anlaysis notebook with minor fix.

* Updated the instance type for rules to ml.m5.8xlarge'

* Updated the rules instances to ml.r5.4xlarge'

* miyoung's changes

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>

* Fixed the metrics names to have correct instance names. (#88)

* Added empty name in an event during merge_timeline if it is missing (#87)

* Add an empty name only for Horovod and Herring events if name is missing for E events.

* Add ProfilerTrial class and profiler builtin rules  (#54)

* add files for gpu usage rule
* adding rule to detect cpu bottlenecks
* add rule to detect outliers in step duration
* added node id to rule analysis
* add rule for checking gpu memory increase
* added rules for batch size and max initialization time
* add rule to detect load balancing issues in multi GPU training
* add dockerfiles to build rule container
* applying changes from https://github.com/awslabs/sagemaker-profiler/commit/57dfe2bd960ae798610b6ff52f661a4f5475eded fixed output directory and label legends

Co-authored-by: Vandana Kannan <vandana268@gmail.com>
Co-authored-by: Vikas Kumar <dev.vikas94@gmail.com>

* Fixing the writing of first event in the tracefile that stores the start time from epoch (#85)

* Fixing the writing of first event in the tracefile.

* Added the master table to ensure that we always write the metaevent in the new traceevent file.

* Fixing bugs in KerasHook and profiler utils (#89)

* Change smdebug version in notebooks (#90)

* change smdebug version
* rename tf_python_stats_dir to python_stats_dir

Co-authored-by: Neelesh Dodda <ndodda@amazon.com>

* Dynamic ON/OFF Herring timeline for PyTorch framework (#80)

* Fix pytest version (#91)

* support mixed precision training (#96)

* merging sys metrics and bottlenecks in the timeline (#93)

* merging sys metrics and bottlenecks in the timeline

* Fix hvd failures and add native TF training in TF integration tests (#97)

* Reading rule stop signal file and stopping the rule if gracetime has … (#98)

* Reading rule stop signal file and stopping the rule if gracetime(60s) has passed

* [Sync] Sync smdebug with sagemaker-debugger master branch (#95)

Co-authored-by: Vikas-kum <vikumar@amazon.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>

* add rule for framework metrics  (#100)

* add rule for framework metrics overview

* update report

* replaced matplolib figures with bokeh charts

* fix pre-commit error

* minor fixes in report notebook

Co-authored-by: Connor Goggins <cgoggins0@gmail.com>

* Update Profiler Trial and Rules to Generate Report on Every Invoke (#102)

* [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule (#105)

* [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule

Update ProfilerReport rule to emit RuleEvaluationConditionMet if any subrule
having rule evaluation confition met.

* Update to emit RuleEvaluationConditionMet at the end of job

* Fix comment

* add unit test for ProfilerReport

* remove scanel_interval passed in

* Update unit tests

* Fix incorrect comment on last step.

* Update log message.

* Sync with sagemaker-debugger master branch and fix issue with tensorflow_datasets version (#114)

* Update sagemaker.md (#250)

* Bumping version to 0.9.0 (#251)

* Skip using standalone keras Py3.7+ (#253)

* Gradtape zcc (#252)

* Fix Incorrect Log Statement (#256)

* Incorrect number of tensors saved with MirroredStrategy (#257)

* Change Version to 0.8.1 (#258)

* Save Scalars With Mirrored Strategy (#259)

* skip flaky test (#262)

* Don't export to collections for all workers with unsupported distrib training (#263)

* version bump (#265)

* Avoiding Basehook object pickling (#266)

* handle eager tensors (#271)

* TF 2.x: Support for keras to estimator (#268)

* Revert "TF 2.x: Support for keras to estimator (#268)" (#273)

This reverts commit 749bded.

* Disable TB Testing  (#275)

* Support for TF 2 estimator (#274)

* Adding a TF2 Hvd example and test (#279)

* Moved end of training log from info to debug (#281)

#280

* Adding action class (#285)

* Adding action class
Actions added: stop trianing job, email,  sms

* Fix buildspec used for PR CI (#287)

* Adding a test to check that PT model is saved without issues (#283)

* test that model can be pickled without issues

* Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (#282)

* Pin pytest version (#293)

* Load IRIS Dataset from S3 (#298)

* Load dataset from s3 (#299)

* remove problematic log (#300)

* Change Enum (#301)

* Doc update (#292)

* rename enum (#305)

* version bump to 0.9.1 (#304)

* modify asserts (#307)

* version compare (#306)

* Support TF 2.3 Tests (#312)

* Disable TB in ZCC for AWS TF 2.3.0 (#316)

* Update Assert Statements For New TF 2.2.0 DLC (#320)

* Version Bump (#319)

* add a note for TF 2.2 limited support (#303)


Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>

* TF 2.2 documentation update  (#322)

* update TF 2.2 smdebug features
* Update code samples/notes for new pySDK and smdebug/add and fix links
* add 'New features' note
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>

* Adding pagination in list_training_jobs (#323)

* Adding pagination in list_Training_jobs

* Test Custom Step Usecase (#331)

* save tf2 model (#333)

* Add ability to only save shapes of tensors (#328)

* Revert "Add ability to only save shapes of tensors (#328)" (#337)

This reverts commit c9eb769.

* Function to Test If the hook has been configured with the Default hook config (#332)

* Default hook config (#338)

* version bump (#339)

* TF ZCC limitation footnote (#342)

* Ability to save shapes (#341)

* WIP saveshape

* Add shape writer

* Add pytorch test

* Add untested keras test

* fix syntax

* fix syntax

* Import

* Import

* Add tests for TF

* Simplify read code

* Add read API and tests

* Add mxnet test

* Add s3 and json tests

* lint

* Fix payload

* fix import

* Handle different num tensors for losses

* Fix exact equal condition

* Fix mode bug

* trigger CI

* Add support for distributed training with writer map

* Check that value throws exception

* Fix tests to make them more resilient

* Fix mxnet and pytorch tests

* Remove tensor names

* pre-commmit

* Fix get_mode

* Fix bug with old index files

* Fix keras test with names of tensors

* Set original name to None if tf_obj is None

* Fix mirrored test for cpu

* Add docs

* trigger CI

* Fix shape writer get

* Simplify by removing shape writer

* Cleanup

* Fix name of writer

* Addressed review comments

* trigger ci

* retrigger CI

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Support Inputs and Labels in the dict format (#345)

* 0.9.4 (#347)

* Refactor Make Numpy Array (#329)

* warn gradtape users  about tf.function support (#348)

* Support all tf types (#346)

* Model Subclassing Test (#351)

* Modify Should Save Tensor Test To Work on Any Version of TF (#352)

* framework version updates (#360)

* list training jobs improvements (#349)

* Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic.
* if there are training jobs found with prefix, we break
 * if there are exceptions caught more than 5 times we break.

* Handle Deprecation Of experimental_ref api (#356)

* check file exist before moving (#364)

* check file exist before moving when closing the file.

* Support Saving Tensors in Graph Mode with add_for_mode (#353)

* Change layer name logic (#357)

* Pass Variable Length Argument To Old Function Call (#366)

* test concat layers (#367)

* Update README.md (#371)

* Pinning the version of tensorflow_datasets package so that it does not require updating TF (#373)

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (#372)

* Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (#374)

* returning list instead of dict keys (#376)

fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always.

* Add support for mixed precision training (#378)

* Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (#380)

* pytorch tmp (#382)

* extend zcc to 2.1.2 (#384)

* disable pytorch (#386)

* Removed the redundant installation of smdebug and smdebug-rules (#391)

* Incrementing the version to 0.9.5 (#396)

* pin tensorflow dataset in test config (#399)

* add back test

* revert some changes

* unpin pytest version

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Vikas-kum <vikumar@amazon.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>

* Changing the Herring user-facing API (#110)

* [TRSL-998] Update Rule Test with Result Checking (#106)

* [TRSL-998] Update Rule Test with Result Checking

Update existing rule testing to assert against rule output. This will ensure
rule are tested with its report result which should be deterministic thru CI.

* Generate HTML Report at every ProfilerReport invoke (#112)

This change adds HTML report generation at the end of every invoke of ProfilerReport rule.

* Update RuleEvaluationConditionMet to indicate end of the rule (#115)

* fix: Remove the hard code notebook file path (#117)

* Run rules tests in CI (#116)

* Log fix memory issue fix (#113)

* Changed the Herring API and variable names (#118)

* Removing the functionality to attach the backward hook to the module (#125)

* Removing the functionality to attach the backward hook to the module

* Updated the number of traceevents as the backward hook is no longer registered.

* Herring TF2 Native Graident Tape SMDebugger support (#122)

* Fix bug in base hook (#127)

* Minor bugfixes/changes in rules (#126)

* minor bugfixes for rules

* Updating batch size rule (#123)

* fix for batch size rule

* Dataloader rule (#108)

* added dataloader rule and updated profiler report

* Redesign TF dataloader metrics collection (#92)

* Update profiler config parser to match latest SDK changes (#120)

* Replaced herringsinglenode command with smddpsinglenode (#129)

* Updating the version for profiler GA release (#124)

* Updating the version for profiler GA release

* Trigger Build

* Trigger Build

* Trigger Build

* Fix paths in profiler report (#131)

* changed path in profiler report

* fixed env variable (#132)

* making info log to debug from trace event parser as it is very verbose (#134)

* Only do detailed profiling for supported TF versions. (#135)

* Update PT tests (#136)

* Fix bug in parser (#137)

* smdistributed.dataparallel should be invoked from mpi command (#138)

* smdistributed.dataparallel should be invoked from mpi command

* Added comments

* Bugfix: Invalid Worker (#139)

* smdistributed.dataparallel environment check (#140)

* smdistributed.dataparallel environment check

* addressed comments

* Modified check_smdataparallel_env logic

* Install rules packages in PR CI (#143)

* Removed the files and folders that are not required in the public repository

* Removed the integration tests.

* FIxed the pre-commit checks

Co-authored-by: Vandana Kannan <vandana268@gmail.com>
Co-authored-by: Vikas-kum <dev.vikas94@gmail.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com>
Co-authored-by: Neelesh Dodda <ndodda@amazon.com>
Co-authored-by: Rajan Singh <srajanku@amazon.com>
Co-authored-by: sife <sifei.li@hotmail.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Vikas Kumar <vikumar@amazon.com>
Co-authored-by: Anirudh <aanirud@amazon.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Connor Goggins <cgoggins0@gmail.com>
Co-authored-by: JC-Gu <jiacheg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants