-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the #252
Comments
Thank you so much. |
atqy
pushed a commit
to atqy/amazon-sagemaker-examples
that referenced
this issue
Aug 16, 2022
* Save histograms for weights and gradients * Use standard TF summary function * undo line break changes * fix cases when bool tensor was being passed to add_histogram, and fix tests * Fix region bug and update tb_writer construction * Include summaries if any write_histogram was set to True * Refactor writers in core * set default step to 0 * Use new writer in hook * Cherry picking change of refactor writers * set default step to 0 * remove histogram related stuff * rename IndexUtil * Fix imports * remove import of re * Fix import of summary proto * Fix step usage in writers * Fix step usage by event file writer * Remove direcotry in tensorboard directory, and add collection name as prefix for summaries created * Fix import errors * Fix resnet example which did not have str2bool args * Fix core test * Fix core test * Indentation and move some code to a new function * Merged Vikas' branch on tb data read * Add untested support to read tensorboard data * Write mode and mode_step for summaries, and fix the error of multiple global steps being assigned to same train step * remove unnecessary file * remove test script * Remove changes to imagenet script * working scalars * Change path of tornasole event files * Have new index file per mode for tensorboard events * Move tensor values to different file * move to outside tensors folder * Change frequencies for tf examples * Introduce CollectionKeys * Merging export as json * Make histogram a reduction config property, and add save_raw_tensor field to reduction config. Verified the usage for tensorflow. Also some cleanup with respect to save config in save manager * Fix bug in loading collections * Fix writing tensorboard data in global mode * Add graph support to pytorch models. Copied some new protos, and a couple of files from torch.tensorboard. * Working graph export for mxnet * Save graph correctly for mxnet * undo utils change worker pid * fix import * fix import * do not flush index writer * remove data files * Fix save config issue * make save_histogram a property of collection * Fix save config bugs, and add scalar support to TF * Skip summaries whose tensors are unreachable in graph, and avoid adding histogram when original collection is not included * Move histogram creation to writer instead of event_file_writer, refactor should_save_collection in save manager, add save_scalar methods to MXNet and Pytorch * WIP tensor scalar support * undo add of data * remove test * use correct writer * Make saving scalars work, and added type checks * Writing scalars and tensors supported. tested in tensorboard. need to test through trials * WIP testing steps * remove save scalar and tensor for now because of step number issues. work on trial loading tensorboard data and come back to this * Working reads in non index mode * Tensorboard reads working with indexing * cleanup index file location function * Make pytorch tests working * Reduce length of test_estimator_modes, and add tf tensorboard test * Add basic scalar summary test * Untested completed reads of tensorboard data * Add more tensorboard tests for trial * fix test when reading event files for tensorboard from s3 * Fixed a reduction test * Fix reduction test in TF * Fix merge of a test * fix logger import, and default save/reduction config in save manager * Fix reduction save_raw_tensor in TF * Some cleanup of prepare and collection includes * fix tf tests * Fix all tests * Add tensorboard index test * Fix tensorboard test wrt optimizer_variables * not save histogram for strings * remove when nan support * add hash * Fix collection checks in xgboost * add xgboost tests * Typo * Update hook.py (aws#243) * reduce length of test and add / to prefix * WIP move to tornasole hist summaries for TF * Change collections_to_save_for_step, make TF use custom histograms, refactor to _save_tensor method for all frameworks * rename to save_for_tensor * undo some files * undo some files * Update tests.sh * remove pytorch graph support * remove mxnet graph support * cleanup * remove tf tensorboard duplicated test * Fix bug of tb writer not being closed after exporting graph * WIP fixing tests * Remove read changes * fix value types remaining in code * fix tests * catch exception when nan * use make_numpy_array for xgboost * Fix xgboost error where collections_in_set was empty but not none * change log * remove summary collections * tweak dry run behavior * Fix dry run flag * undo move of steps to own file * Delete steps.py * fix import * fix import in test * cleanup * remove index for tensorboard data * Address review comments * Update hook.py
atqy
pushed a commit
to atqy/amazon-sagemaker-examples
that referenced
this issue
Aug 16, 2022
atqy
pushed a commit
to atqy/amazon-sagemaker-examples
that referenced
this issue
Aug 16, 2022
* rotation policy * fix tests * fix write event call * add comments in code * add a test through hook * fix rotation * some fixes * delete file if empty * enable multi-process test * fix multi-process test * add pt distrib test * Revert "add pt distrib test" This reverts commit a8fc661a02ba29e6fdc49019006b2dafc3cbd67d. * enable write to s3 * address some review comments * address some more review comments * cleanup * some fixes * make timestamp mandatory * filename timestamp matches 1st event * more cleanup and fixes * consolidate classes * timestamp in UTC * address review comments * edit base_start_time * remove delete if empty * default queue size and flush secs * Add timestamp test * add abs and rel timestamp in record * save default values to constants file * Cached the names of parsed files to avoid parsing them everytime. * address review comments * lazy file creation * drop events if file creation fails * rename file to event end ts * correct s3 bucket name * test timestamp with file rotation check if timestamp of all events in a file are lesser than timestamp in file name * remove ref to s3 * remove changes to s3.py * add checks for healthy writer * test file open failure * Cleanup hook * Added the buffer for looking up trace file, removed the get_events_at_time function, updated the implementation of get_events to return the active events * make timestamp mandatory everywhere * fix mxnet test * Corrected the multiplier for microseconds * remove flush_secs * Updating the tests directory with new file format. * Simplify class structure * save base_start_time in record * Updated the test directories to the updated YYYYMMDDHR format * init env variables once * Renamed the function and added function comments * address some review comments * cleanup * Fixed the trace file look for start and end time events * Truncating the trace files and updating the test file. * fix pt test * fallback node ID * Removed the functionality to cap the upper_bound_timestamp * Optimize the refreshing the file list based on the last available timestamp in the datasource viz. local or S3 * Correctly named the file suffix. Truncated the horovod timeline file * Added the functionality to download the S3 files in parallel * Addressed the review comments * address review comments * Trace events writer - part 2 (#6) * ensure there's a dir for the new file * add .tmp * handle the case when events are far apart * fix a mistake in cur_hour * updated last_file_close_time to now Co-authored-by: Vikas-kum <dev.vikas94@gmail.com> * Record step duration in keras hook (#8) * add step duration to keras hook Co-authored-by: Vikas-kum <dev.vikas94@gmail.com> * test TF step time with timeline writer (#9) * Read node ID from Resource config (#10) * read host ID from resource config * use timeline writer directly (#11) * Added functionality to record node_id in the events (#7) * Added functionality to record node_id in the events * Added the test to verify node id from file * Moved the functions to extract node id and timestamp to utils directory. * Add profiler config parser (#12) * Timeline file name timestamp in us (#15) * file timestamp in us * Add comprehensive tests for detailed profiler config (#18) * adding comprehensive tests * refactoring fixtures * renaming vars * remove imports * remove extraneous fixture * PR changes * documenting test cases * documenting test cases * refactoring fixtures * Supporting efficiently downloding s3 files for distributed training (#14) * Supporting efficiently downloding s3 files for distributed training * updated op_name and args when recording step duration (#17) * fixes for right directory name(#20) * Fix folder name (#21) * fixes * change all variables to microsecs * Updating the files to fix the pre-commit failures (#23) * Change invalid file path (#25) * change invalid file path * fix other precommit errors * Add error handling for parsing profiler config (#27) * Fixing the tests for CI (#28) * Fixing the tests for CI * fix out_dir bug Co-authored-by: Neelesh Dodda <ndodda@amazon.com> * Default path for profiler has changed (#29) * Update and correct some documentation (#30) * Enabling TF profiler in smdebug (#5) * Enabling TF profiler in smdebug Co-authored-by: Neelesh Dodda <ndodda@amazon.com> * change variable name and folder path (#35) * change variable name and folder path * add tests to check rotation policy * Add ProfilerSystemMetricFileParser and basic tests (#16) * Add ProfilerSystemMetricFileParser and basic tests * Refactor MetricsReaderBase class * Fix timestamp to event files mapping for both MetricsReader and SystemMetricsReader * rename MetricsReader to AlgorithmMetricsReader * refactoring. Providing a way to avoid cache and hence going OOM (#38) * refactoring. Providing a way to avoid cache and hence going OOM * modifying test cases to have use_in_memory_cache param * Time annotations in PyTorch hook (#13) * modified pytorch hook to record time annotations Co-authored-by: Vikas Kumar <vikumar@amazon.com> * Pulling in changes from smdebug repo to private (#39) * latest commit from smdebug repo master is * Disable TB Testing (aws#275) with commit id b8661de Co-authored-by: Nihal Harish <nihal42harish@gmail.com> Co-authored-by: Vikas-kum <vikumar@amazon.com> * Reorganizing the profiler tests for PR CI build (#41) * Organized the profiler tests. * Updated the tests.sh for PR CI build * Updated the tests.sh for PR CI build * profiler dashboards (#4) * add files for profiler dashboards * updated dashboards to use timeline reader * fixed bug 2,5,6,7,9,10 from bugbash * fixed bug 1,3,4,8,16,17,19 from bugbash * linked x-axis of timeline charts * Creating a generic profiler dashboard & report (#42) * Creating a generic profiler dashboard which can take a training job name and region and execute the notebook. * review comments * Updated notebooks and added Pandas functionalities (aws#43) (aws#44) * updated notebook and added Pandas functionalities * minor fixes in profiler_generic_dashboard.ipynb Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com> * Enable file rotation for Horovod trace file (#33) * Hvd file reader and rotation of files Co-authored-by: Anirudh <anirudhkrec@gmail.com> * Pytorch profiler new (#40) * adding profiling info to pytorch hook * imore changes * capturing forward and backward time from within pytorch hook Note that hook provides backward end time, so backward start time is approximated to end of last forward/backward or now So, forward times and backward end times should be accurate while backward start time is approximated. * irmeoved print statements * ran pre-commit and removed some log statements * pre commit run * Fixed the assert * Temporarily skipping the test on codebuild projects where pytorch is not installed. * Temporarily skipping the test on codebuild projects where pytorch is not installed. * Temporarily skipping the test on codebuild projects where pytorch is not installed. * Temporarily skipping the test on codebuild projects where pytorch is not installed. * Temporarily skipping the test on codebuild projects where pytorch is not installed. * reverted the temporary changes * Fixed the assert * FIxing the CI test failure * Fixed the code to include the last layer * Updated the tests and refactored the TraceEvent class. * Converted the rnn test to pytest variant * Fixed the assert for passing CI Co-authored-by: Vikas-Kum <dev.vikas94@gmail.com> Co-authored-by: Vikas Kumar <vikumar@amazon.com> * Python profiler (#36) Co-authored-by: Neelesh Dodda <ndodda@amazon.com> * Changes to horovod file parser (aws#46) * TF2 profiler tests (aws#48) * test detailed step/time based profiling * Bug fixes for autograd profiler in Pytorch hook. (aws#50) * fixed pytorch hook * fixed merge conflict * fixed bug in hook * Adding action class (aws#285) (aws#54) * Adding action class Actions added: stop trianing job, email, sms Co-authored-by: Vikas-kum <vikumar@amazon.com> * Pull in changes from the sagemaker-debugger repository (aws#55) * Pull in changes from the sagemaker-debugger repository * Typecasting profiling parameters to int (aws#52) * Refactor analysis utils (aws#57) * Integration tests for profiler on sagemaker (#19) scripts and infrastructure code * Typecasting str profiling parameters to bool (aws#58) * Typecasting str profiling parameters to bool * Add pyinstrument for python profiling (aws#56) * Make DetailedProfilingConfig a string in profiler config (aws#67) * detailed profiling config now is string * install tf_datasets (aws#66) * Convert profiler data to pandas frame (aws#47) * add class to convert profiler data to pandas frame * fixed local reader * add notebook for pandas queries * added code to find workload balancing issues in multi GPU training * Adding more checks to integration tests (aws#73) * pytorch Added step event, mode and more details to detailed profiling (aws#78) * Added step event, mode and more details to detailed profiling * Changing op name string * Making op_name equivalent to TF * changing step num to mode_step * Adding phase to autograd events * Change timeline node_id for distributed workers (aws#80) * change timeline node_id for distributed workers * Add integration tests for detailed profiling and python profiling (aws#71) * Fixing a bug where step num was not correctly used when enabling detailed profiling Dumping the torch autograd profiler every step. If there are multiple steps then data builds up and can cause gpu memory build up. * Feature to profile for different step phases 2.Capturing profiling step phases for pytorch 3.Fix bug with path string which was always having cprofile in path even if pyinstrument profiler is used * Fix pre-commit * Fix call to stats_filename * Fixing PythonStepStats * auto commit * ifix x * iFix * fix * pre commit fix * fix bug * removed code * make profiling parameters case insensitive * docstring for case insensitive config * precommit * push profiler images to alpha and get tag from environment variable * push profiler images to alpha and get tag from environment variable * Add height param to HeatMap * specify registry ID as env variable, alpha by default * Some cleanup, adding total time in cprofile * Refactored metricsHistogram and stepHistogram and amde more modular * separate usepyinstrument * iFixes for metrics historgram * Fixing StepHistogram * removing pritn with logger * refactoring * changes in detailed profiling * remove imports * notebook fixes and histogram class fixes * Adding wheel lfile * running pre-commit * fix tests * Adding unique thread id , pid, for trace event parser In every event added event_phase, node_id * pre-commit * fixing notebook and other changes * fix check for event_Args None * Changing ntoebook * upload files to s3 during test * minor fix * create new s3 folder for stats * fix syntax errors * Some cleanup * Fix int typecast for rotatemaxfilesizebytes (#19) Co-authored-by: Vikas-kum <vikumar@amazon.com> * Pull in smdebug 145d43b (#38) * Pull in latest smdebug (0.9.1) (upto commit 145d43b) * Reverting the change to GET_OBJECTS_MULTIPROCESSING_THRESHOLD in #14. * Adding metadata file for TF Profiler parser to include startitime (#4) * TF profiler event parser * fix can_start_prof bug * populate start time * handle tf trace json in reader * separate file for metadata * Reorder the writing of events so that events get correctly written according to their end timestamp. (#39) Co-authored-by: Vikas-kum <vikumar@amazon.com> * Enable profiling between steps for tensorflow (#2) * Dump HTML for each pyinstrument stats file (#16) * output html in python profiler * dump output html for pyinstrument * Add higher level analysis functions for cProfile python profiling (#6) * Updated preview notebooks (#8) * Valid trace file check (#41) * fix valid trace file check * change log level * Adding analysis utils and updating the analysis notebook (#9) * add pandas analysis utils * update profiler analysis notebook (#32) * Updated analysis utils (#34) * add python profiling to notebook (untested) Co-authored-by: NRauschmayr <n.rauschmayr@gmail.com> Co-authored-by: Neelesh Dodda <ndodda@amazon.com> * check record end time similar to c++ writer (aws#45) * remove flakiness offset from sm tests (aws#43) * Add example notebook fixes for python profiling (aws#46) * Refactored profiler dashboards (#42) * refactored dashboards to plot new system metrics * updated step timeline chart to plot train/eval/global step * bugfixes for analysis notebook (aws#44) * Bugfixes in analysis and notebooks (aws#49) * Followup to the PR on analysis utils (aws#50) * Prevent metrics reader from reading invalid files (aws#52) * Modify horovod tests to generate check for horovod timeline (aws#51) * Bugfixes (aws#57) * fix for dashboards * Add timeline image for bottlenecks notebook (aws#59) * Error handling for pyinstrument (aws#58) * Enable/disable python profiling after forward pass of pytorch hook instead of backward pass (aws#56) * Pytorch integration tests (#33) * Enabling integration tests for pytorch * Fixed the job index for codebuild project. * Fixed the job index for codebuild project. * Fixing the codebuild project to install smdebugger in docker * Fixing codebuild project * Adding cpu jobs * Adjusted the parameters for cpu jobs * PyTorch detailed profiler traces are not present in detailed_profiling directory. * Fixing the test yml file. * Fixing the test yml file. * Removed commented code. * Added test configuration for absent profiler. * Preloading the cifar10 dataset into source directory. * ENabled the assert for checking the timestamp * adjusted the tracefile counts * Fixed the job names, added tests for cprofile * Updated the job configs * Adjusted the expected trace file count. * Changed the order in which the trace events are written * Reduced the batch size for cpu tests. * Reduced the batch size for cpu tests. * Fixed the imports * Added capability to handle html file. * Adding horovod tests for integration * Adding horovod tests for integration * Fixed the assert for horovod trace file count * Valid trace file check (#41) * fix valid trace file check * change log level * Fixed the expected count of stats and trace files. * Fixed the profiler config name UsePyinstrument * Preloading mnist dataset to avoid downloading it from internet during training. * Bugfixes in analysis and notebooks (aws#49) * Added test scenario to test the file rotations. * Adding more test scenarios * Adding integration test for distributed training using distributed api * Adding horovod training with resnet50 and cifar10 * FIxing tehe launcher script for resnet50 with horovod. * Increased the batch size * Supporting res50 and cifar with horovod. * Fixed the validation for horovod tracefiles. * Update tests/sagemaker/test_profiler_pytorch.py Co-authored-by: Anirudh <anirudhkrec@gmail.com> * Scheduling sagemaker jobs in parallel. * Fixed the config file path. Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com> Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com> Co-authored-by: Anirudh <anirudhkrec@gmail.com> * Fix buildspec yaml file for TF integration tests (aws#66) * Merge latest changes from smdebug to smprofiler (aws#68) * Updating analysis utils (aws#63) * Modify step stats util to compute stats for multiproc data * Modify utils to handle multi-node data * Modify notebook utils to handle multi-node data Co-authored-by: Neelesh Dodda <ndodda@amazon.com> * Merge timeline for framework events (#5) * Fixing the CI failure caused by awscli (aws#72) * Add metrics config (aws#67) * Add API functions to python profiling analysis for correlation with framework metrics (aws#53) * Dataloader analysis for PyTorch (aws#64) * Adding the functions to get the dataloader events for pytorch * Adding the training script and notebook for dataloader analysis * Fixed the timeconversion from timestamp to UTC and fixed the local reader for system tracefiles. * Updating the dataloader analysis notebook * Updated the notebook with analysis for batch processing. * Updated notebook to display python profiler stats. * Updated the notebook with documenation and layout * Updated the notebook to have static contents * Updating the notebook to handle absence of traceevents * FIxed the tracevents as per the current format and added notebook for triggering the pytorch training jobs * Moved the analysis functions from notebook to a class * Updated the utility functions to retrieve the dataloader events * Added the test scripts for horovod and distributed training * Adding a script that uses dummy custom dataloader * Addressed the review comments * Updated the utility code and added a training script that uses custom datasets * Added hyper parameteres for custom dataset training. * Fix TF event file decompression issue (aws#73) * Fix bugs in keras hook (aws#75) * Reorder events in pytorch hook (aws#60) * Refactor metrics config (aws#76) * Perf benchmark (#31) * Fix for hvd reader issue and one more change (aws#74) * Fixing the batch time analysis in interactive notebook to not generate incorrect plot (aws#81) * Fixing the compuation of batchtime * Fixing the compuation of batchtime * retrigger CI * Attempting to fix PR CI * Attempting to fix PR CI for PyTorch * Attempting to fix PR CI for PyTorch * Merge timeline fixes (aws#82) * Merge timeline fixes 1) putting the node_ids as threads. 2) Providing right sort order for processes and threads 3) Fixing bugs * add check if gpu is available (aws#62) Co-authored-by: Vikas-kum <vikumar@amazon.com> * Performance benchmarking for PyTorch (aws#78) * Pytorch performance tests * Fixed the estimator * Fixed the training script for correct metrics generation * Added train duration metrics in the training script * Adjusted the alarm values * Adjusted the alarm values * Fixed the job name for no smdebug and no profiler * Optimized the training script and added comments in the driver script. * Updated the scripts for framework only training job * Removed the unenecessary code. * Updating the instance types. * Notebook for interactive analysis (aws#69) * Notebook for interactive analysis * add python profiling to interactive analysis notebook * Updated the interactive notebook with dataloader analysis for pytorch * updated the utility functions to retrieve the dataloader events * some changes to the nb * some fixes to the nb * fixes * reset index * editing nb content * fixes * nit fix * fixes after metricsconfig * update notebooks * add updated job notebooks * updated notebooks for bug bash * update TF notebook * rename notebooks * rename notebooks * updating notebooks with feedback * Renamed Profiler to EagleEye * minor edits * scripts * fix * Updated the interactive anlaysis notebook with minor fix. * Updated the instance type for rules to ml.m5.8xlarge' * Updated the rules instances to ml.r5.4xlarge' * miyoung's changes Co-authored-by: Neelesh Dodda <ndodda@amazon.com> Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com> Co-authored-by: Anirudh <anirudhkrec@gmail.com> * Fixed the metrics names to have correct instance names. (aws#88) * Added empty name in an event during merge_timeline if it is missing (aws#87) * Add an empty name only for Horovod and Herring events if name is missing for E events. * Add ProfilerTrial class and profiler builtin rules (aws#54) * add files for gpu usage rule * adding rule to detect cpu bottlenecks * add rule to detect outliers in step duration * added node id to rule analysis * add rule for checking gpu memory increase * added rules for batch size and max initialization time * add rule to detect load balancing issues in multi GPU training * add dockerfiles to build rule container * applying changes from https://github.com/awslabs/sagemaker-profiler/commit/57dfe2bd960ae798610b6ff52f661a4f5475eded fixed output directory and label legends Co-authored-by: Vandana Kannan <vandana268@gmail.com> Co-authored-by: Vikas Kumar <dev.vikas94@gmail.com> * Fixing the writing of first event in the tracefile that stores the start time from epoch (aws#85) * Fixing the writing of first event in the tracefile. * Added the master table to ensure that we always write the metaevent in the new traceevent file. * Fixing bugs in KerasHook and profiler utils (aws#89) * Change smdebug version in notebooks (aws#90) * change smdebug version * rename tf_python_stats_dir to python_stats_dir Co-authored-by: Neelesh Dodda <ndodda@amazon.com> * Dynamic ON/OFF Herring timeline for PyTorch framework (aws#80) * Fix pytest version (aws#91) * support mixed precision training (aws#96) * merging sys metrics and bottlenecks in the timeline (aws#93) * merging sys metrics and bottlenecks in the timeline * Fix hvd failures and add native TF training in TF integration tests (aws#97) * Reading rule stop signal file and stopping the rule if gracetime has … (aws#98) * Reading rule stop signal file and stopping the rule if gracetime(60s) has passed * [Sync] Sync smdebug with sagemaker-debugger master branch (aws#95) Co-authored-by: Vikas-kum <vikumar@amazon.com> Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com> Co-authored-by: Anirudh <anirudhkrec@gmail.com> Co-authored-by: Miyoung <myoung8739@gmail.com> Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> Co-authored-by: Rahul Huilgol <huilgolr@amazon.com> Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com> * add rule for framework metrics (aws#100) * add rule for framework metrics overview * update report * replaced matplolib figures with bokeh charts * fix pre-commit error * minor fixes in report notebook Co-authored-by: Connor Goggins <cgoggins0@gmail.com> * Update Profiler Trial and Rules to Generate Report on Every Invoke (aws#102) * [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule (aws#105) * [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule Update ProfilerReport rule to emit RuleEvaluationConditionMet if any subrule having rule evaluation confition met. * Update to emit RuleEvaluationConditionMet at the end of job * Fix comment * add unit test for ProfilerReport * remove scanel_interval passed in * Update unit tests * Fix incorrect comment on last step. * Update log message. * Sync with sagemaker-debugger master branch and fix issue with tensorflow_datasets version (aws#114) * Update sagemaker.md (aws#250) * Bumping version to 0.9.0 (aws#251) * Skip using standalone keras Py3.7+ (aws#253) * Gradtape zcc (aws#252) * Fix Incorrect Log Statement (aws#256) * Incorrect number of tensors saved with MirroredStrategy (aws#257) * Change Version to 0.8.1 (aws#258) * Save Scalars With Mirrored Strategy (aws#259) * skip flaky test (aws#262) * Don't export to collections for all workers with unsupported distrib training (aws#263) * version bump (aws#265) * Avoiding Basehook object pickling (aws#266) * handle eager tensors (aws#271) * TF 2.x: Support for keras to estimator (aws#268) * Revert "TF 2.x: Support for keras to estimator (aws#268)" (aws#273) This reverts commit 749bded. * Disable TB Testing (aws#275) * Support for TF 2 estimator (aws#274) * Adding a TF2 Hvd example and test (aws#279) * Moved end of training log from info to debug (aws#281) awslabs/sagemaker-debugger#280 * Adding action class (aws#285) * Adding action class Actions added: stop trianing job, email, sms * Fix buildspec used for PR CI (aws#287) * Adding a test to check that PT model is saved without issues (aws#283) * test that model can be pickled without issues * Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (aws#282) * Pin pytest version (aws#293) * Load IRIS Dataset from S3 (aws#298) * Load dataset from s3 (aws#299) * remove problematic log (aws#300) * Change Enum (aws#301) * Doc update (aws#292) * rename enum (aws#305) * version bump to 0.9.1 (aws#304) * modify asserts (aws#307) * version compare (aws#306) * Support TF 2.3 Tests (aws#312) * Disable TB in ZCC for AWS TF 2.3.0 (aws#316) * Update Assert Statements For New TF 2.2.0 DLC (aws#320) * Version Bump (aws#319) * add a note for TF 2.2 limited support (aws#303) Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> Co-authored-by: Nihal Harish <nihal42harish@gmail.com> * TF 2.2 documentation update (aws#322) * update TF 2.2 smdebug features * Update code samples/notes for new pySDK and smdebug/add and fix links * add 'New features' note Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> * Adding pagination in list_training_jobs (aws#323) * Adding pagination in list_Training_jobs * Test Custom Step Usecase (aws#331) * save tf2 model (aws#333) * Add ability to only save shapes of tensors (aws#328) * Revert "Add ability to only save shapes of tensors (aws#328)" (aws#337) This reverts commit c9eb769. * Function to Test If the hook has been configured with the Default hook config (aws#332) * Default hook config (aws#338) * version bump (aws#339) * TF ZCC limitation footnote (aws#342) * Ability to save shapes (aws#341) * WIP saveshape * Add shape writer * Add pytorch test * Add untested keras test * fix syntax * fix syntax * Import * Import * Add tests for TF * Simplify read code * Add read API and tests * Add mxnet test * Add s3 and json tests * lint * Fix payload * fix import * Handle different num tensors for losses * Fix exact equal condition * Fix mode bug * trigger CI * Add support for distributed training with writer map * Check that value throws exception * Fix tests to make them more resilient * Fix mxnet and pytorch tests * Remove tensor names * pre-commmit * Fix get_mode * Fix bug with old index files * Fix keras test with names of tensors * Set original name to None if tf_obj is None * Fix mirrored test for cpu * Add docs * trigger CI * Fix shape writer get * Simplify by removing shape writer * Cleanup * Fix name of writer * Addressed review comments * trigger ci * retrigger CI Co-authored-by: NihalHarish <nihal42harish@gmail.com> * Support Inputs and Labels in the dict format (aws#345) * 0.9.4 (aws#347) * Refactor Make Numpy Array (aws#329) * warn gradtape users about tf.function support (aws#348) * Support all tf types (aws#346) * Model Subclassing Test (aws#351) * Modify Should Save Tensor Test To Work on Any Version of TF (aws#352) * framework version updates (aws#360) * list training jobs improvements (aws#349) * Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic. * if there are training jobs found with prefix, we break * if there are exceptions caught more than 5 times we break. * Handle Deprecation Of experimental_ref api (aws#356) * check file exist before moving (aws#364) * check file exist before moving when closing the file. * Support Saving Tensors in Graph Mode with add_for_mode (aws#353) * Change layer name logic (aws#357) * Pass Variable Length Argument To Old Function Call (aws#366) * test concat layers (aws#367) * Update README.md (aws#371) * Pinning the version of tensorflow_datasets package so that it does not require updating TF (aws#373) Co-authored-by: NihalHarish <nihal42harish@gmail.com> * Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (aws#372) * Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (aws#374) * returning list instead of dict keys (aws#376) fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always. * Add support for mixed precision training (aws#378) * Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (aws#380) * pytorch tmp (aws#382) * extend zcc to 2.1.2 (aws#384) * disable pytorch (aws#386) * Removed the redundant installation of smdebug and smdebug-rules (aws#391) * Incrementing the version to 0.9.5 (aws#396) * pin tensorflow dataset in test config (aws#399) * add back test * revert some changes * unpin pytest version Co-authored-by: Nihal Harish <nihal42harish@gmail.com> Co-authored-by: Vikas-kum <vikumar@amazon.com> Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com> Co-authored-by: Anirudh <anirudhkrec@gmail.com> Co-authored-by: Miyoung <myoung8739@gmail.com> Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> Co-authored-by: Rahul Huilgol <huilgolr@amazon.com> Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com> * Changing the Herring user-facing API (aws#110) * [TRSL-998] Update Rule Test with Result Checking (aws#106) * [TRSL-998] Update Rule Test with Result Checking Update existing rule testing to assert against rule output. This will ensure rule are tested with its report result which should be deterministic thru CI. * Generate HTML Report at every ProfilerReport invoke (aws#112) This change adds HTML report generation at the end of every invoke of ProfilerReport rule. * Update RuleEvaluationConditionMet to indicate end of the rule (aws#115) * fix: Remove the hard code notebook file path (aws#117) * Run rules tests in CI (aws#116) * Log fix memory issue fix (aws#113) * Changed the Herring API and variable names (aws#118) * Removing the functionality to attach the backward hook to the module (aws#125) * Removing the functionality to attach the backward hook to the module * Updated the number of traceevents as the backward hook is no longer registered. * Herring TF2 Native Graident Tape SMDebugger support (aws#122) * Fix bug in base hook (aws#127) * Minor bugfixes/changes in rules (aws#126) * minor bugfixes for rules * Updating batch size rule (aws#123) * fix for batch size rule * Dataloader rule (aws#108) * added dataloader rule and updated profiler report * Redesign TF dataloader metrics collection (aws#92) * Update profiler config parser to match latest SDK changes (aws#120) * Replaced herringsinglenode command with smddpsinglenode (aws#129) * Updating the version for profiler GA release (aws#124) * Updating the version for profiler GA release * Trigger Build * Trigger Build * Trigger Build * Fix paths in profiler report (aws#131) * changed path in profiler report * fixed env variable (aws#132) * making info log to debug from trace event parser as it is very verbose (aws#134) * Only do detailed profiling for supported TF versions. (aws#135) * Update PT tests (aws#136) * Fix bug in parser (aws#137) * smdistributed.dataparallel should be invoked from mpi command (aws#138) * smdistributed.dataparallel should be invoked from mpi command * Added comments * Bugfix: Invalid Worker (aws#139) * smdistributed.dataparallel environment check (aws#140) * smdistributed.dataparallel environment check * addressed comments * Modified check_smdataparallel_env logic * Install rules packages in PR CI (aws#143) * Removed the files and folders that are not required in the public repository * Removed the integration tests. * FIxed the pre-commit checks Co-authored-by: Vandana Kannan <vandana268@gmail.com> Co-authored-by: Vikas-kum <dev.vikas94@gmail.com> Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com> Co-authored-by: Nathalie Rauschmayr <n.rauschmayr@gmail.com> Co-authored-by: Neelesh Dodda <ndodda@amazon.com> Co-authored-by: Rajan Singh <srajanku@amazon.com> Co-authored-by: sife <sifei.li@hotmail.com> Co-authored-by: Anirudh <anirudhkrec@gmail.com> Co-authored-by: Vikas Kumar <vikumar@amazon.com> Co-authored-by: Anirudh <aanirud@amazon.com> Co-authored-by: Karan Jariwala <karankjariwala@gmail.com> Co-authored-by: Nihal Harish <nihal42harish@gmail.com> Co-authored-by: Miyoung <myoung8739@gmail.com> Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> Co-authored-by: Rahul Huilgol <huilgolr@amazon.com> Co-authored-by: Connor Goggins <cgoggins0@gmail.com> Co-authored-by: JC-Gu <jiacheg@amazon.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Re: amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/
import pip
pip.main(['install', 'pandas'])
This method is not working with the newer version of pip. p2.xlarge instance does not seem to be supporting pandas. What is the recommended/alternative method to install Python packages when training and inference do not support the particular modules?
Thank you.
The text was updated successfully, but these errors were encountered: