Fix script mode training hang with logging enabled #77

aws-patlin · 2019-12-02T21:24:31Z

Description of changes:
A user discovered that after a recent release, their training script stops to work when adding a stdout stream handler to the logger. The issue was traced back to this commit in sagemaker-containers.

capture_error=True appends stderr to the error message that gets thrown if training fails. For context, this was specifically a workaround for PyTorch, which can throw a specific error even if training succeeds, so I don't believe this is necessary for XGBoost.

Corresponding PR for 0.90-2: #78

Testing:
Using the prod image with capture_error enabled, the training script would hang with no log output. With capture_error disabled on a custom image, I was able to complete the training job successfully with the expected log output.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… image (#179) * Bump Python to 3.7.10 * Merge commits from 0.90-1 back to reverted master * Fix CSV Pipe parsing argument to use weight instead of weights. Fix requirements for tox. (#81) * Fix script mode training hang with logging enabled. (#77) * Fix training unit test to match PR #77. (#84) * Fix label concatenation for RecordIO-protobuf dmatrix (#85) Closes #83 * Add verbosity to hyperparameter validation. (#87) * Add verbosity to hyperparameter validation. * Set scipy requirement to 1.2.2 for sagemaker-containers. * Add missing eval_metrics to hp validation. (#82) * Added aucpr and cox-nloglik to eval_metric hp validation. * Add two separate list for MAXIMIZE and MINIMIZE metrics. Co-authored-by: ericangelokim <39601338+ericangelokim@users.noreply.github.com> Co-authored-by: Patrick Lin <52252844+aws-patlin@users.noreply.github.com> Co-authored-by: rizwangilani <rizwan.gl@gmail.com>

Fix script mode training hang with logging enabled.

ecedc4e

aws-patlin requested review from iyerr3 and ericangelokim December 2, 2019 21:24

aws-patlin mentioned this pull request Dec 2, 2019

Fix script mode training hang with logging enabled. #78

Closed

laurenyu approved these changes Dec 3, 2019

View reviewed changes

ericangelokim approved these changes Dec 4, 2019

View reviewed changes

aws-patlin merged commit 363e39d into aws:master Dec 4, 2019

aws-patlin added a commit to aws-patlin/sagemaker-xgboost-container that referenced this pull request Dec 5, 2019

Fix training unit test to match PR aws#77.

b1c894a

aws-patlin added a commit to aws-patlin/sagemaker-xgboost-container that referenced this pull request Dec 5, 2019

Fix training unit test to match PR aws#77.

bf4281e

aws-patlin mentioned this pull request Dec 5, 2019

Fix training unit test to match PR #77. #84

Merged

aws-patlin added a commit that referenced this pull request Dec 5, 2019

Fix training unit test to match PR #77. (#84)

5656a9a

edwardjkim pushed a commit to edwardjkim/sagemaker-xgboost-container that referenced this pull request Mar 17, 2021

Fix script mode training hang with logging enabled. (aws#77)

f12afc7

edwardjkim pushed a commit to edwardjkim/sagemaker-xgboost-container that referenced this pull request Mar 17, 2021

Fix training unit test to match PR aws#77. (aws#84)

6c67a66

edwardjkim pushed a commit to edwardjkim/sagemaker-xgboost-container that referenced this pull request Mar 17, 2021

Fix script mode training hang with logging enabled. (aws#77)

c69703d

edwardjkim pushed a commit to edwardjkim/sagemaker-xgboost-container that referenced this pull request Mar 17, 2021

Fix training unit test to match PR aws#77. (aws#84)

3a71dfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix script mode training hang with logging enabled #77

Fix script mode training hang with logging enabled #77

aws-patlin commented Dec 2, 2019 •

edited

Fix script mode training hang with logging enabled #77

Fix script mode training hang with logging enabled #77

Conversation

aws-patlin commented Dec 2, 2019 • edited

aws-patlin commented Dec 2, 2019 •

edited