Skip to content

Commit

Permalink
fix: increase worker waiting time for ORTE proc (#178)
Browse files Browse the repository at this point in the history
* fix: increase worker waiting time for ORTE proc

* remove todo tag for passing pylint chec

* ping sagemaker version to avoid a credential error
  • Loading branch information
yl-to committed Apr 4, 2023
1 parent 5053a62 commit 5b5fac3
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 3 deletions.
5 changes: 3 additions & 2 deletions src/sagemaker_training/mpi.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,9 @@ def _wait_orted_process_to_finish(): # type: () -> None


def _orted_process(): # pylint: disable=inconsistent-return-statements
"""Wait a maximum of 5 minutes for orted process to start."""
for _ in range(5 * 60):
"""Wait a maximum of 20 minutes for orted process to start."""
# the wait time here should be set to a dynamic value according to cluster size
for _ in range(20 * 60):
procs = [p for p in psutil.process_iter(attrs=["name"]) if p.info["name"] == "orted"]
if procs:
logger.info("Process[es]: %s", procs)
Expand Down
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ deps =
pytest-asyncio
mock
awslogs
sagemaker[local]
sagemaker[local]==2.136.0
numpy
flask
gunicorn
Expand Down

0 comments on commit 5b5fac3

Please sign in to comment.