docs: add model debug doc #1895

rb-determined-ai · 2021-01-30T00:17:29Z

Description

It is looking like the observability project is going to solve what I was trying to solve in #1625, only much better. As a result, I'm pulling out just the debug documentation as a separate PR to land now.

neilconway

Overall, I think this is great! I added a few minor comments.

I think the discussion might benefit from a bit more elaboration on the thought process behind these debugging steps. For example, you could start by saying "code running in Determined is different from code running in your local environment in several ways: (a) runs as part of the Trial API (b) runs in a container on a remote machine (c) might run as part of distributed training. When debugging, it helps to isolate whether any of those environmental differences are the root cause of any observed issues." And then add some details to each section, "now that we have confirmed that the code works locally, the next step is to ..."

docs/how-to/model-debug.txt

neilconway · 2021-02-01T14:47:26Z

docs/how-to/model-debug.txt

+
+**How to diagnose failures:**  If you are using the ``determined``
+library APIs correctly, then theoretically distributed training should
+"just work".  However, you should be aware of some common pitfalls:


Talk about networking issues? And maybe scheduling problems / hangs?

Added a detail about not being scheduled.

Are there common networking issue that affect distributed training and not normal training? Almost all of the networking issues I've helped debug were issues that affected even a single notebook.

re hangs: I think hangs in our harness are nearly 100% our responsibility. I can't think of any hangs that were due to the user doing something wrong.

Yeah -- I mostly meant hangs due to dtrain trials not being scheduled.

Re: networking, I was thinking of issues like: (a) matching network interface names (b) additional ports we use for NCCL, ssh, etc. traffic we don't use for single-GPU training (c) issues with istio or other proxies that seem to be more problematic for dtrain trials (e.g., https://determined-community.slack.com/archives/CV3MTNZ6U/p1607112511338900)

I think things like port configurations belong in a separate "debug/validate your determined cluster", where we would have a series of tests that new cluster admins could run to ensure that their users' code would work. I think this document should target the ml end user

rb-determined-ai · 2021-02-01T20:52:18Z

Danny is going to update the rstfmt command to support non-auto-enumerated lists so I can pass make -C doc check, otherwise I got rid of all TODOs in the doc and I think it's ready for another round of review.

neilconway

lgtm. Can we find a few places to add links to this doc elsewhere in the docs? e.g., maybe add it as a suggestion in the Keras/PyTorch tutorials, or add an FAQ, etc.

rb-determined-ai · 2021-02-02T19:18:09Z

I added it in several places:

FAQ
Keras/Pytorch/Estimator API docs
Model definition topic guides

(cherry picked from commit a00492c)

docs: add model debug doc

faa3a7a

rb-determined-ai added the documentation Improvements or additions to documentation label Jan 30, 2021

rb-determined-ai requested a review from neilconway January 30, 2021 00:17

rb-determined-ai assigned neilconway Jan 30, 2021

cla-bot bot added the cla-signed label Jan 30, 2021

rb-determined-ai mentioned this pull request Jan 30, 2021

Detailed debug #1625

Closed

neilconway reviewed Feb 1, 2021

View reviewed changes

rb-determined-ai force-pushed the debug-doc branch from 2776641 to 3c7ed17 Compare February 1, 2021 20:41

chore: review feedback and fmt

adc1c5f

rb-determined-ai force-pushed the debug-doc branch from 3c7ed17 to adc1c5f Compare February 1, 2021 20:50

neilconway approved these changes Feb 2, 2021

View reviewed changes

rb-determined-ai added 2 commits February 2, 2021 12:15

chore: write back-links and release notes

3712de3

chore: use the brand-new rstfmt release

2e56920

neilconway assigned rb-determined-ai and unassigned neilconway Feb 2, 2021

rb-determined-ai merged commit a00492c into determined-ai:master Feb 2, 2021

rb-determined-ai deleted the debug-doc branch February 2, 2021 21:21

determined-dsw pushed a commit that referenced this pull request Feb 3, 2021

docs: add model debug doc (#1895)

e25c054

(cherry picked from commit a00492c)

dannysauer added this to the 0.14.2 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add model debug doc #1895

docs: add model debug doc #1895

rb-determined-ai commented Jan 30, 2021 •

edited

Loading

neilconway left a comment

neilconway Feb 1, 2021

rb-determined-ai Feb 1, 2021

neilconway Feb 1, 2021

rb-determined-ai Feb 1, 2021

rb-determined-ai commented Feb 1, 2021

neilconway left a comment

rb-determined-ai commented Feb 2, 2021

docs: add model debug doc #1895

docs: add model debug doc #1895

Conversation

rb-determined-ai commented Jan 30, 2021 • edited Loading

Description

neilconway left a comment

Choose a reason for hiding this comment

neilconway Feb 1, 2021

Choose a reason for hiding this comment

rb-determined-ai Feb 1, 2021

Choose a reason for hiding this comment

neilconway Feb 1, 2021

Choose a reason for hiding this comment

rb-determined-ai Feb 1, 2021

Choose a reason for hiding this comment

rb-determined-ai commented Feb 1, 2021

neilconway left a comment

Choose a reason for hiding this comment

rb-determined-ai commented Feb 2, 2021

rb-determined-ai commented Jan 30, 2021 •

edited

Loading