Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add model debug doc #1895

Merged
merged 4 commits into from
Feb 2, 2021

Conversation

rb-determined-ai
Copy link
Member

@rb-determined-ai rb-determined-ai commented Jan 30, 2021

Description

It is looking like the observability project is going to solve what I was trying to solve in #1625, only much better. As a result, I'm pulling out just the debug documentation as a separate PR to land now.

@rb-determined-ai rb-determined-ai added the documentation Improvements or additions to documentation label Jan 30, 2021
@cla-bot cla-bot bot added the cla-signed label Jan 30, 2021
@rb-determined-ai rb-determined-ai mentioned this pull request Jan 30, 2021
Copy link
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this is great! I added a few minor comments.

I think the discussion might benefit from a bit more elaboration on the thought process behind these debugging steps. For example, you could start by saying "code running in Determined is different from code running in your local environment in several ways: (a) runs as part of the Trial API (b) runs in a container on a remote machine (c) might run as part of distributed training. When debugging, it helps to isolate whether any of those environmental differences are the root cause of any observed issues." And then add some details to each section, "now that we have confirmed that the code works locally, the next step is to ..."

docs/how-to/model-debug.txt Outdated Show resolved Hide resolved
docs/how-to/model-debug.txt Show resolved Hide resolved
docs/how-to/model-debug.txt Outdated Show resolved Hide resolved
docs/how-to/model-debug.txt Outdated Show resolved Hide resolved

**How to diagnose failures:** If you are using the ``determined``
library APIs correctly, then theoretically distributed training should
"just work". However, you should be aware of some common pitfalls:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about networking issues? And maybe scheduling problems / hangs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a detail about not being scheduled.

Are there common networking issue that affect distributed training and not normal training? Almost all of the networking issues I've helped debug were issues that affected even a single notebook.

re hangs: I think hangs in our harness are nearly 100% our responsibility. I can't think of any hangs that were due to the user doing something wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I mostly meant hangs due to dtrain trials not being scheduled.

Re: networking, I was thinking of issues like: (a) matching network interface names (b) additional ports we use for NCCL, ssh, etc. traffic we don't use for single-GPU training (c) issues with istio or other proxies that seem to be more problematic for dtrain trials (e.g., https://determined-community.slack.com/archives/CV3MTNZ6U/p1607112511338900)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things like port configurations belong in a separate "debug/validate your determined cluster", where we would have a series of tests that new cluster admins could run to ensure that their users' code would work. I think this document should target the ml end user

@rb-determined-ai
Copy link
Member Author

Danny is going to update the rstfmt command to support non-auto-enumerated lists so I can pass make -C doc check, otherwise I got rid of all TODOs in the doc and I think it's ready for another round of review.

Copy link
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Can we find a few places to add links to this doc elsewhere in the docs? e.g., maybe add it as a suggestion in the Keras/PyTorch tutorials, or add an FAQ, etc.

@rb-determined-ai
Copy link
Member Author

I added it in several places:

  • FAQ
  • Keras/Pytorch/Estimator API docs
  • Model definition topic guides

@rb-determined-ai rb-determined-ai merged commit a00492c into determined-ai:master Feb 2, 2021
@rb-determined-ai rb-determined-ai deleted the debug-doc branch February 2, 2021 21:21
determined-dsw pushed a commit that referenced this pull request Feb 3, 2021
(cherry picked from commit a00492c)
@dannysauer dannysauer added this to the 0.14.2 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants