-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add model debug doc #1895
docs: add model debug doc #1895
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this is great! I added a few minor comments.
I think the discussion might benefit from a bit more elaboration on the thought process behind these debugging steps. For example, you could start by saying "code running in Determined is different from code running in your local environment in several ways: (a) runs as part of the Trial API (b) runs in a container on a remote machine (c) might run as part of distributed training. When debugging, it helps to isolate whether any of those environmental differences are the root cause of any observed issues." And then add some details to each section, "now that we have confirmed that the code works locally, the next step is to ..."
docs/how-to/model-debug.txt
Outdated
|
||
**How to diagnose failures:** If you are using the ``determined`` | ||
library APIs correctly, then theoretically distributed training should | ||
"just work". However, you should be aware of some common pitfalls: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talk about networking issues? And maybe scheduling problems / hangs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a detail about not being scheduled.
Are there common networking issue that affect distributed training and not normal training? Almost all of the networking issues I've helped debug were issues that affected even a single notebook.
re hangs: I think hangs in our harness are nearly 100% our responsibility. I can't think of any hangs that were due to the user doing something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I mostly meant hangs due to dtrain trials not being scheduled.
Re: networking, I was thinking of issues like: (a) matching network interface names (b) additional ports we use for NCCL, ssh, etc. traffic we don't use for single-GPU training (c) issues with istio or other proxies that seem to be more problematic for dtrain trials (e.g., https://determined-community.slack.com/archives/CV3MTNZ6U/p1607112511338900)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think things like port configurations belong in a separate "debug/validate your determined cluster", where we would have a series of tests that new cluster admins could run to ensure that their users' code would work. I think this document should target the ml end user
2776641
to
3c7ed17
Compare
3c7ed17
to
adc1c5f
Compare
Danny is going to update the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Can we find a few places to add links to this doc elsewhere in the docs? e.g., maybe add it as a suggestion in the Keras/PyTorch tutorials, or add an FAQ, etc.
I added it in several places:
|
(cherry picked from commit a00492c)
Description
It is looking like the observability project is going to solve what I was trying to solve in #1625, only much better. As a result, I'm pulling out just the debug documentation as a separate PR to land now.