Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Make program thread-local to support multi-threading #338

Merged
merged 5 commits into from
Jan 13, 2019

Conversation

lingfanyu
Copy link
Collaborator

Description

DGLGraph uses one global execution plan / program, which leads to data racing when multi-threading (like PyTorch DataParallel) is used. This PR fix this bug by making schedule a threading.local object. (#302)

Checklist

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

  • keep reference to current schedule program in a threading.local object.

@lingfanyu
Copy link
Collaborator Author

@yzh119 This PR should fix the multi-gpu issue. Can you double check?

@yzh119
Copy link
Member

yzh119 commented Jan 5, 2019

Thanks, the PR solved my problem.

@yzh119
Copy link
Member

yzh119 commented Jan 7, 2019

well, i find that currently enabling multi-GPU could not speed up training, the number of active GPUs is always one during training.

@lingfanyu
Copy link
Collaborator Author

@yzh119 Sounds weird. When I ran your transformer on a 4-GPU instance, all GPUs were active but with low utilization (<25%).

Copy link
Member

@jermainewang jermainewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The performance issue seems about multi-threading itself, which is not the purpose of this PR.

@lingfanyu lingfanyu merged commit ed1948b into dmlc:master Jan 13, 2019
@lingfanyu lingfanyu deleted the fix-multigpu-program branch January 13, 2019 03:04
@jermainewang jermainewang mentioned this pull request Feb 18, 2019
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants