Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] GraphSAGE with control variate sampling on new sampler #1355

Merged
merged 17 commits into from
Apr 26, 2020

Conversation

BarclayII
Copy link
Collaborator

@BarclayII BarclayII commented Mar 12, 2020

Description

This implements control variate sampling from https://arxiv.org/abs/1710.10568 with the new sampler interface.

@zheng-da @lingfanyu Please take a look if you are interested.

TODO

  • Profile performance (speed and memory)
  • Verify correctness on multi-GPU
  • Better document the code
  • Refactor the code with vanilla GraphSAGE and extract common logic Skipping this for now.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change

@BarclayII BarclayII closed this Mar 13, 2020
@BarclayII BarclayII reopened this Mar 13, 2020
@lingfanyu
Copy link
Collaborator

Wondering why not start with a non-distributed version?

@BarclayII
Copy link
Collaborator Author

There's not much difference between a non-distributed version and a distributed one: the only differences are to initialize a distributed context and synchronize the gradients among GPUs.

@BarclayII BarclayII marked this pull request as ready for review March 28, 2020 11:55
"""
Copys features and labels of a set of nodes onto GPU.
"""
blocks[0].srcdata['features'] = g.ndata['features'][blocks[0].srcdata[dgl.NID]].to(dev_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, the original paper plays a trick. it aggregates the features in the first layer in advance.

examples/pytorch/graphsage/train_cv_multi_gpu.py Outdated Show resolved Hide resolved
else:
assert isinstance(exception, Exception)
raise exception.__class__(trace)
return decorated_function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this somewhere inside DGL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. Will put it in utils.

It is everywhere in our newly contributed examples already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now wondering if utils is a good place to put this function, since it depends on PyTorch multiprocessing module.

@zheng-da
Copy link
Collaborator

Why not follow the original example?

@BarclayII
Copy link
Collaborator Author

I thought using push could save communication compared to using pull. My bad. Will change it.

hist_col = 'hist_%d' % (i + 1)

h_new = block.dstdata['h_new'].cpu()
g.ndata[hist_col][ids] = h_new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be data race issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The children don't share the same frame apparently.

Copy link
Member

@jermainewang jermainewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jermainewang
Copy link
Member

Other suggestions:

  • Add the training results to the README.
  • Does control variate based sampling converge faster? If so, you could highlight that in the README too.
  • There are many code duplication in the train_cv.py and train_cv_multi_gpu.py. I can see putting everything in one script is potentially easier for our users to copy-paste. I'll leave to you to decide whether to put them in a common file.

@BarclayII
Copy link
Collaborator Author

I'll leave them as is.

@BarclayII BarclayII merged commit 97bb85d into dmlc:master Apr 26, 2020
@BarclayII BarclayII deleted the graphsage-cv branch April 26, 2020 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants