Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Update ItemSampler #7408

Merged
merged 12 commits into from
May 21, 2024
Merged

Conversation

Skeleton003
Copy link
Collaborator

@Skeleton003 Skeleton003 commented May 15, 2024

Description

  1. Update ItemSampler to support correct stochastic sharding across distributed groups.
  2. Modify the logic of ItemSet.__getitem__() when index is an iterable of int.

Benchmark: https://docs.google.com/document/d/1Pzk2PJoFtTZSu17wTXVK4mqvfrMLAj2xK6fcGC1pwEg/edit?usp=sharing

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 15, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@mfbalin
Copy link
Collaborator

mfbalin commented May 15, 2024

The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization. Since ItemSampler and rest of the sampling pipeline runs concurrently, runtime is not enough information to determine that.

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 15, 2024

Commit ID: 0134ddf

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization. Since ItemSampler and rest of the sampling pipeline runs concurrently, runtime is not enough information to determine that.

Sure, could u please give me some guide on how to do that?

@mfbalin
Copy link
Collaborator

mfbalin commented May 15, 2024

The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization. Since ItemSampler and rest of the sampling pipeline runs concurrently, runtime is not enough information to determine that.

Sure, could u please give me some guide on how to do that?

The simplest way to monitor htop -d5 or nvtop to see what is the CPU utilization. If you wanted to be more precise, you could also insert timing code into ItemSampler to see how long each call takes. It may also make sense to write a regression benchmark so that we can monitor its runtime. We can have a sampling pipeline only containing the ItemSampler and benchmark it.

Copy link
Collaborator

@Rhett-Ying Rhett-Ying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks clean and code to me.

One more thing is please benchmark on larger dataset with larger batch-size such as ogbn-papers100M and heterogenous dataset, link prediction datasets(to measure the perf of indexing on tuple of tensors). Let's make sure it's performance efficient on most common datasets.

python/dgl/graphbolt/item_sampler.py Show resolved Hide resolved
python/dgl/graphbolt/item_sampler.py Show resolved Hide resolved
python/dgl/graphbolt/item_sampler.py Show resolved Hide resolved
python/dgl/graphbolt/item_sampler.py Show resolved Hide resolved
@dgl-bot
Copy link
Collaborator

dgl-bot commented May 16, 2024

Commit ID: 428d47a3b483c1e7f9b403124a56763afbac038b

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 16, 2024

Commit ID: fa3ccb337322150991a773951da7f74ed897e8a7

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 17, 2024

Commit ID: bacdfb1c6088d51474c3581e84ef998da9d6185b

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

@mfbalin Thank you for your valuable suggestions, but considering the scope of this PR, I'd like to defer them to another PR.

@Rhett-Ying If everything looks good to you, please approve so I can work on.

@Rhett-Ying
Copy link
Collaborator

@Skeleton003 According to the doc, examples/sampling/graphbolt/link_prediction.py runs a bit slower. Please keep an eye on the regression results.

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 21, 2024

Commit ID: 1d69f6b7b4119f7cefb5801ba4ea01c94c573a4f

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 21, 2024

Commit ID: 4987175ffa9db59998db8797ced119ee91c8d946

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 21, 2024

Commit ID: a4b3f08b42d47b5105adaa6190a7a14593bc9ca6

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003 Skeleton003 merged commit 3574bff into dmlc:master May 21, 2024
2 checks passed
@Skeleton003 Skeleton003 mentioned this pull request Jun 3, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants