[Model] Refine GraphSAINT #3328

LspongebobJH · 2021-09-06T08:36:00Z

Description

The refinement version of GraphSAINT, implementing online+offline sampling and concurrent sampling using multi-process of torch.DataLoader.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

Implement online sampling in training phase
Implement concurrent sampling in pre-sampling phase using multi-process of torch.DataLoader

cc @mufeili

Check the basic pipeline of codes. Next to check the details of samplers , GCN layer (forward propagation) and loss (backward propagation)

There're still some bugs with sampling in training procedure

Succeed in testing validity on ppi_node experiments without testing other setup. 1. Online sampling on ppi_node experiments performs perfectly. 2. Sampling speed is a bit slow because the operations on [dgl.subgraphs], next step is to improve this part by putting the conversion into parallelism 3. Figuring out why offline+online sampling method performs bad, which does not make sense 4. Doing experiments on other setup

Use torch.dataloader to speed up saint sampling with experiments. Except experiments on too large dataset Amazon, we've done some experiments on other four datasets including ppi, flickr, reddit and yelp. Preliminary experimental results show consumed time and metrics reach not bad level. Next step is to employ more accurate profiler which is the line_profiler to test consumed period, and adjust num_workers to speed up sampling procedures on same certain datasets faster.

Reorganize some codes and comments.

Fix bugs about why fully offline sampling and author's version don't work

Reorganize files and codes then do some experiments to test the performance of offline sampling and online sampling

dgl-bot · 2021-09-06T08:37:03Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

examples/pytorch/graphsaint/README.md

examples/pytorch/graphsaint/sampler.py

examples/pytorch/graphsaint/train_sampling.py

mufeili · 2021-09-12T17:14:56Z

Done a first pass. cc @BarclayII in case you want to take a look.

1. handle directory named 'graphsaintdata' 2. control graph shift between gpu and cpu related to large dataset ('amazon') 3. remove parameter 'train' 4. refine annotations of the sampler 5. update README.md including updating dataset info, dependencies info, etc

explain config differences in TEST part remove a sampling time variant make 'online' an argument change 'norm' to 'sampler' explain parameters in README.md

* make online an argument * refine README.md * refine codes of `collate_fn` in sampler.py, in training phase only return one subgraph, no need to check if the number of subgraphs larger than 1

check the problem on flickr is about overfitting.

examples/pytorch/graphsaint/sampler.py

Fix the overfitting problem of `flickr` dataset. We need to restrict the number of subgraphs (also the number of iterations) used in each epoch of training phase. Or it might overfit when validating at the end of each epoch. The method to limit the number is a formula specified by the author. * Set up a new flag `full` specifying if the number of subgraphs used in training phase equals to that of pre-sampled subgraphs * Modify codes and annotations related the new flag * Add a new parameter called `node_budget` in the base class `SAINTSampler` to compute the specific formula

* Finish the experiments on Flickr, which is done after adding new flag `full`

* use half of edges in the original graph to do sampling * test dgl.random.choice with or without replacement with half of edges ~ next is to test what if put the calculating probability part out of __getitem__ can speed up sampling and try to implement sampling method of author

* employ cython to implement edge sampling for per edge * doing experiments to test consumed time and performance ** the consumed time decreased to approximately 480s, the performance decrease about 5 points. * deprecate cython implementation

* This reverts commit 4ba4f09 * Deprecate cython implementation * Reserve half-edges mechanism

* delete unnecessary annotations

mufeili

Well done

mufeili · 2021-10-07T11:07:46Z

@BarclayII Currently the edge sampler is still not as efficient as the authors' Cython implementation, particularly in the case of Amazon. We might want to have a more efficient implementation in DGL core.

* The start of experiments of Jiahang Li on GraphSAINT. * a nightly build * a nightly build Check the basic pipeline of codes. Next to check the details of samplers , GCN layer (forward propagation) and loss (backward propagation) * a night build * Implement GraphSAINT with torch.dataloader There're still some bugs with sampling in training procedure * Test validity Succeed in testing validity on ppi_node experiments without testing other setup. 1. Online sampling on ppi_node experiments performs perfectly. 2. Sampling speed is a bit slow because the operations on [dgl.subgraphs], next step is to improve this part by putting the conversion into parallelism 3. Figuring out why offline+online sampling method performs bad, which does not make sense 4. Doing experiments on other setup * Implement saint with torch.dataloader Use torch.dataloader to speed up saint sampling with experiments. Except experiments on too large dataset Amazon, we've done some experiments on other four datasets including ppi, flickr, reddit and yelp. Preliminary experimental results show consumed time and metrics reach not bad level. Next step is to employ more accurate profiler which is the line_profiler to test consumed period, and adjust num_workers to speed up sampling procedures on same certain datasets faster. * a nightly build * Update .gitignore * reorganize codes Reorganize some codes and comments. * a nightly build * Update .gitignore * fix bugs Fix bugs about why fully offline sampling and author's version don't work * reorganize files and codes Reorganize files and codes then do some experiments to test the performance of offline sampling and online sampling * do some experiments and update README * a nightly build * a nightly build * Update README.md * delete unnecessary files * Update README.md * a nightly update 1. handle directory named 'graphsaintdata' 2. control graph shift between gpu and cpu related to large dataset ('amazon') 3. remove parameter 'train' 4. refine annotations of the sampler 5. update README.md including updating dataset info, dependencies info, etc * a nightly update explain config differences in TEST part remove a sampling time variant make 'online' an argument change 'norm' to 'sampler' explain parameters in README.md * Update README.md * a nightly build * make online an argument * refine README.md * refine codes of `collate_fn` in sampler.py, in training phase only return one subgraph, no need to check if the number of subgraphs larger than 1 * Update sampler.py check the problem on flickr is about overfitting. * a nightly update Fix the overfitting problem of `flickr` dataset. We need to restrict the number of subgraphs (also the number of iterations) used in each epoch of training phase. Or it might overfit when validating at the end of each epoch. The method to limit the number is a formula specified by the author. * Set up a new flag `full` specifying if the number of subgraphs used in training phase equals to that of pre-sampled subgraphs * Modify codes and annotations related the new flag * Add a new parameter called `node_budget` in the base class `SAINTSampler` to compute the specific formula * set `gpu` as a command line argument * Update README.md * Finish the experiments on Flickr, which is done after adding new flag `full` * a nightly update * use half of edges in the original graph to do sampling * test dgl.random.choice with or without replacement with half of edges ~ next is to test what if put the calculating probability part out of __getitem__ can speed up sampling and try to implement sampling method of author * employ cython to implement edge sampling for per edge * employ cython to implement edge sampling for per edge * doing experiments to test consumed time and performance ** the consumed time decreased to approximately 480s, the performance decrease about 5 points. * deprecate cython implementation * Revert "employ cython to implement edge sampling for per edge" * This reverts commit 4ba4f09 * Deprecate cython implementation * Reserve half-edges mechanism * a nightly update * delete unnecessary annotations Co-authored-by: Mufei Li <mufeili1996@gmail.com>

LspongebobJH added 20 commits July 17, 2021 23:30

The start of experiments of Jiahang Li on GraphSAINT.

1ab071c

a nightly build

e766001

a nightly build

900ee7b

Check the basic pipeline of codes. Next to check the details of samplers , GCN layer (forward propagation) and loss (backward propagation)

a night build

3fb891b

Implement GraphSAINT with torch.dataloader

92fe8d1

There're still some bugs with sampling in training procedure

a nightly build

6f44a3d

Update .gitignore

d99deac

reorganize codes

64f749f

Reorganize some codes and comments.

a nightly build

5ea9a2e

Update .gitignore

8a5ec2c

fix bugs

b64005e

Fix bugs about why fully offline sampling and author's version don't work

reorganize files and codes

8541112

Reorganize files and codes then do some experiments to test the performance of offline sampling and online sampling

do some experiments and update README

9eab8d4

a nightly build

d8b6785

a nightly build

869165b

Update README.md

c154300

delete unnecessary files

5b7441c

Merge branch 'master' into jiahanli

c18f409

LspongebobJH added 2 commits September 7, 2021 09:47

Update README.md

6f89b5b

Merge branch 'jiahanli' of github.com:ljh1064126026/dgl into jiahanli

e476ab9

mufeili self-requested a review September 7, 2021 06:29

Merge branch 'master' into jiahanli

c206af4

mufeili reviewed Sep 9, 2021

View reviewed changes

examples/pytorch/graphsaint/README.md Show resolved Hide resolved

mufeili reviewed Sep 9, 2021

View reviewed changes

examples/pytorch/graphsaint/README.md Show resolved Hide resolved

mufeili reviewed Sep 9, 2021

View reviewed changes

examples/pytorch/graphsaint/README.md Show resolved Hide resolved

mufeili reviewed Sep 9, 2021

View reviewed changes

examples/pytorch/graphsaint/README.md Show resolved Hide resolved

mufeili reviewed Sep 10, 2021

View reviewed changes

examples/pytorch/graphsaint/README.md Show resolved Hide resolved