Wish clarification about two optimization strategies. #20

backyes · 2020-04-11T16:15:51Z

How to balance DNN computation on GPU and sampling compuation on cpu in graph-learn, if GPU is fast and data provided by CPU sampling is not fast enough? Generally, We will use latency hidden skill to prefetch and buffer samples that samped by CPU.

If this is not resolved, GPU will not be fully used in some situations.

https://github.com/alibaba/graph-learn/tree/master/graphlearn/core/operator/aggregator really works? After reading https://arxiv.org/pdf/1902.08730.pdf and these source codes, I really get confused with aggregator implementation. It seem that all GCN layers just use tensorflow native operators (https://github.com/alibaba/graph-learn/tree/master/graphlearn/python/model/tf/aggregators) instead of core/operator/aggregator to do aggregator computations described in Algorithm 1: GNN Framework https://arxiv.org/pdf/1902.08730.pdf.

Wish better clarification these trouble, thanks a lot.

baoleai · 2020-04-13T03:35:00Z

Good questions.

We are trying to parallelize sampling and make it asynchronous with the training process to improve GPU utilization. Reducing sampling time through message fusion can also improve GPU utilization in distributed mode.
Aggregator in core/operator/aggregator is a WIP feature, which will be used to optimize aggregation in distributed training through message fusion, see aggregator: C++ version vs Python version #15.

Seventeen17 · 2020-05-08T07:37:47Z

I have rose a pr about Aggregator, fyi @backyes .

lorinlee · 2020-09-17T07:31:38Z

@baoleai Hi, is 'parallelizing sampling and make it asynchronous with training' already done or being working in progress? Thx~ And I'm confused about why not using tf.data.Dataset.prefetch to do sampling？I'm a beginner in tensorflow, maybe I have misunderstood this method.

YijianLiu · 2022-11-28T02:53:03Z

@baoleai Hi, is 'parallelizing sampling and make it asynchronous with training' already done or being working in progress? Thx~ And I'm confused about why not using tf.data.Dataset.prefetch to do sampling？I'm a beginner in tensorflow, maybe I have misunderstood this method.

Have you solved this problem? I am trying to use this method to do sampling

baoleai added the feature New feature or request label Apr 13, 2020

Seventeen17 linked a pull request May 8, 2020 that will close this issue

[feature] Distribute aggregator and API. #42

Merged

Seventeen17 closed this as completed in #42 May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wish clarification about two optimization strategies. #20

Wish clarification about two optimization strategies. #20

backyes commented Apr 11, 2020 •

edited

baoleai commented Apr 13, 2020

Seventeen17 commented May 8, 2020

lorinlee commented Sep 17, 2020

YijianLiu commented Nov 28, 2022

Wish clarification about two optimization strategies. #20

Wish clarification about two optimization strategies. #20

Comments

backyes commented Apr 11, 2020 • edited

baoleai commented Apr 13, 2020

Seventeen17 commented May 8, 2020

lorinlee commented Sep 17, 2020

YijianLiu commented Nov 28, 2022

backyes commented Apr 11, 2020 •

edited