Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wish clarification about two optimization strategies. #20

Closed
backyes opened this issue Apr 11, 2020 · 4 comments · Fixed by #42
Closed

Wish clarification about two optimization strategies. #20

backyes opened this issue Apr 11, 2020 · 4 comments · Fixed by #42
Labels
feature New feature or request

Comments

@backyes
Copy link

backyes commented Apr 11, 2020

  • How to balance DNN computation on GPU and sampling compuation on cpu in graph-learn, if GPU is fast and data provided by CPU sampling is not fast enough? Generally, We will use latency hidden skill to prefetch and buffer samples that samped by CPU.

If this is not resolved, GPU will not be fully used in some situations.

Wish better clarification these trouble, thanks a lot.

@baoleai baoleai added the feature New feature or request label Apr 13, 2020
@baoleai
Copy link
Collaborator

baoleai commented Apr 13, 2020

Good questions.

  1. We are trying to parallelize sampling and make it asynchronous with the training process to improve GPU utilization. Reducing sampling time through message fusion can also improve GPU utilization in distributed mode.

  2. Aggregator in core/operator/aggregator is a WIP feature, which will be used to optimize aggregation in distributed training through message fusion, see aggregator: C++ version vs Python version #15.

@Seventeen17 Seventeen17 linked a pull request May 8, 2020 that will close this issue
@Seventeen17
Copy link
Collaborator

I have rose a pr about Aggregator, fyi @backyes .

@lorinlee
Copy link
Contributor

@baoleai Hi, is 'parallelizing sampling and make it asynchronous with training' already done or being working in progress? Thx~ And I'm confused about why not using tf.data.Dataset.prefetch to do sampling?I'm a beginner in tensorflow, maybe I have misunderstood this method.

@YijianLiu
Copy link

@baoleai Hi, is 'parallelizing sampling and make it asynchronous with training' already done or being working in progress? Thx~ And I'm confused about why not using tf.data.Dataset.prefetch to do sampling?I'm a beginner in tensorflow, maybe I have misunderstood this method.

Have you solved this problem? I am trying to use this method to do sampling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants