New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Out-of-core training on V3 #3762

Open
wants to merge 46 commits into
base: v3
from

Conversation

Projects
None yet
6 participants
@imaihal

imaihal commented Nov 1, 2017

This PR enables us to run out-of-core training on V3. The out-of-core training realizes training of very large models whose memory usage goes above GPU memory by using CPU memory as swap area. This PR is a successor of #2027 (v1 OoC) and basic functions of this PR are ported from @anaruse ’s branch for v2 OoC). I think they should credit to @anaruse and our contributions are as follows

  • Ported automatic breakpoint insertion to split computational graph to V3
  • Ported swap-in/out related functions to V3
  • Enabled data parallel execution for OoC
  • Added enlarged models of GoogLeNet and ResNet and examples to use it

Please see examples (examples/imagenet/train_imagenet_OOC_ibm.py and examples/imagenet/train_imagenet_data_parallel_OOC_ibm.py) to run this function. As you can see in the examples, “with chainer.out_of_core_mode()” enables the out-of-core functions. Command line options of “—ooc” and “—insize” are added in the examples. You don’t have to modify model files.

I created another PR to Cupy v2 (Swap in/out between GPU and CPU memory #694) and this PR depends on it.

imaihal and others added some commits Sep 26, 2017

Merge remote-tracking branch 'origin/3.0.0rc1-trl-ooc-parallel-fix-re…
…dundant-breakpoints-withoutOOC' into 3.0.0rc1-trl-ooc-parallel
Merge branch 'v3-trl-ooc-pr' of github.ibm.com:TUNG/chainer into v3-t…
…rl-ooc-pr

Conflicts:
	chainer/functions/normalization/local_response_normalization.py
@stale

This comment has been minimized.

Show comment
Hide comment
@stale

stale bot Mar 7, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

stale bot commented Mar 7, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 7, 2018

@niboshi niboshi removed the stale label Mar 7, 2018

@beam2d beam2d assigned mitmul and unassigned niboshi Jun 4, 2018

@kmaehashi kmaehashi added this to the Future Task milestone Jun 4, 2018

@mitmul

This comment has been minimized.

Show comment
Hide comment
@mitmul

mitmul Jun 4, 2018

Member

@imaihal Hi, thank you for sending this PR. How's the progress? It seems to be still a "WIP" PR, and it's based on Chainer v3, but the next major version of Chainer is v5. I think a lot of code in this PR can work in v5 as is, but some conflicts have already happened. Could you resolve those conflicts and let us know the status and plans to finish this PR?

Member

mitmul commented Jun 4, 2018

@imaihal Hi, thank you for sending this PR. How's the progress? It seems to be still a "WIP" PR, and it's based on Chainer v3, but the next major version of Chainer is v5. I think a lot of code in this PR can work in v5 as is, but some conflicts have already happened. Could you resolve those conflicts and let us know the status and plans to finish this PR?

@stale

This comment has been minimized.

Show comment
Hide comment
@stale

stale bot Sep 2, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

stale bot commented Sep 2, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 2, 2018

@imaihal

This comment has been minimized.

Show comment
Hide comment
@imaihal

imaihal Sep 5, 2018

@mitmul Sorry for late reply... Thank you for your comment. We've implemented this PR on Chainer V4, but not done on Chainer V5 yet. Unfortunately, as of now, we don't have plan to do it. However, we are interested in updating this PR. So, we would like to discuss what is the best way to merge official code at some point.

Should I close this PR for now?

imaihal commented Sep 5, 2018

@mitmul Sorry for late reply... Thank you for your comment. We've implemented this PR on Chainer V4, but not done on Chainer V5 yet. Unfortunately, as of now, we don't have plan to do it. However, we are interested in updating this PR. So, we would like to discuss what is the best way to merge official code at some point.

Should I close this PR for now?

@stale stale bot removed the stale label Sep 5, 2018

@mitmul

This comment has been minimized.

Show comment
Hide comment
@mitmul

mitmul Sep 8, 2018

Member

@imaihal I think it's OK to keep it open for making this easy-to-track by other committers. Then how about discusssing how to proceed this here or on the slack channel (https://chainer.slack.com)?

Member

mitmul commented Sep 8, 2018

@imaihal I think it's OK to keep it open for making this easy-to-track by other committers. Then how about discusssing how to proceed this here or on the slack channel (https://chainer.slack.com)?

@imaihal

This comment has been minimized.

Show comment
Hide comment
@imaihal

imaihal Sep 10, 2018

@mitmul OK. Thanks. I think it is OK to discuss here.

imaihal commented Sep 10, 2018

@mitmul OK. Thanks. I think it is OK to discuss here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment