Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Out-of-core training on V3 #3762

Closed
wants to merge 46 commits into from
Closed

[WIP] Out-of-core training on V3 #3762

wants to merge 46 commits into from

Conversation

@imaihal
Copy link

imaihal commented Nov 1, 2017

This PR enables us to run out-of-core training on V3. The out-of-core training realizes training of very large models whose memory usage goes above GPU memory by using CPU memory as swap area. This PR is a successor of #2027 (v1 OoC) and basic functions of this PR are ported from @anaruse ’s branch for v2 OoC). I think they should credit to @anaruse and our contributions are as follows

  • Ported automatic breakpoint insertion to split computational graph to V3
  • Ported swap-in/out related functions to V3
  • Enabled data parallel execution for OoC
  • Added enlarged models of GoogLeNet and ResNet and examples to use it

Please see examples (examples/imagenet/train_imagenet_OOC_ibm.py and examples/imagenet/train_imagenet_data_parallel_OOC_ibm.py) to run this function. As you can see in the examples, “with chainer.out_of_core_mode()” enables the out-of-core functions. Command line options of “—ooc” and “—insize” are added in the examples. You don’t have to modify model files.

I created another PR to Cupy v2 (Swap in/out between GPU and CPU memory #694) and this PR depends on it.

imaihal and others added 24 commits Sep 26, 2017
…dundant-breakpoints-withoutOOC' into 3.0.0rc1-trl-ooc-parallel
…rl-ooc-pr

Conflicts:
	chainer/functions/normalization/local_response_normalization.py
@imaihal imaihal force-pushed the imaihal:v3-trl-ooc-pr branch from 7f4b94c to 7543460 Nov 15, 2017
imaihal added 3 commits Nov 15, 2017
…rain_imagenet_data_parallel.py
@stale

This comment has been minimized.

Copy link

stale bot commented Mar 7, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 7, 2018
@niboshi niboshi removed the stale label Mar 7, 2018
@beam2d beam2d assigned mitmul and unassigned niboshi Jun 4, 2018
@kmaehashi kmaehashi added this to the Future Task milestone Jun 4, 2018
@mitmul

This comment has been minimized.

Copy link
Member

mitmul commented Jun 4, 2018

@imaihal Hi, thank you for sending this PR. How's the progress? It seems to be still a "WIP" PR, and it's based on Chainer v3, but the next major version of Chainer is v5. I think a lot of code in this PR can work in v5 as is, but some conflicts have already happened. Could you resolve those conflicts and let us know the status and plans to finish this PR?

@stale

This comment has been minimized.

Copy link

stale bot commented Sep 2, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 2, 2018
@imaihal

This comment has been minimized.

Copy link
Author

imaihal commented Sep 5, 2018

@mitmul Sorry for late reply... Thank you for your comment. We've implemented this PR on Chainer V4, but not done on Chainer V5 yet. Unfortunately, as of now, we don't have plan to do it. However, we are interested in updating this PR. So, we would like to discuss what is the best way to merge official code at some point.

Should I close this PR for now?

@stale stale bot removed the stale label Sep 5, 2018
@mitmul

This comment has been minimized.

Copy link
Member

mitmul commented Sep 8, 2018

@imaihal I think it's OK to keep it open for making this easy-to-track by other committers. Then how about discusssing how to proceed this here or on the slack channel (https://chainer.slack.com)?

@imaihal

This comment has been minimized.

Copy link
Author

imaihal commented Sep 10, 2018

@mitmul OK. Thanks. I think it is OK to discuss here.

@lxqlxq21

This comment has been minimized.

Copy link

lxqlxq21 commented Nov 5, 2018

Hi @imaihal, I was able to install v3-ooc, but there are some features of V4 Chainer I want to use, do you mind sharing v4-ooc? I'm having problem installing the following version.
https://developer.ibm.com/linuxonpower/2018/07/31/deep-learning-openpower-install-ibm-optimized-chainer-v4-easily-pip-command-operpower-linux-systems/

@imaihal

This comment has been minimized.

Copy link
Author

imaihal commented Nov 6, 2018

Hi, @lxqlxq21 , Thank you for trying this v3-ooc and the instruction in the blog.
Chainer v4-ooc and Cupy for v4-ooc are included in following links written in the blog. Have you already tried them?

@lxqlxq21

This comment has been minimized.

Copy link

lxqlxq21 commented Nov 6, 2018

Hi, @imaihal I tried it with pip install, but import chainer and import cupy both raise error. V3-OOC worked fine for me following the install instruction, but I can’t get any version of chainer and cupy installed after uninstall V3-OOC. It’s probably because I haven’t uninstall it correctly. Either I need to resolve that or is there anyway I can install it as a ready anacondas env? Would it be possible for you to create and share such an env?

@imaihal

This comment has been minimized.

Copy link
Author

imaihal commented Nov 6, 2018

@lxqlxq21 I also think there is a problem in uninstallation. You may have done, but could you try followings? They are not specific to our code, but I often do them when uninstalling.
Sorry, I usually use pyenv and I'm not familiar with anaconda.

@lxqlxq21

This comment has been minimized.

Copy link

lxqlxq21 commented Nov 6, 2018

@imaihal I tried everything, nothing seems to help.

Following is the error I see.

First time I run it:

File "/home/xueqing/.local/lib/python2.7/site-packages/cupy/cuda/cudnn_util.py", line 13
print(msg, file=sys.stderr, flush=True)
^
SyntaxError: invalid syntax

If I run it again:

ImportError Traceback (most recent call last)
in ()
2 import glob
3 import math, random
----> 4 import chainer
5 from chainer import cuda, Function, gradient_check, report, training, utils, Variable
6 from chainer import datasets, iterators, optimizers, serializers

/usr/local/lib/python2.7/dist-packages/chainer/init.py in ()
6 import numpy
7
----> 8 from chainer import _version
9 from chainer import backends # NOQA
10 from chainer import configuration # NOQA

ImportError: cannot import name _version

Then I uninstall anaconda completely and I installed from the downloaded folder and it worked!
It's either anaconda messed up my dir or pip install doesn't work for me.

@stale

This comment has been minimized.

Copy link

stale bot commented Feb 7, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 7, 2019
@stale

This comment has been minimized.

Copy link

stale bot commented Mar 9, 2019

This issue is closed as announced. Feel free to re-open it if needed.

@stale stale bot closed this Mar 9, 2019
@kmaehashi kmaehashi removed the stale label Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.