Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove "free-gpu" from *_train and create queue-freegpu.pl #938

Merged
merged 8 commits into from Jul 31, 2019

Conversation

@kamo-naoyuki
Copy link
Contributor

kamo-naoyuki commented Jun 28, 2019

This PR is a refactoring request of this part of *_train.py

if args.ngpu > 0:
# python 2 case
if platform.python_version_tuple()[0] == '2':
if "clsp.jhu.edu" in subprocess.check_output(["hostname", "-f"]):
cvd = subprocess.check_output(["/usr/local/bin/free-gpu", "-n", str(args.ngpu)]).strip()
logging.info('CLSP: use gpu' + cvd)
os.environ['CUDA_VISIBLE_DEVICES'] = cvd
# python 3 case
else:
if "clsp.jhu.edu" in subprocess.check_output(["hostname", "-f"]).decode():
cvd = subprocess.check_output(["/usr/local/bin/free-gpu", "-n", str(args.ngpu)]).decode().strip()
logging.info('CLSP: use gpu' + cvd)
os.environ['CUDA_VISIBLE_DEVICES'] = cvd
cvd = os.environ.get("CUDA_VISIBLE_DEVICES")
if cvd is None:
logging.warning("CUDA_VISIBLE_DEVICES is not set.")
elif args.ngpu != len(cvd.split(",")):
logging.error("#gpus is not matched with CUDA_VISIBLE_DEVICES.")
sys.exit(1)
,

This part works only at JHU and such codes shouldn't be written in python level.
I created queue-freegpu.pl, which is almost same as kaldi's queue.pl except that it invokes free-gpu command and set CUDA_VISIBLE_DEVICES as this part do.
Note that queue-freegpu.pl is a private tool for JHU.

I also changed the behaviour of --ngpu of *_train.py:

  1. If --npu is given, use as it is.
  2. elif CUDA_VISIBLE_DEVICES=None, use the number of visible devices.
  3. elif nvidia-smi can be invoked, use the number of all devices.
  4. else ngpu==0

Recent job scheduler such as slurm, torque, can manage GPU resources and user program doesn't need to handle gpu arguments, but it is enough just to use 'visible' devices, which are allocated by Job scheduler.

About 3. I decided to use nvidia-smi instead of the python method for getting the Ngpu, e.g. torch.cuda.device_count(), because I want not to limit the DNN backend in this part.

I also planned to remove --ngpu from run.sh for *_train.py because it is specified twice for cmd and there, but finally I kept it as it is. This is need for run.pl, or ssh.pl, and I have no idea to solve it with better way.

Actually, I don't have neither SGE nor free-gpu, so please test this tool, anyone.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jun 28, 2019

Codecov Report

Merging #938 into master will increase coverage by 0.07%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #938      +/-   ##
==========================================
+ Coverage    50.2%   50.28%   +0.07%     
==========================================
  Files          88       88              
  Lines        9777     9762      -15     
==========================================
  Hits         4909     4909              
+ Misses       4868     4853      -15
Impacted Files Coverage Δ
espnet/bin/tts_train.py 0% <0%> (ø) ⬆️
espnet/bin/asr_train.py 0% <0%> (ø) ⬆️
espnet/bin/lm_train.py 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6b55b3...f18c569. Read the comment docs.

@kan-bayashi kan-bayashi added this to the v.0.4.1 milestone Jul 1, 2019
@sw005320

This comment has been minimized.

Copy link
Contributor

sw005320 commented Jul 2, 2019

Thanks a lot!!!
I may not find time to test it...
@27jiangziyan, could you test it at the CLSP environment?

egs/an4/asr1/cmd.sh Outdated Show resolved Hide resolved
@sw005320 sw005320 modified the milestones: v.0.4.1, v.0.4.2 Jul 3, 2019
@Xiaofei-Wang

This comment has been minimized.

Copy link
Contributor

Xiaofei-Wang commented Jul 17, 2019

Hi Kamo,
I tested your pull request on the JHU cluster using --ngpu=1 but failed. Here is the returned errors:

Traceback (most recent call last):
  File "/export/c03/xwang/espnet_test1/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 373, in <module>
    main(sys.argv[1:])
  File "/export/c03/xwang/espnet_test1/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 361, in main
    train(args)
  File "/export/c03/xwang/espnet_test1/espnet/espnet/asr/pytorch_backend/asr.py", line 291, in train
    model = model.to(device)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 116, in _apply
    ret = super(RNNBase, self)._apply(fn)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/export/c03/xwang/espnet_test1/espnet/tools/venv/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1549635019666/work/aten/src/THC/THCGeneral.cpp:51```

I haven't tried the case of ngpu=None, I'll keep you posted when I finish the following tests.
@kan-bayashi kan-bayashi modified the milestones: v.0.4.2, v.0.4.3 Jul 23, 2019
@sw005320

This comment has been minimized.

Copy link
Contributor

sw005320 commented Jul 30, 2019

@Xiaofei-Wang, can you test it again?

@Xiaofei-Wang

This comment has been minimized.

Copy link
Contributor

Xiaofei-Wang commented Jul 31, 2019

I've tested, it works well on JHU grid.

# Started at Wed Jul 31 00:38:04 EDT 2019
# asr_train.py --config conf/train_mtlalpha1.0.yaml --ngpu 1 --backend pytorch --outdir exp/train_nodev_pytorch_train_mtlalpha1.0/results --tensorboard-dir tensorboard/train_nodev_pytorch_train_mtlalpha1.0 --debugmode 1 --dict data/lang_1char/train_nodev_units.txt --debugdir exp/train_nodev_pytorch_train_mtlalpha1.0 --minibatches 0 --verbose 1 --resume --train-json dump/train_nodev/deltafalse/data.json --valid-json dump/train_dev/deltafalse/data.json 
free gpu: 0
2019-07-31 00:38:05,582 (asr_train:331) INFO: ngpu: 1
2019-07-31 00:38:05,583 (asr_train:334) INFO: python path = :/home/xwang/anaconda2/pkgs
2019-07-31 00:38:05,583 (asr_train:337) INFO: random seed = 1
2019-07-31 00:38:05,598 (asr_train:354) INFO: backend = pytorch
/export/c03/xwang/espnet-test/espnet/tools/venv/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
2019-07-31 00:38:15,164 (deterministic_utils:24) INFO: torch type check is disabled
2019-07-31 00:38:15,228 (asr:241) INFO: #input dims : 83
2019-07-31 00:38:15,228 (asr:242) INFO: #output dims: 30
2019-07-31 00:38:15,228 (asr:247) INFO: Pure CTC mode
2019-07-31 00:38:15,228 (e2e_asr:86) INFO: subsample: 1 2 2 1 1
2019-07-31 00:38:15,328 (encoders:236) INFO: BLSTM with every-layer projection for encoder
2019-07-31 00:38:15,572 (asr:274) INFO: writing a model config file to exp/train_nodev_pytorch_train_mtlalpha1.0/results/model.json
@sw005320

This comment has been minimized.

Copy link
Contributor

sw005320 commented Jul 31, 2019

Thanks, @Xiaofei-Wang.
@kamo-naoyuki, it would be great if you prepare cmd.sh for the other recipes.

@sw005320 sw005320 merged commit ff84d95 into espnet:master Jul 31, 2019
5 checks passed
5 checks passed
ci/circleci: test-centos7 Your tests passed on CircleCI!
Details
ci/circleci: test-debian9 Your tests passed on CircleCI!
Details
ci/circleci: test-ubuntu16 Your tests passed on CircleCI!
Details
ci/circleci: test-ubuntu18 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.