New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug coverage_attn in multi-gpu mode #5

Open
nikhilweee opened this Issue Jun 8, 2018 · 25 comments

Comments

Projects
None yet
4 participants
@nikhilweee

nikhilweee commented Jun 8, 2018

Could you please help me on how to use the new GPU options? I used the flags -gpuid 0 1 -gpu_verbose 0 -gpu_rank 0 for the trianing script which resulted in the following error

Traceback (most recent call last):
  File "/data/projects/opennmt-ubiqus/train_multi.py", line 43, in run
    single_main(opt)
  File "/data/projects/opennmt-ubiqus/train_single.py", line 120, in main
    opt.valid_steps)
  File "/data/projects/opennmt-ubiqus/onmt/trainer.py", line 143, in train
    if self.gpu_verbose > 1:
TypeError: unorderable types: list() > int()

Here's the output from nvidia-smi, just in case.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000A20E:00:00.0 Off |                    0 |
| N/A   70C    P0    65W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000C0B5:00:00.0 Off |                    0 |
| N/A   38C    P0    72W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
@vince62s

This comment has been minimized.

vince62s commented Jun 8, 2018

Hi,
can you please provide the full command line you used?
using just -gpuid 0 1 should be fine (as it used to be)
did you export CUDA_VISIBLE_DEVICES=0,1 ?

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 8, 2018

I used the following flags. export CUDA_VISIBLE_DEVICES=0,1 has no effect.

python3 train.py -data data-ubiqus -save_model model-ubiqus -gpuid 0 1 -train_steps 3000 \
-valid_steps 60 -optim adam -learning_rate 0.001 -learning_rate_decay 0.9 -enc_layers 4 \
-dec_layers 2 -copy_attn -reuse_copy_attn -gpu_verbose 0 -gpu_rank 0
@vince62s

This comment has been minimized.

vince62s commented Jun 8, 2018

ok there is a bug with an option, I'll push a fix
pushed, can you try again ?

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 8, 2018

Ouch! I get another error. I'm not sure if this error is related to your fix though.

python3 opennmt-ubiqus/train.py -data data-ubiqus -save_model model-ubiqus-coverage -gpuid 0 1 \
-train_steps 3000 -valid_steps 60 -optim adam -learning_rate 0.001 -learning_rate_decay 0.9 \
-enc_layers 4 -dec_layers 2 -copy_attn -reuse_copy_attn -coverage_attn -gpu_verbose 0 -gpu_rank 0
Traceback (most recent call last):
  File "/data/projects/bash2nl/opennmt-ubiqus/train_multi.py", line 43, in run
    single_main(opt)
  File "/data/projects/bash2nl/opennmt-ubiqus/train_single.py", line 120, in main
    opt.valid_steps)
  File "/data/projects/bash2nl/opennmt-ubiqus/onmt/trainer.py", line 162, in train
    report_stats, normalization)
  File "/data/projects/bash2nl/opennmt-ubiqus/onmt/trainer.py", line 275, in _gradient_accumulation
    grads = [p.grad.data for p in self.model.parameters()
  File "/data/projects/bash2nl/opennmt-ubiqus/onmt/trainer.py", line 276, in <listcomp>
    if p.requires_grad]
AttributeError: 'NoneType' object has no attribute 'data'
@vince62s

This comment has been minimized.

vince62s commented Jun 8, 2018

remove
-gpu_verbose and gpu_rank from the command line please

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 8, 2018

Ah! without -gpu_verbose and -gpu_rank it worked earlier too. I just wanted to know how to use these options as I couldn't find the relevant documentation.

@vince62s

This comment has been minimized.

vince62s commented Jun 8, 2018

gpu_verbose: default 0, when set to 1 or 2 will print more info in multi-gpu mode.

gpu_rank: it should not be exposed at the present time, might be used for multi node in the future.

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 8, 2018

@vince62s I'm sorry I didn't check before. I get the same AttributeError: 'NoneType' object has no attribute 'data' when I remove -gpu_verbose and -gpu_rank.

@vince62s

This comment has been minimized.

vince62s commented Jun 8, 2018

Okay not sure about the task you are working we’ll look into it but if you can try without the attn flags might be helpful to narrow down the issue

@vince62s vince62s changed the title from How to use the new GPU options? to bug coverage_attn in multi-gpu mode Jun 8, 2018

@pltrdy

This comment has been minimized.

Collaborator

pltrdy commented Jun 8, 2018

@nikhilweee I could add a condition on trainer.py L275 i.e.

                    grads = [p.grad.data for p in self.model.parameters()
                             if p.requires_grad
                             and p.grad is not None]

Still, as I do not really understand which parameter are None (and why) we could print some information i.e.

grads = []
for name, p in self.model.named_parameters():
    if p.requires_grad:
        if p.grad is not None:
            grads += [p.grad.data]
        else:
            print("Model parameter '%s' has None grad" % name)          

Could you try those?

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 9, 2018

@pltrdy I think this should help you out.

python3 train.py -data data-ubiqus -save_model model-ubiqus-coverage -gpuid 0 1 \
-train_steps 3000 -valid_steps 60 -save_checkpoint_steps 60 -optim adam \
-learning_rate 0.001 -learning_rate_decay 0.9 -enc_layers 4 -dec_layers 2 -copy_attn \
-reuse_copy_attn -coverage_attn
Model parameter 'decoder.attn.linear_cover.weight' has None grad

On a separate note, I still get this warning which was supposed to be fixed in #6

/data/projects/opennmt-py/onmt/modules/copy_generator.py:94: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = self.softmax(logits)

EDIT: The error AttributeError: 'NoneType' object has no attribute 'data' seems to be a side effect of 04c67bd. It works if I revert these changes.

@rylanchiu

This comment has been minimized.

rylanchiu commented Jun 26, 2018

@nikhilweee I still get the error... Could you please tell me which branch you cloned? Thanks!

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 26, 2018

@rylanchiu Though I'm not sure, I'm pretty certain I was on master.

@rylanchiu

This comment has been minimized.

rylanchiu commented Jun 26, 2018

@nikhilweee Thanks for your reply. Seems that you forked a version that is 18 days ago, when you opened this issue. But at that time, the master branch still did not support multi-gpu yet. Could you please send the forked library to my email (rylanlzc@gmail.com)? It would be of great help. Thanks in advance.

@rylanchiu

This comment has been minimized.

rylanchiu commented Jun 26, 2018

@nikhilweee BTW, I wonder what is the version of your torchtext and pytorch respectively, thanks!

@nikhilweee

This comment has been minimized.

nikhilweee commented Jun 26, 2018

I was using pytorch 0.4 and torchtext 0.3.0. I guess you can find the older version at 052b2f7

@rylanchiu

This comment has been minimized.

rylanchiu commented Jun 26, 2018

@nikhilweee The error still exists... Anyway, thanks for your help.

@vince62s

This comment has been minimized.

vince62s commented Jun 26, 2018

@rylanchiu Are you talking about the error with coverage_attention n multi gpu mode?
this is still a bug, I never inquired to fix this.
if you have another issue open a new one, thanks

@rylanchiu

This comment has been minimized.

rylanchiu commented Jun 26, 2018

@vince62s Yes. That's what I am talking about. What is the reason of that bug? Do you have a rough time that it can be fixed ? Thanks.

@pltrdy

This comment has been minimized.

Collaborator

pltrdy commented Jul 4, 2018

I submitted a PR OpenNMT#799 in source repo (that is now up to date with this one).

It would require a bit of testing / investigation, @rylanchiu @nikhilweee

@rylanchiu

This comment has been minimized.

rylanchiu commented Jul 6, 2018

Thanks a lot for your great work @pltrdy! I will have a try immediately. BTW, can I merge this branch with your other branch of reinforcement learning without significant conflicts?

@rylanchiu

This comment has been minimized.

rylanchiu commented Jul 7, 2018

@pltrdy @vince62s I have a try and now it has a new problem. The multiprocess is hang, and seems here is where it hang:

      File "train.py", line 40, in <module>
    main(opt)
  File "train.py", line 25, in main
    multi_main(opt)
  File "/opt/conda/lib/python3.6/site-packages/OpenNMT_py-0.2-py3.6.egg/onmt/train_multi.py", line 38, in main
    p.join()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

There is no such issue in earlier commits. So what is probably the reason of such issue? Thanks.

@vince62s

This comment has been minimized.

vince62s commented Jul 7, 2018

can you please post your command line ? thanks.

@rylanchiu

This comment has been minimized.

rylanchiu commented Jul 7, 2018

Sorry. Forgot that. Here it is:

python -u train.py -data data/tfm -save_model models/tfm -layers 4 -rnn_size 512 -word_vec_size 512 -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.2 -param_init 0 -learning_rate 0.001 -batch_size 4096 -batch_type tokens -normalization tokens -train_steps 1000000 -save_checkpoint_steps 1000 -share_embeddings -copy_attn -param_init_glorot -gpuid 0 1

@pltrdy

This comment has been minimized.

Collaborator

pltrdy commented Jul 9, 2018

@rylanchiu concerning RL, it's not an easy thing to make it work, the code isn't ready to be merged. Most discussions about here are in the PR OpenNMT#319.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment