This repository was archived by the owner on Mar 20, 2026. It is now read-only.
Fixes bugs of evaluation with BLEU score when training with multi-gpus.#3237
Closed
cordercorder wants to merge 4 commits intofacebookresearch:masterfrom
cordercorder:master
Closed
Fixes bugs of evaluation with BLEU score when training with multi-gpus.#3237cordercorder wants to merge 4 commits intofacebookresearch:masterfrom cordercorder:master
cordercorder wants to merge 4 commits intofacebookresearch:masterfrom
cordercorder:master
Conversation
alexeib
suggested changes
Feb 12, 2021
Contributor
alexeib
left a comment
There was a problem hiding this comment.
thanks for the contribution, please see inline comment!
fairseq/tasks/translation.py
Outdated
Comment on lines
+399
to
+400
| if isinstance(result, torch.Tensor) and result.device.type != "cpu": | ||
| result = result.cpu() |
Contributor
There was a problem hiding this comment.
maybe lets do
if torch.is_tensor(result):
result = result.cpu()
Contributor
Author
There was a problem hiding this comment.
Thanks for you review
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@alexeib has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
harkash
pushed a commit
to harkash/fairseq
that referenced
this pull request
Feb 23, 2021
…s. (facebookresearch#3237) Summary: …ith BLEU scores # Before submitting - [no] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [yes] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [no need] Did you make sure to update the docs? - [no need] Did you write any new necessary tests? ## What does this PR do? Fixes bugs of evaluation with BLEU score when training with multi-gpus. But no error will happend if there is no distributed training. when --eval-bleu is set to be `True` (default it is `False` and the best checkpoint is selected according to loss) and training with multi-gpus (when the number of gpu which participate in distributed training is greater than 1), following error will happend. ```bash Traceback (most recent call last): Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main distributed_utils.call_main(cfg, main)distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main main(cfg, **kwargs) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main main(cfg, **kwargs) main(cfg, **kwargs)rder/fairseq/fairseq_cli/train.py", line 143, in main File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> return func(*args, **kwds) return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save cfg, trainer, task, epoch_itr, valid_subsets, end_of_epochsys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step return func(*args, **kwds)distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main main(cfg, **kwargs) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics self.task.reduce_metrics(logging_outputs, self.get_criterion())valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return func(*args, **kwds)metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return self.numpy() TypeError: can't convert cuda:2 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. return self.numpy() TypeError: can't convert cuda:3 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. return self.numpy() TypeError: can't convert cuda:1 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/data/cordercorder/anaconda3/envs/nmt/bin/python', '-u', '/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train', '--local_rank=3', 'tiny_data_bin', '--distributed-world-size', '4', '--arch', 'transformer', '--share-decoder-input-output-embed', '--optimizer', 'adam', '--adam-betas', '(0.9, 0.98)', '--clip-norm', '0.0', '--lr-scheduler', 'inverse_sqrt', '--warmup-init-lr', '1e-07', '--warmup-updates', '3000', '--lr', '0.0005', '--stop-min-lr', '1e-09', '--dropout', '0.25', '--weight-decay', '0.0001', '--criterion', 'label_smoothed_cross_entropy', '--label-smoothing', '0.1', '--max-tokens', '5000', '--batch-size', '64', '--update-freq', '4', '--max-epoch', '30', '--save-dir', 'checkpoint', '--skip-invalid-size-inputs-valid-test', '--eval-bleu', '--eval-bleu-args', '{"beam": 5}', '--eval-bleu-remove-bpe', 'sentencepiece', '--eval-bleu-print-samples', '--eval-tokenized-bleu', '--best-checkpoint-metric', 'bleu', '--maximize-best-checkpoint-metric', '--validate-interval-updates', '1']' returned non-zero exit status 1. ``` The error is cased by the fact that the numpy of version 1.20.1 does't support codes like following: ```python import torch import numpy as np a = torch.tensor(0, device="cuda:0") b = np.array([a]) ``` The above codes will lead to error: "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", but the codes run well if the numpy version is 1.18.1 or 1.17.0 (when the numpy version is below 1.20.0, it is ok, I guess). However, it seems like that the latest version of fairseq need a numpy package of version 1.20.0 or higher (issue facebookresearch#3203 ). ### Reproduce the error Download the source code of fairseq (commit ID: 7061a0f) and run following code: ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 data_bin_dir=tiny_data_bin python -m torch.distributed.launch --nproc_per_node=4 \ --master_addr="127.0.0.1" \ --master_port=12345 \ $(which fairseq-train) ${data_bin_dir} \ --distributed-world-size 4 \ --arch transformer \ --share-decoder-input-output-embed \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 3000 \ --lr 0.0005 \ --stop-min-lr 1e-09 \ --dropout 0.25 \ --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5000 \ --batch-size 64 \ --update-freq 4 \ --max-epoch 30 \ --save-dir checkpoint \ --skip-invalid-size-inputs-valid-test \ --eval-bleu \ --eval-bleu-args '{"beam": 5}' \ --eval-bleu-remove-bpe sentencepiece \ --eval-bleu-print-samples \ --eval-tokenized-bleu \ --best-checkpoint-metric bleu \ --maximize-best-checkpoint-metric \ --validate-interval-updates 1 ``` ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: facebookresearch#3237 Reviewed By: myleott Differential Revision: D26429732 Pulled By: alexeib fbshipit-source-id: bc887ce952d28541cb07dbbdc7e80e99428a6b34
jinyiyang-jhu
pushed a commit
to jinyiyang-jhu/fairseq-jyang
that referenced
this pull request
Feb 26, 2021
…s. (#3237) Summary: …ith BLEU scores # Before submitting - [no] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [yes] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [no need] Did you make sure to update the docs? - [no need] Did you write any new necessary tests? ## What does this PR do? Fixes bugs of evaluation with BLEU score when training with multi-gpus. But no error will happend if there is no distributed training. when --eval-bleu is set to be `True` (default it is `False` and the best checkpoint is selected according to loss) and training with multi-gpus (when the number of gpu which participate in distributed training is greater than 1), following error will happend. ```bash Traceback (most recent call last): Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main distributed_utils.call_main(cfg, main)distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main main(cfg, **kwargs) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main main(cfg, **kwargs) main(cfg, **kwargs)rder/fairseq/fairseq_cli/train.py", line 143, in main File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> return func(*args, **kwds) return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save cfg, trainer, task, epoch_itr, valid_subsets, end_of_epochsys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step return func(*args, **kwds)distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main main(cfg, **kwargs) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics self.task.reduce_metrics(logging_outputs, self.get_criterion())valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return func(*args, **kwds)metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return self.numpy() TypeError: can't convert cuda:2 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. return self.numpy() TypeError: can't convert cuda:3 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. return self.numpy() TypeError: can't convert cuda:1 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/data/cordercorder/anaconda3/envs/nmt/bin/python', '-u', '/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train', '--local_rank=3', 'tiny_data_bin', '--distributed-world-size', '4', '--arch', 'transformer', '--share-decoder-input-output-embed', '--optimizer', 'adam', '--adam-betas', '(0.9, 0.98)', '--clip-norm', '0.0', '--lr-scheduler', 'inverse_sqrt', '--warmup-init-lr', '1e-07', '--warmup-updates', '3000', '--lr', '0.0005', '--stop-min-lr', '1e-09', '--dropout', '0.25', '--weight-decay', '0.0001', '--criterion', 'label_smoothed_cross_entropy', '--label-smoothing', '0.1', '--max-tokens', '5000', '--batch-size', '64', '--update-freq', '4', '--max-epoch', '30', '--save-dir', 'checkpoint', '--skip-invalid-size-inputs-valid-test', '--eval-bleu', '--eval-bleu-args', '{"beam": 5}', '--eval-bleu-remove-bpe', 'sentencepiece', '--eval-bleu-print-samples', '--eval-tokenized-bleu', '--best-checkpoint-metric', 'bleu', '--maximize-best-checkpoint-metric', '--validate-interval-updates', '1']' returned non-zero exit status 1. ``` The error is cased by the fact that the numpy of version 1.20.1 does't support codes like following: ```python import torch import numpy as np a = torch.tensor(0, device="cuda:0") b = np.array([a]) ``` The above codes will lead to error: "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", but the codes run well if the numpy version is 1.18.1 or 1.17.0 (when the numpy version is below 1.20.0, it is ok, I guess). However, it seems like that the latest version of fairseq need a numpy package of version 1.20.0 or higher (issue facebookresearch/fairseq#3203 ). ### Reproduce the error Download the source code of fairseq (commit ID: 7061a0f) and run following code: ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 data_bin_dir=tiny_data_bin python -m torch.distributed.launch --nproc_per_node=4 \ --master_addr="127.0.0.1" \ --master_port=12345 \ $(which fairseq-train) ${data_bin_dir} \ --distributed-world-size 4 \ --arch transformer \ --share-decoder-input-output-embed \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 3000 \ --lr 0.0005 \ --stop-min-lr 1e-09 \ --dropout 0.25 \ --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5000 \ --batch-size 64 \ --update-freq 4 \ --max-epoch 30 \ --save-dir checkpoint \ --skip-invalid-size-inputs-valid-test \ --eval-bleu \ --eval-bleu-args '{"beam": 5}' \ --eval-bleu-remove-bpe sentencepiece \ --eval-bleu-print-samples \ --eval-tokenized-bleu \ --best-checkpoint-metric bleu \ --maximize-best-checkpoint-metric \ --validate-interval-updates 1 ``` ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: facebookresearch/fairseq#3237 Reviewed By: myleott Differential Revision: D26429732 Pulled By: alexeib fbshipit-source-id: bc887ce952d28541cb07dbbdc7e80e99428a6b34
caltia
pushed a commit
to caltia/fairseq
that referenced
this pull request
Jul 8, 2025
…s. (#3237) Summary: …ith BLEU scores # Before submitting - [no] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [yes] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [no need] Did you make sure to update the docs? - [no need] Did you write any new necessary tests? ## What does this PR do? Fixes bugs of evaluation with BLEU score when training with multi-gpus. But no error will happend if there is no distributed training. when --eval-bleu is set to be `True` (default it is `False` and the best checkpoint is selected according to loss) and training with multi-gpus (when the number of gpu which participate in distributed training is greater than 1), following error will happend. ```bash Traceback (most recent call last): Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main distributed_utils.call_main(cfg, main)distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main main(cfg, **kwargs) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main main(cfg, **kwargs) main(cfg, **kwargs)rder/fairseq/fairseq_cli/train.py", line 143, in main File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module> return func(*args, **kwds) return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save cfg, trainer, task, epoch_itr, valid_subsets, end_of_epochsys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner distributed_utils.call_main(cfg, main) File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step return func(*args, **kwds)distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main main(cfg, **kwargs) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats return func(*args, **kwds) File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics self.task.reduce_metrics(logging_outputs, self.get_criterion())valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ trainer.valid_step(sample) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return func(*args, **kwds)metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return self.numpy() TypeError: can't convert cuda:2 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. return self.numpy() TypeError: can't convert cuda:3 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. return self.numpy() TypeError: can't convert cuda:1 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. Traceback (most recent call last): File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/data/cordercorder/anaconda3/envs/nmt/bin/python', '-u', '/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train', '--local_rank=3', 'tiny_data_bin', '--distributed-world-size', '4', '--arch', 'transformer', '--share-decoder-input-output-embed', '--optimizer', 'adam', '--adam-betas', '(0.9, 0.98)', '--clip-norm', '0.0', '--lr-scheduler', 'inverse_sqrt', '--warmup-init-lr', '1e-07', '--warmup-updates', '3000', '--lr', '0.0005', '--stop-min-lr', '1e-09', '--dropout', '0.25', '--weight-decay', '0.0001', '--criterion', 'label_smoothed_cross_entropy', '--label-smoothing', '0.1', '--max-tokens', '5000', '--batch-size', '64', '--update-freq', '4', '--max-epoch', '30', '--save-dir', 'checkpoint', '--skip-invalid-size-inputs-valid-test', '--eval-bleu', '--eval-bleu-args', '{"beam": 5}', '--eval-bleu-remove-bpe', 'sentencepiece', '--eval-bleu-print-samples', '--eval-tokenized-bleu', '--best-checkpoint-metric', 'bleu', '--maximize-best-checkpoint-metric', '--validate-interval-updates', '1']' returned non-zero exit status 1. ``` The error is cased by the fact that the numpy of version 1.20.1 does't support codes like following: ```python import torch import numpy as np a = torch.tensor(0, device="cuda:0") b = np.array([a]) ``` The above codes will lead to error: "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", but the codes run well if the numpy version is 1.18.1 or 1.17.0 (when the numpy version is below 1.20.0, it is ok, I guess). However, it seems like that the latest version of fairseq need a numpy package of version 1.20.0 or higher (issue facebookresearch/fairseq#3203 ). ### Reproduce the error Download the source code of fairseq (commit ID: 7061a0f) and run following code: ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 data_bin_dir=tiny_data_bin python -m torch.distributed.launch --nproc_per_node=4 \ --master_addr="127.0.0.1" \ --master_port=12345 \ $(which fairseq-train) ${data_bin_dir} \ --distributed-world-size 4 \ --arch transformer \ --share-decoder-input-output-embed \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 3000 \ --lr 0.0005 \ --stop-min-lr 1e-09 \ --dropout 0.25 \ --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5000 \ --batch-size 64 \ --update-freq 4 \ --max-epoch 30 \ --save-dir checkpoint \ --skip-invalid-size-inputs-valid-test \ --eval-bleu \ --eval-bleu-args '{"beam": 5}' \ --eval-bleu-remove-bpe sentencepiece \ --eval-bleu-print-samples \ --eval-tokenized-bleu \ --best-checkpoint-metric bleu \ --maximize-best-checkpoint-metric \ --validate-interval-updates 1 ``` ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: facebookresearch/fairseq#3237 Reviewed By: myleott Differential Revision: D26429732 Pulled By: alexeib fbshipit-source-id: bc887ce952d28541cb07dbbdc7e80e99428a6b34
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…ith BLEU scores
Before submitting
What does this PR do?
Fixes bugs of evaluation with BLEU score when training with multi-gpus. But no error will happend if there is no distributed training.
when --eval-bleu is set to be
True(default it isFalseand the best checkpoint is selected according to loss) and training with multi-gpus (when the number of gpu which participate in distributed training is greater than 1), following error will happend.The error is cased by the fact that the numpy of version 1.20.1 does't support codes like following:
The above codes will lead to error: "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", but the codes run well if the numpy version is 1.18.1 or 1.17.0 (when the numpy version is below 1.20.0, it is ok, I guess). However, it seems like that the latest version of fairseq need a numpy package of version 1.20.0 or higher (issue #3203 ).
Reproduce the error
Download the source code of fairseq (commit ID: 7061a0f) and run following code:
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃