Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning, ValueError: DistributedDataParallel device_ids and output_device arguments only work... #81

Closed
fpservant opened this issue Jan 28, 2024 · 1 comment

Comments

@fpservant
Copy link

Hi,

trying for the 1st time to fine-tune colbert-ir/colbertv2.0 on my mac, an error is shown in the log:

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}

My code is just the following, mostly copy/paste from the readme (I removed the all_documents arg)

trainer = RAGTrainer(model_name = "MyFineTunedColBERT",
        pretrained_model_name = "colbert-ir/colbertv2.0") # In this example, we run fine-tuning

# This step handles all the data processing, check the examples for more details!
trainer.prepare_training_data(raw_data=pairs,
                                data_out_path="../data/",
                                # all_documents=my_full_corpus
                                )

(pairs is a List of 3581 tuples such as:

('1234yf', "Le 1234yf est un bla bla...")

Here is the trace:

...
Using config.bsize = 32 (per process) and config.accumsteps = 1
[Jan 28, 14:42:04] #> Loading the queries from ../data/queries.train.colbert.tsv ...
[Jan 28, 14:42:04] #> Got 3581 queries. All QIDs are unique.

[Jan 28, 14:42:04] #> Loading collection...
0M 
[Jan 28, 14:42:05] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Process Process-5:
Traceback (most recent call last):
  File "/Users/fps/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/fps/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 128, in setup_new_process
    return_val = callee(config, *args)
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/colbert/training/training.py", line 55, in train
    colbert = torch.nn.parallel.DistributedDataParallel(colbert, device_ids=[config.rank],
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 603, in __init__
    self._log_and_throw(
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 769, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.

This didn't stop the execution, which seems to be continuing (without any new output)

@fpservant
Copy link
Author

Ah, sorry, only now that I see #69 : no support of CPU for training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant