DistributedDataParallel issue on .train() #23

adlindenberg · 2024-01-07T16:47:36Z

I understand this may not be a RAGatouille issue - but I can't seem to get a simple training example to work. Relentlessly running into the following within trainer.train(...):

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.

Any ideas? Apple M3 Max chip, here is the script im running using python 3.10

from ragatouille import RAGTrainer
from ragatouille.utils import get_wikipedia_page

if __name__ == "__main__":

    pairs = [
        ("Who won the premier league in 1976?", "Liverpool won the premier league in 1976."),
        ("Who was the manager for the premier league winners in 1976?", "Bob Paisley was the manager for the premier league winners in 1976."),
        ("Who was the premier league runner up in 1988-89?", "Liverpool was the premier league runner up in 1988-89."),
        ("Who has the most premier league titles?", "Manchester United has the most premier league titles."),
    ]

    my_full_corpus = [get_wikipedia_page("List_of_English_football_champions")]

    trainer = RAGTrainer(model_name="MyFineTunedColBERT", pretrained_model_name="colbert-ir/colbertv2.0") # In this example, we run fine-tuning

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data/", all_documents=my_full_corpus)

    trainer.train(batch_size=32) # Train with the default hyperparams

The text was updated successfully, but these errors were encountered:

bclavie · 2024-01-07T17:07:50Z

Hey, I'll add a disclaimer to the README to make this more obvious, but currently ColBERT can only be trained on GPU, and doesn't support mps for training so there isn't a way to locally train on a MacBook (@okhat with how good the apple silicon chips are, this could be a good "Help Wanted!" for upstream ColBERT)

adlindenberg · 2024-01-07T17:10:24Z

@bclavie Got it - thank you for the details! Apologies if a noob question, will use a different machine!

bclavie · 2024-01-07T21:34:18Z

No problem! And (not that there's anything wrong with noob questions!), it's really on me -- this should've been documented more clearly.

adlindenberg changed the title ~~DistributedDataParallel issue~~ DistributedDataParallel issue on .train() Jan 7, 2024

adlindenberg closed this as completed Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedDataParallel issue on .train() #23

DistributedDataParallel issue on .train() #23

adlindenberg commented Jan 7, 2024

bclavie commented Jan 7, 2024 •

edited

Loading

adlindenberg commented Jan 7, 2024 •

edited

Loading

bclavie commented Jan 7, 2024

DistributedDataParallel issue on .train() #23

DistributedDataParallel issue on .train() #23

Comments

adlindenberg commented Jan 7, 2024

bclavie commented Jan 7, 2024 • edited Loading

adlindenberg commented Jan 7, 2024 • edited Loading

bclavie commented Jan 7, 2024

bclavie commented Jan 7, 2024 •

edited

Loading

adlindenberg commented Jan 7, 2024 •

edited

Loading