Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedDataParallel issue on .train() #23

Closed
adlindenberg opened this issue Jan 7, 2024 · 3 comments
Closed

DistributedDataParallel issue on .train() #23

adlindenberg opened this issue Jan 7, 2024 · 3 comments

Comments

@adlindenberg
Copy link

I understand this may not be a RAGatouille issue - but I can't seem to get a simple training example to work. Relentlessly running into the following within trainer.train(...):

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.

Any ideas? Apple M3 Max chip, here is the script im running using python 3.10

from ragatouille import RAGTrainer
from ragatouille.utils import get_wikipedia_page

if __name__ == "__main__":

    pairs = [
        ("Who won the premier league in 1976?", "Liverpool won the premier league in 1976."),
        ("Who was the manager for the premier league winners in 1976?", "Bob Paisley was the manager for the premier league winners in 1976."),
        ("Who was the premier league runner up in 1988-89?", "Liverpool was the premier league runner up in 1988-89."),
        ("Who has the most premier league titles?", "Manchester United has the most premier league titles."),
    ]

    my_full_corpus = [get_wikipedia_page("List_of_English_football_champions")]

    trainer = RAGTrainer(model_name="MyFineTunedColBERT", pretrained_model_name="colbert-ir/colbertv2.0") # In this example, we run fine-tuning

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data/", all_documents=my_full_corpus)

    trainer.train(batch_size=32) # Train with the default hyperparams
@adlindenberg adlindenberg changed the title DistributedDataParallel issue DistributedDataParallel issue on .train() Jan 7, 2024
@bclavie
Copy link
Collaborator

bclavie commented Jan 7, 2024

Hey, I'll add a disclaimer to the README to make this more obvious, but currently ColBERT can only be trained on GPU, and doesn't support mps for training so there isn't a way to locally train on a MacBook (@okhat with how good the apple silicon chips are, this could be a good "Help Wanted!" for upstream ColBERT)

@adlindenberg
Copy link
Author

adlindenberg commented Jan 7, 2024

@bclavie Got it - thank you for the details! Apologies if a noob question, will use a different machine!

@bclavie
Copy link
Collaborator

bclavie commented Jan 7, 2024

No problem! And (not that there's anything wrong with noob questions!), it's really on me -- this should've been documented more clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants