-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training #159
Comments
Hi all, I would like to confirm this, I have the same issue with the above technology stack. |
Hey @zouharvi @maxiek0071. Can you try the linked PR and let me know if that works (if it does not, post the error trace)? You can install it like this: python -m pip install git+https://github.com/Unbabel/COMET.git@refs/pull/160/head |
Hi @BramVanroy, I install comet from the branch you specified, and now I'm getting a similar error for EvaluationLoop.
I suppose further adjustments are necessary. |
@maxiek0071 I've been looking at this over lunch and I have made some progress, but not enough I believe. I do not have the time/patience currently to dig deeper into the idiosyncracies of PyTorch Lightning (where the issue lies) but I've written about what the issue is and what some one with more experience can do to fix it, in this PR: #160 (comment) So perhaps if you share that PR with your network, other people may chime in and we can quickly solve it. But I cannot dig deeper into this for now, sorry! Maybe @ricardorei has some ideas. |
Thanks @BramVanroy for your help, I appreciate it! I will first evaluate how not using encoder fine-tuning impacts the QE quality (#158). If the speed stays at 13.98it/s throughout training, it takes about 12-15h for 3-4 epochs for me. Could @ricardorei confirm that they executed |
Hi all! I'll look into this today. I had this fixed before but Pytorch-Lightning likes to changes things. Maybe its just a quick fix... Like Bram said in his PR I think the problem is with torchmetrics. |
I updated lightning and metrics and I tested multGPU training and it was working. I used Please give it a try |
Use the latest version 2.1.0 |
Hi @ricardorei, I have just checked with this version, and I can execute training on multiple GPUs. |
Thanks, @ricardorei 🙂 |
I am attempting to run
comet-train
with multiple GPUs.Command (abbreviated):
Config (abbreviated):
Output with error (abbreviated):
I'm using NVIDIA A10G GPUs and the following software versions:
The text was updated successfully, but these errors were encountered: