Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] trainer.distribute not working #134

Open
xyyimian opened this issue Dec 8, 2023 · 1 comment
Open

[Bug] trainer.distribute not working #134

xyyimian opened this issue Dec 8, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@xyyimian
Copy link

xyyimian commented Dec 8, 2023

Describe the bug

For single GPU training, I am using train_yourtts.py. When I switch to multi-gpu, the program could run but didn't show acceleration. I checked the code in distribute.py and found that it only set environment and start parallel processes. It didn't do collection and sync operation. I am wondering if it is by design or I misused the trainer.distribute

To Reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m trainer.distribute --script recipes/vctk/yourtts/train_yourtts.py

Expected behavior

expected two times acceleration but actually the progress is same as single gpu training.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "Trainer": "v0.0.34",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.18",
        "version": "#99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023"
    }
}

Additional context

No response

@xyyimian xyyimian added the bug Something isn't working label Dec 8, 2023
@erogol
Copy link
Member

erogol commented Dec 18, 2023

how do you know it doesn't work? Also, why do you expect 2 times acceleration?

( I am jealous of your shiny H100s )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants