Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA bfloat16 problem #122

Open
A-2-H opened this issue Oct 23, 2023 · 4 comments
Open

CUDA bfloat16 problem #122

A-2-H opened this issue Oct 23, 2023 · 4 comments

Comments

@A-2-H
Copy link

A-2-H commented Oct 23, 2023

2023-10-23 10:49:30,409 WARNING: logs/HiFiSVC doesn't exist yet!
Global seed set to 594461
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: logs/HiFiSVC
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name          | Type                     | Params
-----------------------------------------------------------
0 | generator     | HiFiSinger               | 14.9 M
1 | mpd           | MultiPeriodDiscriminator | 57.5 M
2 | msd           | MultiScaleDiscriminator  | 29.6 M
3 | mel_transform | MelSpectrogram           | 0     
-----------------------------------------------------------
102 M     Trainable params
0         Non-trainable params
102 M     Total params
408.124   Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:442: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0% 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/fish-diffusion/tools/hifisinger/train.py", line 83, in <module>
    trainer.fit(model, train_loader, valid_loader, ckpt_path=args.resume)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in validation_step
    with self.precision_plugin.val_step_context():
  File "/content/env/envs/fish_diffusion/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 170, in val_step_context
    with self.forward_context():
  File "/content/env/envs/fish_diffusion/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 118, in forward_context
    with self.autocast_context_manager():
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 113, in autocast_context_manager
    return torch.autocast(self.device, dtype=torch.bfloat16 if self.precision == "bf16-mixed" else torch.half)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 234, in __init__
    raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

Google Colab
using T4 gpu and any other gpu - still the same error when I tried to train my model

@li-henan
Copy link

hi friend, I occur the same error, have you solved it, thank you sincerely if you can help

@A-2-H
Copy link
Author

A-2-H commented Apr 12, 2024

hi friend, I occur the same error, have you solved it, thank you sincerely if you can help

It's not fixed yet but I found workaround. After you make environment and clone git of Fish-Diff into your colab you have to change config file in fish because google colab doesn't use bfloat16 so it has to be changed (for now as a quick solution). Here is the file directory: /content/fish-diffusion/configs/base/trainers/base.py

in this file change line 18 from 'precision="bf16-mixed",' to 'precision="16-mixed",'
and save it. It should work now.

@li-henan
Copy link

thank you very much for your help, it works now.
Look forward to discussing with you about the effects of the model
Sincerely yours!

@li-henan
Copy link

dear friend, this code can fine-tune text encoder projection layer + diffusion or fine-tune hifigan, but have you fine-tuned contentvec using this code? such as use different layers' transformer in contentvec to cut size

thank you sincerely for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants