Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using "accelerate" for multi-GPU calculate (enable device_map = "auto") #130

Open
CruelBrutalMan opened this issue Jul 17, 2024 · 0 comments

Comments

@CruelBrutalMan
Copy link

CruelBrutalMan commented Jul 17, 2024

I'm trying to run SUPIR on several video cards (8xNvidia A5000), since my task requires processing large images (2500x2500 pixels and more). LLava can work with this library, and by setting device_map = 'auto' all GPU are used in the calculation.

I tried using load_checkpoint_and_dispatch when loading models in util.py

...
from accelerate import load_checkpoint_and_dispatch

...

def create_SUPIR_model(config_path, SUPIR_sign=None, load_default_setting=False):
    config = OmegaConf.load(config_path)
    model = instantiate_from_config(config.model).cpu() 
    print(f'Loaded model config from [{config_path}]')
    if config.SDXL_CKPT is not None:
        print(config.SDXL_CKPT)
        # model.load_state_dict(load_state_dict(config.SDXL_CKPT), strict=False)        
        model = load_checkpoint_and_dispatch(model, checkpoint=config.SDXL_CKPT, device_map="auto")
    if config.SUPIR_CKPT is not None:
        print(config.SUPIR_CKPT)
        model.load_state_dict(load_state_dict(config.SUPIR_CKPT), strict=False)
        #model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT, device_map="auto")
    if SUPIR_sign is not None:
        assert SUPIR_sign in ['F', 'Q']
        if SUPIR_sign == 'F':
            print(config.SUPIR_CKPT_F)
            # model.load_state_dict(load_state_dict(config.SUPIR_CKPT_F), strict=False)
            model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT_F, device_map="auto")
        elif SUPIR_sign == 'Q':
            print(config.SUPIR_CKPT_F)
            # model.load_state_dict(load_state_dict(config.SUPIR_CKPT_Q), strict=False)
            model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT_Q, device_map="auto")
    if load_default_setting:
        default_setting = config.default_setting
        return model, default_setting
    return model

And the models are loaded, everything is fine!
But during calculations, naturally, an error appears that the results of calculations on several GPU

Traceback (most recent call last):
  File "test.py", line 191, in <module>
    samples = model.batchify_sample(LQ_img, captions, num_steps=args.edm_steps, restoration_scale=args.s_stage1, s_churn=args.s_churn,
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/SUPIR/models/SUPIR_model.py", line 121, in batchify_sample
    c, uc = self.prepare_condition(_z, p, p_p, n_p, N)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/SUPIR/models/SUPIR_model.py", line 166, in prepare_condition
    c, uc = self.conditioner.get_unconditional_conditioning(batch, batch_uc)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 185, in get_unconditional_conditioning
    c = self(batch_c)
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 206, in forward
    emb_out = embedder(batch[embedder.input_key])
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/util.py", line 59, in do_autocast
    return f(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 560, in forward
    z = self.encode_with_transformer(tokens.to(self.device))
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 570, in encode_with_transformer
    x = x + self.model.positional_embedding
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:6!

I need to rewrite the "forward" method (or some other) so that the result goes to one GPU. After all, this is implemented in LLava. But I haven’t yet figured out how and where to do this in SUPIR. If anyone can help, that would be great!

I found that the developers themselves could add this to their models
huggingface/transformers#29786
But not sure if it works for SUPIR

@CruelBrutalMan CruelBrutalMan changed the title Using "accelerate" for multi-GPU calculate Using "accelerate" for multi-GPU calculate (enable device_mod = "auto") Jul 17, 2024
@CruelBrutalMan CruelBrutalMan changed the title Using "accelerate" for multi-GPU calculate (enable device_mod = "auto") Using "accelerate" for multi-GPU calculate (enable device_map = "auto") Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant