Using "accelerate" for multi-GPU calculate (enable device_map = "auto") #130

CruelBrutalMan · 2024-07-17T00:10:25Z

I'm trying to run SUPIR on several video cards (8xNvidia A5000), since my task requires processing large images (2500x2500 pixels and more). LLava can work with this library, and by setting device_map = 'auto' all GPU are used in the calculation.

I tried using load_checkpoint_and_dispatch when loading models in util.py

...
from accelerate import load_checkpoint_and_dispatch

...

def create_SUPIR_model(config_path, SUPIR_sign=None, load_default_setting=False):
    config = OmegaConf.load(config_path)
    model = instantiate_from_config(config.model).cpu() 
    print(f'Loaded model config from [{config_path}]')
    if config.SDXL_CKPT is not None:
        print(config.SDXL_CKPT)
        # model.load_state_dict(load_state_dict(config.SDXL_CKPT), strict=False)        
        model = load_checkpoint_and_dispatch(model, checkpoint=config.SDXL_CKPT, device_map="auto")
    if config.SUPIR_CKPT is not None:
        print(config.SUPIR_CKPT)
        model.load_state_dict(load_state_dict(config.SUPIR_CKPT), strict=False)
        #model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT, device_map="auto")
    if SUPIR_sign is not None:
        assert SUPIR_sign in ['F', 'Q']
        if SUPIR_sign == 'F':
            print(config.SUPIR_CKPT_F)
            # model.load_state_dict(load_state_dict(config.SUPIR_CKPT_F), strict=False)
            model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT_F, device_map="auto")
        elif SUPIR_sign == 'Q':
            print(config.SUPIR_CKPT_F)
            # model.load_state_dict(load_state_dict(config.SUPIR_CKPT_Q), strict=False)
            model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT_Q, device_map="auto")
    if load_default_setting:
        default_setting = config.default_setting
        return model, default_setting
    return model

And the models are loaded, everything is fine!
But during calculations, naturally, an error appears that the results of calculations on several GPU

Traceback (most recent call last):
  File "test.py", line 191, in <module>
    samples = model.batchify_sample(LQ_img, captions, num_steps=args.edm_steps, restoration_scale=args.s_stage1, s_churn=args.s_churn,
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/SUPIR/models/SUPIR_model.py", line 121, in batchify_sample
    c, uc = self.prepare_condition(_z, p, p_p, n_p, N)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/SUPIR/models/SUPIR_model.py", line 166, in prepare_condition
    c, uc = self.conditioner.get_unconditional_conditioning(batch, batch_uc)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 185, in get_unconditional_conditioning
    c = self(batch_c)
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 206, in forward
    emb_out = embedder(batch[embedder.input_key])
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/util.py", line 59, in do_autocast
    return f(*args, **kwargs)
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 560, in forward
    z = self.encode_with_transformer(tokens.to(self.device))
  File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 570, in encode_with_transformer
    x = x + self.model.positional_embedding
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:6!

I need to rewrite the "forward" method (or some other) so that the result goes to one GPU. After all, this is implemented in LLava. But I haven’t yet figured out how and where to do this in SUPIR. If anyone can help, that would be great!

I found that the developers themselves could add this to their models
huggingface/transformers#29786
But not sure if it works for SUPIR

The text was updated successfully, but these errors were encountered:

CruelBrutalMan changed the title ~~Using "accelerate" for multi-GPU calculate~~ Using "accelerate" for multi-GPU calculate (enable device_mod = "auto") Jul 17, 2024

CruelBrutalMan changed the title ~~Using "accelerate" for multi-GPU calculate (enable device_mod = "auto")~~ Using "accelerate" for multi-GPU calculate (enable device_map = "auto") Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using "accelerate" for multi-GPU calculate (enable device_map = "auto") #130

Using "accelerate" for multi-GPU calculate (enable device_map = "auto") #130

CruelBrutalMan commented Jul 17, 2024 •

edited

Loading

Using "accelerate" for multi-GPU calculate (enable device_map = "auto") #130

Using "accelerate" for multi-GPU calculate (enable device_map = "auto") #130

Comments

CruelBrutalMan commented Jul 17, 2024 • edited Loading

CruelBrutalMan commented Jul 17, 2024 •

edited

Loading