You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run SUPIR on several video cards (8xNvidia A5000), since my task requires processing large images (2500x2500 pixels and more). LLava can work with this library, and by setting device_map = 'auto' all GPU are used in the calculation.
I tried using load_checkpoint_and_dispatch when loading models in util.py
...
from accelerate import load_checkpoint_and_dispatch
...
def create_SUPIR_model(config_path, SUPIR_sign=None, load_default_setting=False):
config = OmegaConf.load(config_path)
model = instantiate_from_config(config.model).cpu()
print(f'Loaded model config from [{config_path}]')
if config.SDXL_CKPT is not None:
print(config.SDXL_CKPT)
# model.load_state_dict(load_state_dict(config.SDXL_CKPT), strict=False)
model = load_checkpoint_and_dispatch(model, checkpoint=config.SDXL_CKPT, device_map="auto")
if config.SUPIR_CKPT is not None:
print(config.SUPIR_CKPT)
model.load_state_dict(load_state_dict(config.SUPIR_CKPT), strict=False)
#model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT, device_map="auto")
if SUPIR_sign is not None:
assert SUPIR_sign in ['F', 'Q']
if SUPIR_sign == 'F':
print(config.SUPIR_CKPT_F)
# model.load_state_dict(load_state_dict(config.SUPIR_CKPT_F), strict=False)
model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT_F, device_map="auto")
elif SUPIR_sign == 'Q':
print(config.SUPIR_CKPT_F)
# model.load_state_dict(load_state_dict(config.SUPIR_CKPT_Q), strict=False)
model = load_checkpoint_and_dispatch(model, checkpoint=config.SUPIR_CKPT_Q, device_map="auto")
if load_default_setting:
default_setting = config.default_setting
return model, default_setting
return model
And the models are loaded, everything is fine!
But during calculations, naturally, an error appears that the results of calculations on several GPU
Traceback (most recent call last):
File "test.py", line 191, in <module>
samples = model.batchify_sample(LQ_img, captions, num_steps=args.edm_steps, restoration_scale=args.s_stage1, s_churn=args.s_churn,
File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/nvmedata/Mikhail/arc_upscale/SUPIR/SUPIR/models/SUPIR_model.py", line 121, in batchify_sample
c, uc = self.prepare_condition(_z, p, p_p, n_p, N)
File "/nvmedata/Mikhail/arc_upscale/SUPIR/SUPIR/models/SUPIR_model.py", line 166, in prepare_condition
c, uc = self.conditioner.get_unconditional_conditioning(batch, batch_uc)
File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 185, in get_unconditional_conditioning
c = self(batch_c)
File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 206, in forward
emb_out = embedder(batch[embedder.input_key])
File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/nvmedata/programs/anaconda3/envs/SUPIR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/util.py", line 59, in do_autocast
return f(*args, **kwargs)
File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 560, in forward
z = self.encode_with_transformer(tokens.to(self.device))
File "/nvmedata/Mikhail/arc_upscale/SUPIR/sgm/modules/encoders/modules.py", line 570, in encode_with_transformer
x = x + self.model.positional_embedding
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:6!
I need to rewrite the "forward" method (or some other) so that the result goes to one GPU. After all, this is implemented in LLava. But I haven’t yet figured out how and where to do this in SUPIR. If anyone can help, that would be great!
I found that the developers themselves could add this to their models huggingface/transformers#29786
But not sure if it works for SUPIR
The text was updated successfully, but these errors were encountered:
CruelBrutalMan
changed the title
Using "accelerate" for multi-GPU calculate
Using "accelerate" for multi-GPU calculate (enable device_mod = "auto")
Jul 17, 2024
CruelBrutalMan
changed the title
Using "accelerate" for multi-GPU calculate (enable device_mod = "auto")
Using "accelerate" for multi-GPU calculate (enable device_map = "auto")
Jul 17, 2024
I'm trying to run SUPIR on several video cards (8xNvidia A5000), since my task requires processing large images (2500x2500 pixels and more). LLava can work with this library, and by setting device_map = 'auto' all GPU are used in the calculation.
I tried using load_checkpoint_and_dispatch when loading models in util.py
And the models are loaded, everything is fine!
But during calculations, naturally, an error appears that the results of calculations on several GPU
I need to rewrite the "forward" method (or some other) so that the result goes to one GPU. After all, this is implemented in LLava. But I haven’t yet figured out how and where to do this in SUPIR. If anyone can help, that would be great!
I found that the developers themselves could add this to their models
huggingface/transformers#29786
But not sure if it works for SUPIR
The text was updated successfully, but these errors were encountered: