-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_memory and offload_folder options not working for big models #78
Comments
|
I find that from auto_gptq import AutoGPTQForCausalLM
path_to_model = 'models/TheBloke_stable-vicuna-13B-GPTQ'
params = {
'model_basename': 'stable-vicuna-13B-GPTQ-4bit.compat.no-act-order',
'use_triton': False,
'use_safetensors': True,
'max_memory': {0: '2GiB', 'cpu': '99GiB'}
}
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
input()
|
I ended up having to massively reduce my example dataset in order to get it to load, because of the vram constraints involved. Disabling the cache-on-GPU flag didn't appear to affect this result. Also, it appears to be impossible to do a truly cpu-only run, as a lack of GPUs leads to a division by zero. |
Hi @PanQiWei, So, I am still running into an error while trying to quantize large models (that don't fit in the 12GB of VRAM). The script ran for 8 mins and then failed: pretrained_model_dir = "EleutherAI/gpt-j-6b"
quantized_model_dir = "EleutherAI/gpt-j-6b-4bit-128g"
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
)
max_memory={0: "6GiB", 'cpu': '80GiB'}
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, max_memory=max_memory)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples, use_triton=False)
# save quantized model
model.save_quantized(quantized_model_dir) I run into the following problem: ---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[6], line 16
13 model.quantize(examples, use_triton=False)
15 # save quantized model
---> 16 model.save_quantized(quantized_model_dir)
18 # save quantized model using safetensors
19 model.save_quantized(quantized_model_dir, use_safetensors=True)
File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:392](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:392), in BaseGPTQForCausalLM.save_quantized(self, save_dir, use_safetensors)
389 if not self.quantized:
390 raise EnvironmentError("can only save quantized model, please execute .quantize first.")
--> 392 self.model.to(CPU)
394 model_save_name = f"gptq_model-{self.quantize_config.bits}bit-{self.quantize_config.group_size}g"
395 if use_safetensors:
File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/transformers/modeling_utils.py:1878](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/transformers/modeling_utils.py:1878), in PreTrainedModel.to(self, *args, **kwargs)
1873 raise ValueError(
1874 "`.to` is not supported for `8-bit` models. Please use the model as it is, since the"
1875 " model has already been set to the correct devices and casted to the correct `dtype`."
1876 )
1877 else:
-> 1878 return super().to(*args, **kwargs)
File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1145](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1145), in Module.to(self, *args, **kwargs)
1141 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1142 non_blocking, memory_format=convert_to_format)
1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)
File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:797](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:797), in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
File /opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
[... skipping similar frames: Module._apply at line 797 (1 times)]
File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:797](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:797), in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
File /opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:820, in Module._apply(self, fn)
816 # Tensors stored in modules are graph leaves, and we don't want to
817 # track autograd history of `param_applied`, so we have to use
818 # `with torch.no_grad():`
819 with torch.no_grad():
--> 820 param_applied = fn(param)
821 should_use_set_data = compute_should_use_set_data(param, param_applied)
822 if should_use_set_data:
File /opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1143, in Module.to..convert(t)
1140 if convert_to_format is not None and t.dim() in (4, 5):
1141 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1142 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data! Also, I see that the VRAM usage goes upto 11GB even when I have specified 6GB in the |
I may be mistaken, but there might be a bug in the In #47 (comment) I've shown what device map is generated when you specify {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 'cpu', 'model.layers.9': 'cpu', 'model.layers.10': 'cpu', 'model.layers.11': 'cpu', 'model.layers.12': 'cpu', 'model.layers.13': 'cpu', 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'cpu', 'model.layers.29': 'cpu', 'model.layers.30': 'cpu', 'model.layers.31': 'cpu', 'model.layers.32': 'cpu', 'model.layers.33': 'cpu', 'model.layers.34': 'cpu', 'model.layers.35': 'cpu', 'model.layers.36': 'cpu', 'model.layers.37': 'cpu', 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'} Let's take this part:
The # For each device, get the weights that go there
device_weights = {device: [] for device in devices}
for module_name, device in device_map.items():
if device in devices:
device_weights[device].extend([k for k in weight_names if k.startswith(module_name)]) This code is supposed to distribute all modules to their respective devices, so that And here's the problem: 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0 would mean that not only layers 1-3 go to the GPU, but also layers 10-39 as well. In fact, it seems like layers 10-39 go to both CPU and GPU. Disclaimer: I have no idea how Also, as I've shown in #47 (comment) even if you put only |
For accelerate I always have to lower the actual memory fed into it to get usable results. I will tell it to feed 16gb and it will load 18gb. |
Hi! This pr #100 fixed the bug that can't save quantized model when load pretrained model using CPU offload. |
Close this issue for the problem mentioned here has been fixed |
Hi,
I have a GeForce RTX 3060 GPU with 12GB VRAM. I am able to load models up to 3B parameters and quantize them, however am running into trouble when I try to load 6B parameter or bigger models.
Here are the GPU details:
I try to load them using
However, I get the error:
No matter what memory I specify in
max_memory
, I get the same error.What am I missing?
The text was updated successfully, but these errors were encountered: