Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

量化 int4,报错"RuntimeError: CUDA Error: no kernel image is available for execution on the device" #56

Closed
landxman opened this issue Jul 13, 2023 · 7 comments
Assignees

Comments

@landxman
Copy link

量化 int4,
NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0
运行“streamlit run web_demo.py”可以正常启动,但是问问题后,就报错。

[user] 你是谁?
2023-07-13 12:53:14.567 Uncaught app exception
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/root/Baichuan-13B/web_demo.py", line 72, in
main()
File "/root/Baichuan-13B/web_demo.py", line 61, in main
for response in model.chat(tokenizer, messages, stream=True):
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/modeling_baichuan.py", line 527, in stream_generator
for token in self.generate(input_ids, generation_config=stream_config):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/usr/local/lib/python3.8/dist-packages/transformers_stream_generator/main.py", line 931, in sample_stream
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/modeling_baichuan.py", line 382, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/modeling_baichuan.py", line 325, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/modeling_baichuan.py", line 178, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/modeling_baichuan.py", line 113, in forward
proj = self.W_pack(hidden_states)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/quantizer.py", line 116, in forward
rweight = dequant4(self.weight, self.scale, input).T
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/quantizer.py", line 82, in dequant4
kernels.int4_to_fp16(
File "/usr/local/lib/python3.8/dist-packages/cpm_kernels/kernels/base.py", line 48, in call
func = self._prepare_func()
File "/usr/local/lib/python3.8/dist-packages/cpm_kernels/kernels/base.py", line 40, in _prepare_func
self._module.get_module(), self._func_name
File "/usr/local/lib/python3.8/dist-packages/cpm_kernels/kernels/base.py", line 24, in get_module
self._module[curr_device] = cuda.cuModuleLoadData(self._code)
File "/usr/local/lib/python3.8/dist-packages/cpm_kernels/library/base.py", line 94, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/cpm_kernels/library/cuda.py", line 233, in cuModuleLoadData
checkCUStatus(cuda.cuModuleLoadData(ctypes.byref(module), data))
File "/usr/local/lib/python3.8/dist-packages/cpm_kernels/library/cuda.py", line 216, in checkCUStatus
raise RuntimeError("CUDA Error: %s" % cuGetErrorString(error))
RuntimeError: CUDA Error: no kernel image is available for execution on the device

@sun1092469590
Copy link

我也遇到这个问题,用官方的方式int4量化,推理有问题

@jameswu2014
Copy link
Collaborator

能不能贴一下你的代码?

@landxman
Copy link
Author

def init_model():
    print("init model ...")
    model = AutoModelForCausalLM.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat",
        torch_dtype=torch.float16,
        trust_remote_code=True
    )
    model = model.quantize(4).cuda()
    model.generation_config = GenerationConfig.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat"
    )
    tokenizer = AutoTokenizer.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat",
    #    use_fast=False,
        trust_remote_code=True
    )
    return model, tokenizer

@jameswu2014
Copy link
Collaborator

def init_model():
    print("init model ...")
    model = AutoModelForCausalLM.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat",
        torch_dtype=torch.float16,
        trust_remote_code=True
    )
    model = model.quantize(4).cuda()
    model.generation_config = GenerationConfig.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat"
    )
    tokenizer = AutoTokenizer.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat",
    #    use_fast=False,
        trust_remote_code=True
    )
    return model, tokenizer

我的代码和你差不多:
def init_model():
model = AutoModelForCausalLM.from_pretrained(
"baichuan-inc/Baichuan-13B-Chat",
torch_dtype=torch.float16,
# device_map="auto",
trust_remote_code=True
)
model = model.quantize(4).cuda()
model.generation_config = GenerationConfig.from_pretrained(
"baichuan-inc/Baichuan-13B-Chat"
)
tokenizer = AutoTokenizer.from_pretrained(
"baichuan-inc/Baichuan-13B-Chat",
use_fast=False,
trust_remote_code=True
)
return model, tokenizer

可以正常运行

@bxjxxyy
Copy link

bxjxxyy commented Jul 20, 2023

def init_model():
    print("init model ...")
    model = AutoModelForCausalLM.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat",
        torch_dtype=torch.float16,
        trust_remote_code=True
    )
    model = model.quantize(4).cuda()
    model.generation_config = GenerationConfig.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat"
    )
    tokenizer = AutoTokenizer.from_pretrained(
        "/data/baichuan/Baichuan-13B-Chat",
    #    use_fast=False,
        trust_remote_code=True
    )
    return model, tokenizer

我的代码和你差不多: def init_model(): model = AutoModelForCausalLM.from_pretrained( "baichuan-inc/Baichuan-13B-Chat", torch_dtype=torch.float16, # device_map="auto", trust_remote_code=True ) model = model.quantize(4).cuda() model.generation_config = GenerationConfig.from_pretrained( "baichuan-inc/Baichuan-13B-Chat" ) tokenizer = AutoTokenizer.from_pretrained( "baichuan-inc/Baichuan-13B-Chat", use_fast=False, trust_remote_code=True ) return model, tokenizer

可以正常运行

我的和你一样 无法运行,报错跟楼主一样。
image

@dalong2hongmei
Copy link

解决了吗 遇到一样的问题

@shesung
Copy link

shesung commented Aug 9, 2023

quantizer.py里面的kernel有问题,可以用chatglm2的代码进行替换。
https://gist.github.com/shesung/3acd80c22a19d3e019553ad7e497a707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants