Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Baichuan2 Error] : CUDA error 9 at xxx/llama.cpp/ggml-cuda.cu:6862: invalid configuration argument #3740

Closed
wzp123123 opened this issue Oct 23, 2023 · 10 comments · Fixed by #3921

Comments

@wzp123123
Copy link

Pipeline

I try to convert baichuan2 model to gguf format and load it.

step 1. Use the Script https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#migrating-inference-optimizations-from-baichuan-1-to-baichuan-2 convert Baichuan2 to Baichuan1

step 2. I try to use convert.py and convert-baichuan-hf-to-gguf.py to convert baichuan1 to gguf

step 3. Use build/bin/quantize to quantize gguf model to q4_0

step 4. Use build/bin/main to run prompt.

I try both 7b-chat and 13b-chat model, convert.py and convert-baichuan-hf-to-gguf.py.

CPU works well, but GPU error, i am sure i use the latest llama.cpp version.

Log and Error Message

build/bin/main -m ../model/gguf/baichuan2-7b-chat.Q4_0.gguf --prompt "赏析:白日依山尽,黄河入海流" -ngl 1
Log start
main: build = 1414 (96981f3)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed = 1698054643
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../model/gguf/baichuan2-7b-chat.Q4_0.gguf (version unknown)
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4096, 125696, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 82: blk.13.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 83: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 84: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 85: blk.14.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 86: blk.14.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 87: blk.14.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 88: blk.14.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 89: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 90: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 91: blk.15.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 92: blk.15.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 93: blk.15.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 94: blk.15.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 95: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 96: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 97: blk.16.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 98: blk.16.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 99: blk.16.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 100: blk.16.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 101: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 102: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 103: blk.17.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 104: blk.17.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 105: blk.17.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 106: blk.17.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 107: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 108: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 109: blk.18.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 110: blk.18.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 111: blk.18.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 112: blk.18.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 113: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 114: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 115: blk.19.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 116: blk.19.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 117: blk.19.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 118: blk.19.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 119: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 120: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 121: blk.20.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 122: blk.20.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 123: blk.20.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 124: blk.20.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 125: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.21.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 128: blk.21.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 129: blk.21.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 130: blk.21.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 131: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 132: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 133: blk.22.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 134: blk.22.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 135: blk.22.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 136: blk.22.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 137: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 138: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 139: blk.23.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 140: blk.23.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 141: blk.23.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 142: blk.23.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 143: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.24.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 146: blk.24.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 147: blk.24.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 148: blk.24.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 149: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 150: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 151: blk.25.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 152: blk.25.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 153: blk.25.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 154: blk.25.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 155: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 156: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 157: blk.26.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 158: blk.26.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 159: blk.26.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 160: blk.26.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 161: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.27.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 164: blk.27.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 165: blk.27.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 166: blk.27.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 167: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 168: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 169: blk.28.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 170: blk.28.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 171: blk.28.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 172: blk.28.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 173: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 174: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 175: blk.29.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 176: blk.29.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 177: blk.29.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 178: blk.29.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 179: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.30.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 182: blk.30.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 183: blk.30.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 184: blk.30.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 185: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 186: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 187: blk.31.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 188: blk.31.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 189: blk.31.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 190: blk.31.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 191: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 192: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 193: output_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 194: output.weight q6_K [ 4096, 125696, 1, 1 ]
llama_model_loader: - tensor 195: blk.0.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 196: blk.0.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 197: blk.0.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 198: blk.1.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 199: blk.1.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 200: blk.1.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 201: blk.2.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 202: blk.2.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 203: blk.2.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 204: blk.3.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 205: blk.3.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 206: blk.3.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 207: blk.4.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 208: blk.4.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 209: blk.4.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 210: blk.5.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 211: blk.5.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 212: blk.5.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 213: blk.6.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 214: blk.6.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 215: blk.6.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 216: blk.7.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 217: blk.7.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 218: blk.7.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 219: blk.8.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 220: blk.8.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 221: blk.8.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 222: blk.9.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 223: blk.9.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 224: blk.9.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 225: blk.10.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 226: blk.10.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 227: blk.10.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 228: blk.11.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 229: blk.11.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 230: blk.11.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 231: blk.12.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 232: blk.12.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 233: blk.12.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 234: blk.13.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 235: blk.13.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 236: blk.13.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 237: blk.14.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 238: blk.14.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 239: blk.14.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 240: blk.15.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 241: blk.15.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 242: blk.15.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 243: blk.16.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 244: blk.16.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 245: blk.16.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 246: blk.17.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 247: blk.17.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 248: blk.17.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 249: blk.18.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 250: blk.18.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 251: blk.18.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 252: blk.19.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 253: blk.19.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 254: blk.19.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 255: blk.20.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 256: blk.20.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 257: blk.20.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 258: blk.21.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 259: blk.21.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 260: blk.21.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 261: blk.22.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 262: blk.22.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 263: blk.22.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 264: blk.23.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 265: blk.23.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 266: blk.23.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 267: blk.24.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 268: blk.24.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 269: blk.24.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 270: blk.25.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 271: blk.25.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 272: blk.25.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 273: blk.26.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 274: blk.26.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 275: blk.26.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 276: blk.27.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 277: blk.27.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 278: blk.27.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 279: blk.28.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 280: blk.28.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 281: blk.28.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 282: blk.29.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 283: blk.29.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 284: blk.29.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 285: blk.30.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 286: blk.30.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 287: blk.30.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 289: blk.31.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 290: blk.31.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: baichuan.tensor_data_layout str
llama_model_loader: - kv 3: baichuan.context_length u32
llama_model_loader: - kv 4: baichuan.embedding_length u32
llama_model_loader: - kv 5: baichuan.block_count u32
llama_model_loader: - kv 6: baichuan.feed_forward_length u32
llama_model_loader: - kv 7: baichuan.rope.dimension_count u32
llama_model_loader: - kv 8: baichuan.attention.head_count u32
llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32
llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - kv 19: general.file_type u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 1298/125696 vs 259/125696 ).
llm_load_print_meta: format = unknown
llm_load_print_meta: arch = baichuan
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 125696
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 7.51 B
llm_load_print_meta: model size = 4.06 GiB (4.64 BPW)
llm_load_print_meta: general.name = Baichuan2-7B-Chat
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4045.48 MB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 108.59 MB
......................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 259.63 MB
llama_new_context_with_model: VRAM scratch buffer: 253.50 MB
llama_new_context_with_model: total VRAM used: 362.09 MB (model: 108.59 MB, context: 253.50 MB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

赏析:白日依山尽,黄河入海流。
CUDA error 9 at /home/ubuntu/workspace/baichuan2-gguf-sagemaker/llama.cpp/ggml-cuda.cu:6862: invalid configuration argument
current device: 0

@happyme531
Copy link

Exact same error for causallm
Maybe (not) related: #3732

/m/n/l/llama.cpp (master) [1]> ./main -ngl 999 -i -m ../../text-generation-webui/models/causallm_14b.Q5_1.gguf
Log start
main: build = 239 (96981f3)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1698080075
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from ../../text-generation-webui/models/causallm_14b.Q5_1.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q5_1     [  5120, 152064,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  289:             blk.32.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  290:             blk.32.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  291:             blk.32.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  292:        blk.32.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  293:           blk.32.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  294:             blk.32.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  295:           blk.32.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  296:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  297:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  298:             blk.33.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  299:             blk.33.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  300:             blk.33.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  301:        blk.33.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  302:           blk.33.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  303:             blk.33.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  304:           blk.33.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  305:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  306:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  307:             blk.34.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  308:             blk.34.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  309:             blk.34.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  310:        blk.34.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  311:           blk.34.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  312:             blk.34.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  313:           blk.34.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  314:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  315:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  316:             blk.35.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  317:             blk.35.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  318:             blk.35.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  319:        blk.35.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  320:           blk.35.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  321:             blk.35.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  322:           blk.35.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  323:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  324:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  325:             blk.36.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  326:             blk.36.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  327:             blk.36.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  328:        blk.36.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  329:           blk.36.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  330:             blk.36.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  331:           blk.36.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  332:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  333:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  334:             blk.37.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  335:             blk.37.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  336:             blk.37.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  337:        blk.37.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  338:           blk.37.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  339:             blk.37.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  340:           blk.37.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  341:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  342:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  343:             blk.38.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  344:             blk.38.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  345:             blk.38.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  346:        blk.38.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  347:           blk.38.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  348:             blk.38.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  349:           blk.38.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  350:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  351:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  352:             blk.39.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  353:             blk.39.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  354:             blk.39.attn_v.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  355:        blk.39.attn_output.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  356:           blk.39.ffn_gate.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  357:             blk.39.ffn_up.weight q5_1     [  5120, 13696,     1,     1 ]
llama_model_loader: - tensor  358:           blk.39.ffn_down.weight q5_1     [ 13696,  5120,     1,     1 ]
llama_model_loader: - tensor  359:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  360:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  361:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  362:                    output.weight q6_K     [  5120, 152064,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr     
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32     
llama_model_loader: - kv  20:               general.quantization_version u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_1:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ).
llm_load_print_meta: format           = unknown
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 109170
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = mostly Q5_1
llm_load_print_meta: model params     = 14.17 B
llm_load_print_meta: model size       = 9.95 GiB (6.03 BPW) 
llm_load_print_meta: general.name   = causallm_14b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token  = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  557.00 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 9629.41 MB
...........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MB
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size = 313.13 MB
llama_new_context_with_model: VRAM scratch buffer: 307.00 MB
llama_new_context_with_model: total VRAM used: 10336.41 MB (model: 9629.41 MB, context: 707.00 MB)

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: interactive mode on.
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|endoftext|>

你好
CUDA error 9 at ggml-cuda.cu:6862: invalid configuration argument
current device: 0

@JeremyBickel
Copy link

I'm getting an error from the same line of code, but mine says "no kernel image is available for execution on the device", instead of "invalid configuration argument".
SciSharp/LLamaSharp#189

@RoacherM
Copy link

same error
llm_load_vocab: mismatch in special tokens definition ( 9/100008 vs 8/100008 ).
llm_load_print_meta: format = unknown
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 100008
llm_load_print_meta: n_merges = 99743
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 6144
llm_load_print_meta: n_head = 48
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 24576
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: model type = 30B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 33.69 B
llm_load_print_meta: model size = 18.93 GiB (4.83 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 0 '<|endoftext|>'
llm_load_print_meta: EOS token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: LF token = 129 'Ä'
llm_load_tensors: ggml ctx size = 0.18 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 13118.48 MB
llm_load_tensors: offloading 20 repeating layers to GPU
llm_load_tensors: offloaded 20/63 layers to GPU
llm_load_tensors: VRAM used: 6270.00 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.25
llama_new_context_with_model: kv self size = 120.00 MB
llama_new_context_with_model: compute buffer total size = 213.45 MB
llama_new_context_with_model: VRAM scratch buffer: 207.33 MB
llama_new_context_with_model: total VRAM used: 6477.33 MB (model: 6270.00 MB, context: 207.33 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:你好
Bob
CUDA error 9 at ggml-cuda.cu:6862: invalid configuration argument
current device: 0

@slaren
Copy link
Collaborator

slaren commented Oct 25, 2023

A call stack and the kernel launch parameters that cause this error would help get this fixed quicker.

@happyme531
Copy link

happyme531 commented Oct 26, 2023

A call stack and the kernel launch parameters that cause this error would help get this fixed quicker.

Here it is.

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|endoftext|>
warning: Cuda API error detected: cudaLaunchKernel returned (0x9)

warning: Cuda API error detected: cudaGetLastError returned (0x9)


Thread 1 "main" hit Breakpoint 1, ggml_cuda_op_mul_mat (src0=0x7fff9ba002e0, src1=0x7ff99e918dc0, dst=0x7ff99e918f20, op=0x5555556bcb9b <ggml_cuda_op_mul_mat_vec_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st* const&)>, convert_src1_to_q8_1=true) at /mnt/ntfs/llm/llama.cpp/ggml-cuda.cu:6863
6863                        printf("Error in mul_mat\n");}
(cuda-gdb) bt
#0  ggml_cuda_op_mul_mat (src0=0x7fff9ba002e0, src1=0x7ff99e918dc0, dst=0x7ff99e918f20, 
    op=0x5555556bcb9b <ggml_cuda_op_mul_mat_vec_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st* const&)>, convert_src1_to_q8_1=true)
    at /mnt/ntfs/llm/llama.cpp/ggml-cuda.cu:6863
#1  0x00005555556a7c87 in ggml_cuda_mul_mat (src0=0x7fff9ba002e0, src1=0x7ff99e918dc0, dst=0x7ff99e918f20) at /mnt/ntfs/llm/llama.cpp/ggml-cuda.cu:7077
#2  0x00005555556a9fad in ggml_cuda_compute_forward (params=0x7ffffffead10, tensor=0x7ff99e918f20) at /mnt/ntfs/llm/llama.cpp/ggml-cuda.cu:7550
#3  0x00005555555a901a in ggml_compute_forward (params=0x7ffffffead10, tensor=0x7ff99e918f20) at ggml.c:16606
#4  0x00005555555ae0eb in ggml_graph_compute_thread (data=0x7ffffffead60) at ggml.c:18327
#5  0x00005555555af9b0 in ggml_graph_compute (cgraph=0x7ff99e800020, cplan=0x7ffffffeae60) at ggml.c:18903
#6  0x00005555555bd5a4 in ggml_graph_compute_helper (buf=..., graph=0x7ff99e800020, n_threads=1) at llama.cpp:568
#7  0x00005555555dea55 in llama_decode_internal (lctx=..., batch=...) at llama.cpp:6006
#8  0x00005555555ebcd3 in llama_decode (ctx=0x555556e3a4a0, batch=...) at llama.cpp:9656
#9  0x000055555556696f in main (argc=6, argv=0x7fffffffce58) at examples/main/main.cpp:584
(cuda-gdb) info cuda kernels
No CUDA kernels.
(cuda-gdb) info locals
i02 = 0
src1_ddq_i_offset = 0
src0_dd_i = 0x7ff9c8000000 ""
src1_ddf_i = 0x7ff96a000000
dst_dd_i = 0x7ff97d600000
i03 = 0
src1_ddq_i = 0x7fff9b9bfc00 ""
i0 = 0
src1_on_device = true
dst_on_device = false
row_diff = 152064
stream = 0x555556578a70
id = 0
is = 0
src1_ncols = 1
src1_col_0 = 0
ne00 = 5120
ne01 = 152064
ne02 = 1
ne03 = 1
nrows0 = 152064
ne10 = 5120
ne11 = 1
ne12 = 1
ne13 = 1
nrows1 = 1
ne0 = 152064
ne1 = 1
nb2 = 608256
nb3 = 608256
i02_divisor = 1
src0_ts = 210
src0_bs = 256
q8_1_ts = 36
q8_1_bs = 32
src0_extra = 0x55555b576660
src1_extra = 0x7fffbc91f390
dst_extra = 0x0
src0_on_device = true
src0_is_contiguous = true
src1_is_contiguous = true
src1_padded_col_size = 5120
split = true
src0_dd = {0x7ff9c8000000 "", 0x0 <repeats 15 times>}
src1_ddf = {0x7ff96a000000, 0x0 <repeats 15 times>}
src1_ddq = {0x7fff9b9bfc00 "", 0x0 <repeats 15 times>}
dst_dd = {0x7ff97d600000, 0x0 <repeats 15 times>}
src0_as = {0 <repeats 16 times>}
src1_asf = {0 <repeats 16 times>}
src1_asq = {17408, 0 <repeats 15 times>}
dst_as = {5504256, 0 <repeats 15 times>}
--Type <RET> for more, q to quit, c to continue without paging--
row_low = {0, 0, 93825000467376, 0, 4294967306, 140710083923392, 140735804342656, 140737239086272, 5120, 1, 93825093053984, 93825093053984, 140737488268256, 93825028256304, 0, 93824993610497}
row_high = {152064, 140710083923392, 140735804342656, 140710083923040, 128, 282578800082984, 4, 512, 512, 0, 0, 0, 93825009158768, 140709201969152, 140735804015616, 140709201969152}
src1_col_stride = 1

@slaren
Copy link
Collaborator

slaren commented Oct 27, 2023

The problem here seems to be that n_vocab is very large, and this value is used as the y dimension of the block size, which has a maximum of 65535. The correct fix would be to move this to the x dimension, which has no such limit. As a workaround, building with a higher value of LLAMA_CUDA_MMV_Y may fix this, try adding LLAMA_CUDA_MMV_Y=4 to the make or cmake command line, since this acts as a divisor of the block size.

@ArtyomZemlyak
Copy link

@slaren this also worked with CausalLM on GPU! (earlier have same issue)

@wzp123123
Copy link
Author

I test baichuan2, it works well On GPU, thanks a lot! @slaren

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2023

We still need to fix though, this is just a workaround.

@zolastro
Copy link

zolastro commented Nov 2, 2023

Not sure if I made something wrong, but after recompiling with the LLAMA_CUDA_MMV_Y=4 flag and correctly loading the Bloom model on GPU, it actually runs very slow compared to other models. I have made some test compared with Falcon7b and LLama v2 and it runs x30 times slower in GPU using q8_0 quantization. Is this something expected given this workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants