Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPT support in llama.cpp #3417

Merged
merged 17 commits into from Oct 10, 2023
Merged

MPT support in llama.cpp #3417

merged 17 commits into from Oct 10, 2023

Conversation

jploski
Copy link
Contributor

@jploski jploski commented Sep 30, 2023

As per #1333 (comment)

Some comments regarding this initial implementation:

  • It's based off the Falcon integration as much as possible (copy-paste), with only the required MPT changes ported from ggml/examples/mpt; it certainly makes it perform well (in terms of time-per-token) and hopefully will make refactoring out common code easier. It's slightly confusing because MPT does not use GQA, but I kept the references intact (just remember that n_head_kv == n_head).
  • Only tested with mpt-7b-storywriter (plausible output) and mpt-mini-shakespeare (token-by-token match for a couple example prompts).
  • The conversion script duplicates the token embeddings tensor ("wte") to server as the "output" tensor; this makes the GGUF file ~150 MB bigger than necessary (for the 7 GB model) and probably also wastes RAM. This allowed me to avoid what seemed as tricky changes to llama.cpp's internals (as I understand it, "wte" weights reside on CPU, while "output" is on GPU). But it might also mean the GGUF format for MPT is not final / not ready for publishing.
  • The conversion script expands vocab_size to match the size of the embeddings tensor; in MPT the embeddings tensor is oversized (contains rows for some extra unused tokens), which causes problems in llama.cpp if not addressed
  • Guesswork code around application of alibi and n_kv; as I did fully understand the recent changes from llama : custom attention mask + parallel decoding + no context swaps #3228, just made it prority to make the assertion ne1 == ne0 + n_past not fail in compute_forward_alibi; needs review
  • I did not implement the "rope_shift". I'm not sure if it's needed if there is no rope, although I suspect it may be. If so, this should only affect generations of outputs longer than the context length specified by the -c parameter. Although I did not notice any obviously garbage output because of that.
  • The GPU version fails to load mpt-mini-shakespeare because of some failed assertion. Surprisingly, this problem does not occur for 7B.

@cebtenzzre
Copy link
Collaborator

@goerch Another PR that conflicts with #3252

@ggerganov ggerganov added high priority Very important issue model Model specific labels Sep 30, 2023
@cebtenzzre
Copy link
Collaborator

quantize warns because it is looking for attn_k and not attn_qkv:

llama_model_quantize_internal ============ Strange model: n_attention_wv = 0, n_feed_forward_w2 = 32, hparams.n_layer = 32

@jploski
Copy link
Contributor Author

jploski commented Sep 30, 2023

quantize warns because it is looking for attn_k and not attn_qkv:

llama_model_quantize_internal ============ Strange model: n_attention_wv = 0, n_feed_forward_w2 = 32, hparams.n_layer = 32

Now fixed as well.

gguf-py/gguf/gguf.py Outdated Show resolved Hide resolved
…rom metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
llama.cpp Outdated Show resolved Hide resolved
convert-mpt-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-mpt-hf-to-gguf.py Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

  • The rope shift is not needed when using ALiBi
  • The ggml_alibi assert should just be removed and we should just pass 0 for n_past. At some point, I will look into replacing the ggml_alibi with a ggml_add but for now this should work

llama.cpp Outdated Show resolved Hide resolved
@jploski
Copy link
Contributor Author

jploski commented Oct 3, 2023

Note that this PR does not include the modifications of convert script proposed in #3252 and referred to in #3417 (comment) yet. Since this PR is based on a pre-merge commit of #3252, it may be easier to add this change after the merge.

convert-mpt-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-mpt-hf-to-gguf.py Show resolved Hide resolved
convert-mpt-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-mpt-hf-to-gguf.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf.py Outdated Show resolved Hide resolved
@ggerganov ggerganov mentioned this pull request Oct 9, 2023
1 task
@ggerganov
Copy link
Owner

@cebtenzzre Thanks for the merge. If anyone can give this a quick try and confirms working, we should merge.
I can do so tomorrow, but feel free to merge upon success before that

@jploski
Copy link
Contributor Author

jploski commented Oct 9, 2023

@cebtenzzre Thanks for the merge. If anyone can give this a quick try and confirms working, we should merge. I can do so tomorrow, but feel free to merge upon success before that

Works for me. The PR is now almost the same as my own previous private merge attempt.

The disable-n_past-assertion changes to ggml_compute_forward_alibi_f16 and ggml_compute_forward_alibi_f32 could be made syntactically more consistent - but AFAICS they are functionally equivalent. So not a show stopper for merge into master.

@ggerganov ggerganov merged commit f5f9121 into ggerganov:master Oct 10, 2023
32 of 38 checks passed
Copy link
Collaborator

@goerch goerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this, works fine for me. The test failure in test-tokenizer-1-bpe is due to added tokens. I'll fix this in a future PR.

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 12, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (34 commits)
  examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436)
  docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597)
  cmake : fix add_compile_options on macOS
  typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592)
  ci : check if there is enough VRAM (ggerganov#3596)
  server : add completion mode (no chat) (ggerganov#3582)
  prompts : add mnemonics.txt
  server : fix kv cache management (ggerganov#3588)
  main : fix session loading bug (ggerganov#3400)
  server : add parameter -tb N, --threads-batch N (ggerganov#3584)
  common : fix mirostat state when using multiple sequences (ggerganov#3543)
  batched : add bench tool (ggerganov#3545)
  examples : add batched.swift + improve CI for swift (ggerganov#3562)
  Add MPT model to supported models in README.md (ggerganov#3574)
  Minor improvements in GPT2 tokenizer (ggerganov#3567)
  readme : add bloom (ggerganov#3570)
  llm : add bloom models (ggerganov#3553)
  swift : improvements and fixes (ggerganov#3564)
  llm : add MPT support (ggerganov#3417)
  infill. : fix tokenization (ggerganov#3508)
  ...
cebtenzzre pushed a commit to nomic-ai/llama.cpp that referenced this pull request Oct 16, 2023
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

(cherry picked from commit f5f9121)
snichols pushed a commit to xgaicc/llama.cpp that referenced this pull request Oct 16, 2023
* CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545)

* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt

* mpt : protect against "clip_qkv": null in mpt-7b

* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models

* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)

* mpt : standardized all tensor names to follow GGUF spec

* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code

* mpt : fixed comment s/gptneox/mpt/

* mpt : remove tabs, trailing whitespace

* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt

* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252

* comment out n_past instead of marking it unused

* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]

* mpt : remove unused tokenizer_json in convert script

* ggml : remove obsolete n_past assert in ggml_alibi

* llama : print clam_kqv and max_alibi_bias hparams

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@maddes8cht
Copy link
Contributor

maddes8cht commented Oct 18, 2023

I converted the mpt-7b-chat and the mpt-7b-storywriter. The conversion and quantization completes sucessfully and produces the .gguf files. however, the files don't work for me. When running main with them, i get an

ERROR: byte not found in vocab: '

For reference, here is the full output:

Chosen model: E:\hf\mosaicml-mpt-7b-chat-gguf\ggml-mosaicml-mpt-7b-chat-Q2_K.gguf
Subdirectory: ggml_mosaicml_mpt_7b_chat_Q2_K
Parameter: -m E:\hf\mosaicml-mpt-7b-chat-gguf\ggml-mosaicml-mpt-7b-chat-Q2_K.gguf
Log start
main: build = 1299 (f5ef5cf)
main: built with MSVC 19.35.32217.1 for x64
main: seed  = 1697605592
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 195 tensors from E:\hf\mosaicml-mpt-7b-chat-gguf\ggml-mosaicml-mpt-7b-chat-Q2_K.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q2_K     [  4096, 50432,     1,     1 ]
llama_model_loader: - tensor    1:                    output.weight q6_K     [  4096, 50432,     1,     1 ]
llama_model_loader: - tensor    2:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.1.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   10:         blk.1.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   13:            blk.1.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   14:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   15:            blk.2.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   16:         blk.2.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   17:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:              blk.2.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   19:            blk.2.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   20:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   21:            blk.3.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   22:         blk.3.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   24:              blk.3.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   25:            blk.3.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.4.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   28:         blk.4.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   30:              blk.4.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   31:            blk.4.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   32:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   33:            blk.5.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   34:         blk.5.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   35:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:              blk.5.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   37:            blk.5.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   38:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   39:            blk.6.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   40:         blk.6.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   42:              blk.6.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   43:            blk.6.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.7.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   46:         blk.7.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   48:              blk.7.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   49:            blk.7.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   50:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   51:            blk.8.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   52:         blk.8.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   53:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:              blk.8.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   55:            blk.8.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   56:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   57:            blk.9.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   58:         blk.9.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   60:              blk.9.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   61:            blk.9.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   62:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:           blk.10.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   64:        blk.10.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   66:             blk.10.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   67:           blk.10.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   68:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   69:           blk.11.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   70:        blk.11.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:             blk.11.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   73:           blk.11.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   74:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   75:           blk.12.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   76:        blk.12.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   78:             blk.12.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   79:           blk.12.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   80:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:           blk.13.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   82:        blk.13.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:             blk.13.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   85:           blk.13.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   86:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   87:           blk.14.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   88:        blk.14.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:             blk.14.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   91:           blk.14.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   92:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   93:           blk.15.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor   94:        blk.15.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   96:             blk.15.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor   97:           blk.15.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.16.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  100:        blk.16.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.16.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  103:           blk.16.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  104:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  105:           blk.17.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  106:        blk.17.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  107:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:             blk.17.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  109:           blk.17.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  110:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  111:           blk.18.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  112:        blk.18.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  114:             blk.18.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  115:           blk.18.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.19.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  118:        blk.19.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.19.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  121:           blk.19.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  122:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  123:           blk.20.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  125:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:             blk.20.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  127:           blk.20.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  128:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  130:        blk.21.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  132:             blk.21.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  133:           blk.21.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.22.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  136:        blk.22.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.22.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  139:           blk.22.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  140:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  141:           blk.23.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  142:        blk.23.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  143:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:             blk.23.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  145:           blk.23.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  146:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  147:           blk.24.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  148:        blk.24.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  150:             blk.24.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  151:           blk.24.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.25.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  154:        blk.25.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.25.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  157:           blk.25.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  158:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  159:           blk.26.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  160:        blk.26.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  161:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:             blk.26.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  163:           blk.26.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  164:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  165:           blk.27.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  166:        blk.27.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  168:             blk.27.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  169:           blk.27.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.28.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  172:        blk.28.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.28.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  175:           blk.28.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  176:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  177:           blk.29.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  178:        blk.29.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  179:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:             blk.29.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  181:           blk.29.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  182:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  183:           blk.30.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  184:        blk.30.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  186:             blk.30.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  187:           blk.30.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.31.attn_qkv.weight q2_K     [  4096, 12288,     1,     1 ]
llama_model_loader: - tensor  190:        blk.31.attn_output.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.31.ffn_up.weight q3_K     [  4096, 16384,     1,     1 ]
llama_model_loader: - tensor  193:           blk.31.ffn_down.weight q3_K     [ 16384,  4096,     1,     1 ]
llama_model_loader: - tensor  194:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                         mpt.context_length u32
llama_model_loader: - kv   3:                       mpt.embedding_length u32
llama_model_loader: - kv   4:                            mpt.block_count u32
llama_model_loader: - kv   5:                    mpt.feed_forward_length u32
llama_model_loader: - kv   6:                   mpt.attention.head_count u32
llama_model_loader: - kv   7:           mpt.attention.layer_norm_epsilon f32
llama_model_loader: - kv   8:               mpt.attention.max_alibi_bias f32
llama_model_loader: - kv   9:                       tokenizer.ggml.model str
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  11:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  17:               general.quantization_version u32
llama_model_loader: - kv  18:                          general.file_type u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:   33 tensors
llama_model_loader: - type q3_K:   96 tensors
llama_model_loader: - type q6_K:    1 tensors
ERROR: byte not found in vocab: '
'

I already have successfull converted a bunch of falcon models that work fine, butthe mpt conversion script does not work for me.
I'm running on a a windows 10 machine.

@maddes8cht
Copy link
Contributor

Here is a hexdump of the beginning of the files:

hexdump -C ggml-mosaicml-mpt-7b-chat-Q2_K.gguf |head -n 20
00000000  47 47 55 46 02 00 00 00  c3 00 00 00 00 00 00 00  |GGUF............|
00000010  13 00 00 00 00 00 00 00  14 00 00 00 00 00 00 00  |................|
00000020  67 65 6e 65 72 61 6c 2e  61 72 63 68 69 74 65 63  |general.architec|
00000030  74 75 72 65 08 00 00 00  03 00 00 00 00 00 00 00  |ture............|
00000040  6d 70 74 0c 00 00 00 00  00 00 00 67 65 6e 65 72  |mpt........gener|
00000050  61 6c 2e 6e 61 6d 65 08  00 00 00 09 00 00 00 00  |al.name.........|
00000060  00 00 00 66 61 6c 63 6f  6e 2d 69 6e 12 00 00 00  |...falcon-in....|
00000070  00 00 00 00 6d 70 74 2e  63 6f 6e 74 65 78 74 5f  |....mpt.context_|
00000080  6c 65 6e 67 74 68 04 00  00 00 00 08 00 00 14 00  |length..........|
00000090  00 00 00 00 00 00 6d 70  74 2e 65 6d 62 65 64 64  |......mpt.embedd|
000000a0  69 6e 67 5f 6c 65 6e 67  74 68 04 00 00 00 00 10  |ing_length......|
000000b0  00 00 0f 00 00 00 00 00  00 00 6d 70 74 2e 62 6c  |..........mpt.bl|
000000c0  6f 63 6b 5f 63 6f 75 6e  74 04 00 00 00 20 00 00  |ock_count.... ..|
000000d0  00 17 00 00 00 00 00 00  00 6d 70 74 2e 66 65 65  |.........mpt.fee|
000000e0  64 5f 66 6f 72 77 61 72  64 5f 6c 65 6e 67 74 68  |d_forward_length|
000000f0  04 00 00 00 00 40 00 00  18 00 00 00 00 00 00 00  |.....@..........|
00000100  6d 70 74 2e 61 74 74 65  6e 74 69 6f 6e 2e 68 65  |mpt.attention.he|
00000110  61 64 5f 63 6f 75 6e 74  04 00 00 00 20 00 00 00  |ad_count.... ...|
00000120  20 00 00 00 00 00 00 00  6d 70 74 2e 61 74 74 65  | .......mpt.atte|
00000130  6e 74 69 6f 6e 2e 6c 61  79 65 72 5f 6e 6f 72 6d  |ntion.layer_norm|

in comparison to the openbuddy falconversion that works fine:

hexdump -C ggml-OpenBuddy-openbuddy-falcon-7b-v5-fp16-Q4_0.gguf |head -n 20
00000000  47 47 55 46 02 00 00 00  c4 00 00 00 00 00 00 00  |GGUF............|
00000010  10 00 00 00 00 00 00 00  14 00 00 00 00 00 00 00  |................|
00000020  67 65 6e 65 72 61 6c 2e  61 72 63 68 69 74 65 63  |general.architec|
00000030  74 75 72 65 08 00 00 00  06 00 00 00 00 00 00 00  |ture............|
00000040  66 61 6c 63 6f 6e 0c 00  00 00 00 00 00 00 67 65  |falcon........ge|
00000050  6e 65 72 61 6c 2e 6e 61  6d 65 08 00 00 00 06 00  |neral.name......|
00000060  00 00 00 00 00 00 46 61  6c 63 6f 6e 15 00 00 00  |......Falcon....|
00000070  00 00 00 00 66 61 6c 63  6f 6e 2e 63 6f 6e 74 65  |....falcon.conte|
00000080  78 74 5f 6c 65 6e 67 74  68 04 00 00 00 00 08 00  |xt_length.......|
00000090  00 19 00 00 00 00 00 00  00 66 61 6c 63 6f 6e 2e  |.........falcon.|
000000a0  74 65 6e 73 6f 72 5f 64  61 74 61 5f 6c 61 79 6f  |tensor_data_layo|
000000b0  75 74 08 00 00 00 07 00  00 00 00 00 00 00 6a 70  |ut............jp|
000000c0  6c 6f 73 6b 69 17 00 00  00 00 00 00 00 66 61 6c  |loski........fal|
000000d0  63 6f 6e 2e 65 6d 62 65  64 64 69 6e 67 5f 6c 65  |con.embedding_le|
000000e0  6e 67 74 68 04 00 00 00  c0 11 00 00 1a 00 00 00  |ngth............|
000000f0  00 00 00 00 66 61 6c 63  6f 6e 2e 66 65 65 64 5f  |....falcon.feed_|
00000100  66 6f 72 77 61 72 64 5f  6c 65 6e 67 74 68 04 00  |forward_length..|
00000110  00 00 00 47 00 00 12 00  00 00 00 00 00 00 66 61  |...G..........fa|
00000120  6c 63 6f 6e 2e 62 6c 6f  63 6b 5f 63 6f 75 6e 74  |lcon.block_count|
00000130  04 00 00 00 20 00 00 00  1b 00 00 00 00 00 00 00  |.... ...........|

What I notice is that after general.name there is a falcon-in. This is the directory name where the conversion was done in my case (because I just used the same directory structure I created from my falcon conversions).

In contrast, the actual falcon model has a falcon behind general.name, which was not a directory name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue model Model specific
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants