Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model conversion support for T5 and FLAN-T5 model variants #8055

Merged
merged 8 commits into from
Jun 24, 2024

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Jun 21, 2024

This PR adds model conversion support for T5 and FLAN-T5 model variants:

It's a first PR from a series of PR adding support for T5 and FLAN-T5 model families.

@github-actions github-actions bot added the python python script changes label Jun 21, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024
Copy link
Contributor

@felladrin felladrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this implementation, @fairydreaming!
I've just tested the conversion of t5-small and it worked great!
I hope you can also bring support for flan-t5 later 🙏

@fairydreaming
Copy link
Collaborator Author

Thank you for this implementation, @fairydreaming! I've just tested the conversion of t5-small and it worked great! I hope you can also bring support for flan-t5 later 🙏

Hmm, since it's the same architecture with small tweaks (gated gelu instead of relu, separate lm_head), it shouldn't be hard.

@fairydreaming
Copy link
Collaborator Author

I hope you can also bring support for flan-t5 later 🙏

@felladrin It's now supported.

@fairydreaming fairydreaming changed the title Model conversion support for T5 model variants Model conversion support for T5 and FLAN-T5 model variants Jun 23, 2024
@felladrin
Copy link
Contributor

Amazing work!

I have just one more thought:
Would it be possible not to require the spiece.model file when converting?
I'm asking because MBZUAI/LaMini-T5-61M and MBZUAI/LaMini-Flan-T5-77M, for example, don't have this file in their repo; but even with this file missing they can be converted to GGUF by huggingface/candle (and I'd guess the answer is somewhere around candle-transformers/src/models/t5.rs).

…tokens tensors (they are duplicates of shared tensor)
@fairydreaming
Copy link
Collaborator Author

Amazing work!

I have just one more thought: Would it be possible not to require the spiece.model file when converting? I'm asking because MBZUAI/LaMini-T5-61M and MBZUAI/LaMini-Flan-T5-77M, for example, don't have this file in their repo; but even with this file missing they can be converted to GGUF by huggingface/candle (and I'd guess the answer is somewhere around candle-transformers/src/models/t5.rs).

@felladrin From what I see all models from T5 and FLAN-T5 families use the same spiece.model file. If they fine-tuned T5 or FLAN-T5 to create LaMini-T5 and LaMini-Flan-T5 models without changing tokens then you can simply copy spiece.model from T5 or FLAN-T5. I added one more commit that allows to convert both of the LaMini models you mentioned. They seem to work just fine on my t5 branch (https://github.com/fairydreaming/llama.cpp/tree/t5):

./llama-cli --temp 0.01 -m models/lamini-flan-t5-77m.gguf -p 'how can I become more healthy?'

...
llama_output_reserve: reallocating output buffer from size 0.12 MiB to 1.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
 You can become more healthy by practicing good nutrition, getting enough sleep, eating a balanced diet, staying hydrated, and getting enough sleep. [end of text]

llama_print_timings:        load time =      16.45 ms
llama_print_timings:      sample time =       3.35 ms /    30 runs   (    0.11 ms per token,  8957.90 tokens per second)
llama_print_timings: prompt eval time =      14.23 ms /     9 tokens (    1.58 ms per token,   632.42 tokens per second)
llama_print_timings:        eval time =     141.37 ms /    29 runs   (    4.87 ms per token,   205.13 tokens per second)
llama_print_timings:       total time =     222.85 ms /    38 tokens
Log end

@felladrin
Copy link
Contributor

Thank you! Onwards!

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have some very minor comments on this, which is great!

Comment on lines +347 to +348
MODEL_TENSOR.DEC_OUTPUT_NORM: "dec.output_norm",
MODEL_TENSOR.ENC_ATTN_NORM: "enc.blk.{bid}.attn_norm",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enc and dec prefixes will (eventually) need to be also handled by the new markdown output mode of gguf-dump.py (#7853).

Can be fixed in a separate PR, I'm mentioning this for future reference.

(@mofosyne, you should be aware of this)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compilade I tried it on one example model python3 gguf-py/scripts/gguf-dump.py --markdown /mnt/md0/models/t5-small.gguf and I'm not sure what could be fixed, can you be more specific?

T_ID Tensor Layer Name Human Friendly Tensor Layer Name Elements Shape Type
0 dec.blk.0.attn_k.weight Dec Block 0 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
1 dec.blk.0.attn_o.weight Dec Block 0 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
2 dec.blk.0.attn_q.weight Dec Block 0 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
3 dec.blk.0.attn_rel_b.weight Dec Block 0 Attn_Rel_B (W) ( 256) 256 8 x 32 x 1 x 1 F16
4 dec.blk.0.attn_v.weight Dec Block 0 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
5 dec.blk.0.attn_norm.weight Dec Block 0 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
6 dec.blk.0.cross_attn_k.weight Dec Block 0 Cross_Attn_K (W) (~262K) 262144 512 x 512 x 1 x 1 F16
7 dec.blk.0.cross_attn_o.weight Dec Block 0 Cross_Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
8 dec.blk.0.cross_attn_q.weight Dec Block 0 Cross_Attn_Q (W) (~262K) 262144 512 x 512 x 1 x 1 F16
9 dec.blk.0.cross_attn_rel_b.weight Dec Block 0 Cross_Attn_Rel_B (W) ( 256) 256 8 x 32 x 1 x 1 F16
10 dec.blk.0.cross_attn_v.weight Dec Block 0 Cross_Attn_V (W) (~262K) 262144 512 x 512 x 1 x 1 F16
11 dec.blk.0.cross_attn_norm.weight Dec Block 0 Cross_Attn_Norm (W) ( 512) 512 512 x 1 x 1 x 1 F32
12 dec.blk.0.ffn_up.weight Dec Block 0 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
13 dec.blk.0.ffn_down.weight Dec Block 0 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
14 dec.blk.0.ffn_norm.weight Dec Block 0 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
15 dec.blk.1.attn_k.weight Dec Block 1 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
16 dec.blk.1.attn_o.weight Dec Block 1 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
17 dec.blk.1.attn_q.weight Dec Block 1 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
18 dec.blk.1.attn_v.weight Dec Block 1 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
19 dec.blk.1.attn_norm.weight Dec Block 1 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
20 dec.blk.1.cross_attn_k.weight Dec Block 1 Cross_Attn_K (W) (~262K) 262144 512 x 512 x 1 x 1 F16
21 dec.blk.1.cross_attn_o.weight Dec Block 1 Cross_Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
22 dec.blk.1.cross_attn_q.weight Dec Block 1 Cross_Attn_Q (W) (~262K) 262144 512 x 512 x 1 x 1 F16
23 dec.blk.1.cross_attn_v.weight Dec Block 1 Cross_Attn_V (W) (~262K) 262144 512 x 512 x 1 x 1 F16
24 dec.blk.1.cross_attn_norm.weight Dec Block 1 Cross_Attn_Norm (W) ( 512) 512 512 x 1 x 1 x 1 F32
25 dec.blk.1.ffn_up.weight Dec Block 1 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
26 dec.blk.1.ffn_down.weight Dec Block 1 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
27 dec.blk.1.ffn_norm.weight Dec Block 1 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
28 dec.blk.2.attn_k.weight Dec Block 2 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
29 dec.blk.2.attn_o.weight Dec Block 2 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
30 dec.blk.2.attn_q.weight Dec Block 2 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
31 dec.blk.2.attn_v.weight Dec Block 2 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
32 dec.blk.2.attn_norm.weight Dec Block 2 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
33 dec.blk.2.cross_attn_k.weight Dec Block 2 Cross_Attn_K (W) (~262K) 262144 512 x 512 x 1 x 1 F16
34 dec.blk.2.cross_attn_o.weight Dec Block 2 Cross_Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
35 dec.blk.2.cross_attn_q.weight Dec Block 2 Cross_Attn_Q (W) (~262K) 262144 512 x 512 x 1 x 1 F16
36 dec.blk.2.cross_attn_v.weight Dec Block 2 Cross_Attn_V (W) (~262K) 262144 512 x 512 x 1 x 1 F16
37 dec.blk.2.cross_attn_norm.weight Dec Block 2 Cross_Attn_Norm (W) ( 512) 512 512 x 1 x 1 x 1 F32
38 dec.blk.2.ffn_up.weight Dec Block 2 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
39 dec.blk.2.ffn_down.weight Dec Block 2 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
40 dec.blk.2.ffn_norm.weight Dec Block 2 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
41 dec.blk.3.attn_k.weight Dec Block 3 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
42 dec.blk.3.attn_o.weight Dec Block 3 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
43 dec.blk.3.attn_q.weight Dec Block 3 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
44 dec.blk.3.attn_v.weight Dec Block 3 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
45 dec.blk.3.attn_norm.weight Dec Block 3 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
46 dec.blk.3.cross_attn_k.weight Dec Block 3 Cross_Attn_K (W) (~262K) 262144 512 x 512 x 1 x 1 F16
47 dec.blk.3.cross_attn_o.weight Dec Block 3 Cross_Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
48 dec.blk.3.cross_attn_q.weight Dec Block 3 Cross_Attn_Q (W) (~262K) 262144 512 x 512 x 1 x 1 F16
49 dec.blk.3.cross_attn_v.weight Dec Block 3 Cross_Attn_V (W) (~262K) 262144 512 x 512 x 1 x 1 F16
50 dec.blk.3.cross_attn_norm.weight Dec Block 3 Cross_Attn_Norm (W) ( 512) 512 512 x 1 x 1 x 1 F32
51 dec.blk.3.ffn_up.weight Dec Block 3 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
52 dec.blk.3.ffn_down.weight Dec Block 3 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
53 dec.blk.3.ffn_norm.weight Dec Block 3 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
54 dec.blk.4.attn_k.weight Dec Block 4 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
55 dec.blk.4.attn_o.weight Dec Block 4 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
56 dec.blk.4.attn_q.weight Dec Block 4 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
57 dec.blk.4.attn_v.weight Dec Block 4 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
58 dec.blk.4.attn_norm.weight Dec Block 4 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
59 dec.blk.4.cross_attn_k.weight Dec Block 4 Cross_Attn_K (W) (~262K) 262144 512 x 512 x 1 x 1 F16
60 dec.blk.4.cross_attn_o.weight Dec Block 4 Cross_Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
61 dec.blk.4.cross_attn_q.weight Dec Block 4 Cross_Attn_Q (W) (~262K) 262144 512 x 512 x 1 x 1 F16
62 dec.blk.4.cross_attn_v.weight Dec Block 4 Cross_Attn_V (W) (~262K) 262144 512 x 512 x 1 x 1 F16
63 dec.blk.4.cross_attn_norm.weight Dec Block 4 Cross_Attn_Norm (W) ( 512) 512 512 x 1 x 1 x 1 F32
64 dec.blk.4.ffn_up.weight Dec Block 4 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
65 dec.blk.4.ffn_down.weight Dec Block 4 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
66 dec.blk.4.ffn_norm.weight Dec Block 4 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
67 dec.blk.5.attn_k.weight Dec Block 5 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
68 dec.blk.5.attn_o.weight Dec Block 5 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
69 dec.blk.5.attn_q.weight Dec Block 5 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
70 dec.blk.5.attn_v.weight Dec Block 5 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
71 dec.blk.5.attn_norm.weight Dec Block 5 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
72 dec.blk.5.cross_attn_k.weight Dec Block 5 Cross_Attn_K (W) (~262K) 262144 512 x 512 x 1 x 1 F16
73 dec.blk.5.cross_attn_o.weight Dec Block 5 Cross_Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
74 dec.blk.5.cross_attn_q.weight Dec Block 5 Cross_Attn_Q (W) (~262K) 262144 512 x 512 x 1 x 1 F16
75 dec.blk.5.cross_attn_v.weight Dec Block 5 Cross_Attn_V (W) (~262K) 262144 512 x 512 x 1 x 1 F16
76 dec.blk.5.cross_attn_norm.weight Dec Block 5 Cross_Attn_Norm (W) ( 512) 512 512 x 1 x 1 x 1 F32
77 dec.blk.5.ffn_up.weight Dec Block 5 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
78 dec.blk.5.ffn_down.weight Dec Block 5 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
79 dec.blk.5.ffn_norm.weight Dec Block 5 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
80 dec.output_norm.weight Dec Output Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
81 enc.blk.0.attn_k.weight Enc Block 0 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
82 enc.blk.0.attn_o.weight Enc Block 0 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
83 enc.blk.0.attn_q.weight Enc Block 0 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
84 enc.blk.0.attn_rel_b.weight Enc Block 0 Attn_Rel_B (W) ( 256) 256 8 x 32 x 1 x 1 F16
85 enc.blk.0.attn_v.weight Enc Block 0 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
86 enc.blk.0.attn_norm.weight Enc Block 0 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
87 enc.blk.0.ffn_up.weight Enc Block 0 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
88 enc.blk.0.ffn_down.weight Enc Block 0 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
89 enc.blk.0.ffn_norm.weight Enc Block 0 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
90 enc.blk.1.attn_k.weight Enc Block 1 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
91 enc.blk.1.attn_o.weight Enc Block 1 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
92 enc.blk.1.attn_q.weight Enc Block 1 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
93 enc.blk.1.attn_v.weight Enc Block 1 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
94 enc.blk.1.attn_norm.weight Enc Block 1 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
95 enc.blk.1.ffn_up.weight Enc Block 1 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
96 enc.blk.1.ffn_down.weight Enc Block 1 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
97 enc.blk.1.ffn_norm.weight Enc Block 1 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
98 enc.blk.2.attn_k.weight Enc Block 2 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
99 enc.blk.2.attn_o.weight Enc Block 2 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
100 enc.blk.2.attn_q.weight Enc Block 2 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
101 enc.blk.2.attn_v.weight Enc Block 2 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
102 enc.blk.2.attn_norm.weight Enc Block 2 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
103 enc.blk.2.ffn_up.weight Enc Block 2 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
104 enc.blk.2.ffn_down.weight Enc Block 2 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
105 enc.blk.2.ffn_norm.weight Enc Block 2 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
106 enc.blk.3.attn_k.weight Enc Block 3 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
107 enc.blk.3.attn_o.weight Enc Block 3 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
108 enc.blk.3.attn_q.weight Enc Block 3 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
109 enc.blk.3.attn_v.weight Enc Block 3 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
110 enc.blk.3.attn_norm.weight Enc Block 3 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
111 enc.blk.3.ffn_up.weight Enc Block 3 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
112 enc.blk.3.ffn_down.weight Enc Block 3 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
113 enc.blk.3.ffn_norm.weight Enc Block 3 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
114 enc.blk.4.attn_k.weight Enc Block 4 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
115 enc.blk.4.attn_o.weight Enc Block 4 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
116 enc.blk.4.attn_q.weight Enc Block 4 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
117 enc.blk.4.attn_v.weight Enc Block 4 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
118 enc.blk.4.attn_norm.weight Enc Block 4 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
119 enc.blk.4.ffn_up.weight Enc Block 4 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
120 enc.blk.4.ffn_down.weight Enc Block 4 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
121 enc.blk.4.ffn_norm.weight Enc Block 4 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
122 enc.blk.5.attn_k.weight Enc Block 5 Attention Key (W) (~262K) 262144 512 x 512 x 1 x 1 F16
123 enc.blk.5.attn_o.weight Enc Block 5 Attn_O (W) (~262K) 262144 512 x 512 x 1 x 1 F16
124 enc.blk.5.attn_q.weight Enc Block 5 Attention Query (W) (~262K) 262144 512 x 512 x 1 x 1 F16
125 enc.blk.5.attn_v.weight Enc Block 5 Attention Value (W) (~262K) 262144 512 x 512 x 1 x 1 F16
126 enc.blk.5.attn_norm.weight Enc Block 5 Attention Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
127 enc.blk.5.ffn_up.weight Enc Block 5 Feed-Forward Network "Up" (W) ( ~1M) 1048576 512 x 2048 x 1 x 1 F16
128 enc.blk.5.ffn_down.weight Enc Block 5 Feed-Forward Network "Down" (W) ( ~1M) 1048576 2048 x 512 x 1 x 1 F16
129 enc.blk.5.ffn_norm.weight Enc Block 5 Feed-Forward Network Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
130 enc.output_norm.weight Enc Output Normalization (W) ( 512) 512 512 x 1 x 1 x 1 F32
131 token_embd.weight Token Embedding (W) ( ~16M) 16449536 512 x 32128 x 1 x 1 F16

Copy link
Collaborator

@compilade compilade Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the markdown output of gguf-dump.py, there's currently a special case for tensor names which don't start with blk (ref: #7853 (comment), it seemed reasonable at the time), and it puts them all in the same section (so that token_embd.weight is in the same section as output.weight). If you try it on a non-T5 model (e.g. tinyllama or something), you'll notice that there are sections for each layer number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #8090

@@ -49,6 +49,7 @@ class LLM:
EXPERT_WEIGHTS_SCALE = "{arch}.expert_weights_scale"
POOLING_TYPE = "{arch}.pooling_type"
LOGIT_SCALE = "{arch}.logit_scale"
DECODER_START_TOKEN_ID = "{arch}.decoder_start_token_id"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why the decoder_start_token_id isn't with the rest of the tokenizer config (like e.g. tokenizer.ggml.bos_token_id)?

In what way is it different from tokenizer.ggml.bos_token_id? When is it used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's different. It's not related to the tokenizer at all, it's a model parameter. Decoder start token is not a separate specific token like BOS, EOS or PAD. It's used in encoder-decoder models like T5 as an initial starting token of the autoregressive decoding process. The model creators decided to use one of the existing tokens as the decoder start token (PAD in case of T5) and id of this token is stored in this parameter.

convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@fairydreaming fairydreaming merged commit de0d6a6 into ggerganov:master Jun 24, 2024
19 checks passed
@Sadeghi85
Copy link

Hello,

Is Madlad-400 also supported? It's based on T5.

@fairydreaming
Copy link
Collaborator Author

Is Madlad-400 also supported? It's based on T5.

Currently it converts OK, but then crashes with a big boom: Segmentation fault (core dumped) during inference. But that's good, at least we'll fix more bugs before the merge.

@MoonRide303
Copy link

MoonRide303 commented Jun 24, 2024

I tried to convert pile-t5-xl (blog post) using 52fc870 - it didn't work:

python D:\repos-git\llama.cpp\convert-hf-to-gguf.py --outtype f16 ..\pile-t5-xl\ --outfile pile-t5-xl-F16.gguf
INFO:hf-to-gguf:Loading model: pile-t5-xl
ERROR:hf-to-gguf:Model UMT5ForConditionalGeneration is not supported

Could it be supported, too? It uses Llama tokenizer.

@fairydreaming
Copy link
Collaborator Author

Is Madlad-400 also supported? It's based on T5.

@Sadeghi85 I added some fixes allowing to run this (tested on madlad400-3b), but they are currently in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5

@Sadeghi85
Copy link

Is Madlad-400 also supported? It's based on T5.

@Sadeghi85 I added some fixes allowing to run this (tested on madlad400-3b), but they are currently in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5

I converted hf model to gguf, it went ok. then compiled t5 branch and ran llama-server with the converted gguf, it gave below error:

GGML_ASSERT: J:\fairydreaming\llama.cpp\examples\server\server.cpp:690: llama_add_eos_token(model) != 1

@fairydreaming
Copy link
Collaborator Author

fairydreaming commented Jun 24, 2024

Is Madlad-400 also supported? It's based on T5.

@Sadeghi85 I added some fixes allowing to run this (tested on madlad400-3b), but they are currently in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5

I converted hf model to gguf, it went ok. then compiled t5 branch and ran llama-server with the converted gguf, it gave below error:

GGML_ASSERT: J:\fairydreaming\llama.cpp\examples\server\server.cpp:690: llama_add_eos_token(model) != 1

@Sadeghi85 Only llama-cli supports encoder-decoder models at this moment.

Example:

(llama.cpp) phm@epyc:~/projects/llama.cpp-t5$ ./llama-cli --temp 0.01 -m /mnt/md0/models/madlad400-3b.gguf -p '<2de> I love pizza!'
Log start
main: build = 3235 (68b51162)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1719255737
...
llama_output_reserve: reallocating output buffer from size 0.98 MiB to 6.86 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
▅ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
 Ich liebe Pizza! [end of text]

llama_print_timings:        load time =     267.43 ms
llama_print_timings:      sample time =       5.05 ms /     7 runs   (    0.72 ms per token,  1385.04 tokens per second)
llama_print_timings: prompt eval time =     390.53 ms /     8 tokens (   48.82 ms per token,    20.48 tokens per second)
llama_print_timings:        eval time =     731.00 ms /     6 runs   (  121.83 ms per token,     8.21 tokens per second)
llama_print_timings:       total time =    1613.58 ms /    14 tokens
Log end

It looks like there's some weird extra character outputted with madlad400-3b, but I didn't have time to investigate this yet.

@fairydreaming
Copy link
Collaborator Author

I tried to convert pile-t5-xl (blog post) using 52fc870 - it didn't work:

python D:\repos-git\llama.cpp\convert-hf-to-gguf.py --outtype f16 ..\pile-t5-xl\ --outfile pile-t5-xl-F16.gguf
INFO:hf-to-gguf:Loading model: pile-t5-xl
ERROR:hf-to-gguf:Model UMT5ForConditionalGeneration is not supported

Could it be supported, too? It uses Llama tokenizer.

From the description it looks like it's based on T5X, not T5.

I tried to convert pile-t5-xl (blog post) using 52fc870 - it didn't work:

python D:\repos-git\llama.cpp\convert-hf-to-gguf.py --outtype f16 ..\pile-t5-xl\ --outfile pile-t5-xl-F16.gguf
INFO:hf-to-gguf:Loading model: pile-t5-xl
ERROR:hf-to-gguf:Model UMT5ForConditionalGeneration is not supported

Could it be supported, too? It uses Llama tokenizer.

@MoonRide303 It looks like it would require some extra work, so maybe some day.

@MoonRide303
Copy link

@fairydreaming It seems that they've released both T5 and T5x checkpoints. I've mentioned those, cause they've got some improvements on benchmarks compared to vanilla T5, and looked roughly compatible - but if it's not trivial to add support for it, then I guess they'll have to wait for better times.

@Sadeghi85
Copy link

@Sadeghi85 Only llama-cli supports encoder-decoder models at this moment.

I tried with my own finetune of madlad400-7b and it worked correctly. (there is an extra character at the start as you mentioned)

Thanks.

@fairydreaming
Copy link
Collaborator Author

@fairydreaming It seems that they've released both T5 and T5x checkpoints. I've mentioned those, cause they've got some improvements on benchmarks compared to vanilla T5, and looked roughly compatible - but if it's not trivial to add support for it, then I guess they'll have to wait for better times.

@MoonRide303 I managed to run pile-t5-base, but it looks like all it can do is "to take a string of text that has been partially replaced with mask tokens and predict a sequence of tokens that would replace those mask tokens". Are there any fine-tunes of pile-t5 with more interesting use-cases?

@fairydreaming
Copy link
Collaborator Author

@Sadeghi85 Only llama-cli supports encoder-decoder models at this moment.

I tried with my own finetune of madlad400-7b and it worked correctly. (there is an extra character at the start as you mentioned)

Thanks.

@Sadeghi85 I know what's this extra char is, it's the decoder starting token (initial token passed to the decode to start autoregressive decoding process). In madlad400 decoder starting token has id 0 and token 0 is unk_token, and llama prints unknown tokens as “▅” U+2585 Lower Five Eighths Block Unicode Character. So it's not exactly a bug, but I'm not sure whether llama-cli shall print the decoder starting token or not.

@MoonRide303
Copy link

@fairydreaming It seems that they've released both T5 and T5x checkpoints. I've mentioned those, cause they've got some improvements on benchmarks compared to vanilla T5, and looked roughly compatible - but if it's not trivial to add support for it, then I guess they'll have to wait for better times.

@MoonRide303 I managed to run pile-t5-base, but it looks like all it can do is "to take a string of text that has been partially replaced with mask tokens and predict a sequence of tokens that would replace those mask tokens". Are there any fine-tunes of pile-t5 with more interesting use-cases?

I've found finetuned variants (like FLAN) on HF, but didn't test those, yet. I was wondering if the base models could be used as an alternative for vanilla T5 for the purpose of image generation (in architectures like SD3 or PixArt Sigma) - it might require training new model with Pile-T5 from the scratch, though.

@fairydreaming
Copy link
Collaborator Author

@MoonRide303 I managed to run pile-t5-base, but it looks like all it can do is "to take a string of text that has been partially replaced with mask tokens and predict a sequence of tokens that would replace those mask tokens". Are there any fine-tunes of pile-t5 with more interesting use-cases?

I've found finetuned variants (like FLAN) on HF, but didn't test those, yet. I was wondering if the base models could be used as an alternative for vanilla T5 for the purpose of image generation (in architectures like SD3 or PixArt Sigma) - it might require training new model with Pile-T5 from the scratch, though.

@MoonRide303 Pile-T5 models should now work in my t5 branch. I checked pile-t5-xl-flan you mentioned, seems to generate coherent output.

@MathiasSchindler
Copy link

@Sadeghi85 Only llama-cli supports encoder-decoder models at this moment.

I tried with my own finetune of madlad400-7b and it worked correctly. (there is an extra character at the start as you mentioned)

Thanks.

Congratulations. Since this is outside the scope of this thread here, would you be able to point to me to a simple explanation how to use the MADLAD-400 model using llama.cpp? This would be greatly appreciated.

@Sadeghi85
Copy link

Congratulations. Since this is outside the scope of this thread here, would you be able to point to me to a simple explanation how to use the MADLAD-400 model using llama.cpp? This would be greatly appreciated.

Follow T5 support progression here: #5763

When it's complete, you can use madlad like any other model.

If you want to test it now, you have to compile fairydreaming's t5 branch. Use convert-hf-to-gguf.py to convert madlad model to gguf and use llama-cli for inference.

@vladfaust
Copy link

It may be out of the scope of this PR, but I'd like to note that ./llama-quantize ./models/t5-small/ggml-model-f16.gguf ./models/t5-small/ggml-model-Q4_K_M.gguf Q4_K_M fails with the following output:

main: build = 3252 (7d7fff46)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.2.0
main: quantizing './models/t5-small/ggml-model-f16.gguf' to './models/t5-small/ggml-model-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 28 key-value pairs and 132 tensors from ./models/t5-small/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = t5
llama_model_loader: - kv   1:                               general.name str              = T5
llama_model_loader: - kv   2:                          t5.context_length u32              = 512
llama_model_loader: - kv   3:                        t5.embedding_length u32              = 512
llama_model_loader: - kv   4:                     t5.feed_forward_length u32              = 2048
llama_model_loader: - kv   5:                             t5.block_count u32              = 6
llama_model_loader: - kv   6:                    t5.attention.head_count u32              = 8
llama_model_loader: - kv   7:                    t5.attention.key_length u32              = 64
llama_model_loader: - kv   8:                  t5.attention.value_length u32              = 64
llama_model_loader: - kv   9:            t5.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  10:        t5.attention.relative_buckets_count u32              = 32
llama_model_loader: - kv  11:        t5.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                  t5.decoder_start_token_id u32              = 0
llama_model_loader: - kv  13:                          general.file_type u32              = 1
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32128]   = ["<pad>", "</s>", "<unk>", "▁", "X"...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32128]   = [0.000000, 0.000000, 0.000000, -2.012...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32128]   = [3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  20:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  21:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 2
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   32 tensors
llama_model_loader: - type  f16:  100 tensors
GGML_ASSERT: src/llama.cpp:17201: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected"
[1]    68589 abort      ./llama-quantize ./models/t5-small/ggml-model-f16.gguf  Q4_K_M

@fairydreaming
Copy link
Collaborator Author

It may be out of the scope of this PR, but I'd like to note that ./llama-quantize ./models/t5-small/ggml-model-f16.gguf ./models/t5-small/ggml-model-Q4_K_M.gguf Q4_K_M fails with the following output:

@vladfaust I added fixes for this in #8141 PR, thanks for reporting!

@compilade compilade mentioned this pull request Jun 30, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants