-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model conversion support for T5 and FLAN-T5 model variants #8055
Conversation
…alGeneration and T5WithLMHeadModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this implementation, @fairydreaming!
I've just tested the conversion of t5-small and it worked great!
I hope you can also bring support for flan-t5 later 🙏
Hmm, since it's the same architecture with small tweaks (gated gelu instead of relu, separate lm_head), it shouldn't be hard. |
@felladrin It's now supported. |
Amazing work! I have just one more thought: |
…tokens tensors (they are duplicates of shared tensor)
@felladrin From what I see all models from T5 and FLAN-T5 families use the same spiece.model file. If they fine-tuned T5 or FLAN-T5 to create LaMini-T5 and LaMini-Flan-T5 models without changing tokens then you can simply copy spiece.model from T5 or FLAN-T5. I added one more commit that allows to convert both of the LaMini models you mentioned. They seem to work just fine on my t5 branch (https://github.com/fairydreaming/llama.cpp/tree/t5):
|
Thank you! Onwards! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have some very minor comments on this, which is great!
MODEL_TENSOR.DEC_OUTPUT_NORM: "dec.output_norm", | ||
MODEL_TENSOR.ENC_ATTN_NORM: "enc.blk.{bid}.attn_norm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compilade I tried it on one example model python3 gguf-py/scripts/gguf-dump.py --markdown /mnt/md0/models/t5-small.gguf
and I'm not sure what could be fixed, can you be more specific?
T_ID | Tensor Layer Name | Human Friendly Tensor Layer Name | Elements | Shape | Type |
---|---|---|---|---|---|
0 | dec.blk.0.attn_k.weight | Dec Block 0 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
1 | dec.blk.0.attn_o.weight | Dec Block 0 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
2 | dec.blk.0.attn_q.weight | Dec Block 0 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
3 | dec.blk.0.attn_rel_b.weight | Dec Block 0 Attn_Rel_B (W) | ( 256) 256 | 8 x 32 x 1 x 1 | F16 |
4 | dec.blk.0.attn_v.weight | Dec Block 0 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
5 | dec.blk.0.attn_norm.weight | Dec Block 0 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
6 | dec.blk.0.cross_attn_k.weight | Dec Block 0 Cross_Attn_K (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
7 | dec.blk.0.cross_attn_o.weight | Dec Block 0 Cross_Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
8 | dec.blk.0.cross_attn_q.weight | Dec Block 0 Cross_Attn_Q (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
9 | dec.blk.0.cross_attn_rel_b.weight | Dec Block 0 Cross_Attn_Rel_B (W) | ( 256) 256 | 8 x 32 x 1 x 1 | F16 |
10 | dec.blk.0.cross_attn_v.weight | Dec Block 0 Cross_Attn_V (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
11 | dec.blk.0.cross_attn_norm.weight | Dec Block 0 Cross_Attn_Norm (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
12 | dec.blk.0.ffn_up.weight | Dec Block 0 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
13 | dec.blk.0.ffn_down.weight | Dec Block 0 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
14 | dec.blk.0.ffn_norm.weight | Dec Block 0 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
15 | dec.blk.1.attn_k.weight | Dec Block 1 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
16 | dec.blk.1.attn_o.weight | Dec Block 1 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
17 | dec.blk.1.attn_q.weight | Dec Block 1 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
18 | dec.blk.1.attn_v.weight | Dec Block 1 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
19 | dec.blk.1.attn_norm.weight | Dec Block 1 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
20 | dec.blk.1.cross_attn_k.weight | Dec Block 1 Cross_Attn_K (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
21 | dec.blk.1.cross_attn_o.weight | Dec Block 1 Cross_Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
22 | dec.blk.1.cross_attn_q.weight | Dec Block 1 Cross_Attn_Q (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
23 | dec.blk.1.cross_attn_v.weight | Dec Block 1 Cross_Attn_V (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
24 | dec.blk.1.cross_attn_norm.weight | Dec Block 1 Cross_Attn_Norm (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
25 | dec.blk.1.ffn_up.weight | Dec Block 1 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
26 | dec.blk.1.ffn_down.weight | Dec Block 1 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
27 | dec.blk.1.ffn_norm.weight | Dec Block 1 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
28 | dec.blk.2.attn_k.weight | Dec Block 2 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
29 | dec.blk.2.attn_o.weight | Dec Block 2 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
30 | dec.blk.2.attn_q.weight | Dec Block 2 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
31 | dec.blk.2.attn_v.weight | Dec Block 2 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
32 | dec.blk.2.attn_norm.weight | Dec Block 2 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
33 | dec.blk.2.cross_attn_k.weight | Dec Block 2 Cross_Attn_K (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
34 | dec.blk.2.cross_attn_o.weight | Dec Block 2 Cross_Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
35 | dec.blk.2.cross_attn_q.weight | Dec Block 2 Cross_Attn_Q (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
36 | dec.blk.2.cross_attn_v.weight | Dec Block 2 Cross_Attn_V (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
37 | dec.blk.2.cross_attn_norm.weight | Dec Block 2 Cross_Attn_Norm (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
38 | dec.blk.2.ffn_up.weight | Dec Block 2 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
39 | dec.blk.2.ffn_down.weight | Dec Block 2 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
40 | dec.blk.2.ffn_norm.weight | Dec Block 2 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
41 | dec.blk.3.attn_k.weight | Dec Block 3 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
42 | dec.blk.3.attn_o.weight | Dec Block 3 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
43 | dec.blk.3.attn_q.weight | Dec Block 3 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
44 | dec.blk.3.attn_v.weight | Dec Block 3 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
45 | dec.blk.3.attn_norm.weight | Dec Block 3 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
46 | dec.blk.3.cross_attn_k.weight | Dec Block 3 Cross_Attn_K (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
47 | dec.blk.3.cross_attn_o.weight | Dec Block 3 Cross_Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
48 | dec.blk.3.cross_attn_q.weight | Dec Block 3 Cross_Attn_Q (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
49 | dec.blk.3.cross_attn_v.weight | Dec Block 3 Cross_Attn_V (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
50 | dec.blk.3.cross_attn_norm.weight | Dec Block 3 Cross_Attn_Norm (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
51 | dec.blk.3.ffn_up.weight | Dec Block 3 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
52 | dec.blk.3.ffn_down.weight | Dec Block 3 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
53 | dec.blk.3.ffn_norm.weight | Dec Block 3 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
54 | dec.blk.4.attn_k.weight | Dec Block 4 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
55 | dec.blk.4.attn_o.weight | Dec Block 4 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
56 | dec.blk.4.attn_q.weight | Dec Block 4 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
57 | dec.blk.4.attn_v.weight | Dec Block 4 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
58 | dec.blk.4.attn_norm.weight | Dec Block 4 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
59 | dec.blk.4.cross_attn_k.weight | Dec Block 4 Cross_Attn_K (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
60 | dec.blk.4.cross_attn_o.weight | Dec Block 4 Cross_Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
61 | dec.blk.4.cross_attn_q.weight | Dec Block 4 Cross_Attn_Q (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
62 | dec.blk.4.cross_attn_v.weight | Dec Block 4 Cross_Attn_V (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
63 | dec.blk.4.cross_attn_norm.weight | Dec Block 4 Cross_Attn_Norm (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
64 | dec.blk.4.ffn_up.weight | Dec Block 4 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
65 | dec.blk.4.ffn_down.weight | Dec Block 4 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
66 | dec.blk.4.ffn_norm.weight | Dec Block 4 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
67 | dec.blk.5.attn_k.weight | Dec Block 5 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
68 | dec.blk.5.attn_o.weight | Dec Block 5 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
69 | dec.blk.5.attn_q.weight | Dec Block 5 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
70 | dec.blk.5.attn_v.weight | Dec Block 5 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
71 | dec.blk.5.attn_norm.weight | Dec Block 5 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
72 | dec.blk.5.cross_attn_k.weight | Dec Block 5 Cross_Attn_K (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
73 | dec.blk.5.cross_attn_o.weight | Dec Block 5 Cross_Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
74 | dec.blk.5.cross_attn_q.weight | Dec Block 5 Cross_Attn_Q (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
75 | dec.blk.5.cross_attn_v.weight | Dec Block 5 Cross_Attn_V (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
76 | dec.blk.5.cross_attn_norm.weight | Dec Block 5 Cross_Attn_Norm (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
77 | dec.blk.5.ffn_up.weight | Dec Block 5 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
78 | dec.blk.5.ffn_down.weight | Dec Block 5 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
79 | dec.blk.5.ffn_norm.weight | Dec Block 5 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
80 | dec.output_norm.weight | Dec Output Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
81 | enc.blk.0.attn_k.weight | Enc Block 0 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
82 | enc.blk.0.attn_o.weight | Enc Block 0 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
83 | enc.blk.0.attn_q.weight | Enc Block 0 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
84 | enc.blk.0.attn_rel_b.weight | Enc Block 0 Attn_Rel_B (W) | ( 256) 256 | 8 x 32 x 1 x 1 | F16 |
85 | enc.blk.0.attn_v.weight | Enc Block 0 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
86 | enc.blk.0.attn_norm.weight | Enc Block 0 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
87 | enc.blk.0.ffn_up.weight | Enc Block 0 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
88 | enc.blk.0.ffn_down.weight | Enc Block 0 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
89 | enc.blk.0.ffn_norm.weight | Enc Block 0 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
90 | enc.blk.1.attn_k.weight | Enc Block 1 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
91 | enc.blk.1.attn_o.weight | Enc Block 1 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
92 | enc.blk.1.attn_q.weight | Enc Block 1 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
93 | enc.blk.1.attn_v.weight | Enc Block 1 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
94 | enc.blk.1.attn_norm.weight | Enc Block 1 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
95 | enc.blk.1.ffn_up.weight | Enc Block 1 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
96 | enc.blk.1.ffn_down.weight | Enc Block 1 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
97 | enc.blk.1.ffn_norm.weight | Enc Block 1 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
98 | enc.blk.2.attn_k.weight | Enc Block 2 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
99 | enc.blk.2.attn_o.weight | Enc Block 2 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
100 | enc.blk.2.attn_q.weight | Enc Block 2 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
101 | enc.blk.2.attn_v.weight | Enc Block 2 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
102 | enc.blk.2.attn_norm.weight | Enc Block 2 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
103 | enc.blk.2.ffn_up.weight | Enc Block 2 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
104 | enc.blk.2.ffn_down.weight | Enc Block 2 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
105 | enc.blk.2.ffn_norm.weight | Enc Block 2 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
106 | enc.blk.3.attn_k.weight | Enc Block 3 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
107 | enc.blk.3.attn_o.weight | Enc Block 3 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
108 | enc.blk.3.attn_q.weight | Enc Block 3 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
109 | enc.blk.3.attn_v.weight | Enc Block 3 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
110 | enc.blk.3.attn_norm.weight | Enc Block 3 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
111 | enc.blk.3.ffn_up.weight | Enc Block 3 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
112 | enc.blk.3.ffn_down.weight | Enc Block 3 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
113 | enc.blk.3.ffn_norm.weight | Enc Block 3 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
114 | enc.blk.4.attn_k.weight | Enc Block 4 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
115 | enc.blk.4.attn_o.weight | Enc Block 4 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
116 | enc.blk.4.attn_q.weight | Enc Block 4 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
117 | enc.blk.4.attn_v.weight | Enc Block 4 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
118 | enc.blk.4.attn_norm.weight | Enc Block 4 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
119 | enc.blk.4.ffn_up.weight | Enc Block 4 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
120 | enc.blk.4.ffn_down.weight | Enc Block 4 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
121 | enc.blk.4.ffn_norm.weight | Enc Block 4 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
122 | enc.blk.5.attn_k.weight | Enc Block 5 Attention Key (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
123 | enc.blk.5.attn_o.weight | Enc Block 5 Attn_O (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
124 | enc.blk.5.attn_q.weight | Enc Block 5 Attention Query (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
125 | enc.blk.5.attn_v.weight | Enc Block 5 Attention Value (W) | (~262K) 262144 | 512 x 512 x 1 x 1 | F16 |
126 | enc.blk.5.attn_norm.weight | Enc Block 5 Attention Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
127 | enc.blk.5.ffn_up.weight | Enc Block 5 Feed-Forward Network "Up" (W) | ( ~1M) 1048576 | 512 x 2048 x 1 x 1 | F16 |
128 | enc.blk.5.ffn_down.weight | Enc Block 5 Feed-Forward Network "Down" (W) | ( ~1M) 1048576 | 2048 x 512 x 1 x 1 | F16 |
129 | enc.blk.5.ffn_norm.weight | Enc Block 5 Feed-Forward Network Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
130 | enc.output_norm.weight | Enc Output Normalization (W) | ( 512) 512 | 512 x 1 x 1 x 1 | F32 |
131 | token_embd.weight | Token Embedding (W) | ( ~16M) 16449536 | 512 x 32128 x 1 x 1 | F16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the markdown output of gguf-dump.py
, there's currently a special case for tensor names which don't start with blk
(ref: #7853 (comment), it seemed reasonable at the time), and it puts them all in the same section (so that token_embd.weight
is in the same section as output.weight
). If you try it on a non-T5 model (e.g. tinyllama or something), you'll notice that there are sections for each layer number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #8090
@@ -49,6 +49,7 @@ class LLM: | |||
EXPERT_WEIGHTS_SCALE = "{arch}.expert_weights_scale" | |||
POOLING_TYPE = "{arch}.pooling_type" | |||
LOGIT_SCALE = "{arch}.logit_scale" | |||
DECODER_START_TOKEN_ID = "{arch}.decoder_start_token_id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why the decoder_start_token_id
isn't with the rest of the tokenizer config (like e.g. tokenizer.ggml.bos_token_id
)?
In what way is it different from tokenizer.ggml.bos_token_id
? When is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's different. It's not related to the tokenizer at all, it's a model parameter. Decoder start token is not a separate specific token like BOS, EOS or PAD. It's used in encoder-decoder models like T5 as an initial starting token of the autoregressive decoding process. The model creators decided to use one of the existing tokens as the decoder start token (PAD in case of T5) and id of this token is stored in this parameter.
Hello, Is Madlad-400 also supported? It's based on T5. |
Currently it converts OK, but then crashes with a big boom: |
I tried to convert pile-t5-xl (blog post) using 52fc870 - it didn't work:
Could it be supported, too? It uses Llama tokenizer. |
@Sadeghi85 I added some fixes allowing to run this (tested on madlad400-3b), but they are currently in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5 |
I converted hf model to gguf, it went ok. then compiled t5 branch and ran llama-server with the converted gguf, it gave below error: GGML_ASSERT: J:\fairydreaming\llama.cpp\examples\server\server.cpp:690: llama_add_eos_token(model) != 1 |
@Sadeghi85 Only llama-cli supports encoder-decoder models at this moment. Example:
It looks like there's some weird extra character outputted with madlad400-3b, but I didn't have time to investigate this yet. |
From the description it looks like it's based on T5X, not T5.
@MoonRide303 It looks like it would require some extra work, so maybe some day. |
@fairydreaming It seems that they've released both T5 and T5x checkpoints. I've mentioned those, cause they've got some improvements on benchmarks compared to vanilla T5, and looked roughly compatible - but if it's not trivial to add support for it, then I guess they'll have to wait for better times. |
I tried with my own finetune of madlad400-7b and it worked correctly. (there is an extra character at the start as you mentioned) Thanks. |
@MoonRide303 I managed to run pile-t5-base, but it looks like all it can do is "to take a string of text that has been partially replaced with mask tokens and predict a sequence of tokens that would replace those mask tokens". Are there any fine-tunes of pile-t5 with more interesting use-cases? |
@Sadeghi85 I know what's this extra char is, it's the decoder starting token (initial token passed to the decode to start autoregressive decoding process). In madlad400 decoder starting token has id 0 and token 0 is unk_token, and llama prints unknown tokens as “▅” U+2585 Lower Five Eighths Block Unicode Character. So it's not exactly a bug, but I'm not sure whether llama-cli shall print the decoder starting token or not. |
I've found finetuned variants (like FLAN) on HF, but didn't test those, yet. I was wondering if the base models could be used as an alternative for vanilla T5 for the purpose of image generation (in architectures like SD3 or PixArt Sigma) - it might require training new model with Pile-T5 from the scratch, though. |
@MoonRide303 Pile-T5 models should now work in my t5 branch. I checked pile-t5-xl-flan you mentioned, seems to generate coherent output. |
Congratulations. Since this is outside the scope of this thread here, would you be able to point to me to a simple explanation how to use the MADLAD-400 model using llama.cpp? This would be greatly appreciated. |
Follow T5 support progression here: #5763 When it's complete, you can use madlad like any other model. If you want to test it now, you have to compile fairydreaming's t5 branch. Use convert-hf-to-gguf.py to convert madlad model to gguf and use llama-cli for inference. |
It may be out of the scope of this PR, but I'd like to note that
|
@vladfaust I added fixes for this in #8141 PR, thanks for reporting! |
This PR adds model conversion support for T5 and FLAN-T5 model variants:
It's a first PR from a series of PR adding support for T5 and FLAN-T5 model families.