llama : add grammar-based sampling #1773

ejones · 2023-06-09T05:08:03Z

EDITED after updates

Inspired by #1397 and grantslatton's CFG work, this adds an API that takes a serialized context-free grammar to guide and constrain sampling. Also adds a sample Backus-Naur form (BNF)-like syntax in main for specifying a grammar for generations.

Testing

(M2 Max, 30B)

Chess

 % ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'A good game:\n\n' --grammar-file grammars/chess.gbnf
main: build = 674 (e550234)
main: seed  = 1688014137
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


main: grammar:
root ::= [1] [.] [ ] move [ ] move [<U+000A>] root_4 
move ::= move_5 move_9 
root_2 ::= [1-9] root_3 [.] [ ] move [ ] move [<U+000A>] 
root_3 ::= [0-9] | 
root_4 ::= root_2 root_4 | root_2 
move_5 ::= pawn | nonpawn | castle 
pawn ::= pawn_14 [a-h] [1-8] pawn_16 
nonpawn ::= [NBKQR] nonpawn_10 nonpawn_11 nonpawn_12 [a-h] [1-8] 
castle ::= [O] [-] [O] castle_17 
move_9 ::= [+#] | 
nonpawn_10 ::= [a-h] | 
nonpawn_11 ::= [1-8] | 
nonpawn_12 ::= [x] | 
pawn_13 ::= [a-h] [x] 
pawn_14 ::= pawn_13 | 
pawn_15 ::= [=] [NBKQR] 
pawn_16 ::= pawn_15 | 
castle_17 ::= [-] [O] | 

 A good game:

1. e4 e5
2. Nf3 Nc6
3. Bb5 a6
4. Ba4 Nf6

llama_print_timings:        load time =  1144.33 ms
llama_print_timings:      sample time =    35.87 ms /    32 runs   (    1.12 ms per token)
llama_print_timings: prompt eval time =  1126.34 ms /     7 tokens (  160.91 ms per token)
llama_print_timings:        eval time =  5214.99 ms /    31 runs   (  168.23 ms per token)
llama_print_timings:       total time =  6398.45 ms

"Chess" without grammar

% ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'A good game:\n\n'  

main: build = 645 (fd0eb66)
main: seed  = 1686286016
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 A good game:

Sir Thomas Gresham, when he was building his famous Exchange at London, had the following dialogue with a mason, whose name was Richard B
llama_print_timings:        load time =  1185.47 ms
llama_print_timings:      sample time =    21.57 ms /    32 runs   (    0.67 ms per token)
llama_print_timings: prompt eval time =  1167.67 ms /     7 tokens (  166.81 ms per token)
llama_print_timings:        eval time =  4977.97 ms /    31 runs   (  160.58 ms per token)
llama_print_timings:       total time =  6188.21 ms

Arithmetic

 % ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'Some arithmetic practice:\n\n' \                      
--grammar 'root  ::= (expr "=" ws num "\n")+
expr  ::= term ([-+*/] term)*
term  ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num   ::= [0-9]+ ws
ws    ::= [ \t\n]*'
main: build = 674 (e550234)
main: seed  = 1688014196
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


main: grammar:
root ::= root_5 
root_1 ::= expr [=] ws num [<U+000A>] 
expr ::= term expr_8 
ws ::= ws_12 
num ::= num_11 ws 
root_5 ::= root_1 root_5 | root_1 
term ::= ident | num | [(] ws expr [)] ws 
expr_7 ::= [-+*/] term 
expr_8 ::= expr_7 expr_8 | 
ident ::= [a-z] ident_10 ws 
ident_10 ::= [a-z0-9_] ident_10 | 
num_11 ::= [0-9] num_11 | [0-9] 
ws_12 ::= [ <U+0009><U+000A>] ws_12 | 

 Some arithmetic practice:

10 *a*1 +b*2 =640

10 *a*2 +b*3 =656


llama_print_timings:        load time =  1165.00 ms
llama_print_timings:      sample time =    41.11 ms /    32 runs   (    1.28 ms per token)
llama_print_timings: prompt eval time =  1147.76 ms /     7 tokens (  163.97 ms per token)
llama_print_timings:        eval time =  5113.92 ms /    31 runs   (  164.97 ms per token)
llama_print_timings:       total time =  6323.27 ms

Arithmetic - no grammar

 % ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'Some arithmetic practice:\n\n'                                            
main: build = 645 (fd0eb66)
main: seed  = 1686286388
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 Some arithmetic practice:

\begin{code}
package main

import (
    "fmt"
)

func main() {
    fmt.Println(
llama_print_timings:        load time =  1171.65 ms
llama_print_timings:      sample time =    21.37 ms /    32 runs   (    0.67 ms per token)
llama_print_timings: prompt eval time =  1153.88 ms /     7 tokens (  164.84 ms per token)
llama_print_timings:        eval time =  4991.68 ms /    31 runs   (  161.02 ms per token)
llama_print_timings:       total time =  6187.91 ms

JSON

% ./main -m $LLAMA_30B_Q4_0 -n 64 -p $'A bit about me:\n\n' --grammar-file grammars/json.gbnf
main: build = 674 (e550234)
main: seed  = 1688014289
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0


main: grammar:
root ::= object 
object ::= [{] ws object_11 [}] 
value ::= object | array | string | number | boolean 
array ::= [[] ws array_15 []] 
string ::= ["] string_16 ["] ws 
number ::= number_17 number_18 ws 
boolean ::= boolean_19 ws 
ws ::= [ <U+0009><U+000A>] ws | 
object_8 ::= string [:] ws value object_10 
object_9 ::= [,] ws string [:] ws value 
object_10 ::= object_9 object_10 | 
object_11 ::= object_8 | 
array_12 ::= value array_14 
array_13 ::= [,] ws value 
array_14 ::= array_13 array_14 | 
array_15 ::= array_12 | 
string_16 ::= [ <U+0009>!#-[]-~] string_16 | 
number_17 ::= [-] | 
number_18 ::= [0-9] number_18 | [0-9] 
boolean_19 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] 

 A bit about me:

{
	"fullName": "Ramon Rodriguez",
	"username": "ramon",
	"email": "ramon@mail.com",
	"phoneNumber": "+1234567890",
	"address": {
		
llama_print_timings:        load time =  1273.70 ms
llama_print_timings:      sample time =    82.93 ms /    64 runs   (    1.30 ms per token)
llama_print_timings: prompt eval time =  1256.36 ms /     8 tokens (  157.04 ms per token)
llama_print_timings:        eval time = 10432.05 ms /    63 runs   (  165.59 ms per token)
llama_print_timings:       total time = 11795.36 ms

"JSON" - no grammar

 % ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'A bit about me:\n\n'                                                                          
main: build = 645 (fd0eb66)
main: seed  = 1686286615
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 A bit about me:

A former teacher, now a full-time writer. I am the author of two novels: _The Man in the Moon_ and _The Riddle
llama_print_timings:        load time =  1291.32 ms
llama_print_timings:      sample time =    21.48 ms /    32 runs   (    0.67 ms per token)
llama_print_timings: prompt eval time =  1274.63 ms /     8 tokens (  159.33 ms per token)
llama_print_timings:        eval time =  4990.01 ms /    31 runs   (  160.97 ms per token)
llama_print_timings:       total time =  6306.01 ms

Japanese

 % ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'Building a website can be done in 10 simple steps (from the original Japanese):\n\n' --grammar-file grammars/japanese.gbnf
main: build = 674 (e550234)
main: seed  = 1688013430
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


main: grammar:
root ::= root_2 root_5 
jp-char ::= hiragana | katakana | punctuation | cjk 
root_2 ::= jp-char root_2 | jp-char 
root_3 ::= [ <U+0009><U+000A>] root_4 
root_4 ::= jp-char root_4 | jp-char 
root_5 ::= root_3 root_5 | 
hiragana ::= [<U+3041>-<U+309F>] 
katakana ::= [<U+30A1>-<U+30FF>] 
punctuation ::= [<U+3001>-<U+303E>] 
cjk ::= [<U+4E00>-<U+9FFF>] 

 Building a website can be done in 10 simple steps (from the original Japanese):

一、目的は何なのか
二、お客さまを思い出して
三、お客さまのこと
llama_print_timings:        load time =  2957.19 ms
llama_print_timings:      sample time =    42.67 ms /    32 runs   (    1.33 ms per token)
llama_print_timings: prompt eval time =  2941.56 ms /    21 tokens (  140.07 ms per token)
llama_print_timings:        eval time =  5384.28 ms /    31 runs   (  173.69 ms per token)
llama_print_timings:       total time =  8387.61 ms

Japanese - no grammar

% ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'Building a website can be done in 10 simple steps (from the original Japanese):\n\n' 
main: build = 674 (e550234)
main: seed  = 1688013483
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 Building a website can be done in 10 simple steps (from the original Japanese):

1. Determine your goal for your site.
2. Make a plan.
3. Select the domain name.
4. Choose web
llama_print_timings:        load time =  2955.05 ms
llama_print_timings:      sample time =    22.96 ms /    32 runs   (    0.72 ms per token)
llama_print_timings: prompt eval time =  2937.10 ms /    21 tokens (  139.86 ms per token)
llama_print_timings:        eval time =  5032.41 ms /    31 runs   (  162.34 ms per token)
llama_print_timings:       total time =  8013.71 ms

Approach

Grammar API

The llama API accepts a data structure representing a context-free grammar over 32-bit code points:

    // grammar element type
    enum llama_gretype {
        // end of rule definition
        LLAMA_GRETYPE_END            = 0,

        // start of alternate definition for rule
        LLAMA_GRETYPE_ALT            = 1,

        // non-terminal element: reference to rule
        LLAMA_GRETYPE_RULE_REF       = 2,

        // terminal element: character (code point)
        LLAMA_GRETYPE_CHAR           = 3,

        // modifies a preceding LLAMA_GRETYPE_CHAR or LLAMA_GRETYPE_CHAR_ALT to
        // be an inclusive range ([a-z])
        LLAMA_GRETYPE_CHAR_RNG_UPPER = 4,

        // modifies a preceding LLAMA_GRETYPE_CHAR or
        // LLAMA_GRETYPE_CHAR_RNG_UPPER to add an alternate char to match ([ab], [a-zA])
        LLAMA_GRETYPE_CHAR_ALT       = 5,
    };

    typedef struct llama_grammar_element {
        enum llama_gretype type;
        uint32_t           value; // Unicode code point or rule ID
    } llama_grammar_element;

    LLAMA_API struct llama_grammar * llama_grammar_init(
            const llama_grammar_element ** rules,
                                 size_t    n_rules,
                                 size_t    start_rule_index);

Sampling

The grammar sampling code models a nondeterministic pushdown automaton, maintaining N stacks for the possible parse states. Sampling a token is done in two steps: a sampling API that filters candidates to those that match one of the parse stacks (llama_sample_grammar) and adding the chose token to the grammar (llama_grammar_accept_token).

Examples

Adds --grammar and --grammar-file arguments to main taking a simple extended BNF to constrain generations. The parser for this format is implemented in examples/grammar-parser.{h,cpp}:

// ... Supports character
// ranges, grouping, and repetition operators. As an example, a grammar for
// arithmetic might look like:
//
// root  ::= expr
// expr  ::= term ([-+*/] term)*
// term  ::= num | "(" space expr ")" space
// num   ::= [0-9]+ space
// space ::= [ \t\n]*

The root rule identifies the start of the grammar.

~~## Caveats~~

~~the binary format makes the code harder to understand and more brittle~~
~~the grammar contemplates 16-bit chars but it's just being applied to the 8-bit UTF-8 chars in token strings currently~~
~~the 1-char lookahead sampling is probably biasing generations in a weird way; further investigation on quality of outputs is probably needed~~

howard0su · 2023-06-10T01:10:08Z

Suggest taking a file as grammar parameter and put several examples like what we did for prompts (in .\prompts folder).

tobi · 2023-06-10T20:26:34Z

Incredibly useful contribution. It's really amazing how much this simplifies many use cases.

I agree that it would be better if the grammar came from a file.

Two snags I hit while trying this out:

it crashes with --prompt-cache
any empty lines in the grammar cause a crash

Some additional thoughts:

Would love to have the grammars support empty lines and comments
I wonder if the grammar could be compiled into a tensor of state transitions and run on the GPU
I wonder if there is an optimization where the next token is already known form the grammar we could skip the inference and just add it? In many types of grammars like json or html that could really speed up generation
I think it's worth allowing to reference full tokens form the grammar. Maybe something like @“ token” or @13432 Id of token.

slaren · 2023-06-11T12:42:40Z

Very nice! I am wondering what is the rationale for not including the parser in the llama.cpp API. Without it, most downstream users will be forced to manually make a copy of the parser in their code to support the feature, which is not great.
Also for usability, I think it would be a good idea to keep a copy of the binary grammar in llama_grammar, rather than asking the users to keep the provided copy alive. The overhead would be minimal, and it would simplify the code of downstream users.

ejones · 2023-06-12T04:43:45Z

Thanks all! Just added support for grammar files (with examples) and updated the grammar syntax to add shell-style comments and allow empty lines between rules, as well as newlines inside parenthesized groups.

it crashes with --prompt-cache

I wonder if that was #1699 ? If so, should be fixed now

I wonder if the grammar could be compiled into a tensor of state transitions and run on the GPU

Sounds cool, I don't know enough about GPU programming to comment on that myself. The grammar participates in the sampling layer, and I'm not sure if that leverages the GPU currently.

I wonder if there is an optimization where the next token is already known form the grammar we could skip the inference and just add it?

This is definitely possible. That said, AFAIK the token would still need to be evaluated, and that seems to be the bottleneck. Maybe the optimization comes in being able to batch eval strings of such tokens?

I think it's worth allowing to reference full tokens form the grammar

Neat idea. Would that be more of an optimization or to reference tokens that can't be expressed textually?

what is the rationale for not including the parser in the llama.cpp API.

Honestly, I was trying to reduce the changes to llama.cpp itself. Agree it would be more convenient in the API.

I think it would be a good idea to keep a copy of the binary grammar

Makes sense. I left that out of this round of changes - if it's desired to have the grammar parser in the llama API, this may naturally fit with that change.

bullno1 · 2023-06-12T15:46:49Z

First, this is amazing work.

This makes me wonder whether the entire sampling API should be pulled into something like llama_samplers instead.
External samplers can evolve independently of the core API.

The existing functions can be kept for compatibility.
AFAIK, the only thing we need is to expose the RNG.
And even then, the existence of that inside a state/context is debatable.
The context window is already managed by user code so why not sampling?

This reminds me a lot of: https://lmql.ai/.
There is also https://github.com/1rgs/jsonformer where the input is a json schema which is not always easy to express in BNF.

AFAIK the token would still need to be evaluated

Would it though?
We just immediately add it to the context.
It is done manually in user code now.

Maybe the optimization comes in being able to batch eval strings of such tokens?

AFAIK, that's the case.
The initial prompt and the user input are submitted in a large batch.
The inference loop just feed the single chosen token back until eos.

The grammar participates in the sampling layer, and I'm not sure if that leverages the GPU currently.

The current sampling is CPU.

Green-Sky · 2023-06-12T16:15:00Z

This makes me wonder whether the entire sampling API should be pulled into something like llama_samplers instead.

one of the discussion points for adding more llm generic tooling back into ggml(repo) was moving the sampler there. but afaik nothing happened yet :)

ejones · 2023-06-12T19:51:50Z

There is also https://github.com/1rgs/jsonformer where the input is a json schema

Was planning to tackle this next. I've got it more or less working locally in a branch off of this, at least with the examples on jsonformer's README. It uses a Python script to generate a JSON BNF that conforms to the schema.

howard0su · 2023-06-15T04:59:28Z

llama.h

@@ -263,6 +289,9 @@ extern "C" {
    LLAMA_API void llama_sample_typical(struct llama_context * ctx, llama_token_data_array * candidates, float p, size_t min_keep);
    LLAMA_API void llama_sample_temperature(struct llama_context * ctx, llama_token_data_array * candidates, float temp);

+    /// @details Apply constraints from grammar
+    LLAMA_API void llama_sample_grammar(struct llama_context * ctx, llama_token_data_array * candidates, const struct llama_grammar * grammar);


Can we make llama_grammar as a structure with two callbacks? So the other implementation of it can support context aware state machine instead?

Do you mean like, the caller would provide the implementation of llama_grammar (via callbacks), from which the llama API determines which tokens are valid?

yes, so llama code will not assume the grammar implementation.

Yeah, I'm open to that idea, assuming the grammar interface itself generalizes well to other implementations. I kind of designed this with the specific implementation in mind so that's not a guarantee.

ggerganov · 2023-06-15T18:52:33Z

Great stuff!

I'm still wrapping my head around this.

Yes, this can become part of a llama.cpp or ggml sampling API, but I guess for now we can keep it as example and see what are the pros and cons and learn how to use it most efficiently
What happens when then next N > 1 tokens are uniquely determined by the grammar? I guess we will sample them one by one, correct? What would it take to make it so that they are submitted to be processed as a batch? This would significantly speed up the inference in such cases

ejones · 2023-06-16T04:30:46Z

Yes, this can become part of a llama.cpp or ggml sampling API, but I guess for now we can keep it as example and see what are the pros and cons and learn how to use it most efficiently

To clarify, this PR adds the core sampling functionality in llama.cpp, leaving the grammar parser out in examples. Should that all be moved to examples or just left as is?

What happens when then next N > 1 tokens are uniquely determined by the grammar? I guess we will sample them one by one, correct? What would it take to make it so that they are submitted to be processed as a batch? This would significantly speed up the inference in such cases

Yes, that's correct. I think that's doable, I can take a stab at that.

SlyEcho · 2023-06-16T07:25:27Z

the grammar contemplates 16-bit chars but it's just being applied to the 8-bit UTF-8 chars in token strings currently

~~I don't understand this part. So it is converting to UTF-16?~~

~~Another option would be to use token values but it will be more limiting.~~

EDIT: I read through the code.

The grammar doesn't care about the text encoding. It could work with any encoding, provided that the rules match the characters correctly.

The parser doesn't understand UTF-8 so it will create rules that don't match as the user expects.

For example, if I wanted to create a rule to match all Hiragana characters, I should be able to write:

[ぁ-ゖ]

However the parser doesn't see it as two characters separated by -, instead:

[\xe3\x81\x81-\xe3\x82\x96]

But the correct rule should be something like this?

"\xe3" [\x81-\x82] [\x81-\x96]

llama.cpp

examples/grammar-parser.cpp

ivanstepanovftw · 2023-06-16T11:05:52Z

Just dont use repeat penalties to get best grammar as llama can

ggerganov · 2023-06-16T15:30:38Z

To clarify, this PR adds the core sampling functionality in llama.cpp, leaving the grammar parser out in examples. Should that all be moved to examples or just left as is?

It's fine the way it is

burke · 2023-06-16T20:12:51Z

FWIW I'm adapting this code into an analogous feature for models running on torch. In my implementation, I'm doing grammar enforcement logit masking on the GPU across the full token set before selecting candidates: https://github.com/Shopify/torch-grammar/blob/df23e354083c909c70120e256ed34036c93f6714/grammar_sampler.py#L232-L239. The same strategy would probably work here if anyone was super motivated to try it.

llama.h

ggerganov · 2023-08-06T08:29:56Z

Adding a similar grammar-based sampling to whisper.cpp would be a really cool contribution.
Here is one ongoing implementation that can potentially benefit from such functionality by using VIM-based grammar:

ggerganov/whisper.cpp#1144

x4080 · 2023-08-12T22:48:11Z

Can we improve the result by fine tuning the model ? If so, what is the example to improve it ?
TIA

ejones · 2023-08-14T03:44:43Z

@ggerganov agreed! Although I'm not sure when or if I'll be able to contribute that.

@x4080 this approach is independent of the model variant and can be used with a fine-tune. In the comments above there's a demonstration of using WizardLM, for example.

x4080 · 2023-08-14T06:56:30Z

@ejones thanks

RevanthRameshkumar · 2023-08-25T22:52:46Z

What is the generation speed on this? How does it compare to unconstrained generation when using cpp?

I am trying to replicate this in python using an A100 on an 8bit quantized llama 7b and it is extremely slow per token (compared to unconstrained generation) due to all the extra encoding and decoding that needs to happen.

ejones · 2023-08-27T01:17:26Z

I've only done CPU inference, but the performance impact has been insignificant for everything I've tried. On the M2 Max I'm seeing about ~0.5 ms / token sampling for unconstrained vs ~6ms with a grammar, with token eval taking about ~70ms for 13b (Q4_K).

I'm not sure if I know enough about GPU programming to meaningfully comment; I know that other folks are working on approaches that are more generic or are GPU- and/or Python-focused. There are some examples of this upthread: https://github.com/Shopify/torch-grammar and https://github.com/normal-computing/outlines.

ejones · 2023-08-31T00:39:42Z

@ggerganov alright, I've done it: ggerganov/whisper.cpp#1229

kalomaze · 2023-09-23T10:36:29Z

Is it a natural side effect of the grammar sampling method that I am seeing a significant degradation in tokens per second speed during generation, even with permissive sampling rules?
On 13b, I go from ~20T/s to ~13T/s, which is pretty unfortunate in my eyes, but is it expected? Is it CPU bound potentially?

Green-Sky · 2023-09-23T11:46:58Z

It tells you the time spent on sampling at the end. You can confirm it there.
eg:

llama_print_timings:        load time =   285,92 ms
llama_print_timings:      sample time =    30,81 ms /    37 runs   (    0,83 ms per token,  1200,95 tokens per second)
llama_print_timings: prompt eval time =   159,23 ms /     5 tokens (   31,85 ms per token,    31,40 tokens per second)
llama_print_timings:        eval time =  3656,10 ms /    36 runs   (  101,56 ms per token,     9,85 tokens per second)
llama_print_timings:       total time =  3861,11 ms

ejones · 2023-09-26T02:29:25Z

Yeah, I generally see about ~5ms/token overhead for grammars on the M2 Max, which is usually a fraction of the per token eval time. But recently I was testing with a grammar and saw a more significant impact. Should investigate, there may be some pathological cases.

…d grammars by including the `messages` field and adjusting the endpoint to `/v1/chat/completions`. # Aider chat conversation: USER: https://localai.io/features/constrained_grammars/: - [](/) LocalAI - - - [*info* Overview](https://localai.io/) - *rocket_launch* Getting started - [Quickstart](https://localai.io/basics/getting_started/) - [Run other Models](https://localai.io/docs/getting-started/run-other-models/) - [Customizing the Model](https://localai.io/docs/getting-started/customize-model/) - [Run models manually](https://localai.io/docs/getting-started/manual/) - [Build LocalAI from source](https://localai.io/basics/build/) - [*newspaper* News](https://localai.io/basics/news/) - *feature_search* Features - [⚡ GPU acceleration](https://localai.io/features/gpu-acceleration/) - [📖 Text generation (GPT)](https://localai.io/features/text-generation/) - [🗣 Text to audio (TTS)](https://localai.io/features/text-to-audio/) - [🎨 Image generation](https://localai.io/features/image-generation/) - [🧠 Embeddings](https://localai.io/features/embeddings/) - [🆕 GPT Vision](https://localai.io/features/gpt-vision/) - [✍️ Constrained grammars](https://localai.io/features/constrained_grammars/) - [🔈 Audio to text](https://localai.io/features/audio-to-text/) - [🔥 OpenAI functions and tools](https://localai.io/features/openai-functions/) - [💾 Stores](https://localai.io/stores/) - [🖼️ Model gallery](https://localai.io/models/) - [*rocket_launch* Integrations](https://localai.io/docs/integrations/) - *science* Advanced - [Advanced usage](https://localai.io/advanced/) - [Fine-tuning LLMs for text generation](https://localai.io/docs/advanced/fine-tuning/) - *science* References - [Model compatibility table](https://localai.io/model-compatibility/) - [Architecture](https://localai.io/docs/reference/architecture/) - [Available Container images](https://localai.io/docs/reference/container-images/) - [All-In-One images](https://localai.io/docs/reference/aio-images/) - [*quiz* FAQ](https://localai.io/faq/) ::::::::::::::::::::::::::::::::::::::::::::::::: {role="main"} [](/) menu search Search [ ]{.kbd} [ ]{.kbd} - [](%20https://github.com/mudler/LocalAI%20) - [](%20https://twitter.com/LocalAI_API%20) - [](%20https://localai.io/index.xml%20) []{.kbd} []{.kbd} to navigate []{.kbd} to select []{.kbd} to close cancel - [*Home*](/docs/) - [Features](/features/) - ✍️ Constrained grammars On this page - - [Setup](#setup) - [💡 Usage example](#-usage-example) Table of Contents - - [Setup](#setup) - [💡 Usage example](#-usage-example) *article* # ✍️ Constrained grammars The chat endpoint accepts an additional `grammar` parameter which takes a [BNF defined grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form). This allows the LLM to constrain the output to a user-defined schema, allowing to generate `JSON`, `YAML`, and everything that can be defined with a BNF grammar. notifications This feature works only with models compatible with the [llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also [Model compatibility](/model-compatibility/)). For details on how it works, see the upstream PRs: <ggerganov/llama.cpp#1773>, <ggerganov/llama.cpp#1887> ## Setup [*link*](#setup) Follow the setup instructions from the [LocalAI functions](/features/openai-functions/) page. ## 💡 Usage example [*link*](#-usage-example) For example, to constrain the output to either `yes`, `no`: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Do you like apples?"}], "grammar": "root ::= (\"yes\" | \"no\")" }' [ Edit this page](https://github.com/mudler/LocalAI/blob/master/docs/content/docs/features/constrained_grammars.md) Last updated 3 months ago. history ------------------------------------------------------------------------ [](/features/gpt-vision/) *navigate_before* 🆕 GPT Vision [](/features/audio-to-text/) 🔈 Audio to text *navigate_next* © 2023-2024 [Ettore Di Giacinto](https://mudler.pm) ::::::::::::::::::::::::::::::::::::::::::::::::: ASSISTANT: Based on the information provided about the LocalAI's constrained grammars feature, it seems that the `handleCompletions` function in the `rag-api.go` file needs to be updated to support the constrained grammars when making a request to the LocalAI service. The request payload should include a `grammar` field when it is provided in the `CompletionRequest`. Here is the diff for the necessary changes: ```diff --- cluster/apps/home/localai/app/config/rag-api.go +++ cluster/apps/home/localai/app/config/rag-api.go @@ ... @@ func handleCompletions(w http.ResponseWriter, r *http.Request) { atomic.AddUint64(&requestMetrics.CompletionRequests, 1) logRequest(r) var req CompletionRequest if err := json.NewDecoder(r.Body).Decode(&req); err != nil { http.Error(w, err.Error(), http.StatusBadRequest) return } defer r.Body.Close() embedding, err := getEmbeddings(req.Prompt) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } - // Generate completion using constrained grammar (if requested) - var completion CompletionResponse - if req.ConstrainedGrammar != "" { - payload := map[string]interface{}{ - "model": "gpt-4", - "prompt": req.Prompt, - "grammar": req.ConstrainedGrammar, - "max_tokens": req.MaxTokens, - "temperature": req.Temperature, - "top_p": req.TopP, - } - jsonPayload, err := json.Marshal(payload) - if err != nil { - http.Error(w, err.Error(), http.StatusInternalServerError) - return - } + // Generate completion using constrained grammar or retrieval-augmented generation (if requested) + var completionResp CompletionResponse + payload := map[string]interface{}{ + "model": "gpt-4", + "prompt": req.Prompt, + "max_tokens": req.MaxTokens, + "temperature": req.Temperature, + "top_p": req.TopP, + } + if req.ConstrainedGrammar != "" { + payload["grammar"] = req.ConstrainedGrammar + } + jsonPayload, err := json.Marshal(payload) + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } localAI := os.Getenv("LOCAL_AI_ENDPOINT") if localAI == "" { localAI = defaultLocalAI } - resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload)) + resp, err := http.Post(localAI+"/completions", "application/json", bytes.NewBuffer(jsonPayload)) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { body, err := io.ReadAll(resp.Body) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode) return } - var respBody struct { - Result CompletionResponse `json:"result"` - } - if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil { + if err := json.NewDecoder(resp.Body).Decode(&completionResp); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } - completion = respBody.Result + jsonResp, err := json.Marshal(completionResp) + if err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } - jsonResp, err := json.Marshal(completion) - if err != nil { - http.Error(w, err.Error(), http.StatusInternalServerError) - return - } w.Header().Set("Content-Type", "application/json") w.WriteHeader(http.StatusOK) w.Write(jsonResp) } ``` This diff updates the `handleCompletions` function to include the `grammar` field in the request payload if it is provided in the `CompletionRequest`. It also corrects the endpoint to `/completions` as per the LocalAI documentation. USER: UnifiedDiffNoMatch: hunk failed to apply! cluster/apps/home/localai/app/config/rag-api.go does not contain lines that match the diff you provided! Try again. DO NOT skip blank lines, comments, docstrings, etc! The diff needs to apply cleanly to the lines in cluster/apps/home/localai/app/config/rag-api.go! cluster/apps/home/localai/app/config/rag-api.go does not contain these 76 exact lines in a row: ``` func handleCompletions(w http.ResponseWriter, r *http.Request) { atomic.AddUint64(&requestMetrics.CompletionRequests, 1) logRequest(r) var req CompletionRequest if err := json.NewDecoder(r.Body).Decode(&req); err != nil { http.Error(w, err.Error(), http.StatusBadRequest) return } defer r.Body.Close() embedding, err := getEmbeddings(req.Prompt) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } // Generate completion using constrained grammar (if requested) var completion CompletionResponse if req.ConstrainedGrammar != "" { payload := map[string]interface{}{ "model": "gpt-4", "prompt": req.Prompt, "grammar": req.ConstrainedGrammar, "max_tokens": req.MaxTokens, "temperature": req.Temperature, "top_p": req.TopP, } jsonPayload, err := json.Marshal(payload) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } localAI := os.Getenv("LOCAL_AI_ENDPOINT") if localAI == "" { localAI = defaultLocalAI } resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload)) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { body, err := io.ReadAll(resp.Body) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode) return } var respBody struct { Result CompletionResponse `json:"result"` } if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } completion = respBody.Result jsonResp, err := json.Marshal(completion) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } w.Header().Set("Content-Type", "application/json") w.WriteHeader(http.StatusOK) w.Write(jsonResp) } ``` ASSISTANT: I apologize for the confusion. Let's try again with the correct diff that applies cleanly to the `cluster/apps/home/localai/app/config/rag-api.go` file. I will update the `handleCompletions` function to correctly handle the constrained grammars feature as described in the LocalAI documentation. ```diff --- cluster/apps/home/localai/app/config/rag-api.go +++ cluster/apps/home/localai/app/config/rag-api.go @@ ... @@ func handleCompletions(w http.ResponseWriter, r *http.Request) { atomic.AddUint64(&requestMetrics.CompletionRequests, 1) logRequest(r) var req CompletionRequest if err := json.NewDecoder(r.Body).Decode(&req); err != nil { http.Error(w, err.Error(), http.StatusBadRequest) return } defer r.Body.Close() // Generate completion using constrained grammar (if requested) var completion CompletionResponse if req.ConstrainedGrammar != "" { payload := map[string]interface{}{ "model": "gpt-4", "prompt": req.Prompt, - "grammar": req.ConstrainedGrammar, + "grammar": req.ConstrainedGrammar, "max_tokens": req.MaxTokens, "temperature": req.Temperature, "top_p": req.TopP, } jsonPayload, err := json.Marshal(payload) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } localAI := os.Getenv("LOCAL_AI_ENDPOINT") if localAI == "" { localAI = defaultLocalAI } - resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload)) + resp, err := http.Post(localAI+"/completions", "application/json", bytes.NewBuffer(jsonPayload)) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { body, err := io.ReadAll(resp.Body) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode) return } var respBody struct { Result CompletionResponse `json:"result"` } if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } completion = respBody.Result } else { // Implement retrieval-augmented generation using retrieved data // ... } jsonResp, err := json.Marshal(completion) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } w.Header().Set ^C KeyboardInterrupt USER: https://localai.io/features/constrained_grammars/: - [](/) LocalAI - - - [*info* Overview](https://localai.io/) - *rocket_launch* Getting started - [Quickstart](https://localai.io/basics/getting_started/) - [Run other Models](https://localai.io/docs/getting-started/run-other-models/) - [Customizing the Model](https://localai.io/docs/getting-started/customize-model/) - [Run models manually](https://localai.io/docs/getting-started/manual/) - [Build LocalAI from source](https://localai.io/basics/build/) - [*newspaper* News](https://localai.io/basics/news/) - *feature_search* Features - [⚡ GPU acceleration](https://localai.io/features/gpu-acceleration/) - [📖 Text generation (GPT)](https://localai.io/features/text-generation/) - [🗣 Text to audio (TTS)](https://localai.io/features/text-to-audio/) - [🎨 Image generation](https://localai.io/features/image-generation/) - [🧠 Embeddings](https://localai.io/features/embeddings/) - [🆕 GPT Vision](https://localai.io/features/gpt-vision/) - [✍️ Constrained grammars](https://localai.io/features/constrained_grammars/) - [🔈 Audio to text](https://localai.io/features/audio-to-text/) - [🔥 OpenAI functions and tools](https://localai.io/features/openai-functions/) - [💾 Stores](https://localai.io/stores/) - [🖼️ Model gallery](https://localai.io/models/) - [*rocket_launch* Integrations](https://localai.io/docs/integrations/) - *science* Advanced - [Advanced usage](https://localai.io/advanced/) - [Fine-tuning LLMs for text generation](https://localai.io/docs/advanced/fine-tuning/) - *science* References - [Model compatibility table](https://localai.io/model-compatibility/) - [Architecture](https://localai.io/docs/reference/architecture/) - [Available Container images](https://localai.io/docs/reference/container-images/) - [All-In-One images](https://localai.io/docs/reference/aio-images/) - [*quiz* FAQ](https://localai.io/faq/) ::::::::::::::::::::::::::::::::::::::::::::::::: {role="main"} [](/) menu search Search [ ]{.kbd} [ ]{.kbd} - [](%20https://github.com/mudler/LocalAI%20) - [](%20https://twitter.com/LocalAI_API%20) - [](%20https://localai.io/index.xml%20) []{.kbd} []{.kbd} to navigate []{.kbd} to select []{.kbd} to close cancel - [*Home*](/docs/) - [Features](/features/) - ✍️ Constrained grammars On this page - - [Setup](#setup) - [💡 Usage example](#-usage-example) Table of Contents - - [Setup](#setup) - [💡 Usage example](#-usage-example) *article* # ✍️ Constrained grammars The chat endpoint accepts an additional `grammar` parameter which takes a [BNF defined grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form). This allows the LLM to constrain the output to a user-defined schema, allowing to generate `JSON`, `YAML`, and everything that can be defined with a BNF grammar. notifications This feature works only with models compatible with the [llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also [Model compatibility](/model-compatibility/)). For details on how it works, see the upstream PRs: <ggerganov/llama.cpp#1773>, <ggerganov/llama.cpp#1887> ## Setup [*link*](#setup) Follow the setup instructions from the [LocalAI functions](/features/openai-functions/) page. ## 💡 Usage example [*link*](#-usage-example) For example, to constrain the output to either `yes`, `no`: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Do you like apples?"}], "grammar": "root ::= (\"yes\" | \"no\")" }' [ Edit this page](https://github.com/mudler/LocalAI/blob/master/docs/content/docs/features/constrained_grammars.md) Last updated 3 months ago. history ------------------------------------------------------------------------ [](/features/gpt-vision/) *navigate_before* 🆕 GPT Vision [](/features/audio-to-text/) 🔈 Audio to text *navigate_next* © 2023-2024 [Ettore Di Giacinto](https://mudler.pm) ::::::::::::::::::::::::::::::::::::::::::::::::: ASSISTANT: To implement the constrained grammars feature as described in the LocalAI documentation, we need to update the `handleCompletions` function in the `rag-api.go` file to correctly handle the `grammar` parameter in the request payload. Here is the diff for the necessary changes: ```diff --- cluster/apps/home/localai/app/config/rag-api.go +++ cluster/apps/home/localai/app/config/rag-api.go @@ ... @@ func handleCompletions(w http.ResponseWriter, r *http.Request) { atomic.AddUint64(&requestMetrics.CompletionRequests, 1) logRequest(r) var req CompletionRequest if err := json.NewDecoder(r.Body).Decode(&req); err != nil { http.Error(w, err.Error(), http.StatusBadRequest) return } defer r.Body.Close() // Generate completion using constrained grammar (if requested) var completion CompletionResponse if req.ConstrainedGrammar != "" { payload := map[string]interface{}{ "model": "gpt-4", "prompt": req.Prompt, + "messages": []map[string]string{ + { + "role": "user", + "content": req.Prompt, + }, + }, "max_tokens": req.MaxTokens, "temperature": req.Temperature, "top_p": req.TopP, + "grammar": req.ConstrainedGrammar, } jsonPayload, err := json.Marshal(payload) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } localAI := os.Getenv("LOCAL_AI_ENDPOINT") if localAI == "" { localAI = defaultLocalAI } - resp, err := http.Post(localAI+"/completions", "application/json", bytes.NewBuffer(jsonPayload)) + resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload)) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { body, err := io.ReadAll(resp.Body) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode) return } var respBody struct { Result CompletionResponse `json:"result"` } if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } completion = respBody.Result } else { // Implement retrieval-augmented generation using retrieved data // ... } jsonResp, err := json.Marshal(completion) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } w.Header().Set("Content-Type", "application/json") w.WriteHeader(http.StatusOK) w.Write(jsonResp) } ``` This diff adds the `messages` field to the payload as required by the LocalAI API for constrained grammars and updates the endpoint to `/v1/chat/completions`. It also includes the `grammar` field when the `ConstrainedGrammar` is not empty.

llama, main : constrain sampling to grammar

fd0eb66

ggerganov added the high priority Very important issue label Jun 9, 2023

ejones added 6 commits June 11, 2023 23:44

allow loading grammar from file

834d423

fix whitespace errors

9e77f42

handle & print parser errors

674bb08

add comments to grammar syntax and allow newlines where unambiguous

98a9587

Merge remote-tracking branch 'refs/remotes/upstream/master' into grammar

56904ca

add missing include

3e78f00

ejones added 4 commits June 14, 2023 23:53

support alternates in root rule

421c6e1

fix bugs with empty token and EOS

b876d19

adjust JSON grammar

58ca9bc

remove swp file

414f251

howard0su reviewed Jun 15, 2023

View reviewed changes

ejones mentioned this pull request Jun 16, 2023

examples : generate JSON according to schema #1887

Merged

SlyEcho reviewed Jun 16, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

SlyEcho reviewed Jun 16, 2023

View reviewed changes

examples/grammar-parser.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Jun 17, 2023

View reviewed changes

llama.h Outdated Show resolved Hide resolved

jmorganca mentioned this pull request Jul 27, 2023

Function calling ollama/ollama#215

Closed

This was referenced Jul 31, 2023

Refactor for extraction docs langchain-ai/langchain#8465

Merged

Generic LLM wrapper to support chat model interface with configurable chat prompt format langchain-ai/langchain#8295

Merged

michael-dm mentioned this pull request Aug 1, 2023

Constrained generation support rustformers/llm#235

Open

c0sogi mentioned this pull request Aug 5, 2023

Add grammar-based sampling abetlen/llama-cpp-python#572

Merged

imaurer mentioned this pull request Aug 15, 2023

[Feature Request] Output Restricted by Context Free Grammar or JSON Schema mlc-ai/mlc-llm#758

Closed

hammer mentioned this pull request Aug 20, 2023

Python client for AI interactions refstudio/refstudio#58

Closed

ejones mentioned this pull request Aug 22, 2023

docs : add grammar docs #2701

Merged

mattpulver mentioned this pull request Aug 30, 2023

llama : add support for batched inference #2813

Closed

ejones mentioned this pull request Aug 31, 2023

whisper : add grammar-based sampling ggerganov/whisper.cpp#1229

Merged

breadchris mentioned this pull request Sep 7, 2023

Microsoft/guidance-ai integration with Ollama ollama/ollama#389

Closed

kalomaze mentioned this pull request Sep 14, 2023

Is anyone working on 'conditional' logit biasing? (How to potentially improve repetition penalty?) #3149

Closed

oobabooga mentioned this pull request Oct 7, 2023

[WIP] Add Support for masking output using a Context-Free-Grammar huggingface/transformers#26520

Open

wsxiaoys mentioned this pull request Oct 17, 2023

The generated python code contains redundancies "]" TabbyML/tabby#575

Closed

cebtenzzre mentioned this pull request Oct 24, 2023

Structured JSON output with Pydantic integration nomic-ai/gpt4all#1565

Open

viantirreau mentioned this pull request Oct 25, 2023

llama : combined beam search + grammar sampling strategy #2923

Open

cpacker mentioned this pull request Nov 4, 2023

Add grammar-based sampling (for webui, llamacpp, and koboldcpp) cpacker/MemGPT#293

Merged

wsxiaoys mentioned this pull request Nov 23, 2023

investigating treesitter grammar guided decoding TabbyML/tabby#871

Open

ochafik mentioned this pull request Apr 23, 2024

llama : speed-up grammar sampling #4218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add grammar-based sampling #1773

llama : add grammar-based sampling #1773

ejones commented Jun 9, 2023 •

edited

howard0su commented Jun 10, 2023

tobi commented Jun 10, 2023 •

edited

slaren commented Jun 11, 2023

ejones commented Jun 12, 2023

bullno1 commented Jun 12, 2023

Green-Sky commented Jun 12, 2023

ejones commented Jun 12, 2023

howard0su Jun 15, 2023

ejones Jun 16, 2023

howard0su Jun 16, 2023

ejones Jun 18, 2023

ggerganov commented Jun 15, 2023

ejones commented Jun 16, 2023

SlyEcho commented Jun 16, 2023 •

edited

ivanstepanovftw commented Jun 16, 2023

ggerganov commented Jun 16, 2023

burke commented Jun 16, 2023

ggerganov commented Aug 6, 2023

x4080 commented Aug 12, 2023

ejones commented Aug 14, 2023

x4080 commented Aug 14, 2023

RevanthRameshkumar commented Aug 25, 2023

ejones commented Aug 27, 2023

ejones commented Aug 31, 2023

kalomaze commented Sep 23, 2023 •

edited

Green-Sky commented Sep 23, 2023

ejones commented Sep 26, 2023

llama : add grammar-based sampling #1773

llama : add grammar-based sampling #1773

Conversation

ejones commented Jun 9, 2023 • edited

Testing

Approach

Grammar API

Sampling

Examples

howard0su commented Jun 10, 2023

tobi commented Jun 10, 2023 • edited

slaren commented Jun 11, 2023

ejones commented Jun 12, 2023

bullno1 commented Jun 12, 2023

Green-Sky commented Jun 12, 2023

ejones commented Jun 12, 2023

howard0su Jun 15, 2023

Choose a reason for hiding this comment

ejones Jun 16, 2023

Choose a reason for hiding this comment

howard0su Jun 16, 2023

Choose a reason for hiding this comment

ejones Jun 18, 2023

Choose a reason for hiding this comment

ggerganov commented Jun 15, 2023

ejones commented Jun 16, 2023

SlyEcho commented Jun 16, 2023 • edited

ivanstepanovftw commented Jun 16, 2023

ggerganov commented Jun 16, 2023

burke commented Jun 16, 2023

ggerganov commented Aug 6, 2023

x4080 commented Aug 12, 2023

ejones commented Aug 14, 2023

x4080 commented Aug 14, 2023

RevanthRameshkumar commented Aug 25, 2023

ejones commented Aug 27, 2023

ejones commented Aug 31, 2023

kalomaze commented Sep 23, 2023 • edited

Green-Sky commented Sep 23, 2023

ejones commented Sep 26, 2023

ejones commented Jun 9, 2023 •

edited

tobi commented Jun 10, 2023 •

edited

SlyEcho commented Jun 16, 2023 •

edited

kalomaze commented Sep 23, 2023 •

edited