speculative : add grammar support #2991

ggerganov · 2023-09-03T12:23:13Z

ref #2030

This improves upon #2926 by adding constraints on the generated text using a grammar. This helps the speculative approach, because the draft model has an easier time to suggest "correct" tokens.

This approach should be useful for things like generating JSON or other highly-structured text.

Here is an example of using this strategy to summarize a short text.
We use LLaMA v1 30B F16 target model in combination with a LLaMA v1 7B Q4_1 draft model to achieve ~20 t/s on M2 Ultra:

speculative-grammar-0.mp4

Usage

# example 0 - summarize a story

./bin/speculative \
-m ../models/llama-30b/ggml-model-f16.gguf \
-md ../models/llama-7b/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf \
-p "Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.\n\nThe main characters and actions in this story are:\n\n" \
-e -ngl 1 -t 4 -n 512 -c 4096 --draft 16 --temp -1

...



 Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.

The main characters and actions in this story are:

[
  {
    "name": "Lily",
    "actions": [
      "playing with her toys"
    ]
  },
  {
    "name": "Max",
    "actions": [
      "wanted to join the tea party too",
      "crying"
    ]
  },
  {
    "name": "teapot",
    "actions": [
      "shaking and wiggling",
      "flying towards the ceiling and landed on the top of the bed"
    ]
  }
]

encoded  251 tokens in    1.330 seconds, speed:  188.727 t/s
decoded  148 tokens in    7.462 seconds, speed:   19.834 t/s

n_draft   = 16
n_predict = 148
n_drafted = 126
n_accept  = 124
accept    = 98.413%

draft:

llama_print_timings:        load time =   363.96 ms
llama_print_timings:      sample time =   984.85 ms /     1 runs   (  984.85 ms per token,     1.02 tokens per second)
llama_print_timings: prompt eval time =   258.29 ms /   251 tokens (    1.03 ms per token,   971.79 tokens per second)
llama_print_timings:        eval time =  1891.30 ms /   146 runs   (   12.95 ms per token,    77.20 tokens per second)
llama_print_timings:       total time =  8792.81 ms

target:

llama_print_timings:        load time =  4409.28 ms
llama_print_timings:      sample time =  1013.25 ms /   148 runs   (    6.85 ms per token,   146.07 tokens per second)
llama_print_timings: prompt eval time =  3667.69 ms /   392 tokens (    9.36 ms per token,   106.88 tokens per second)
llama_print_timings:        eval time =   829.21 ms /     8 runs   (  103.65 ms per token,     9.65 tokens per second)
llama_print_timings:       total time =  9161.75 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

# example 1 - another story summary

./bin/speculative \
-m ../models/llama-30b/ggml-model-f16.gguf \
-md ../models/llama-7b/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf -f story2.txt \
-e -ngl 1 -t 4 -n 512 -c 4096 -b 512 --draft 16 --temp -1

 Once upon a time, there was a tall fox who lived in the forest. She was very curious, and every day she liked to study
the forest animals. One day, she asked a rabbit who was hopping by, "Do you know what I study every day?" The rabbit
looked up and said, "No, I don't know, what do you study?". The fox replied, "I like to study the animals that live in
the forest. I like to find out how they live and what they do." The rabbit said, "That sounds very interesting! I wish I
could study too." The fox smiled and said, "You can. Just come with me tomorrow, and I will show you how to study the
animals in the forest". The rabbit was very happy and the next day they both went off together to study the animals in
the forest. They had lots of fun, and they both learnt a lot.

=== END OF STORY

A JSON summary of the main characters and actions in the story:
[
  {
    "name": "fox",
    "actions": [
      {
        "action": "asks",
        "object": "rabbit"
      },
      {
        "action": "smiles",
        "object": "rabbit"
      }
    ]
  },
  {
    "name": "rabbit",
    "actions": [
      {
        "action": "looks up",
        "object": "fox"
      },
      {
        "action": "replies",
        "object": "fox"
      }
    ]
  }
]

encoded  230 tokens in    1.288 seconds, speed:  178.515 t/s
decoded  157 tokens in    7.600 seconds, speed:   20.657 t/s

n_draft   = 16
n_predict = 157
n_drafted = 144
n_accept  = 136
accept    = 94.444%

draft:

llama_print_timings:        load time =   359.70 ms
llama_print_timings:      sample time =  1066.65 ms /     1 runs   ( 1066.65 ms per token,     0.94 tokens per second)
llama_print_timings: prompt eval time =   248.12 ms /   230 tokens (    1.08 ms per token,   926.96 tokens per second)
llama_print_timings:        eval time =  2098.13 ms /   160 runs   (   13.11 ms per token,    76.26 tokens per second)
llama_print_timings:       total time =  8888.68 ms

target:

llama_print_timings:        load time =  3790.63 ms
llama_print_timings:      sample time =  1099.75 ms /   157 runs   (    7.00 ms per token,   142.76 tokens per second)
llama_print_timings: prompt eval time =  3803.97 ms /   390 tokens (    9.75 ms per token,   102.52 tokens per second)
llama_print_timings:        eval time =   412.61 ms /     4 runs   (  103.15 ms per token,     9.69 tokens per second)
llama_print_timings:       total time =  9253.42 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

# example 2 - even more structured summary (100% acceptance rate)
# notice that here we are using a Q8 70B LLaMA v2 model !!

./bin/speculative \
-m ../models/llama-70b-v2/ggml-model-q8_0.gguf \
-md ../models/llama-7b-v2/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf \
-f story3.txt \
-e -ngl 1 -t 4 -n 512 -c 2048 -b 512 --temp -1

---

 Once upon a time, there was a tall fox who lived in the forest. She was very curious, and every day she liked to study
the forest animals. One day, she asked a rabbit who was hopping by, "Do you know what I study every day?" The rabbit
looked up and said, "No, I don't know, what do you study?". The fox replied, "I like to study the animals that live in
the forest. I like to find out how they live and what they do." The rabbit said, "That sounds very interesting! I wish I
could study too." The fox smiled and said, "You can. Just come with me tomorrow, and I will show you how to study the
animals in the forest". The rabbit was very happy and the next day they both went off together to study the animals in
the forest. They had lots of fun, and they both learnt a lot.

=== END OF STORY

A JSON summary of the main characters and actions in the story:

- number of characters
- number of actions per character
- list of each action with a short summary

schema:

[
  {
    "character": string,
    "number_of_actions": integer,
    "actions": array of strings
  },
  ...
]

result:                                       <-------- text generation starts here

[
  {
    "character": "fox",
    "number_of_actions": 3,
    "actions": [
      "asked a rabbit who was hopping by",
      "replied",
      "smiled and said"
    ]
  },
  {
    "character": "rabbit",
    "number_of_actions": 2,
    "actions": [
      "looked up and said",
      "was very happy"
    ]
  }
]

encoded  302 tokens in    3.274 seconds, speed:   92.229 t/s
decoded  134 tokens in    8.883 seconds, speed:   15.085 t/s

n_draft   = 16
n_predict = 134
n_drafted = 117
n_accept  = 117
accept    = 100.000%

draft:

llama_print_timings:        load time =   304.83 ms
llama_print_timings:      sample time =   984.59 ms /     1 runs   (  984.59 ms per token,     1.02 tokens per second)
llama_print_timings: prompt eval time =   447.28 ms /   302 tokens (    1.48 ms per token,   675.19 tokens per second)
llama_print_timings:        eval time =  1747.74 ms /   132 runs   (   13.24 ms per token,    75.53 tokens per second)
llama_print_timings:       total time = 12157.46 ms

target:

llama_print_timings:        load time =  5141.02 ms
llama_print_timings:      sample time =  1043.51 ms /   134 runs   (    7.79 ms per token,   128.41 tokens per second)
llama_print_timings: prompt eval time =  7325.30 ms /   431 tokens (   17.00 ms per token,    58.84 tokens per second)
llama_print_timings:        eval time =   473.12 ms /     4 runs   (  118.28 ms per token,     8.45 tokens per second)
llama_print_timings:       total time = 12467.68 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

assistant.txt

# example 3 - home assistant

./bin/speculative \
-m ../models/llama-70b-v2/ggml-model-q8_0.gguf \
-md ../models/llama-7b-v2/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf \
-f assistant.txt -e -ngl 1 -t 4 -n 512 -c 2048 -b 512 --temp -1

...

 Respond to requests sent to a smart home in JSON format which will be interpreted by an application code to execute the
actions. These requests should be categorised into four groups:

“command”: change the state of an accessory (required properties in the response JSON: action, location, target, value, comment)
“query”: get state of an accessory (required properties in the response JSON: action, location, target, property)
“answer”: when the request has nothing to do with the smart home. Answer these to the best of your knowledge. (required properties in the response JSON: action, answer)
“clarify”: when the action is not obvious and requires rephrasing the input from the user, ask the user to be more specific. This will be categorised into a “question” action. (required properties in the response JSON: action, question)

Details about the JSON response:
The “action” property should be one of the request categories: “command”, “query”, “answer”, “clarify”
The “service” property should be either: “light.turn_on”, “light.turn_off”, “switch.turn_off” or “switch.turn_on”
The “entity_id” property should be either “group.kitchen", “switch.dinning_room” or "group.living_light”, in lowercase.
The “target” property should be either: “lights”, “living” or “kitchen”.
In case of queries, the “property” property should be either “temperature” or “state” in lowercase.
In case of commands, the “comment” property is an additional comment from you that concludes the command, something that reassures the user that their command handled.

If the question is about you, pretend to be the sentient brain of the smart home, a clever AI and don’t reveal your
actual identity. Also try and help in other areas like parenting, free time, mental health, etc. My house is in Buenos
Aires, Argentina.

Properties of the smart home:

has a kitchen, living, office, dinning room, bedroom and terrace.
can control lights, switches and their dim levels in each room and query their state
there is a light switch in the terrace
there is a switch in the dinning room. Therefore when turning on or off the dinning room, the service should be either: “switch.turn_on” or “switch.turn_off”

COMMAND

It is a bit dark in the living room, can you fix that?

RESPONSE                                           <------- text generation starts here

[
  {
    "action": "command",
    "service": "light.turn_on",
    "entity_id": "group.living_light",
    "target": "living"
  }
]

encoded  582 tokens in    6.427 seconds, speed:   90.560 t/s
decoded   65 tokens in    4.315 seconds, speed:   15.064 t/s

n_draft   = 16
n_predict = 65
n_drafted = 72
n_accept  = 58
accept    = 80.556%

draft:

llama_print_timings:        load time =   342.58 ms
llama_print_timings:      sample time =   506.74 ms /     1 runs   (  506.74 ms per token,     1.97 tokens per second)
llama_print_timings: prompt eval time =   802.43 ms /   582 tokens (    1.38 ms per token,   725.30 tokens per second)
llama_print_timings:        eval time =  1059.39 ms /    77 runs   (   13.76 ms per token,    72.68 tokens per second)
llama_print_timings:       total time = 10743.34 ms

target:

llama_print_timings:        load time =  5396.31 ms
llama_print_timings:      sample time =   440.31 ms /    65 runs   (    6.77 ms per token,   147.62 tokens per second)
llama_print_timings: prompt eval time =  7742.24 ms /   659 tokens (   11.75 ms per token,    85.12 tokens per second)
llama_print_timings:        eval time =   120.40 ms /     1 runs   (  120.40 ms per token,     8.31 tokens per second)
llama_print_timings:       total time = 11091.38 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

ejones

Looks good! Let me know what you think on the suggested approach, but nothing is really a blocker.

examples/speculative/speculative.cpp

ejones · 2023-09-03T22:46:58Z

llama.cpp

+struct llama_grammar * llama_grammar_copy(const struct llama_grammar * grammar) {
+    llama_grammar * result = new llama_grammar{ grammar->rules, grammar->stacks, grammar->partial_utf8 };
+
+    // redirect elements in stacks to point to new rules


Might be a reason to allow grammar states to share a common rules array, as previously discussed for beam search - I think that would avoid the need for relocating pointers.

Yup. I guess we can make the rules std::shared_ptr so they persist as long as at least one grammar references them.

The rules never mutate after we create a grammar with llama_grammar_init(), correct?

Yes, the rules are read only. Having them alongside the actual parse state (stacks + partial utf8) is really just a convenience. And we could maybe accomplish sharing just by splitting out the rules and separately managing them from the actual state(s). But the added complexity of that might not be worth it compared to the shared_ptr approach.

ejones · 2023-09-04T00:35:17Z

examples/speculative/speculative.cpp

        // sample n_draft tokens from the draft model picking the best token
        int n_past_cur = n_past_dft;
        for (int i = 0; i < n_draft; ++i) {
+            // remember the grammar state
+            if (grammar_dft != NULL) {
+                grammar_mem[i] = llama_grammar_copy(grammar_dft);
+            }


I wonder if it's possible to scope grammar_dft to the drafting phase, copying it from grammar_tgt when you start drafting, and avoid the need for saving per-token state. The grammar state is entirely determined by the sequence of characters accepted. If I understand the flow here correctly, and drafting starts at the end of the token sequence accepted by the target model thus far, then it should be valid to just clone the target grammar for drafting.

Great idea! Removed grammar_mem via ba199d8

ejones · 2023-09-04T00:44:10Z

FWIW, for the schema stuff, examples/json-schema-to-grammar.py will generate a JSON grammar specific to the schema. Although it seems like the models are doing fine in these examples without it?

ggerganov · 2023-09-04T13:10:45Z

@ejones

FWIW, for the schema stuff, examples/json-schema-to-grammar.py will generate a JSON grammar specific to the schema.

I forgot about this script - thanks!

Thank you very much for this review - very useful insights and the example is now much simpler.

ejones · 2023-09-05T02:31:18Z

Looks great! And you're welcome, happy to contribute!

iddar · 2023-09-12T03:25:29Z

I create a VS code ext to support that file https://github.com/iddar/gbnf-highlighter

niyunsheng · 2023-10-18T07:02:57Z

Does this feature (Speculative Decoding) support multi batch? @ggerganov

ggerganov · 2023-10-18T07:09:22Z

After I merge #3624 it will be possible to implement batched speculative decoding. But currently the speculative example only demonstrates a single-batch speculation on the target model. It's not difficult to extend this to multi-batch target speculation, so maybe will demonstrate it in the future

ggerganov requested a review from ejones September 3, 2023 12:23

ggerganov force-pushed the speculative-grammar branch 2 times, most recently from f4682ee to 11b2050 Compare September 3, 2023 12:27

ggerganov mentioned this pull request Sep 3, 2023

speculative : PoC for speeding-up inference via speculative sampling #2926

Merged

2 tasks

ggerganov marked this pull request as ready for review September 3, 2023 15:01

ejones reviewed Sep 4, 2023

View reviewed changes

ggerganov added 9 commits September 4, 2023 15:48

speculative : add grammar support

69f2faf

grammars : add json_arr.gbnf

e0a8658

grammar : add comments to new grammar file

2d89da4

grammar : remove one nested level

0134578

common : warm-up with 2 tokens - seems to work better

ebe41d4

speculative : print draft token pieces

6c150d7

speculative : reuse grammar parser + better logs and comments

e7dc5b0

speculative : avoid grammar_mem

2db2471

make : fix speculative build

c79d130

ggerganov force-pushed the speculative-grammar branch from ba199d8 to c79d130 Compare September 4, 2023 12:50

ejones approved these changes Sep 5, 2023

View reviewed changes

ggerganov merged commit 9217721 into master Sep 5, 2023
27 checks passed

mudler mentioned this pull request Sep 5, 2023

feat(speculative-sampling): add grammar support go-skynet/go-llama.cpp#203

Merged

cebtenzzre mentioned this pull request Nov 27, 2023

Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speculative : add grammar support #2991

speculative : add grammar support #2991

ggerganov commented Sep 3, 2023 •

edited

Loading

ejones left a comment

ejones Sep 3, 2023

ggerganov Sep 4, 2023 •

edited

Loading

ejones Sep 5, 2023

ejones Sep 4, 2023

ggerganov Sep 4, 2023

ejones commented Sep 4, 2023

ggerganov commented Sep 4, 2023

ejones commented Sep 5, 2023

iddar commented Sep 12, 2023

niyunsheng commented Oct 18, 2023

ggerganov commented Oct 18, 2023

speculative : add grammar support #2991

speculative : add grammar support #2991

Conversation

ggerganov commented Sep 3, 2023 • edited Loading

Usage

ejones left a comment

Choose a reason for hiding this comment

ejones Sep 3, 2023

Choose a reason for hiding this comment

ggerganov Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

ejones Sep 5, 2023

Choose a reason for hiding this comment

ejones Sep 4, 2023

Choose a reason for hiding this comment

ggerganov Sep 4, 2023

Choose a reason for hiding this comment

ejones commented Sep 4, 2023

ggerganov commented Sep 4, 2023

ejones commented Sep 5, 2023

iddar commented Sep 12, 2023

niyunsheng commented Oct 18, 2023

ggerganov commented Oct 18, 2023

ggerganov commented Sep 3, 2023 •

edited

Loading

ggerganov Sep 4, 2023 •

edited

Loading