Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

HanClinto · 2024-04-11T18:10:10Z

tl;dr

~4x-5x speedup on processing complex grammars.

Previously discussed in #4218 (comment)

Motivation:

The grammar stacks have a tendency to explode exponentially in the case of redundant ambiguities (see #4218 (comment) ). In these cases, the grammar engine winds up duplicating effort by evaluating the feasibility of multiple redundant copies of grammars, even though they are equivalent.

Fix:

After each token, the grammar engine builds stacks representing the possible directions the parser could go. When building these stacks, this change takes care to not add new grammars to the stack if they are already there. Despite std::find() being an O(N) operation, the savings gained from not adding trees of redundant grammars vast outweigh this minor check.

Results:

Integration Benchmark

I expanded the grammar integration tests to add some crude timing metrics.

Before:

./tests/test-grammar-integration
Expected error:  parse: error parsing grammar: Undefined rule identifier 'numero'
End of expected error. Test successful.

Timings:
Simple grammar: 18 us
Complex grammar: 939 us
Chained ambiguity: 238687 us
Chained ambiguity (grouped): 45 us
Failure missing root: 5 us
Failure missing reference: 188 us

After:

./tests/test-grammar-integration
Expected error:  parse: error parsing grammar: Undefined rule identifier 'numero'
End of expected error. Test successful.

Timings:
Simple grammar: 38 us
Complex grammar: 1032 us
Chained ambiguity: 71 us
Chained ambiguity (grouped): 19 us
Failure missing root: 4 us
Failure missing reference: 156 us

Note the significant improvement in the chained ambiguity case, which is what this PR was targeting most directly. Most of the other speed differences seem to be hovering around in the noise floor, and don't seem to be consistently faster or slower one way or the other.

"Full" benchmark

Thanks to @ochafik for teaching me how to use Hyperfine, here are the results of the benchmark cribbed from #6609 for a ~9x speedup vs. master:

Hyperfine Results

Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     19.028 s ±  0.887 s    [User: 11.001 s, System: 5.890 s]
  Range (min … max):   18.092 s … 20.205 s    5 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2)
  Time (mean ± σ):      2.095 s ±  0.051 s    [User: 0.861 s, System: 0.143 s]
  Range (min … max):    2.058 s …  2.175 s    5 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2) ran
    9.08 ± 0.48 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)

It's interesting to me that even though this test's grammar seeks to remove redundant ambiguities, this PR still offers enough improvement to be quite significant.

github-actions · 2024-04-11T18:53:38Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 448 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10513.65ms p(95)=27209.89ms fails=, finish reason: stop=396 truncated=52
Prompt processing (pp): avg=116.48tk/s p(95)=513.75tk/s
Token generation (tg): avg=28.08tk/s p(95)=36.3tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gbnf-optimize-ambiguity2 commit=80553e53c075a2d4e1936735eb06368eba5f599d

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 249.62, 249.62, 249.62, 249.62, 249.62, 593.66, 593.66, 593.66, 593.66, 593.66, 490.24, 490.24, 490.24, 490.24, 490.24, 477.01, 477.01, 477.01, 477.01, 477.01, 499.06, 499.06, 499.06, 499.06, 499.06, 540.25, 540.25, 540.25, 540.25, 540.25, 566.55, 566.55, 566.55, 566.55, 566.55, 567.74, 567.74, 567.74, 567.74, 567.74, 571.43, 571.43, 571.43, 571.43, 571.43, 586.9, 586.9, 586.9, 586.9, 586.9, 588.01, 588.01, 588.01, 588.01, 588.01, 600.7, 600.7, 600.7, 600.7, 600.7, 602.41, 602.41, 602.41, 602.41, 602.41, 616.59, 616.59, 616.59, 616.59, 616.59, 614.91, 614.91, 614.91, 614.91, 614.91, 599.58, 599.58, 599.58, 599.58, 599.58, 614.54, 614.54, 614.54, 614.54, 614.54, 611.61, 611.61, 611.61, 611.61, 611.61, 619.61, 619.61, 619.61, 619.61, 619.61, 619.78, 619.78, 619.78, 619.78, 619.78, 620.23, 620.23, 620.23, 620.23, 620.23, 635.23, 635.23, 635.23, 635.23, 635.23, 633.96, 633.96, 633.96, 633.96, 633.96, 633.16, 633.16, 633.16, 633.16, 633.16, 632.69, 632.69, 632.69, 632.69, 632.69, 638.15, 638.15, 638.15, 638.15, 638.15, 638.38, 638.38, 638.38, 638.38, 638.38, 635.89, 635.89, 635.89, 635.89, 635.89, 632.9, 632.9, 632.9, 632.9, 632.9, 631.18, 631.18, 631.18, 631.18, 631.18, 635.17, 635.17, 635.17, 635.17, 635.17, 637.13, 637.13, 637.13, 637.13, 637.13, 646.05, 646.05, 646.05, 646.05, 646.05, 644.85, 644.85, 644.85, 644.85, 644.85, 644.7, 644.7, 644.7, 644.7, 644.7, 644.51, 644.51, 644.51, 644.51, 644.51, 647.12, 647.12, 647.12, 647.12, 647.12, 650.33, 650.33, 650.33, 650.33, 650.33, 650.02, 650.02, 650.02, 650.02, 650.02, 653.19, 653.19, 653.19, 653.19, 653.19, 653.43, 653.43, 653.43, 653.43, 653.43, 661.65, 661.65, 661.65, 661.65, 661.65, 665.81, 665.81, 665.81, 665.81, 665.81, 667.75, 667.75, 667.75, 667.75, 667.75, 671.97, 671.97, 671.97, 671.97, 671.97, 671.29, 671.29, 671.29, 671.29, 671.29, 672.31, 672.31, 672.31, 672.31, 672.31, 675.12, 675.12, 675.12, 675.12, 675.12, 676.85, 676.85, 676.85, 676.85, 676.85, 677.42, 677.42, 677.42, 677.42, 677.42, 673.35, 673.35, 673.35, 673.35, 673.35, 643.28, 643.28, 643.28, 643.28, 643.28, 640.09, 640.09, 640.09, 640.09, 640.09, 638.91, 638.91, 638.91, 638.91, 638.91, 638.32, 638.32, 638.32, 638.32, 638.32, 637.64, 637.64, 637.64, 637.64, 637.64, 640.32, 640.32, 640.32, 640.32, 640.32, 638.43, 638.43, 638.43, 638.43, 638.43, 638.89, 638.89, 638.89, 638.89, 638.89, 641.66, 641.66, 641.66, 641.66, 641.66, 643.85, 643.85, 643.85, 643.85, 643.85]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 35.96, 35.96, 35.96, 35.96, 35.96, 30.48, 30.48, 30.48, 30.48, 30.48, 29.74, 29.74, 29.74, 29.74, 29.74, 20.96, 20.96, 20.96, 20.96, 20.96, 21.92, 21.92, 21.92, 21.92, 21.92, 22.44, 22.44, 22.44, 22.44, 22.44, 22.56, 22.56, 22.56, 22.56, 22.56, 23.1, 23.1, 23.1, 23.1, 23.1, 24.01, 24.01, 24.01, 24.01, 24.01, 24.77, 24.77, 24.77, 24.77, 24.77, 24.82, 24.82, 24.82, 24.82, 24.82, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 25.02, 24.85, 24.85, 24.85, 24.85, 24.85, 24.49, 24.49, 24.49, 24.49, 24.49, 24.21, 24.21, 24.21, 24.21, 24.21, 23.72, 23.72, 23.72, 23.72, 23.72, 23.1, 23.1, 23.1, 23.1, 23.1, 23.08, 23.08, 23.08, 23.08, 23.08, 22.94, 22.94, 22.94, 22.94, 22.94, 23.18, 23.18, 23.18, 23.18, 23.18, 23.28, 23.28, 23.28, 23.28, 23.28, 22.97, 22.97, 22.97, 22.97, 22.97, 22.71, 22.71, 22.71, 22.71, 22.71, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.41, 22.17, 22.17, 22.17, 22.17, 22.17, 22.33, 22.33, 22.33, 22.33, 22.33, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.57, 22.57, 22.57, 22.57, 22.57, 22.7, 22.7, 22.7, 22.7, 22.7, 22.72, 22.72, 22.72, 22.72, 22.72, 22.49, 22.49, 22.49, 22.49, 22.49, 22.43, 22.43, 22.43, 22.43, 22.43, 22.31, 22.31, 22.31, 22.31, 22.31, 22.39, 22.39, 22.39, 22.39, 22.39, 22.45, 22.45, 22.45, 22.45, 22.45, 22.62, 22.62, 22.62, 22.62, 22.62, 22.71, 22.71, 22.71, 22.71, 22.71, 22.74, 22.74, 22.74, 22.74, 22.74, 22.76, 22.76, 22.76, 22.76, 22.76, 22.72, 22.72, 22.72, 22.72, 22.72, 22.58, 22.58, 22.58, 22.58, 22.58, 22.51, 22.51, 22.51, 22.51, 22.51, 22.46, 22.46, 22.46, 22.46, 22.46, 22.49, 22.49, 22.49, 22.49, 22.49, 22.6, 22.6, 22.6, 22.6, 22.6, 22.68, 22.68, 22.68, 22.68, 22.68, 22.82, 22.82, 22.82, 22.82, 22.82, 22.93, 22.93, 22.93, 22.93, 22.93, 22.85, 22.85, 22.85, 22.85, 22.85, 22.59, 22.59, 22.59, 22.59, 22.59, 22.34, 22.34, 22.34, 22.34, 22.34, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 22.2, 21.32, 21.32, 21.32, 21.32, 21.32, 21.31, 21.31, 21.31, 21.31, 21.31, 21.32, 21.32, 21.32, 21.32, 21.32, 21.4, 21.4, 21.4, 21.4, 21.4, 21.45, 21.45, 21.45, 21.45, 21.45]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.46, 0.46, 0.46, 0.46, 0.46, 0.51, 0.51, 0.51, 0.51, 0.51, 0.42, 0.42, 0.42, 0.42, 0.42, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 448 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712865211 --> 1712865839
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0]

…rammar.

HanClinto · 2024-04-11T19:26:49Z

Was getting cross-platform build errors on the new benchmark reporting that I put into test-grammar-integration, so removed it for now. The tests (along with my "cool" benchmarking utility functions) are still available to view in 490d06f, and maybe I'll figure out the %lld vs. %ld fprintf() issue at some point to bring it back in.

Regardless, taking out the changes to the integration tests makes this a much cleaner change, and hopefully easier to review.

Also rebased on top of latest master to take advantage of speed improvements in #6609, and the speed improvements from this PR are still there. Benching with 10 iterations showed a roughly 4x-5x speedup:

`gbnf-optimize-ambiguity2` ran 5.41 ± 0.23 times faster than `master`

Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     11.198 s ±  0.372 s    [User: 8.113 s, System: 1.386 s]
  Range (min … max):   10.675 s … 11.607 s    10 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2)
  Time (mean ± σ):      2.070 s ±  0.053 s    [User: 0.845 s, System: 0.119 s]
  Range (min … max):    2.026 s …  2.213 s    10 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = gbnf-optimize-ambiguity2) ran
    5.41 ± 0.23 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)

Edit: Was accidentally comparing against old master, so was mistakenly adding the improvements from #6009 to my numbers, instead of correctly separating them. After I updated to use latest head, improvements dropped from 9x to ~5x. Still good, but not that good. :)

Overall I'm feeling optimistic about this PR, but would love your review whenever you have time, @ochafik

ochafik · 2024-04-12T00:45:29Z

This is amazing (getting 8.5x speedup on my M2 Max), and such a small change, great catch!

llama.cpp

ochafik

Looks great to me!!

…rammar. (ggerganov#6616)

Optimization: eliminate addition of redundant stacks when advancing g…

80553e5

…rammar.

HanClinto force-pushed the gbnf-optimize-ambiguity2 branch from 3bbceb5 to 80553e5 Compare April 11, 2024 19:09

HanClinto changed the title ~~Grammar Optimization: Eliminate Redundant Grammar Trees~~ Grammar optimization: eliminate redundant grammar trees (~9x faster grammar sampling) Apr 11, 2024

HanClinto changed the title ~~Grammar optimization: eliminate redundant grammar trees (~9x faster grammar sampling)~~ Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) Apr 11, 2024

ochafik reviewed Apr 12, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

ochafik approved these changes Apr 12, 2024

View reviewed changes

HanClinto merged commit 04a5ac2 into ggerganov:master Apr 12, 2024
61 of 62 checks passed

This was referenced Apr 12, 2024

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

Merged

main: add --json-schema / -j flag #6659

Merged

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024

Optimization: eliminate addition of redundant stacks when advancing g…

2acbcbf

…rammar. (ggerganov#6616)

ochafik mentioned this pull request Apr 26, 2024

grammars: x{min,max} repetition operator #6640

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

HanClinto commented Apr 11, 2024 •

edited

Loading

github-actions bot commented Apr 11, 2024 •

edited

Loading

HanClinto commented Apr 11, 2024 •

edited

Loading

ochafik commented Apr 12, 2024

ochafik left a comment

Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

Conversation

HanClinto commented Apr 11, 2024 • edited Loading

tl;dr

Motivation:

Fix:

Results:

Integration Benchmark

"Full" benchmark

github-actions bot commented Apr 11, 2024 • edited Loading

HanClinto commented Apr 11, 2024 • edited Loading

ochafik commented Apr 12, 2024

ochafik left a comment

Choose a reason for hiding this comment

HanClinto commented Apr 11, 2024 •

edited

Loading

github-actions bot commented Apr 11, 2024 •

edited

Loading

HanClinto commented Apr 11, 2024 •

edited

Loading