Can the token output rate be improved?

> ./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 65536 \
  --step-incr 2048 \
  --gen-tokens 128
ds4-bench: context buffers 1311.89 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.374 ms, residency requested in 669.804 ms, warmup 5.021 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,370.57,128,31.25,52184460
4096,2048,324.39,128,30.50,80373132
6144,2048,330.33,128,30.58,108561804
8192,2048,329.27,128,30.64,136750476
10240,2048,323.10,128,30.34,164939148
12288,2048,320.78,128,30.31,193127820
14336,2048,314.43,128,30.09,221316492
16384,2048,313.95,128,30.04,249505164
18432,2048,308.94,128,29.67,277693836
20480,2048,305.58,128,29.78,305882508
22528,2048,301.93,128,29.44,334071180
24576,2048,299.79,128,29.49,362259852
26624,2048,294.64,128,29.13,390448524
28672,2048,292.27,128,29.16,418637196
30720,2048,288.95,128,28.89,446825868
32768,2048,286.91,128,28.83,475014540
34816,2048,281.73,128,28.54,503203212
36864,2048,279.67,128,28.52,531391884
38912,2048,273.01,128,28.32,559580556
40960,2048,272.47,128,28.36,587769228
43008,2048,267.34,128,28.11,615957900
45056,2048,263.90,128,28.12,644146572
47104,2048,262.77,128,27.88,672335244
49152,2048,261.83,128,27.90,700523916
51200,2048,257.78,128,27.68,728712588
53248,2048,255.64,128,27.67,756901260
55296,2048,253.01,128,27.39,785089932
57344,2048,251.09,128,27.48,813278604
59392,2048,247.10,128,27.21,841467276
61440,2048,244.94,128,27.24,869655948
63488,2048,242.50,128,26.95,897844620
65536,2048,240.32,128,26.90,926033292

Hello, the above is my test data on the M5Max128G. In actual use, the highest token output rate can reach 38t/s. Usually, as the context length increases, the output rate will be halved to only about 20t/s, which is not enough for writing code. Are there any other ways to speed it up? I tried adding a bunch of parameters, but the rate actually decreased.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the token output rate be improved? #220

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Can the token output rate be improved? #220

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions