Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : update llama_timings.n_p_eval setting #7160

Merged
merged 1 commit into from
May 9, 2024

Conversation

danbev
Copy link
Collaborator

@danbev danbev commented May 9, 2024

This commit changes the value assigned to llama_timings.n_p_eval when ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value.

The motivation for this change is that if session caching is enabled, for example using the --prompt-cache main-session.txt command line argument for the main example, and if the same prompt is used then on subsequent runs, the prompt tokens will not actually be passed to llama_decode, and n_p_eval will not be updated by llama_synchoronize.

But the value of n_p_eval will be set 1 by llama_get_timings because ctx->n_p_eval will be 0. This could be interpreted as 1 token was evaluated for the prompt which could be misleading for applications using this value.


Example using main

Current behavior on master

First run:

./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --prompt-cache main-session.txt
...
main: prompt: 'Hello world'
main: number of tokens in prompt = 3
     1 -> '<s>'
 15043 -> ' Hello'
  3186 -> ' world'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 3, n_keep = 1


<s> Hello world, what is
llama_print_timings:        load time =     109,87 ms
llama_print_timings:      sample time =       0,07 ms /     3 runs   (    0,02 ms per token, 44776,12 tokens per second)
llama_print_timings: prompt eval time =     129,70 ms /     3 tokens (   43,23 ms per token,    23,13 tokens per second)
llama_print_timings:        eval time =     109,38 ms /     2 runs   (   54,69 ms per token,    18,29 tokens per second)
llama_print_timings:       total time =     240,37 ms /     5 tokens

Second run:

./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --prompt-cache main-session.txt
...
main: prompt: 'Hello world'
main: number of tokens in prompt = 3
     1 -> '<s>'
 15043 -> ' Hello'
  3186 -> ' world'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 3, n_keep = 1


<s> Hello world, this is
llama_print_timings:        load time =     136,40 ms
llama_print_timings:      sample time =       0,07 ms /     3 runs   (    0,02 ms per token, 46153,85 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     1 tokens (    0,00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =     101,48 ms /     2 runs   (   50,74 ms per token,    19,71 tokens per second)
llama_print_timings:       total time =     123,87 ms /     3 tokens

Current behavior using this pull request

First run is the same as the first run above.

Second run:

./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --prompt-cache main-session.txt
...
main: prompt: 'Hello world'
main: number of tokens in prompt = 3
     1 -> '<s>'
 15043 -> ' Hello'
  3186 -> ' world'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 3, n_keep = 1


<s> Hello worlds in which
llama_print_timings:        load time =     122,11 ms
llama_print_timings:      sample time =       0,06 ms /     3 runs   (    0,02 ms per token, 50847,46 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =     100,23 ms /     2 runs   (   50,12 ms per token,    19,95 tokens per second)
llama_print_timings:       total time =     122,41 ms /     2 tokens

This commit changes the value assigned to llama_timings.n_p_eval when
ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value.

The motivation for this change is that if session caching is enabled,
for example using the `--prompt-cache main-session.txt` command line
argument for the main example, and if the same prompt is used then on
subsequent runs, the prompt tokens will not actually be passed to
llama_decode, and n_p_eval will not be updated by llama_synchoronize.

But the value of n_p_eval will be set 1 by llama_get_timings because
ctx->n_p_eval will be 0. This could be interpreted as 1 token was
evaluated for the prompt which could be misleading for applications
using this value.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Copy link
Contributor

github-actions bot commented May 9, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 542 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8630.71ms p(95)=20082.53ms fails=, finish reason: stop=472 truncated=70
  • Prompt processing (pp): avg=99.38tk/s p(95)=485.99tk/s
  • Token generation (tg): avg=34.71tk/s p(95)=49.54tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=llama_get_timings_n_p_eval commit=0db386f4ca67350eeee8dd1797f71d22e4daf01d

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 379.25, 379.25, 379.25, 379.25, 379.25, 402.79, 402.79, 402.79, 402.79, 402.79, 425.24, 425.24, 425.24, 425.24, 425.24, 453.09, 453.09, 453.09, 453.09, 453.09, 505.92, 505.92, 505.92, 505.92, 505.92, 529.1, 529.1, 529.1, 529.1, 529.1, 538.0, 538.0, 538.0, 538.0, 538.0, 562.76, 562.76, 562.76, 562.76, 562.76, 588.07, 588.07, 588.07, 588.07, 588.07, 587.88, 587.88, 587.88, 587.88, 587.88, 609.72, 609.72, 609.72, 609.72, 609.72, 639.25, 639.25, 639.25, 639.25, 639.25, 639.33, 639.33, 639.33, 639.33, 639.33, 666.26, 666.26, 666.26, 666.26, 666.26, 676.39, 676.39, 676.39, 676.39, 676.39, 675.62, 675.62, 675.62, 675.62, 675.62, 687.88, 687.88, 687.88, 687.88, 687.88, 691.73, 691.73, 691.73, 691.73, 691.73, 700.99, 700.99, 700.99, 700.99, 700.99, 706.17, 706.17, 706.17, 706.17, 706.17, 712.54, 712.54, 712.54, 712.54, 712.54, 720.06, 720.06, 720.06, 720.06, 720.06, 724.98, 724.98, 724.98, 724.98, 724.98, 729.15, 729.15, 729.15, 729.15, 729.15, 736.31, 736.31, 736.31, 736.31, 736.31, 739.55, 739.55, 739.55, 739.55, 739.55, 742.23, 742.23, 742.23, 742.23, 742.23, 756.17, 756.17, 756.17, 756.17, 756.17, 753.35, 753.35, 753.35, 753.35, 753.35, 754.3, 754.3, 754.3, 754.3, 754.3, 754.55, 754.55, 754.55, 754.55, 754.55, 761.87, 761.87, 761.87, 761.87, 761.87, 763.11, 763.11, 763.11, 763.11, 763.11, 764.16, 764.16, 764.16, 764.16, 764.16, 767.76, 767.76, 767.76, 767.76, 767.76, 780.37, 780.37, 780.37, 780.37, 780.37, 787.75, 787.75, 787.75, 787.75, 787.75, 797.54, 797.54, 797.54, 797.54, 797.54, 790.94, 790.94, 790.94, 790.94, 790.94, 791.06, 791.06, 791.06, 791.06, 791.06, 791.38, 791.38, 791.38, 791.38, 791.38, 793.06, 793.06, 793.06, 793.06, 793.06, 801.49, 801.49, 801.49, 801.49, 801.49, 796.56, 796.56, 796.56, 796.56, 796.56, 778.09, 778.09, 778.09, 778.09, 778.09, 777.97, 777.97, 777.97, 777.97, 777.97, 777.04, 777.04, 777.04, 777.04, 777.04, 781.71, 781.71, 781.71, 781.71, 781.71, 786.12, 786.12, 786.12, 786.12, 786.12, 786.48, 786.48, 786.48, 786.48, 786.48, 789.17, 789.17, 789.17, 789.17, 789.17, 789.97, 789.97, 789.97, 789.97, 789.97, 793.95, 793.95, 793.95, 793.95, 793.95, 796.73, 796.73, 796.73, 796.73, 796.73, 796.69, 796.69, 796.69, 796.69, 796.69, 803.08, 803.08, 803.08, 803.08, 803.08, 805.17, 805.17, 805.17, 805.17, 805.17, 805.34, 805.34, 805.34, 805.34, 805.34, 807.08, 807.08, 807.08, 807.08, 807.08, 809.57, 809.57, 809.57, 809.57, 809.57, 810.12, 810.12, 810.12, 810.12]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.68, 40.68, 40.68, 40.68, 40.68, 42.98, 42.98, 42.98, 42.98, 42.98, 36.01, 36.01, 36.01, 36.01, 36.01, 33.24, 33.24, 33.24, 33.24, 33.24, 33.04, 33.04, 33.04, 33.04, 33.04, 32.65, 32.65, 32.65, 32.65, 32.65, 33.8, 33.8, 33.8, 33.8, 33.8, 34.69, 34.69, 34.69, 34.69, 34.69, 34.87, 34.87, 34.87, 34.87, 34.87, 34.83, 34.83, 34.83, 34.83, 34.83, 34.33, 34.33, 34.33, 34.33, 34.33, 33.99, 33.99, 33.99, 33.99, 33.99, 34.01, 34.01, 34.01, 34.01, 34.01, 33.15, 33.15, 33.15, 33.15, 33.15, 32.76, 32.76, 32.76, 32.76, 32.76, 32.47, 32.47, 32.47, 32.47, 32.47, 31.91, 31.91, 31.91, 31.91, 31.91, 32.05, 32.05, 32.05, 32.05, 32.05, 32.1, 32.1, 32.1, 32.1, 32.1, 31.79, 31.79, 31.79, 31.79, 31.79, 31.77, 31.77, 31.77, 31.77, 31.77, 31.8, 31.8, 31.8, 31.8, 31.8, 31.69, 31.69, 31.69, 31.69, 31.69, 31.9, 31.9, 31.9, 31.9, 31.9, 31.83, 31.83, 31.83, 31.83, 31.83, 31.92, 31.92, 31.92, 31.92, 31.92, 31.97, 31.97, 31.97, 31.97, 31.97, 32.04, 32.04, 32.04, 32.04, 32.04, 31.57, 31.57, 31.57, 31.57, 31.57, 31.29, 31.29, 31.29, 31.29, 31.29, 31.39, 31.39, 31.39, 31.39, 31.39, 31.51, 31.51, 31.51, 31.51, 31.51, 31.56, 31.56, 31.56, 31.56, 31.56, 31.75, 31.75, 31.75, 31.75, 31.75, 31.79, 31.79, 31.79, 31.79, 31.79, 31.68, 31.68, 31.68, 31.68, 31.68, 31.52, 31.52, 31.52, 31.52, 31.52, 31.44, 31.44, 31.44, 31.44, 31.44, 31.14, 31.14, 31.14, 31.14, 31.14, 31.25, 31.25, 31.25, 31.25, 31.25, 31.44, 31.44, 31.44, 31.44, 31.44, 31.5, 31.5, 31.5, 31.5, 31.5, 31.54, 31.54, 31.54, 31.54, 31.54, 31.5, 31.5, 31.5, 31.5, 31.5, 30.85, 30.85, 30.85, 30.85, 30.85, 30.82, 30.82, 30.82, 30.82, 30.82, 30.03, 30.03, 30.03, 30.03, 30.03, 29.82, 29.82, 29.82, 29.82, 29.82, 29.79, 29.79, 29.79, 29.79, 29.79, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 30.11, 30.11, 30.11, 30.11, 30.11, 30.13, 30.13, 30.13, 30.13, 30.13, 30.07, 30.07, 30.07, 30.07, 30.07, 30.02, 30.02, 30.02, 30.02, 30.02, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1, 30.1, 30.1, 30.1, 30.1, 30.29, 30.29, 30.29, 30.29, 30.29, 30.41, 30.41, 30.41, 30.41, 30.41, 30.46, 30.46, 30.46, 30.46, 30.46, 30.51, 30.51, 30.51, 30.51]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.45, 0.45, 0.45, 0.45, 0.45, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.37, 0.37, 0.37, 0.37, 0.37, 0.33, 0.33, 0.33, 0.33, 0.33, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.3, 0.3, 0.3, 0.3, 0.3, 0.43, 0.43, 0.43, 0.43, 0.43, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.42, 0.42, 0.42, 0.42, 0.42, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@ggerganov ggerganov merged commit fd9f92b into ggerganov:master May 9, 2024
64 checks passed
@mofosyne mofosyne added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants