Skip to content

Eval bug: Incorrect KV cache calculation in llama.android example #12211

@hanyin-arm

Description

@hanyin-arm

Name and Version

version: 4818 (dfd6b2c)
built with Android (10552028, +pgo, +bolt, +lto, -mlgo, based on r487747d) clang version 17.0.2 (https://android.googlesource.com/toolchain/llvm-project d9f89f4d16663d5012e5c09495f3b30ece3d2362) for x86_64-apple-darwin23.6.0

Operating systems

Other? (Please let us know in description)

GGML backends

CPU

Hardware

CPU: Google Tensor G4 (Pixel 9)

Models

No response

Problem description & steps to reproduce

Line 364 of file llama-android.cpp's KV cache size calculation doesn't make sense, it is simply assigning n_len to n_kv_req:

auto n_kv_req = tokens_list.size() + (n_len - tokens_list.size());

Since tokens_list is tokenized from the input text (either formatted or not), while n_len is the max length of the tokens to be generated, the required KV cache size would naturally be the sum of them.

First Bad Commit

No response

Relevant log output

(empty, no tokens are generated. Because when I send a long message formatted with system prompt and user prmopt, `nlen` with a default value of `64` becomes the actual `n_kv_req` and gets burst)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions