One million tokens prompt club #24622

fairydreaming · 2026-06-14T18:44:17Z

fairydreaming
Jun 14, 2026
Collaborator

Since there are more and more models that support 1M tokens context length (DeepSeek V4, MiMo-V2.5, MiniMax M3) let's try to get llama.cpp into better shape by trying 1M tokens long prompts in various models/backends and reporting/fixing any encountered errors. I attached a prompt file with 1048572 dot characters separated with spaces. This should tokenize to 1048572 tokens (checked in DeepSeek V4, to be confirmed in others).

I will start (DeepSeek V4 Flash, CUDA backend, CPU MoE offloading, [WIP] DeepSeek V4 branch):

$ ./bin/llama-completion -m ../../llama.cpp-dsv4/models/DeepSeek-V4-Flash.gguf -f ../prompt-1m.txt --no-warmup --temp 0.01 -fa 1 -ngl 99 -fit off -ncmoe 999 -no-cnv -n 1 -b 8192 -ub 8192
0.00.367.532 I llama_completion: llama backend init
0.00.367.554 I llama_completion: load the model and apply lora adapter, if any
0.02.025.736 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance
0.58.109.622 I common_chat_templates_init: no chat template found in model, using built-in inline template for arch 'deepseek4'
0.58.130.216 I llama_completion: llama threadpool init, n_threads = 32
0.58.131.471 I 
0.58.131.599 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.58.131.601 I 
1.05.915.547 I sampler seed: 3788675317
1.05.915.575 I sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.010
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
1.05.915.598 I sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist 
1.05.915.598 I generate: n_ctx = 1048576, n_batch = 8192, n_predict = 1, n_keep = 0
1.05.915.599 I 
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...a few hundred thousands dots later...
/home/phm/projects/llama.cpp-dsv4-opt/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
57.47.970.902 E CUDA error: an illegal memory access was encountered
57.47.970.911 E   current device: 0, in function ggml_backend_cuda_synchronize at /home/phm/projects/llama.cpp-dsv4-opt/ggml/src/ggml-cuda/ggml-cuda.cu:3249
57.47.970.912 E   cudaStreamSynchronize(cuda_ctx->stream())

Let's run it with CUDA compute-sanitizer to identify the source of the problem. I replaced the model with my 4-layer DeepSeek V4 so I won't have to wait hours for result. I also compiled llama.cpp with cmake .. -DGGML_CUDA=1 -DCMAKE_BUILD_TYPE=Debug -DGGML_CUDA_DEBUG=1 to have CUDA debug line info.

$ GGML_CUDA_DISABLE_GRAPHS=1 compute-sanitizer ./bin/llama-completion -m ../../llama.cpp-dsv4/models/DeepSeek-V4-Flash-4Layers-auto.gguf -f ../../llama.cpp-dsv4-opt/prompt-1m.txt --no-warmup --temp 0.01 -fa 1 -ngl 99 -fit off -no-cnv -n 1 -b 8192 -ub 8192
========= COMPUTE-SANITIZER
0.00.424.967 I llama_completion: llama backend init
0.00.424.988 I llama_completion: load the model and apply lora adapter, if any
0.12.107.584 I llama_completion: llama threadpool init, n_threads = 32
0.12.108.983 I 
0.12.109.113 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.12.109.115 I 
0.19.666.572 I sampler seed: 1244734138
0.19.666.596 I sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.010
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.19.666.638 I sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist 
0.19.666.639 I generate: n_ctx = 1048576, n_batch = 8192, n_predict = 1, n_keep = 0
0.19.666.640 I 
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...a few hundred thousands dots later...
========= Invalid __global__ read of size 4 bytes
=========     at void k_bin_bcast<&op_add, float, float, float, const float *>(const T2 *, const T3 *, T4 *, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, T5...)+0x640 in binbcast.cu:83
=========     by thread (0,0,0) in block (5505181,0,0)
=========     Address 0x70e937434280 is out of bounds
=========     and is 8,300,313,984 bytes before the nearest allocation at 0x70eb26000000 of size 21,957,693,952 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cuLaunchKernelEx [0x39d75f] in libcuda.so.1
=========         Host Frame:  [0x141c1] in libcudart.so.12
=========         Host Frame: cudaLaunchKernelExC [0x7a203] in libcudart.so.12
=========         Host Frame: cudaLaunchKernelEx<float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*&, float const*&, float*&, long&, long&, long&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float const*>(cudaLaunchConfig_st const*, void (*)(float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*), float const*&, float const*&, float*&, long&, long&, long&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float const*&&)::{lambda(float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*)#1}::operator()(float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*) const in cuda_runtime.h:288 [0x267632] in libggml-cuda.so.0
=========         Host Frame: cudaError cudaLaunchKernelEx<float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*&, float const*&, float*&, long&, long&, long&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float const*>(cudaLaunchConfig_st const*, void (*)(float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*), float const*&, float const*&, float*&, long&, long&, long&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float const*&&) in cuda_runtime.h:289 [0x267a54] in libggml-cuda.so.0
=========         Host Frame: void ggml_cuda_kernel_launch<void (*)(float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*), float const*&, float const*&, float*&, long&, long&, long&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float const*>(void (*)(float const*, float const*, float*, int, int, int, uint3, uint3, uint3, uint3, uint3, int, int, int, int, int, int, int, int, int, int, int, float const*), ggml_cuda_kernel_launch_params const&, float const*&, float const*&, float*&, long&, long&, long&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, uint3 const&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float const*&&) in common.cuh:1633 [0x24663c] in libggml-cuda.so.0
=========         Host Frame: void launch_bin_bcast_pack<&(op_add(float, float)), float, float, float, 0ul>(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, CUstream_st*, std::integer_sequence<unsigned long, 0ul>) in binbcast.cu:304 [0x20ac34] in libggml-cuda.so.0
=========         Host Frame: void bin_bcast_cuda<&(op_add(float, float)), 1>::operator()<float, float, float>(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, CUstream_st*) in binbcast.cu:350 [0x1bbd7e] in libggml-cuda.so.0
=========         Host Frame: void ggml_cuda_op_bin_bcast<bin_bcast_cuda<&(op_add(float, float)), 1> >(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void const*, void const*, void*, CUstream_st*) in binbcast.cu:375 [0x1b052d] in libggml-cuda.so.0
=========         Host Frame: ggml_cuda_op_add(ggml_backend_cuda_context&, ggml_tensor*) in binbcast.cu:394 [0x16da83] in libggml-cuda.so.0
=========         Host Frame: ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) in ggml-cuda.cu:2828 [0x38ec30] in libggml-cuda.so.0
=========         Host Frame: ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) in ggml-cuda.cu:4402 [0x3964c1] in libggml-cuda.so.0
=========         Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in ggml-cuda.cu:4522 [0x396c5c] in libggml-cuda.so.0
=========         Host Frame: ggml_backend_graph_compute_async in ggml-backend.cpp:452 [0xb2d81] in libggml-base.so.0
=========         Host Frame: ggml_backend_sched_compute_splits(ggml_backend_sched*) in ggml-backend.cpp:1678 [0xb7ef0] in libggml-base.so.0
=========         Host Frame: ggml_backend_sched_graph_compute_async in ggml-backend.cpp:1901 [0xb8ec3] in libggml-base.so.0
=========         Host Frame: llama_context::graph_compute(ggml_cgraph*, bool) in llama-context.cpp:2338 [0x58c859] in libllama.so.0
=========         Host Frame: llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) in llama-context.cpp:1317 [0x58743c] in libllama.so.0
=========         Host Frame: llama_context::decode(llama_batch const&) in llama-context.cpp:1795 [0x589763] in libllama.so.0
=========         Host Frame: llama_decode in llama-context.cpp:3937 [0x591a8c] in libllama.so.0
=========         Host Frame: common_prompt_batch_decode(llama_context*, std::vector<int, std::allocator<int> > const&, int, int&, int, std::basic_string_view<char, std::char_traits<char> >, bool) in common.cpp:2026 [0x8fa4d5] in libllama-common.so.0
=========         Host Frame: llama_completion(int, char**) in completion.cpp:693 [0x36f4b] in libllama-completion-impl.so
=========         Host Frame: main in main.cpp:4 [0x116c] in llama-completion

Now I have a starting point to investigate more and create a bug report.

Inviting @AesSedai to try this with MiMo-V2.5.

prompt-1m.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One million tokens prompt club #24622

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

One million tokens prompt club #24622

Uh oh!

Uh oh!

fairydreaming Jun 14, 2026 Collaborator

Replies: 0 comments

fairydreaming
Jun 14, 2026
Collaborator