Expose min_p and repetition_penalty in completion options by DuFanYin · Pull Request #560 · cactus-compute/cactus

DuFanYin · 2026-04-07T06:28:45Z

Fixes #554

Scope note: This PR intentionally lands the minimal option-exposure fix first. If preferred by maintainers, I can add follow-up commits in this PR to refactor multimodal sampling paths (and any API-shaping changes) for fully consistent repetition behavior.

Approach

Did not change the public C function signatures: new fields live in options_json and flow through InferenceOptions. Internal C++ (Model::decode, decode_with_images, decode_with_audio, sample_token, graph OpParams) gains parameters with defaults so existing call sites keep building without changes.

What changed

InferenceOptions: add min_p (default 0.15) and repetition_penalty (default 1.1); extend parse_inference_options_json to parse both keys, keeping defaults when omitted.
FFI: pass both through all decode / decode_with_audio call sites in cactus_complete and cactus_transcribe, so the expanded Model::decode* signatures receive the right arguments from options_json.
Graph: extend OpParams; add CactusGraph::sample_with_options. Existing sample() forwards to it with the prior implicit defaults so unmodified call sites are unaffected.
Kernel: add cactus_sample_f32_ex / cactus_sample_f16_ex; keep cactus_sample_f32 / cactus_sample_f16 as wrappers for backward compatibility.
min_p: now applied on both FP32 and FP16 paths (FP16 previously skipped this). Standard rule: after temperature / top-k / bias, compute probabilities and drop logits below max_prob * min_p.
repetition_penalty: implemented in Model::sample_token via the existing logit-bias map — subtract log(penalty) for each token in token_history_. History is bounded to 128 tokens per instance and cleared on reset_cache().
Remove static token_history from the FP16 sampler (process-wide state, broken for multiple concurrent sessions).
docs/cactus_engine.md: document both options and defaults.

Design notes for reviewers

repetition_penalty is intentionally a no-op at the kernel level. cactus_sample_f32_ex / cactus_sample_f16_ex accept the parameter but discard it ((void)). The penalty is fully applied upstream in Model::sample_token as a logit bias before the graph executes. The kernel parameter exists for API symmetry and future direct-graph use cases. This is not a bug.
sample() now implicitly activates min_p=0.15. The old sample() had a hardcoded constexpr float min_p = 0.15f internally. The new sample() forwards to sample_with_options with the same value, so behavior is unchanged for existing callers.
Whisper language probe remains greedy and option-independent. The initial language probe calls decode_with_audio with temperature=0, top_p=0, top_k=0, and explicitly passes min_p=0 / repetition_penalty=1 so user decoding settings do not affect language detection.
Known limitation (pre-existing): some multimodal decode paths sample directly from graph nodes (without Model::sample_token), so repetition behavior is not fully unified with the standard text decode path in this PR.

Fixes cactus-compute#554. Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

Copilot

Pull request overview

This PR exposes min_p and repetition_penalty as generation options via options_json, threads them through InferenceOptions and model decode APIs, and extends graph/kernel sampling to support min_p consistently across FP32/FP16.

Changes:

Add min_p / repetition_penalty to InferenceOptions parsing and pass them through FFI decode/transcribe call paths.
Extend Model::decode* and graph sampling APIs (OpParams, sample_with_options) to carry the new options while preserving backward compatibility.
Update kernel sampling to support min_p on both FP32 and FP16 paths and remove FP16’s global static token history.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
docs/cactus_engine.md	Documents new `min_p` / `repetition_penalty` options and defaults.
cactus/models/model.h	Extends model decode virtuals to accept `min_p` / `repetition_penalty`.
cactus/models/model_whisper.cpp	Passes new options into shared `sample_token` path.
cactus/models/model_parakeet.cpp	Adds new parameters (currently unused for this model).
cactus/models/model_parakeet_tdt.cpp	Adds new parameters (currently unused for this model).
cactus/models/model_moonshine.cpp	Passes new options into shared `sample_token` path.
cactus/models/model_lfm2vl.cpp	Threads options through VLM wrapper to underlying language model decode.
cactus/models/gemma4/model_gemma4.h	Extends multimodal Gemma4 decode APIs to accept new options.
cactus/models/gemma4/model_gemma4_mm.cpp	Uses graph sampling with new options and tracks token history for multimodal path.
cactus/kernel/kernel.h	Adds `_ex` sampler APIs with `min_p` / `repetition_penalty` parameters.
cactus/kernel/kernel_nn.cpp	Implements `_ex` samplers; adds FP16 `min_p`; removes static FP16 repetition state.
cactus/graph/graph.h	Extends `OpParams` and adds `sample_with_options` API.
cactus/graph/graph_ops_sample.cpp	Routes sampling node execution through `_ex` kernel APIs.
cactus/graph/graph_builder.cpp	Implements `sample_with_options`; keeps `sample()` behavior via forwarding defaults.
cactus/ffi/cactus_utils.h	Adds new fields to `InferenceOptions` and parses them from `options_json`.
cactus/ffi/cactus_transcribe.cpp	Forwards new options into audio decode calls; keeps language probe greedy/option-independent.
cactus/ffi/cactus_complete.cpp	Forwards new options into decode and audio decode paths.
cactus/engine/engine.h	Extends base `Model` decode APIs; clears token history on `reset_cache()`.
cactus/engine/engine_model.cpp	Implements repetition penalty as a logit-bias adjustment; threads `min_p` into sampling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

HenryNdubuaku · 2026-04-07T16:22:31Z

@yujonglee wanna review this one? Thanks @DuFanYin !

yujonglee · 2026-04-07T21:21:07Z

I think most of the file touches are justified by the chosen plumbing path, especially for min_p.

That said, I don’t think the implementation is fully consistent yet:

Lfm2VlModel::decode_with_images() still samples directly via gb->sample(...) on the image path, so it bypasses the new min_p / repetition_penalty handling that now lives in Model::sample_token(). It also looks like the sampled token is not recorded back into history on that path.
Gemma4MmModel::decode_multimodal() now threads min_p, but it still samples directly instead of going through Model::sample_token(), so repetition_penalty is still not honored on that multimodal path.

…tion_penalty Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

DuFanYin · 2026-04-07T23:24:08Z

I think most of the file touches are justified by the chosen plumbing path, especially for min_p.

That said, I don’t think the implementation is fully consistent yet:

Lfm2VlModel::decode_with_images() still samples directly via gb->sample(...) on the image path, so it bypasses the new min_p / repetition_penalty handling that now lives in Model::sample_token(). It also looks like the sampled token is not recorded back into history on that path.

Gemma4MmModel::decode_multimodal() now threads min_p, but it still samples directly instead of going through Model::sample_token(), so repetition_penalty is still not honored on that multimodal path.

fixed, They now use language_model_.sample_token(...) plus record_sampled_token(...), matching text decode.

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

Expose min_p and repetition_penalty in completion options

482882b

Fixes cactus-compute#554. Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

Copilot AI review requested due to automatic review settings April 7, 2026 06:28

Copilot started reviewing on behalf of DuFanYin April 7, 2026 06:29 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

DuFanYin force-pushed the hang/expose-minp-repetition-penalty branch from 9ce35db to 482882b Compare April 7, 2026 06:46

DuFanYin added 2 commits April 7, 2026 14:55

Clamp min_p/repetition_penalty and tighten repetition semantics

0282305

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

fix fp16 sampling pipeline order for min_p to match fp32 behavior

b9a1bf9

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

DuFanYin requested a review from Copilot April 7, 2026 14:16

Copilot started reviewing on behalf of DuFanYin April 7, 2026 14:17 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread cactus/models/model_whisper.cpp

Comment thread cactus/models/model_moonshine.cpp

Comment thread cactus/models/gemma4/model_gemma4_mm.cpp Outdated

Comment thread cactus/engine/engine.h

DuFanYin added 2 commits April 7, 2026 22:37

fix token history handling for repetition penalty across decoders

c1e8ffd

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

reverted fix of hardcoded value, out of scope of the issue

fc5f5a4

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

route VLM multimodal decode through sample_token for min_p and repeti…

d6ded3a

…tion_penalty Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

HenryNdubuaku merged commit e9cc468 into cactus-compute:main Apr 9, 2026
1 check passed

DuFanYin deleted the hang/expose-minp-repetition-penalty branch April 9, 2026 17:52

DuFanYin mentioned this pull request Apr 10, 2026

Follow-up: consolidate sampling APIs after #560 #569

Open

DuFanYin added a commit to DuFanYin/cactus that referenced this pull request Apr 13, 2026

consolidate sampling APIs after cactus-compute#560

1326821

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

DuFanYin added a commit to DuFanYin/cactus that referenced this pull request Apr 13, 2026

consolidate sampling APIs after cactus-compute#560

81d108c

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose min_p and repetition_penalty in completion options#560

Expose min_p and repetition_penalty in completion options#560
HenryNdubuaku merged 6 commits into
cactus-compute:mainfrom
DuFanYin:hang/expose-minp-repetition-penalty

DuFanYin commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HenryNdubuaku commented Apr 7, 2026

Uh oh!

yujonglee commented Apr 7, 2026

Uh oh!

DuFanYin commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DuFanYin commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HenryNdubuaku commented Apr 7, 2026

Uh oh!

yujonglee commented Apr 7, 2026

Uh oh!

DuFanYin commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants