Skip to content

Expose min_p and repetition_penalty in completion options#560

Merged
HenryNdubuaku merged 6 commits into
cactus-compute:mainfrom
DuFanYin:hang/expose-minp-repetition-penalty
Apr 9, 2026
Merged

Expose min_p and repetition_penalty in completion options#560
HenryNdubuaku merged 6 commits into
cactus-compute:mainfrom
DuFanYin:hang/expose-minp-repetition-penalty

Conversation

@DuFanYin
Copy link
Copy Markdown
Contributor

@DuFanYin DuFanYin commented Apr 7, 2026

Fixes #554

Scope note: This PR intentionally lands the minimal option-exposure fix first. If preferred by maintainers, I can add follow-up commits in this PR to refactor multimodal sampling paths (and any API-shaping changes) for fully consistent repetition behavior.

Approach

Did not change the public C function signatures: new fields live in options_json and flow through InferenceOptions. Internal C++ (Model::decode, decode_with_images, decode_with_audio, sample_token, graph OpParams) gains parameters with defaults so existing call sites keep building without changes.

What changed

  • InferenceOptions: add min_p (default 0.15) and repetition_penalty (default 1.1); extend parse_inference_options_json to parse both keys, keeping defaults when omitted.
  • FFI: pass both through all decode / decode_with_audio call sites in cactus_complete and cactus_transcribe, so the expanded Model::decode* signatures receive the right arguments from options_json.
  • Graph: extend OpParams; add CactusGraph::sample_with_options. Existing sample() forwards to it with the prior implicit defaults so unmodified call sites are unaffected.
  • Kernel: add cactus_sample_f32_ex / cactus_sample_f16_ex; keep cactus_sample_f32 / cactus_sample_f16 as wrappers for backward compatibility.
  • min_p: now applied on both FP32 and FP16 paths (FP16 previously skipped this). Standard rule: after temperature / top-k / bias, compute probabilities and drop logits below max_prob * min_p.
  • repetition_penalty: implemented in Model::sample_token via the existing logit-bias map — subtract log(penalty) for each token in token_history_. History is bounded to 128 tokens per instance and cleared on reset_cache().
  • Remove static token_history from the FP16 sampler (process-wide state, broken for multiple concurrent sessions).
  • docs/cactus_engine.md: document both options and defaults.

Design notes for reviewers

  • repetition_penalty is intentionally a no-op at the kernel level. cactus_sample_f32_ex / cactus_sample_f16_ex accept the parameter but discard it ((void)). The penalty is fully applied upstream in Model::sample_token as a logit bias before the graph executes. The kernel parameter exists for API symmetry and future direct-graph use cases. This is not a bug.

  • sample() now implicitly activates min_p=0.15. The old sample() had a hardcoded constexpr float min_p = 0.15f internally. The new sample() forwards to sample_with_options with the same value, so behavior is unchanged for existing callers.

  • Whisper language probe remains greedy and option-independent. The initial language probe calls decode_with_audio with temperature=0, top_p=0, top_k=0, and explicitly passes min_p=0 / repetition_penalty=1 so user decoding settings do not affect language detection.

  • Known limitation (pre-existing): some multimodal decode paths sample directly from graph nodes (without Model::sample_token), so repetition behavior is not fully unified with the standard text decode path in this PR.


Fixes cactus-compute#554.

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Copilot AI review requested due to automatic review settings April 7, 2026 06:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR exposes min_p and repetition_penalty as generation options via options_json, threads them through InferenceOptions and model decode APIs, and extends graph/kernel sampling to support min_p consistently across FP32/FP16.

Changes:

  • Add min_p / repetition_penalty to InferenceOptions parsing and pass them through FFI decode/transcribe call paths.
  • Extend Model::decode* and graph sampling APIs (OpParams, sample_with_options) to carry the new options while preserving backward compatibility.
  • Update kernel sampling to support min_p on both FP32 and FP16 paths and remove FP16’s global static token history.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/cactus_engine.md Documents new min_p / repetition_penalty options and defaults.
cactus/models/model.h Extends model decode virtuals to accept min_p / repetition_penalty.
cactus/models/model_whisper.cpp Passes new options into shared sample_token path.
cactus/models/model_parakeet.cpp Adds new parameters (currently unused for this model).
cactus/models/model_parakeet_tdt.cpp Adds new parameters (currently unused for this model).
cactus/models/model_moonshine.cpp Passes new options into shared sample_token path.
cactus/models/model_lfm2vl.cpp Threads options through VLM wrapper to underlying language model decode.
cactus/models/gemma4/model_gemma4.h Extends multimodal Gemma4 decode APIs to accept new options.
cactus/models/gemma4/model_gemma4_mm.cpp Uses graph sampling with new options and tracks token history for multimodal path.
cactus/kernel/kernel.h Adds _ex sampler APIs with min_p / repetition_penalty parameters.
cactus/kernel/kernel_nn.cpp Implements _ex samplers; adds FP16 min_p; removes static FP16 repetition state.
cactus/graph/graph.h Extends OpParams and adds sample_with_options API.
cactus/graph/graph_ops_sample.cpp Routes sampling node execution through _ex kernel APIs.
cactus/graph/graph_builder.cpp Implements sample_with_options; keeps sample() behavior via forwarding defaults.
cactus/ffi/cactus_utils.h Adds new fields to InferenceOptions and parses them from options_json.
cactus/ffi/cactus_transcribe.cpp Forwards new options into audio decode calls; keeps language probe greedy/option-independent.
cactus/ffi/cactus_complete.cpp Forwards new options into decode and audio decode paths.
cactus/engine/engine.h Extends base Model decode APIs; clears token history on reset_cache().
cactus/engine/engine_model.cpp Implements repetition penalty as a logit-bias adjustment; threads min_p into sampling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cactus/ffi/cactus_utils.h Outdated
Comment thread cactus/ffi/cactus_utils.h Outdated
Comment thread cactus/engine/engine_model.cpp Outdated
Comment thread cactus/kernel/kernel_nn.cpp
Comment thread cactus/models/gemma4/model_gemma4_mm.cpp Outdated
Comment thread cactus/ffi/cactus_utils.h
@DuFanYin DuFanYin force-pushed the hang/expose-minp-repetition-penalty branch from 9ce35db to 482882b Compare April 7, 2026 06:46
DuFanYin added 2 commits April 7, 2026 14:55
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cactus/models/model_whisper.cpp
Comment thread cactus/models/model_moonshine.cpp
Comment thread cactus/models/gemma4/model_gemma4_mm.cpp Outdated
Comment thread cactus/engine/engine.h
DuFanYin added 2 commits April 7, 2026 22:37
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
@HenryNdubuaku
Copy link
Copy Markdown
Collaborator

@yujonglee wanna review this one? Thanks @DuFanYin !

@yujonglee
Copy link
Copy Markdown
Contributor

I think most of the file touches are justified by the chosen plumbing path, especially for min_p.

That said, I don’t think the implementation is fully consistent yet:

  • Lfm2VlModel::decode_with_images() still samples directly via gb->sample(...) on the image path, so it bypasses the new min_p / repetition_penalty handling that now lives in Model::sample_token(). It also looks like the sampled token is not recorded back into history on that path.
  • Gemma4MmModel::decode_multimodal() now threads min_p, but it still samples directly instead of going through Model::sample_token(), so repetition_penalty is still not honored on that multimodal path.

…tion_penalty

Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
@DuFanYin
Copy link
Copy Markdown
Contributor Author

DuFanYin commented Apr 7, 2026

I think most of the file touches are justified by the chosen plumbing path, especially for min_p.

That said, I don’t think the implementation is fully consistent yet:

  • Lfm2VlModel::decode_with_images() still samples directly via gb->sample(...) on the image path, so it bypasses the new min_p / repetition_penalty handling that now lives in Model::sample_token(). It also looks like the sampled token is not recorded back into history on that path.
  • Gemma4MmModel::decode_multimodal() now threads min_p, but it still samples directly instead of going through Model::sample_token(), so repetition_penalty is still not honored on that multimodal path.

fixed, They now use language_model_.sample_token(...) plus record_sampled_token(...), matching text decode.

@HenryNdubuaku HenryNdubuaku merged commit e9cc468 into cactus-compute:main Apr 9, 2026
1 check passed
@DuFanYin DuFanYin deleted the hang/expose-minp-repetition-penalty branch April 9, 2026 17:52
DuFanYin added a commit to DuFanYin/cactus that referenced this pull request Apr 13, 2026
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
DuFanYin added a commit to DuFanYin/cactus that referenced this pull request Apr 13, 2026
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose min_p and repetition_penalty in Cactus generation options

4 participants