Expose min_p and repetition_penalty in completion options#560
Conversation
Fixes cactus-compute#554. Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR exposes min_p and repetition_penalty as generation options via options_json, threads them through InferenceOptions and model decode APIs, and extends graph/kernel sampling to support min_p consistently across FP32/FP16.
Changes:
- Add
min_p/repetition_penaltytoInferenceOptionsparsing and pass them through FFI decode/transcribe call paths. - Extend
Model::decode*and graph sampling APIs (OpParams,sample_with_options) to carry the new options while preserving backward compatibility. - Update kernel sampling to support
min_pon both FP32 and FP16 paths and remove FP16’s global static token history.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/cactus_engine.md | Documents new min_p / repetition_penalty options and defaults. |
| cactus/models/model.h | Extends model decode virtuals to accept min_p / repetition_penalty. |
| cactus/models/model_whisper.cpp | Passes new options into shared sample_token path. |
| cactus/models/model_parakeet.cpp | Adds new parameters (currently unused for this model). |
| cactus/models/model_parakeet_tdt.cpp | Adds new parameters (currently unused for this model). |
| cactus/models/model_moonshine.cpp | Passes new options into shared sample_token path. |
| cactus/models/model_lfm2vl.cpp | Threads options through VLM wrapper to underlying language model decode. |
| cactus/models/gemma4/model_gemma4.h | Extends multimodal Gemma4 decode APIs to accept new options. |
| cactus/models/gemma4/model_gemma4_mm.cpp | Uses graph sampling with new options and tracks token history for multimodal path. |
| cactus/kernel/kernel.h | Adds _ex sampler APIs with min_p / repetition_penalty parameters. |
| cactus/kernel/kernel_nn.cpp | Implements _ex samplers; adds FP16 min_p; removes static FP16 repetition state. |
| cactus/graph/graph.h | Extends OpParams and adds sample_with_options API. |
| cactus/graph/graph_ops_sample.cpp | Routes sampling node execution through _ex kernel APIs. |
| cactus/graph/graph_builder.cpp | Implements sample_with_options; keeps sample() behavior via forwarding defaults. |
| cactus/ffi/cactus_utils.h | Adds new fields to InferenceOptions and parses them from options_json. |
| cactus/ffi/cactus_transcribe.cpp | Forwards new options into audio decode calls; keeps language probe greedy/option-independent. |
| cactus/ffi/cactus_complete.cpp | Forwards new options into decode and audio decode paths. |
| cactus/engine/engine.h | Extends base Model decode APIs; clears token history on reset_cache(). |
| cactus/engine/engine_model.cpp | Implements repetition penalty as a logit-bias adjustment; threads min_p into sampling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9ce35db to
482882b
Compare
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
|
@yujonglee wanna review this one? Thanks @DuFanYin ! |
|
I think most of the file touches are justified by the chosen plumbing path, especially for That said, I don’t think the implementation is fully consistent yet:
|
…tion_penalty Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
fixed, They now use language_model_.sample_token(...) plus record_sampled_token(...), matching text decode. |
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Signed-off-by: Hang Zhengyang <hang.zhengyang1010@gmail.com>
Fixes #554
Approach
Did not change the public C function signatures: new fields live in
options_jsonand flow throughInferenceOptions. Internal C++ (Model::decode,decode_with_images,decode_with_audio,sample_token, graphOpParams) gains parameters with defaults so existing call sites keep building without changes.What changed
InferenceOptions: addmin_p(default0.15) andrepetition_penalty(default1.1); extendparse_inference_options_jsonto parse both keys, keeping defaults when omitted.decode/decode_with_audiocall sites incactus_completeandcactus_transcribe, so the expandedModel::decode*signatures receive the right arguments fromoptions_json.OpParams; addCactusGraph::sample_with_options. Existingsample()forwards to it with the prior implicit defaults so unmodified call sites are unaffected.cactus_sample_f32_ex/cactus_sample_f16_ex; keepcactus_sample_f32/cactus_sample_f16as wrappers for backward compatibility.min_p: now applied on both FP32 and FP16 paths (FP16 previously skipped this). Standard rule: after temperature / top-k / bias, compute probabilities and drop logits belowmax_prob * min_p.repetition_penalty: implemented inModel::sample_tokenvia the existing logit-bias map — subtractlog(penalty)for each token intoken_history_. History is bounded to 128 tokens per instance and cleared onreset_cache().static token_historyfrom the FP16 sampler (process-wide state, broken for multiple concurrent sessions).docs/cactus_engine.md: document both options and defaults.Design notes for reviewers
repetition_penaltyis intentionally a no-op at the kernel level.cactus_sample_f32_ex/cactus_sample_f16_exaccept the parameter but discard it ((void)). The penalty is fully applied upstream inModel::sample_tokenas a logit bias before the graph executes. The kernel parameter exists for API symmetry and future direct-graph use cases. This is not a bug.sample()now implicitly activatesmin_p=0.15. The oldsample()had a hardcodedconstexpr float min_p = 0.15finternally. The newsample()forwards tosample_with_optionswith the same value, so behavior is unchanged for existing callers.Whisper language probe remains greedy and option-independent. The initial language probe calls
decode_with_audiowithtemperature=0,top_p=0,top_k=0, and explicitly passesmin_p=0/repetition_penalty=1so user decoding settings do not affect language detection.Known limitation (pre-existing): some multimodal decode paths sample directly from graph nodes (without Model::sample_token), so repetition behavior is not fully unified with the standard text decode path in this PR.