Skip to content

Conversation

@KKkai0315
Copy link
Contributor

@KKkai0315 KKkai0315 commented Nov 23, 2025

  1. Fixed the MiniCPM-O tokenization logic - the convertMessage function previously could not encode images and audio simultaneously

  2. Refactored the code under examples/minicpm_o/, splitting it into:

    • main.cpp (omni end-to-end inference)
    • audio_infer.cpp (audio + text inference)
    • image_infer.cpp (image + text inference)

Summary by CodeRabbit

  • New Features

    • Full MiniCPM4 support: model, config, tokenizer, and inference integration.
    • Enhanced audio for MiniCPMO: mel-spectrogram preprocessing, Whisper-style encoder, and audio-aware tokenization/merge.
    • New 1D average-pooling layer for models.
  • Examples

    • New MiniCPM4 interactive CLI example.
    • Updated MiniCPMO examples: separate image and audio inference programs, plus streamlined run flows.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 23, 2025

Walkthrough

Adds MiniCPM4 model support and example, new MiniCPMO audio/Whisper processing and refactors its examples, implements 1D AvgPool across nn/aops/backend with CPU implementation and factory registration, updates various build targets and minor fixes.

Changes

Cohort / File(s) Summary
Build Configuration
examples/CMakeLists.txt
Commented out multiple example subdirectories (qwen2*, qwen3, llama, etc.), activated minicpm4 replacing previous minicpm_o entry, preserved qwen_npu conditional and added tracy_example conditional.
MiniCPM4 Example
examples/minicpm4/CMakeLists.txt, examples/minicpm4/main.cpp, examples/minicpm4/config_minicpm4.json
New example executable mllm-minicpm4-runner, interactive CLI app for MiniCPM4, and JSON config describing model hyperparameters and RoPE/longrope scaling.
MiniCPM4 Model & Tokenizer
mllm/models/minicpm4/... (configuration_minicpm4.hpp, modeling_minicpm4.hpp, tokenization_minicpm4.hpp)
Added MiniCPM4 config loader, tokenizer, transformer components (MLP, Attention with RoPE/longrope), decoder stack, causal LM wrapper with KV cache and API surfaces.
MiniCPMO Example Split & New Examples
examples/minicpm_o/CMakeLists.txt, examples/minicpm_o/* (main.cpp, image_infer.cpp, audio_infer.cpp)
Reorganized examples: separate main_image, main_audio, and main_minicpmo executables; new image/audio inference programs and updated main flow (including TTS/audio output).
AvgPool1d Operation (nn / aops / backend)
mllm/nn/layers/AvgPool1d.{hpp,cpp}, mllm/core/aops/AvgPool1dOp.{hpp,cpp}, mllm/core/OpTypes.hpp, mllm/backends/cpu/ops/AvgPool1dOp.{hpp,cpp}, mllm/backends/cpu/CPUBackend.cpp, mllm/nn/Nn.hpp
Introduced AvgPool1d layer, aops definition/options, CPU backend implementation with parallelized forward, factory registration in CPUBackend, and OpTypes mapping.
MiniCPMO2.6 Audio & Whisper Integration
mllm/models/minicpm_o2_6/audio_preprocessor_minicpmo.hpp, mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp, mllm/models/minicpm_o2_6/modeling_minicpmo.hpp, mllm/models/minicpm_o2_6/tokenization_minicpmo.hpp
Added Slaney mel spectrogram extractor and MelSpectrogramFeatures, Whisper-style encoder, integrated audio encoder/projection/avg-pool into MiniCPMO model, tokenizer expanded to handle audio placeholders, audio bounds, and multimodal tensor assembly; removed old MiniCPMO2_6 config header.
Streaming & Misc Model Fixes
mllm/models/minicpm_o2_6/streaming_generation.cpp, mllm/models/ARGeneration.cpp, mllm/models/vocos/modeling_vocos.hpp
Adjusted TTS streaming mask and min_new_token, minor device variable capture optimization in ARGeneration, and a formatting newline fix.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant App as MiniCPM4 App
    participant Tokenizer as MiniCPM4Tokenizer
    participant Model as MiniCPM4ForCausalLM
    participant Backend as CPU Backend

    User->>App: Run with model/tokenizer/config paths
    App->>Tokenizer: Init tokenizer
    App->>Model: Init model & load weights
    loop chat
        App->>Tokenizer: encode(prompt)
        Tokenizer-->>App: token IDs
        App->>Model: forward(input_ids)
        Model->>Backend: compute attention/MLP (RoPE)
        Backend-->>Model: logits
        Model-->>App: token id (stream)
        App->>Tokenizer: decode(id)
        Tokenizer-->>App: text
    end
    App->>App: perf & memory report
Loading
sequenceDiagram
    participant User as User/CLI
    participant App as MiniCPMO Audio Infer
    participant AudioProc as Audio Processor
    participant Tokenizer as MiniCPMOTokenizer
    participant Model as MiniCPMOForCausalLM

    User->>App: Run with audio_path + model
    App->>AudioProc: processAudio(file) -> mel
    AudioProc-->>App: mel spectrogram
    App->>Tokenizer: build message with audio placeholder
    Tokenizer->>Tokenizer: replace placeholder -> audio token sequence
    Tokenizer-->>App: input tensors + audio_bounds
    App->>Model: forward(inputs, audio_features, audio_bounds)
    Model->>Model: encode_audio -> audio embeddings
    Model->>Model: merge_audio_text_embeddings
    Model-->>App: generated tokens (stream)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Attention areas requiring careful review:
    • MiniCPM4 modeling: RoPE/longrope math, attention KV cache correctness, tensor shapes across decode/prefill.
    • CPU AvgPool1d: stride/padding/ceil_mode/count_include_pad edge cases and parallelization correctness.
    • Audio pipeline: Slaney mel filterbank correctness, whisper encoder feature flow, and audio/text merging logic.
    • Tokenizer multimodal assembly: placeholder substitution, bounds calculation, and padding/stacking of image/audio tensors.
    • Build changes: CMake targets and example linking adjustments.

Possibly related PRs

Suggested reviewers

  • chenghuaWang
  • yirongjie
  • oreomaker

Poem

🐰 A rabbit hops through pooling and rhyme,
MiniCPM4 strides in long RoPE time,
Audio sings, images twine,
New ops and examples neatly align,
Code hums softly — merge when it's fine.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.52% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: fixing MiniCPM-O tokenization logic and streaming generation.
Description check ✅ Passed The description covers the two primary objectives: tokenization logic fix and code refactoring under examples/minicpm_o/, though it lacks structured sections from the template.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (22)
mllm/core/aops/AvgPool1dOp.cpp (2)

15-43: Validate pooling hyperparameters and guard against invalid output lengths

reshape assumes kernel_size > 0, stride > 0, and yields a positive pooled_length. With malformed options (e.g., stride == 0, kernel_size <= 0, or extreme kernel_size vs (length + 2 * padding) in ceil_mode), this can produce division-by-zero or non‑positive output sizes and eventually invalid tensor shapes.

Consider validating the options and failing fast before computing pooled_length, and also checking that the computed pooled_length is positive (or at least non‑negative) before allocating the output:

   const int kernel_size = options_.kernel_size;
   const int stride = options_.stride;
   const int padding = options_.padding;
+
+  if (kernel_size <= 0 || stride <= 0) {
+    MLLM_ERROR_EXIT(ExitCode::kCoreError,
+                    "AvgPool1dOp expects kernel_size > 0 and stride > 0, got kernel_size={}, stride={}",
+                    kernel_size, stride);
+  }
+
   int pooled_length;
   if (options_.ceil_mode) {
     pooled_length = static_cast<int>(std::ceil(static_cast<float>(length + 2 * padding - kernel_size) / stride)) + 1;
   } else {
     pooled_length = ((length + 2 * padding - kernel_size) / stride) + 1;
   }
+
+  if (pooled_length <= 0) {
+    MLLM_ERROR_EXIT(ExitCode::kCoreError,
+                    "AvgPool1dOp computed non-positive output length: {}", pooled_length);
+  }

This keeps failures explicit and avoids undefined behavior from bad configs.


49-51: Make the base forward fail fast instead of silently doing nothing

Right now AvgPool1dOp::forward is a no‑op. If a caller accidentally instantiates or dispatches through this base op instead of a backend-specific subclass (like CPUAvgPool1dOp), the call will silently succeed while producing no output.

Consider turning this into an explicit error so mis-wired backends are caught immediately:

 void AvgPool1dOp::forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) {
-  // Forward will be dispatched to backend implementation
+  MLLM_ERROR_EXIT(ExitCode::kCoreError,
+                  "AvgPool1dOp::forward must be implemented by a backend-specific subclass");
 }

Alternatively, you could make forward pure virtual in the header if that fits the existing op hierarchy.

mllm/backends/cpu/ops/AvgPool1dOp.cpp (1)

12-36: Align dtype handling with pointer access and add light sanity checks

The forward path looks correct overall, but a couple of small robustness tweaks would make it safer:

  • input.ptr<float>() / output.ptr<float>() are called before the switch (output.dtype()). If ptr<T>() validates dtype internally, this may fail before reaching the NYI path for unsupported dtypes.
  • There’s no explicit assertion that input.dtype() == output.dtype() == kFloat32, which is what the implementation assumes.

You can tighten this up by moving pointer acquisition inside the kFloat32 case and asserting dtype consistency:

-  auto input_ptr = input.ptr<float>();
-  auto output_ptr = output.ptr<float>();
-
-  std::fill_n(output_ptr, output.numel(), 0.0f);
-
-  int total_iterations = batch * channels * pooled_length;
-
-  switch (output.dtype()) {
-    case kFloat32:
-      MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, {
+  int total_iterations = batch * channels * pooled_length;
+
+  switch (output.dtype()) {
+    case kFloat32: {
+      MLLM_ASSERT(input.dtype() == output.dtype(),
+                  "AvgPool1d: input/output dtype mismatch: {} vs {}", input.dtype(), output.dtype());
+      auto input_ptr = input.ptr<float>();
+      auto output_ptr = output.ptr<float>();
+      std::fill_n(output_ptr, output.numel(), 0.0f);
+
+      MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, {
         // Decode 1D index to 3D indices
         int b = idx / (channels * pooled_length);
         int c = (idx % (channels * pooled_length)) / pooled_length;
@@
         if (count > 0) {
           output_ptr[output_idx] = sum / static_cast<float>(count);
         }
-      });
-      break;
+      });
+      break;
+    }

This keeps unsupported dtypes routed cleanly to the NYI case and ensures the float-specific path only runs under the expected conditions.

examples/minicpm4/main.cpp (1)

55-59: Designated initializer requires C++20; consider more portable initialization

convertMessage({.prompt = prompt_text}); uses a C++20 designated initializer syntax for MiniCPM4Message. If the project is still built with C++17 (or older), this will fail to compile.

If you need C++17 compatibility, you can instead write:

-    auto inputs = minicpm4_tokenizer.convertMessage({.prompt = prompt_text});
+    MiniCPM4Message msg;
+    msg.prompt = prompt_text;
+    auto inputs = minicpm4_tokenizer.convertMessage(msg);

Please confirm the project’s C++ standard setting and adjust accordingly.

mllm/models/minicpm4/tokenization_minicpm4.hpp (3)

13-16: Make message_template const and guard the placeholder replacement

MiniCPM4Message::message_template is a mutable static inline std::string, and convertMessage assumes the {{{prompt}}} placeholder is always present:

auto applied_string = MiniCPM4Message::message_template;
size_t pos = applied_string.find("{{{prompt}}}");
applied_string.replace(pos, 12, message.prompt);

Two suggestions:

  1. Declare the template as const to avoid accidental runtime mutation:
-  static inline std::string message_template = "...";
+  static inline const std::string message_template = "...";
  1. Add a simple guard in case the placeholder is ever edited or localized:
auto applied_string = MiniCPM4Message::message_template;
const std::string placeholder = "{{{prompt}}}";
auto pos = applied_string.find(placeholder);
if (pos == std::string::npos) {
  // Fallback: append prompt or throw/log.
  applied_string += message.prompt;
} else {
  applied_string.replace(pos, placeholder.size(), message.prompt);
}

This makes the tokenizer more robust to future template changes.

Also applies to: 112-121


52-58: Avoid duplicating encode logic in convertMessage

Both encode and convertMessage perform the same pattern of tokenize(...) followed by bpe_._lookup_vocab:

// encode
auto tokens = tokenize(str);
for (const auto& token : tokens) { ids.push_back(bpe_._lookup_vocab(token)); }

// convertMessage
auto tokens = tokenize(applied_string);
for (const auto& token : tokens) { ids.push_back(bpe_._lookup_vocab(token)); }

To keep behavior consistent and easier to maintain, consider reusing encode inside convertMessage:

-    auto tokens = tokenize(applied_string);
-
-    std::vector<int64_t> ids;
-    ids.reserve(tokens.size());
-    for (const auto& token : tokens) { ids.push_back(bpe_._lookup_vocab(token)); }
+    std::vector<int64_t> ids = encode(applied_string);

This way, any future changes to encoding/tokenization only need to be made in one place.

Also applies to: 119-127


34-50: Normalize/decode logic is coherent; consider small cleanup and comments

The handling of the SentencePiece underscore () in normalize, decode, decode_token, and decode_string is consistent and should round‑trip text correctly. To make future maintenance easier:

  • Add a brief comment near the hard‑coded "\xE2\x96\x81" explaining it’s the UTF‑8 encoding of used by SentencePiece.
  • Optionally factor the repeated “strip leading and replace remaining with spaces” logic into a small helper used by both decode and decode_string.

These are non‑blocking polish items; the current implementation is functionally sound.

Also applies to: 60-80, 135-170

mllm/models/minicpm4/modeling_minicpm4.hpp (3)

37-84: RoPE embedding computation looks correct but is O(B * S * D); consider reuse if perf becomes an issue

The rotary embedding construction (freqs + duplicated sin/cos) is structurally correct for per‑head RoPE and matches expected shapes. It does, however, recompute the full [B, S, dim] embeddings every forward.

If you later profile MiniCPM4 and find RoPE dominating CPU time for long sequences, it may be worth:

  • Caching inv_freq/freqs per sequence length, or
  • Vectorizing the loops more aggressively.

Fine for now; just something to be aware of.


268-274: Layer index wiring for KV cache is correct but brittle to external construction

Assigning:

for (auto [idx, b] : enumerate(decode_blocks_.list())) {
  b.self_attn_.layer_idx_ = idx;
}

correctly sets the per‑layer index used by StaticCache::updateKVCache. However, because layer_idx_ is public and only set here, creating MiniCPM4Attention instances elsewhere without doing this will break the cache logic.

You may want to:

  • Make layer_idx_ private with a setter, or
  • Pass layer_idx into MiniCPM4Attention’s constructor and initialize it there.

Not a blocker, but it would reduce footguns.


331-349: Position id handling assumes seq_len == 1 for decode and shared lengths across batch

In the decode path:

if (input.count("position_ids")) {
  position_ids = input.at("position_ids");
  if (seq_len == 1) {
    auto last_pos = *position_ids.offsettedPtr<int64_t>({0, position_ids.shape()[1] - 1});
    position_ids = Tensor::empty({batch_size, 1}, kInt64, kCPU).alloc();
    *position_ids.offsettedPtr<int64_t>({0, 0}) = last_pos + 1;
  }
}

this:

  • Only updates a single scalar position for batch 0, and
  • Assumes decode is always called with seq_len == 1.

If your generation loop ever supports:

  • Multiple tokens per decode step, or
  • Different sequence lengths per batch element,

this logic will not generalize correctly.

If the intended contract is “decode is always called with a single new token per batch and synchronized lengths”, it would help to either assert this or extend the logic to compute per‑batch positions (e.g., reading last_pos per batch row).

examples/minicpm_o/main.cpp (1)

131-132: Perf report filename is misleading

The perf report is saved as "minicpm4.perf" even though this executable is main_minicpmo. Renaming the report (e.g., to "minicpmo_omni.perf") would avoid confusion when inspecting traces from multiple examples.

-  mllm::perf::saveReport("minicpm4.perf");
+  mllm::perf::saveReport("minicpmo_omni.perf");
examples/minicpm_o/audio_infer.cpp (1)

46-51: Fail fast on unsupported model_version values

Right now, any unexpected --model_version silently falls back to kV1. It’s safer to validate and exit with an error when the flag is wrong so users immediately know they passed an invalid value.

-  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
-  if (model_version.get() == "v1") {
-    file_version = mllm::ModelFileVersion::kV1;
-  } else if (model_version.get() == "v2") {
-    file_version = mllm::ModelFileVersion::kV2;
-  }
+  mllm::ModelFileVersion file_version;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print("Unsupported model_version '{}'; expected 'v1' or 'v2'.\n", model_version.get());
+    mllm::shutdownContext();
+    return 1;
+  }
examples/minicpm_o/image_infer.cpp (1)

28-39: Align perf tracing start/stop and avoid starting it on --help

With MLLM_PERFETTO_ENABLE, mllm::perf::start() is called before checking the --help flag, but perf::stop()/saveReport() only run in the execution block. If a user invokes --help, perf is started but never stopped or reported. It’s cleaner to start perf only when we’re actually running the model:

-  Argparse::parse(argc, argv);
-
-#ifdef MLLM_PERFETTO_ENABLE
-  mllm::perf::start();
-#endif
-
-  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  Argparse::parse(argc, argv);
+
+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
@@
-  if (help.isSet()) {
+  if (help.isSet()) {
     Argparse::printHelp();
     mllm::shutdownContext();
     return 0;
   }
-  {
+  {
+#ifdef MLLM_PERFETTO_ENABLE
+    mllm::perf::start();
+#endif
@@
-#ifdef MLLM_PERFETTO_ENABLE
-    mllm::perf::stop();
-    mllm::perf::saveReport("minicpmo.perf");
-#endif
+#ifdef MLLM_PERFETTO_ENABLE
+    mllm::perf::stop();
+    mllm::perf::saveReport("minicpmo.perf");
+#endif
examples/minicpm_o/CMakeLists.txt (1)

3-15: CMake wiring for the three MiniCPMO examples looks consistent

The three executables (main_image, main_audio, main_minicpmo) are all linked against MllmRT and MllmCPUBackend with the shared include directory, which matches how the corresponding example sources are structured. If this list grows further, you could consider a small helper function to DRY up the repeated target_link_libraries/target_include_directories calls, but it’s not necessary right now.

mllm/models/minicpm_o2_6/audio_preprocessor_minicpmo.hpp (3)

167-208: Verify STFT → mel matmul and dynamic range steps for multi‑dimensional inputs

mel_specgram = nn::functional::matmul(specgram.T(), melscale_fbanks_).T(); assumes a particular STFT output layout (effectively 2D [freq_bins, frames]). If nn::STFT returns a 3D tensor like [B, freq_bins, frames] (which is typical), the semantics of .T() and matmul broadcasting need to be double‑checked to ensure you end up with [B, n_mels, n_frames] as intended.

Similarly, auto max_val = mel_specgram.max(); followed by a global threshold applies dynamic range compression over the entire tensor rather than per‑sample. If B > 1, this differs from the usual per‑example normalization.

I recommend:

  • Verifying the actual shapes from stft_ at runtime (or docs) and, if needed, making the mel transform explicit, e.g. mel_specgram = nn::functional::matmul(melscale_fbanks_.T(), specgram) or a batched [B, T, F][B, T, M] transform.
  • Clarifying whether Whisper here is expected to use a per‑example max (e.g., over [freq, time] for each batch element) and adjusting max() reduction accordingly.

Right now this is shape‑ and reduction‑sensitive logic that could silently misbehave if the STFT layout differs from what you assumed.


228-265: Placeholder length may not match encoded audio length due to fixed padding in processAudioData

getAudioPlaceholder computes output_lens from audio_length via hop length and two downsampling stages. However, processAudioData always pads or truncates to n_samples_ = chunk_length * sample_rate_ before STFT, so the encoder input length is effectively fixed (30s @ 16 kHz by default), independent of audio_length.

This can lead to:

  • output_lens < actual audio_seq_len from encode_audio, truncating embeddings when merging into text; or
  • A mismatch with how placeholders were generated during training if the model expects a fixed number of <unk> tokens per 30s chunk.

Consider either:

  • Basing output_lens on n_samples_ instead of audio_length (or min(audio_length, n_samples_)), and/or
  • Returning the actual encoded length from the audio pipeline and using that to derive unk count.

Also, the chunked path (chunk_input == true) uses a hardcoded fbank_feat_in_chunk = chunk_length * 100, which assumes 100 frames/sec; this should stay in sync with hop_length_ and your actual STFT config.


268-325: Minor: factor shared file/data processing and include missing headers explicitly

processAudio and processAudioData duplicate padding/truncation and mel extraction logic. Not a bug, but you could reduce maintenance overhead by having the file‑based path reuse the raw‑data path.

Also, since this header uses std::numeric_limits<float>::max() and std::memcpy, it would be safer to explicitly include <limits> and <cstring> here instead of relying on transitive includes.

mllm/models/minicpm_o2_6/modeling_minicpmo.hpp (2)

242-266: Remove or gate noisy debug prints in encode_audio

encode_audio currently calls mllm::print multiple times to dump tensor shapes (“finish apm!”, “before audio avg pooler!”, etc.). These are useful while debugging but will be noisy in production and can slow down inference.

Consider either removing them or guarding behind a debug flag / log level:

-    mllm::print("finish apm!, tensor shape:", audio_states.shape());
+    // mllm::print("finish apm!, tensor shape:", audio_states.shape());
...
-    mllm::print("final audio embeds!, tensor shape:", audio_embeds.shape());
+    // mllm::print("final audio embeds!, tensor shape:", audio_embeds.shape());

or similar conditional logging.


301-325: Batching assumptions in merge_audio_text_embeddings

merge_audio_text_embeddings assumes:

  • audio_embeddings.shape()[0] equals the number of audio segments (num_bounds), and
  • Each row audio_embeddings[bound_idx, ...] aligns one‑to‑one with audio_bounds[bound_idx].

This matches the current tokenizer path (single audio segment → audio_features with batch size 1 → one bounds pair), but will need care if you later support multiple audio clips per request (e.g., MiniCPMOInput::audio_paths).

Not an immediate bug, but consider documenting this assumption or asserting num_bounds == audio_embeddings.shape()[0] to catch future mismatches early.

mllm/models/minicpm_o2_6/tokenization_minicpmo.hpp (3)

163-186: buildChatMessage now emits a consistent assistant preamble

The updated buildChatMessage always appends:

result += "<|im_start|>assistant\n";
if (gen_audio) { result += "<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>"; }

after the user segment, which is aligned with your audio‑generation flow and the special tokens registered in the trie.

Note that the has_img and has_audio parameters are currently unused inside this method; they’re redundant given you read img_file_path / audio_file_path directly. Optional: either wire them into the logic (e.g., to support multi‑image/audio variants) or drop them from the signature to avoid confusion.


318-419: Image+audio are now processed in two clean phases; only single audio placeholder is supported

The refactored convertMessage:

  • Processes image/audio first (Phase 1),
  • Replaces image placeholders, then audio placeholders, in the same applied_string (Phases 2–3),
  • Tokenizes once and computes image/audio bounds on the final sequence, and
  • Passes everything to convertToTensors.

This fixes the previous limitation where image and audio could not be encoded together; the current ordering preserves the audio placeholder across the image‑replacement step, so both modalities co‑exist correctly.

Two minor limitations to be aware of:

  • Only the first occurrence of "(<audio>./</audio>)" is replaced:

    size_t audio_placeholder_pos = applied_string.find("(<audio>./</audio>)");
    if (audio_placeholder_pos != std::string::npos) {
      applied_string.replace(audio_placeholder_pos, 19, audio_placeholder);
    }

    If you ever support multiple audio segments per prompt, this should loop over all matches.

  • The hard‑coded length 19 is brittle; better to derive it from the placeholder literal:

  • const std::string kAudioTag = "(./)";

  • constexpr char kAudioTag[] = "(./)";
    ...
  • size_t audio_placeholder_pos = applied_string.find("(./)");
  • if (audio_placeholder_pos != std::string::npos) { applied_string.replace(audio_placeholder_pos, 19, audio_placeholder); }
  • size_t audio_placeholder_pos = applied_string.find(kAudioTag);
  • if (audio_placeholder_pos != std::string::npos) {
  • applied_string.replace(audio_placeholder_pos, std::strlen(kAudioTag), audio_placeholder);
  • }

Functionally this addresses the image+audio coexistence issue; the above are robustness improvements.

---

`524-526`: **Audio preprocessor member is correctly introduced; consider future configurability**

Adding:

```cpp
// For audio only.
MiniCPMOAudioProcessor audio_preprocessor_;

to the tokenizer is a good encapsulation of audio‑specific logic. Right now it’s constructed with fixed Whisper defaults (16kHz, 80 mel bins, hop 160); if you foresee running variants of MiniCPMO with different audio front‑ends, you may eventually want to pass these from config or the constructor instead of hard‑coding.

For the current MiniCPMO‑2.6 setup, this is fine.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 836aa97 and 518d66d.

📒 Files selected for processing (27)
  • examples/CMakeLists.txt (1 hunks)
  • examples/minicpm4/CMakeLists.txt (1 hunks)
  • examples/minicpm4/config_minicpm4.json (1 hunks)
  • examples/minicpm4/main.cpp (1 hunks)
  • examples/minicpm_o/CMakeLists.txt (1 hunks)
  • examples/minicpm_o/audio_infer.cpp (1 hunks)
  • examples/minicpm_o/image_infer.cpp (1 hunks)
  • examples/minicpm_o/main.cpp (1 hunks)
  • mllm/backends/cpu/CPUBackend.cpp (2 hunks)
  • mllm/backends/cpu/ops/AvgPool1dOp.cpp (1 hunks)
  • mllm/backends/cpu/ops/AvgPool1dOp.hpp (1 hunks)
  • mllm/core/OpTypes.hpp (2 hunks)
  • mllm/core/aops/AvgPool1dOp.cpp (1 hunks)
  • mllm/core/aops/AvgPool1dOp.hpp (1 hunks)
  • mllm/models/ARGeneration.cpp (2 hunks)
  • mllm/models/minicpm4/configuration_minicpm4.hpp (1 hunks)
  • mllm/models/minicpm4/modeling_minicpm4.hpp (1 hunks)
  • mllm/models/minicpm4/tokenization_minicpm4.hpp (1 hunks)
  • mllm/models/minicpm_o2_6/audio_preprocessor_minicpmo.hpp (1 hunks)
  • mllm/models/minicpm_o2_6/configuration_minicpm_o2_6.hpp (0 hunks)
  • mllm/models/minicpm_o2_6/modeling_minicpmo.hpp (8 hunks)
  • mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp (1 hunks)
  • mllm/models/minicpm_o2_6/tokenization_minicpmo.hpp (10 hunks)
  • mllm/models/vocos/modeling_vocos.hpp (1 hunks)
  • mllm/nn/Nn.hpp (1 hunks)
  • mllm/nn/layers/AvgPool1d.cpp (1 hunks)
  • mllm/nn/layers/AvgPool1d.hpp (1 hunks)
💤 Files with no reviewable changes (1)
  • mllm/models/minicpm_o2_6/configuration_minicpm_o2_6.hpp
🧰 Additional context used
🧬 Code graph analysis (9)
mllm/models/ARGeneration.cpp (2)
mllm/core/Tensor.cpp (4)
  • device (351-351)
  • device (351-351)
  • empty (70-74)
  • empty (70-70)
mllm/core/Tensor.hpp (1)
  • device (395-395)
mllm/nn/layers/AvgPool1d.hpp (1)
mllm/backends/cpu/ops/AvgPool1dOp.hpp (2)
  • options (20-22)
  • options (20-20)
mllm/nn/layers/AvgPool1d.cpp (2)
mllm/nn/layers/AvgPool1d.hpp (3)
  • AvgPool1d (11-11)
  • AvgPool1d (13-14)
  • AvgPool1d (16-16)
mllm/backends/cpu/ops/AvgPool1dOp.hpp (2)
  • options (20-22)
  • options (20-20)
mllm/backends/cpu/ops/AvgPool1dOp.cpp (3)
mllm/backends/cpu/ops/AvgPool1dOp.hpp (4)
  • CPUAvgPool1dOp (13-13)
  • options (20-22)
  • options (20-20)
  • inputs (15-15)
mllm/core/aops/AvgPool1dOp.cpp (2)
  • forward (49-51)
  • forward (49-49)
mllm/core/aops/AvgPool1dOp.hpp (3)
  • inputs (24-24)
  • inputs (26-26)
  • inputs (28-28)
mllm/backends/cpu/ops/AvgPool1dOp.hpp (1)
mllm/core/aops/AvgPool1dOp.hpp (3)
  • inputs (24-24)
  • inputs (26-26)
  • inputs (28-28)
mllm/core/aops/AvgPool1dOp.hpp (1)
mllm/backends/cpu/ops/AvgPool1dOp.hpp (3)
  • options (20-22)
  • options (20-20)
  • inputs (15-15)
examples/minicpm_o/main.cpp (7)
mllm/models/minicpm_o2_6/modeling_minicpmo.hpp (10)
  • config (140-154)
  • config (140-140)
  • inputs (33-39)
  • inputs (33-33)
  • inputs (56-58)
  • inputs (56-56)
  • inputs (75-81)
  • inputs (75-75)
  • inputs (161-240)
  • inputs (161-161)
mllm/models/minicpm_o2_6/configuration_minicpmo.hpp (3)
  • MiniCPMOConfig (11-11)
  • MiniCPMOConfig (13-54)
  • MiniCPMOConfig (13-13)
mllm/models/minicpm_o2_6/tokenization_minicpmo.hpp (2)
  • MiniCPMOTokenizer (199-261)
  • MiniCPMOTokenizer (199-199)
mllm/models/minicpm_o2_6/image_preprocessor_minicpmo.hpp (2)
  • image_path (253-283)
  • image_path (253-254)
mllm/models/minicpm_o2_6/modeling_siglip.hpp (4)
  • inputs (40-154)
  • inputs (40-40)
  • inputs (178-186)
  • inputs (178-178)
mllm/models/minicpm_o2_6/modeling_resampler.hpp (4)
  • inputs (70-168)
  • inputs (70-70)
  • inputs (211-351)
  • inputs (211-211)
mllm/nn/Functional.cpp (2)
  • concat (45-47)
  • concat (45-45)
mllm/models/minicpm_o2_6/modeling_minicpmo.hpp (2)
mllm/models/minicpm_o2_6/modeling_siglip.hpp (12)
  • inputs (40-154)
  • inputs (40-40)
  • inputs (178-186)
  • inputs (178-178)
  • inputs (216-253)
  • inputs (216-216)
  • inputs (277-298)
  • inputs (277-277)
  • inputs (315-322)
  • inputs (315-315)
  • inputs (346-433)
  • inputs (346-346)
mllm/models/minicpm_o2_6/modeling_resampler.hpp (4)
  • inputs (70-168)
  • inputs (70-70)
  • inputs (211-351)
  • inputs (211-211)
mllm/models/minicpm_o2_6/audio_preprocessor_minicpmo.hpp (4)
mllm/models/minicpm_o2_6/modeling_minicpmo.hpp (8)
  • inputs (33-39)
  • inputs (33-33)
  • inputs (56-58)
  • inputs (56-56)
  • inputs (75-81)
  • inputs (75-75)
  • inputs (161-240)
  • inputs (161-161)
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp (6)
  • inputs (39-73)
  • inputs (39-39)
  • inputs (108-130)
  • inputs (108-108)
  • inputs (177-207)
  • inputs (177-177)
mllm/models/minicpm_o2_6/tokenization_minicpmo.hpp (1)
  • nodiscard (163-186)
mllm/models/minicpm_o2_6/streaming_generation.hpp (1)
  • mel_spec (203-203)
🪛 Clang (14.0.6)
mllm/nn/layers/AvgPool1d.hpp

[error] 3-3: 'cstdint' file not found

(clang-diagnostic-error)

mllm/nn/layers/AvgPool1d.cpp

[error] 1-1: 'mllm/nn/layers/AvgPool1d.hpp' file not found

(clang-diagnostic-error)

mllm/backends/cpu/ops/AvgPool1dOp.cpp

[error] 4-4: 'mllm/backends/cpu/ops/AvgPool1dOp.hpp' file not found

(clang-diagnostic-error)

mllm/backends/cpu/ops/AvgPool1dOp.hpp

[error] 6-6: 'mllm/core/BaseOp.hpp' file not found

(clang-diagnostic-error)

mllm/core/aops/AvgPool1dOp.hpp

[error] 6-6: 'cstdint' file not found

(clang-diagnostic-error)


[error] 12-12: constructor does not initialize these fields: kernel_size, stride

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 33-33: member variable 'options_' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

mllm/models/minicpm4/configuration_minicpm4.hpp

[error] 5-5: 'mllm/core/aops/LinearOp.hpp' file not found

(clang-diagnostic-error)


[error] 11-11: constructor does not initialize these fields: short_factor, long_factor

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 14-14: constructor does not initialize these fields: short_factor, long_factor

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 16-16: 'attention_bias' should be initialized in a member initializer of the constructor

(cppcoreguidelines-prefer-member-initializer,-warnings-as-errors)


[error] 23-23: 'rms_norm_eps' should be initialized in a member initializer of the constructor

(cppcoreguidelines-prefer-member-initializer,-warnings-as-errors)


[error] 28-28: if with identical then and else branches

(bugprone-branch-clone,-warnings-as-errors)


[error] 71-71: 1e-05 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 77-77: 10000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 83-83: 12.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 84-84: 256.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 85-85: 1.4 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)

examples/minicpm_o/main.cpp

[error] 4-4: 'fmt/core.h' file not found

(clang-diagnostic-error)


[error] 18-18: variable 'MLLM_MAIN' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 18-18: invalid case style for variable 'MLLM_MAIN'

(readability-identifier-naming,-warnings-as-errors)

mllm/models/minicpm_o2_6/modeling_minicpmo.hpp

[error] 48-48: constructor does not initialize these fields: avg_pool_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 52-52: constructor does not initialize these fields: avg_pool_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 52-52: 3 adjacent parameters of 'AudioAvgPooler' of similar type are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 56-56: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 113-113: variable 'audio_output_dim' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 128-128: member variable 'apm_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 129-129: member variable 'audio_projection_layer_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 130-130: member variable 'audio_avg_pooler_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 192-192: variable 'audio_bounds' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 301-301: 3 adjacent parameters of 'merge_audio_text_embeddings' of similar type ('int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

examples/minicpm4/main.cpp

[error] 1-1: 'iostream' file not found

(clang-diagnostic-error)


[error] 11-11: variable 'MLLM_MAIN' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 11-11: invalid case style for variable 'MLLM_MAIN'

(readability-identifier-naming,-warnings-as-errors)

examples/minicpm_o/image_infer.cpp

[error] 1-1: 'iostream' file not found

(clang-diagnostic-error)


[error] 11-11: variable 'MLLM_MAIN' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 11-11: invalid case style for variable 'MLLM_MAIN'

(readability-identifier-naming,-warnings-as-errors)

mllm/models/minicpm4/tokenization_minicpm4.hpp

[error] 4-4: 'vector' file not found

(clang-diagnostic-error)


[error] 13-13: constructor does not initialize these fields: prompt

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 15-15: variable 'message_template' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 20-20: constructor does not initialize these fields: bpe_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 35-35: variable 'normalized' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 61-61: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 68-68: variable 'decoded' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 107-107: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 114-114: variable 'pos' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 123-123: variable 'sequence' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 136-136: variable 'decoded' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 154-154: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 158-158: variable 'decoded' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

mllm/models/minicpm4/modeling_minicpm4.hpp

[error] 4-4: 'mllm/mllm.hpp' file not found

(clang-diagnostic-error)


[error] 16-16: 4 adjacent parameters of 'makeRoPEInvFreqWithLongRoPE' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 32-32: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 33-33: variable 'scale' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 37-37: 2 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 86-86: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 94-94: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 94-94: 2 adjacent parameters of 'MiniCPM4MLP' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 101-101: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 103-103: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 105-105: variable name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 112-112: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, hidden_size_, head_dim_, num_attention_heads_, num_key_value_heads_, num_key_value_groups_, layer_idx_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 131-131: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, hidden_size_, num_attention_heads_, num_key_value_heads_, layer_idx_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 131-131: 2 adjacent parameters of 'MiniCPM4Attention' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 136-136: 'num_key_value_groups_' should be initialized in a member initializer of the constructor

(cppcoreguidelines-prefer-member-initializer,-warnings-as-errors)


[error] 152-152: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 153-153: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 163-163: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 163-163: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 163-163: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)


[error] 164-164: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 164-164: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 164-164: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 186-186: variable 'attn' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 187-187: repeated branch in conditional chain

(bugprone-branch-clone,-warnings-as-errors)


[error] 209-209: member variable 'layer_idx_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 212-212: constructor does not initialize these fields: input_layer_norm_, post_attention_layer_norm_, scale_depth_, num_hidden_layers_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 224-224: constructor does not initialize these fields: input_layer_norm_, post_attention_layer_norm_, scale_depth_, num_hidden_layers_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 224-224: 2 adjacent parameters of 'MiniCPM4Decoder' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 234-234: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 240-240: variable 'residual_scale' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 244-244: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 259-259: constructor does not initialize these fields: decode_blocks_, norm_, embedding_, scale_emb_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 268-268: constructor does not initialize these fields: decode_blocks_, norm_, embedding_, scale_emb_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 268-268: 2 adjacent parameters of 'MiniCPM4Text' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 276-276: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 280-280: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 282-282: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 298-298: constructor does not initialize these fields: lm_head_, tie_word_embeddings_, kv_cache_, longrope_scaling_factor_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 322-322: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 331-331: variable 'position_ids' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 358-358: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 358-358: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 365-365: variable 'scale_factor' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

mllm/core/aops/AvgPool1dOp.cpp

[error] 15-15: method 'reshape' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 16-16: variable name 'i' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 25-25: variable 'batch' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 26-26: variable 'channels' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 27-27: variable 'length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 29-29: variable 'kernel_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 30-30: variable 'stride' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 31-31: variable 'padding' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 33-33: variable 'pooled_length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

mllm/models/minicpm_o2_6/audio_preprocessor_minicpmo.hpp

[error] 5-5: 'string' file not found

(clang-diagnostic-error)


[error] 19-19: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 20-20: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 21-21: variable 'logstep' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 24-24: 3.0f is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 24-24: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 24-24: 200.0f is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 24-24: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 25-25: do not use 'else' after 'return'

(readability-else-after-return,-warnings-as-errors)


[error] 32-32: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 33-33: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 34-34: variable 'logstep' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 37-37: 200.0f is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 37-37: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 37-37: 3.0f is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 37-37: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 38-38: do not use 'else' after 'return'

(readability-else-after-return,-warnings-as-errors)


[error] 45-45: variable 'length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 47-47: variable 'window_ptr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 61-61: 3 adjacent parameters of 'create_mel_filterbank' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 62-62: 2 adjacent parameters of 'create_mel_filterbank' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 85-85: variable 'filters_ptr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 128-128: constructor does not initialize these fields: n_fft_, hop_length_, win_length_, n_mels_, padding_, power_, stft_, window_, melscale_fbanks_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 143-143: constructor does not initialize these fields: n_fft_, hop_length_, win_length_, n_mels_, padding_, stft_, window_, melscale_fbanks_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 143-143: 7 adjacent parameters of 'MelSpectrogramFeatures' of similar type are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 143-143: 24000 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 143-143: 1024 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 144-144: 256 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 144-144: 100 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 167-167: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 201-201: variable 'threshold' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 220-220: constructor does not initialize these fields: sample_rate_, hop_length_, chunk_length_, n_samples_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 220-220: 4 adjacent parameters of 'MiniCPMOAudioProcessor' of similar type ('int') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 220-220: 16000 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 220-220: 80 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 220-220: 160 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 221-221: 30 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 230-230: variable 'pool_step' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 233-233: variable 'feature_lens' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 239-239: variable 'output_lens' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 241-241: variable 'audio_placeholder' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 245-245: variable 'fbank_feat_in_chunk' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 246-246: variable 'cnn_feat_in_chunk' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 247-247: variable 'audio_embeds_in_chunk' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 248-248: variable 'num_audio_chunks' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 250-250: variable 'total_unk_len' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 276-276: variable 'audio_length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 277-277: variable 'padded_length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 280-280: variable 'audio_ptr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 282-282: if with identical then and else branches

(bugprone-branch-clone,-warnings-as-errors)


[error] 307-307: variable 'padded_length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 310-310: variable 'audio_ptr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 312-312: if with identical then and else branches

(bugprone-branch-clone,-warnings-as-errors)

mllm/models/minicpm_o2_6/tokenization_minicpmo.hpp

[error] 333-333: variable 'audio_features' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 374-374: variable 'final_text' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 386-386: variable 'audio_placeholder' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 387-387: variable 'audio_placeholder_pos' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 411-411: variable 'audio_start_id' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 412-412: variable 'audio_end_id' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 422-422: 2 adjacent parameters of 'convertToTensors' of similar type ('int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 423-423: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 426-426: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 439-439: variable 'channels' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 440-440: variable 'patch_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 441-441: variable 'HW_patch_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 441-441: invalid case style for variable 'HW_patch_size'

(readability-identifier-naming,-warnings-as-errors)


[error] 505-505: variable 'audio_bounds_tensor' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 520-520: redundant access specifier has the same accessibility as the previous access specifier

(readability-redundant-access-specifiers,-warnings-as-errors)

mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp

[error] 5-5: 'mllm/mllm.hpp' file not found

(clang-diagnostic-error)


[error] 15-15: constructor does not initialize these fields: embed_dim_, num_heads_, head_dim_, dropout_, k_proj_, v_proj_, q_proj_, out_proj_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 29-29: constructor does not initialize these fields: embed_dim_, num_heads_, head_dim_, k_proj_, v_proj_, q_proj_, out_proj_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 29-29: 4 adjacent parameters of 'WhisperEncoderAttention' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 29-29: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 39-39: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 58-58: variable 'scale' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 77-77: constructor does not initialize these fields: embed_dim_, ffn_dim_, dropout_, activation_dropout_, self_attn_layer_norm_, fc1_, fc2_, final_layer_norm_, activation_fn_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 93-93: constructor does not initialize these fields: embed_dim_, ffn_dim_, self_attn_layer_norm_, fc1_, fc2_, final_layer_norm_, activation_fn_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 93-93: 2 adjacent parameters of 'WhisperEncoderLayer' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 97-97: 'dropout_' should be initialized in a member initializer of the constructor

(cppcoreguidelines-prefer-member-initializer,-warnings-as-errors)


[error] 97-97: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 98-98: 'activation_dropout_' should be initialized in a member initializer of the constructor

(cppcoreguidelines-prefer-member-initializer,-warnings-as-errors)


[error] 98-98: floating point literal has suffix 'f', which is not uppercase

(readability-uppercase-literal-suffix,-warnings-as-errors)


[error] 108-108: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 134-134: constructor does not initialize these fields: embed_dim_, num_mel_bins_, max_source_positions_, num_layers_, conv1_, gelu1_, conv2_, gelu2_, embed_positions_, layers_, layer_norm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 151-151: constructor does not initialize these fields: embed_dim_, num_mel_bins_, max_source_positions_, num_layers_, conv1_, gelu1_, conv2_, gelu2_, embed_positions_, layers_, layer_norm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 151-151: 2 adjacent parameters of 'WhisperEncoder' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 177-177: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

examples/minicpm_o/audio_infer.cpp

[error] 4-4: 'iostream' file not found

(clang-diagnostic-error)


[error] 16-16: variable 'MLLM_MAIN' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 16-16: invalid case style for variable 'MLLM_MAIN'

(readability-identifier-naming,-warnings-as-errors)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-android
  • GitHub Check: build-macos

Comment on lines +79 to +116
models::minicpmo::StreamingGenerationConfig stream_config;
stream_config.generate_audio = true;
stream_config.output_chunk_size = 25;
stream_config.max_new_tokens = 100;
stream_config.force_no_stop = false;
stream_config.top_p = 0.7f;
stream_config.top_k = 20;
stream_config.sampling = false;
stream_config.tts_temperature = {0.1f, 0.3f, 0.1f, 0.3f};

auto streaming_gen = models::minicpmo::StreamingGenerator(gen_ids, Tensor::nil(), model, minicpmo_tokenizer,
&chattts_tokenizer, stream_config, chattts_config);
int chunk_count = 0;
std::vector<Tensor> audio_chunks;
for (auto& output : streaming_gen) {
fmt::print("\n--- Chunk {} ---\n", chunk_count);
++chunk_count;
fmt::print("Text: {}\n", output.text);

if (output.audio_wav && !output.audio_wav.value().isNil()) {
auto& audio = output.audio_wav.value();
fmt::print("Audio: [{} samples @ {}Hz]\n", audio.shape()[1], output.sampling_rate);
audio_chunks.emplace_back(audio);
}

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
if (output.finished) {
fmt::print("\n=== Generation Finished ===\n");
break;
}
}

if (help.isSet()) {
Argparse::printHelp();
mllm::shutdownContext();
return 0;
}
{
auto minicpmo_cfg = mllm::models::minicpmo::MiniCPMOConfig(config_path.get());
auto minicpmo_tokenizer = mllm::models::minicpmo::MiniCPMOTokenizer(tokenizer_path.get());
auto minicpmo = mllm::models::minicpmo::MiniCPMOForCausalLM(minicpmo_cfg);

auto param = mllm::load(model_path.get(), file_version);
minicpmo.llm_.llm.load(param);
minicpmo.vpm_.load(param);
minicpmo.resampler_.load(param);
// minicpmo.audio_proj_.load(param);
// minicpmo.tts_proj_.load(param);

fmt::print("\n{:*^60}\n", " MiniCPM-o Interactive CLI ");
fmt::print("Enter 'exit' or 'quit' to end the session\n");

std::string image_path = "/Users/luis/Desktop/plane.png";
std::string prompt_text = "描述图片中物体";
mllm::models::minicpmo::MiniCPMOMessage message;
message.prompt = prompt_text;
message.img_file_path = image_path;

fmt::print("Processing...\n");
auto inputs = minicpmo_tokenizer.convertMessage(message);

fmt::print("\nResponse: ");

int token_count = 0;
for (auto& step : minicpmo.chat(inputs)) {
auto token_str = minicpmo_tokenizer.detokenize(step.cur_token_id);
std::wcout << token_str << std::flush;

token_count++;
if (token_count >= 50) break; // Limit output for debugging
}
Tensor audio_output = nn::functional::concat(audio_chunks, -1);

print("Final audio shape:", audio_output.shape(), audio_output);
audio_output = audio_output * 32767;

wenet::WavWriter wav_writer2(audio_output.ptr<float>(), audio_output.shape().back(), 1, 24000, 16);
wav_writer2.Write("./omni.wav");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Guard against empty audio, use runtime sampling rate, and verify manual scaling

There are a few issues/opportunities in the streaming/audio output section:

  • If no audio_wav chunks are returned, concat(audio_chunks, -1) will be called with an empty vector, which is likely to assert or crash.
  • The WAV writer is hard‑coded to 24000 Hz even though output.sampling_rate is available.
  • audio_output is manually scaled by 32767 before being passed as float* to WavWriter; depending on WavWriter’s contract this may be redundant or incorrect.

You can harden this block and align it with the runtime metadata like this:

-  int chunk_count = 0;
-  std::vector<Tensor> audio_chunks;
+  int chunk_count = 0;
+  std::vector<Tensor> audio_chunks;
+  int audio_sampling_rate = 24000;
@@
-    if (output.audio_wav && !output.audio_wav.value().isNil()) {
-      auto& audio = output.audio_wav.value();
-      fmt::print("Audio: [{} samples @ {}Hz]\n", audio.shape()[1], output.sampling_rate);
-      audio_chunks.emplace_back(audio);
-    }
+    if (output.audio_wav && !output.audio_wav.value().isNil()) {
+      auto& audio = output.audio_wav.value();
+      fmt::print("Audio: [{} samples @ {}Hz]\n", audio.shape()[1], output.sampling_rate);
+      audio_chunks.emplace_back(audio);
+      audio_sampling_rate = output.sampling_rate;
+    }
@@
-  Tensor audio_output = nn::functional::concat(audio_chunks, -1);
-
-  print("Final audio shape:", audio_output.shape(), audio_output);
-  audio_output = audio_output * 32767;
-
-  wenet::WavWriter wav_writer2(audio_output.ptr<float>(), audio_output.shape().back(), 1, 24000, 16);
-  wav_writer2.Write("./omni.wav");
+  if (audio_chunks.empty()) {
+    fmt::print("No audio chunks were generated; skipping WAV write.\n");
+  } else {
+    Tensor audio_output = nn::functional::concat(audio_chunks, -1);
+
+    print("Final audio shape:", audio_output.shape(), audio_output);
+    // Verify whether this manual scaling matches WavWriter's expected input range.
+    audio_output = audio_output * 32767;
+
+    wenet::WavWriter wav_writer2(
+        audio_output.ptr<float>(), audio_output.shape().back(), 1, audio_sampling_rate, 16);
+    wav_writer2.Write("./omni.wav");
+  }

This prevents a crash when no audio is produced and keeps the sample rate consistent with what the model actually outputs. Please double‑check WavWriter’s expected input format to confirm whether the explicit * 32767 scaling is correct or should be removed/adjusted.


🏁 Script executed:

#!/bin/bash
# Find WavWriter class to understand its contract
rg -n "class WavWriter" --type=cpp --type=h -A 15

Length of output: 1344


🏁 Script executed:

#!/bin/bash
# Find nn::functional::concat implementation
rg -n "concat" --type=cpp --type=h -A 5 | grep -E "(concat|functional)" -A 5 -B 2 | head -100

Length of output: 7620


🏁 Script executed:

#!/bin/bash
# Verify the actual code in main.cpp at the specified lines
cat examples/minicpm_o/main.cpp | sed -n '79,116p'

Length of output: 1469


🏁 Script executed:

#!/bin/bash
# Check if there are any guards or checks before concat elsewhere in the codebase
rg -n "audio_chunks" --type=cpp examples/

Length of output: 531


🏁 Script executed:

#!/bin/bash
# Find nn::functional::concat implementation
find . -name "*.h" -o -name "*.cpp" | xargs grep -l "namespace functional" | head -5

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for concat function implementation
rg -n "functional::concat|fn::concat" --type=cpp --type=h -A 20 | head -100

Length of output: 9243


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or comments about tensor concat with empty vectors
rg -n "def concat|void concat|Tensor concat" --type=cpp --type=h -B 3 -A 10 | head -150

Length of output: 8337


🏁 Script executed:

#!/bin/bash
# Look for any existing empty checks in similar audio processing code
rg -n "empty()" --type=cpp examples/ -B 2 -A 2

Length of output: 2313


🏁 Script executed:

#!/bin/bash
# Search for OpTypes::kConcat operation implementation
rg -n "kConcat|concat.*Op" --type=cpp --type=h mllm/nn/ -B 2 -A 15 | head -200

Length of output: 1116


🏁 Script executed:

#!/bin/bash
# Look at WavWriter Write implementation more carefully
cat third_party/wenet_audio/wav.h | sed -n '157,250p'

Length of output: 2195


🏁 Script executed:

#!/bin/bash
# Check for any tests or examples that show concat with various inputs
rg -n "functional::concat" --type=cpp -B 5 -A 5 | grep -E "(empty|vector|list|\[\])" -B 3 -A 3 | head -80

Length of output: 6572


🏁 Script executed:

#!/bin/bash
# Look for how audio_wav is populated - could it be nil/empty for all chunks?
rg -n "audio_wav\|output.audio_wav" --type=cpp --type=h -B 2 -A 2 | head -60

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find StreamingGenerator output structure definition
rg -n "struct.*output|class.*Output" --type=cpp --type=h mllm/models/minicpm_o* -A 10 | grep -E "(audio_wav|sampling_rate)" -B 5 -A 5 | head -80

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for buildOpAndSubmitTask to understand how it handles empty vectors
rg -n "buildOpAndSubmitTask" --type=cpp --type=h -B 3 -A 20 | head -150

Length of output: 11003


🏁 Script executed:

#!/bin/bash
# Look for ConcatOp or concat operation actual implementation
rg -n "class.*ConcatOp|ConcatOpOptions" --type=cpp --type=h -A 20 | head -150

Length of output: 10101


🏁 Script executed:

#!/bin/bash
# Check if there are integration tests showing concat behavior
find . -name "*test*.cpp" -o -name "*test*.h" | xargs grep -l "concat" | head -3

Length of output: 0


Guard against empty audio chunks, use runtime sampling rate

Verification confirms all issues:

  1. Empty audio crash: ConcatOp::reshape() exits with error if inputs are empty (see mllm/core/aops/ConcatOp.cpp:29-31). If no audio_wav chunks are generated, line 110 will crash.

  2. Hard-coded sampling rate: Line 115 uses 24000 instead of output.sampling_rate available from the loop (line 102).

  3. Manual scaling: The * 32767 is correct—WavWriter::Write() casts float directly to int16_t for 16-bit PCM, so the scaling converts normalized float [-1, 1] to PCM range appropriately.

Apply the suggested fixes at examples/minicpm_o/main.cpp:92–116 and examples/minicpm_o/demo_minicpmo_streaming.cpp:114–137 to add empty vector guard and capture runtime sampling rate.

🤖 Prompt for AI Agents
In examples/minicpm_o/main.cpp around lines 92–116 and
examples/minicpm_o/demo_minicpmo_streaming.cpp around lines 114–137, guard
against calling concat and writing a WAV when audio_chunks is empty and use the
runtime sampling rate collected from the streaming outputs instead of the
hard-coded 24000; specifically, track the last non-zero output.sampling_rate
inside the loop, after the loop check if audio_chunks.empty() and skip WAV
creation (or log/return) to avoid ConcatOp crashes, otherwise call concat, keep
the existing float-to-PCM scaling, and pass the captured sampling_rate to
WavWriter instead of 24000.

Comment on lines +31 to +36
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate model_version instead of silently defaulting to v1

If -mv|--model_version receives anything other than "v1" or "v2", file_version remains kV1. That silent fallback can hide typos and cause confusing behavior (loading the wrong layout for a v2 model file).

Consider explicitly rejecting unknown versions, e.g., by printing an error and exiting, or at least logging a warning when model_version isn’t recognized.

Comment on lines +52 to +76
fmt::print("\n{:*^60}\n", " MiniCPM4 Interactive CLI ");
fmt::print("Enter 'exit' or 'quit' to end the session\n\n");

std::string prompt_text = "你是谁?";

try {
auto inputs = minicpm4_tokenizer.convertMessage({.prompt = prompt_text});

fmt::print("\nResponse: ");

mllm::models::ARGenerationArgs args;
args["temperature"] = mllm::AnyValue(0.8f);
args["top_p"] = mllm::AnyValue(0.9f);
args["do_sample"] = mllm::AnyValue(true);

for (auto& step : minicpm4.chat(inputs, args)) {
auto token_str = minicpm4_tokenizer.detokenize(step.cur_token_id);
if (token_str == "<|im_end|>") { break; }
std::cout << token_str << std::flush;
}

fmt::print("\n{}\n", std::string(60, '-'));
} catch (const std::exception& e) { fmt::print("\nError: {}\n{}\n", e.what(), std::string(60, '-')); }

minicpm4.perfSummary();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CLI is not actually interactive despite prompts and naming

The banner and message:

  • " MiniCPM4 Interactive CLI "
  • "Enter 'exit' or 'quit' to end the session"

imply a read–eval–print loop, but the code only ever runs a single, hard‑coded prompt ("你是谁?") and then exits—there is no loop or std::getline over user input.

Either:

  • Implement a proper REPL that reads user input until "exit"/"quit" and re‑invokes convertMessage/chat, or
  • Downgrade the messaging to indicate this is a single‑shot demo and remove the exit/quit notice.

As it stands, the UX is misleading for anyone expecting an interactive session.

🤖 Prompt for AI Agents
In examples/minicpm4/main.cpp around lines 52 to 76, the code displays an
"Interactive CLI" banner and tells the user to "Enter 'exit' or 'quit' to end
the session" but only runs a single hard-coded prompt ("你是谁?") and exits; fix by
either implementing a REPL loop that repeatedly reads input with std::getline,
checks for "exit"/"quit", calls convertMessage and minicpm4.chat for each input,
prints responses and continues until quit, or change the banner and help text to
indicate this is a single-shot demo and remove the exit/quit instructions so the
messaging matches behavior.

Comment on lines +12 to +18
struct AvgPool1dOpOptions : public BaseOpOptions<AvgPool1dOpOptions> {
int32_t kernel_size;
int32_t stride;
int32_t padding = 0;
bool ceil_mode = false;
bool count_include_pad = true;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Initialize required fields in AvgPool1dOpOptions.

The fields kernel_size and stride are not initialized with default values. This can lead to undefined behavior when the struct is default-constructed (as seen in mllm/nn/layers/AvgPool1d.cpp line 7, where AvgPool1dOpOptions{} is used).

Apply this diff to add default values:

 struct AvgPool1dOpOptions : public BaseOpOptions<AvgPool1dOpOptions> {
-  int32_t kernel_size;
-  int32_t stride;
+  int32_t kernel_size = 1;
+  int32_t stride = 1;
   int32_t padding = 0;
   bool ceil_mode = false;
   bool count_include_pad = true;
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct AvgPool1dOpOptions : public BaseOpOptions<AvgPool1dOpOptions> {
int32_t kernel_size;
int32_t stride;
int32_t padding = 0;
bool ceil_mode = false;
bool count_include_pad = true;
};
struct AvgPool1dOpOptions : public BaseOpOptions<AvgPool1dOpOptions> {
int32_t kernel_size = 1;
int32_t stride = 1;
int32_t padding = 0;
bool ceil_mode = false;
bool count_include_pad = true;
};
🧰 Tools
🪛 Clang (14.0.6)

[error] 12-12: constructor does not initialize these fields: kernel_size, stride

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

🤖 Prompt for AI Agents
In mllm/core/aops/AvgPool1dOp.hpp around lines 12 to 18, the AvgPool1dOpOptions
struct leaves kernel_size and stride uninitialized which leads to undefined
behavior when default-constructed; set sensible default values (e.g.,
kernel_size = 1 and stride = 1) in the struct definition so AvgPool1dOpOptions{}
produces valid, initialized members while keeping existing defaults for padding,
ceil_mode, and count_include_pad.

Comment on lines +43 to +58
// Create a Hann window
inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
int32_t length = periodic ? window_length + 1 : window_length;
auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
float* window_ptr = window.ptr<float>();

for (int32_t i = 0; i < window_length; ++i) {
// Hann window formula: 0.5 - 0.5 * cos(2π * n / (N-1))
// For periodic: use N instead of N-1
float n = static_cast<float>(i);
float denominator = periodic ? static_cast<float>(length) : static_cast<float>(length - 1);
window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denominator);
}

return window;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Periodic Hann window formula is likely incorrect vs standard/Whisper behavior

create_hann_window uses length = window_length + 1 and denominator = length when periodic == true, but still iterates i only up to window_length - 1. This yields 0.5 - 0.5 * cos(2π n / (N+1)) for n ∈ [0, N-1], which does not match the usual periodic Hann window (2π n / N) or the standard non‑periodic form (2π n / (N-1)).

Given this window is fed directly into nn::STFT and you explicitly aim to match Whisper/NumPy behavior, this mismatch will change the STFT magnitudes and thus the mel features.

Consider simplifying to an explicit, standard formula, e.g.:

-inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
-  int32_t length = periodic ? window_length + 1 : window_length;
-  auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
-  float* window_ptr = window.ptr<float>();
-
-  for (int32_t i = 0; i < window_length; ++i) {
-    // Hann window formula: 0.5 - 0.5 * cos(2π * n / (N-1))
-    // For periodic: use N instead of N-1
-    float n = static_cast<float>(i);
-    float denominator = periodic ? static_cast<float>(length) : static_cast<float>(length - 1);
-    window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denominator);
-  }
+inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
+  auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
+  float* window_ptr = window.ptr<float>();
+
+  const float denom = periodic ? static_cast<float>(window_length)
+                               : static_cast<float>(window_length - 1);
+  for (int32_t i = 0; i < window_length; ++i) {
+    float n = static_cast<float>(i);
+    window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denom);
+  }
   return window;
 }

This aligns periodic and non‑periodic forms with common definitions and avoids the off‑by‑one in the periodic branch.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Create a Hann window
inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
int32_t length = periodic ? window_length + 1 : window_length;
auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
float* window_ptr = window.ptr<float>();
for (int32_t i = 0; i < window_length; ++i) {
// Hann window formula: 0.5 - 0.5 * cos(2π * n / (N-1))
// For periodic: use N instead of N-1
float n = static_cast<float>(i);
float denominator = periodic ? static_cast<float>(length) : static_cast<float>(length - 1);
window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denominator);
}
return window;
}
// Create a Hann window
inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
float* window_ptr = window.ptr<float>();
const float denom = periodic ? static_cast<float>(window_length)
: static_cast<float>(window_length - 1);
for (int32_t i = 0; i < window_length; ++i) {
float n = static_cast<float>(i);
window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denom);
}
return window;
}
🧰 Tools
🪛 Clang (14.0.6)

[error] 45-45: variable 'length' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 47-47: variable 'window_ptr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

🤖 Prompt for AI Agents
In mllm/models/minicpm_o2_6/audio_preprocessor_minicpmo.hpp around lines 43-58,
the periodic Hann branch uses length = window_length + 1 and then divides by
length while iterating i in [0, window_length-1], producing an off-by-one
denominator; change to use the standard formulas: set denominator = periodic ?
static_cast<float>(window_length) : static_cast<float>(window_length - 1) (and
keep the loop range 0..window_length-1), so the periodic case uses 2π * n / N
and the non-periodic case uses 2π * n / (N-1), matching NumPy/Whisper behavior.
Ensure the Tensor allocation and indices remain based on window_length.

Comment on lines +28 to +34
if (data()["eos_token_id"].is_array()) {
eos_token_ids = data()["eos_token_id"].get<std::vector<int64_t>>();
eos_token_id = eos_token_ids[0];
} else {
eos_token_id = data()["eos_token_id"];
eos_token_ids.push_back(eos_token_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against empty eos_token_id arrays before indexing

When eos_token_id is an array, eos_token_ids[0] is accessed without checking that the array is non‑empty. A malformed or partially edited config would cause undefined behavior here.

Consider validating and failing fast, for example:

-    if (data()["eos_token_id"].is_array()) {
-      eos_token_ids = data()["eos_token_id"].get<std::vector<int64_t>>();
-      eos_token_id = eos_token_ids[0];
-    } else {
+    if (data()["eos_token_id"].is_array()) {
+      eos_token_ids = data()["eos_token_id"].get<std::vector<int64_t>>();
+      if (eos_token_ids.empty()) {
+        throw std::runtime_error("MiniCPM4Config: `eos_token_id` array must not be empty");
+      }
+      eos_token_id = eos_token_ids[0];
+    } else {
       eos_token_id = data()["eos_token_id"];
       eos_token_ids.push_back(eos_token_id);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (data()["eos_token_id"].is_array()) {
eos_token_ids = data()["eos_token_id"].get<std::vector<int64_t>>();
eos_token_id = eos_token_ids[0];
} else {
eos_token_id = data()["eos_token_id"];
eos_token_ids.push_back(eos_token_id);
}
if (data()["eos_token_id"].is_array()) {
eos_token_ids = data()["eos_token_id"].get<std::vector<int64_t>>();
if (eos_token_ids.empty()) {
throw std::runtime_error("MiniCPM4Config: `eos_token_id` array must not be empty");
}
eos_token_id = eos_token_ids[0];
} else {
eos_token_id = data()["eos_token_id"];
eos_token_ids.push_back(eos_token_id);
}
🧰 Tools
🪛 Clang (14.0.6)

[error] 28-28: if with identical then and else branches

(bugprone-branch-clone,-warnings-as-errors)

🤖 Prompt for AI Agents
In mllm/models/minicpm4/configuration_minicpm4.hpp around lines 28-34, the code
assumes data()["eos_token_id"] array contains at least one element and directly
accesses eos_token_ids[0]; add a guard to validate the array is non-empty before
indexing and fail fast if it is empty (e.g., log/throw an informative exception
or return an error), and ensure eos_token_id and eos_token_ids are set
consistently afterwards; also validate the non-array path (presence/type) and
handle malformed or missing values with the same fail-fast behavior.

Comment on lines +46 to +59
// RoPE scaling configuration
if (data().contains("rope_scaling") && !data()["rope_scaling"].is_null()) {
auto rope_scaling = data()["rope_scaling"];
rope_scaling_type = rope_scaling["rope_type"];
rope_scaling_factor = rope_scaling.value("factor", 1.0);

if (rope_scaling_type == "longrope") {
original_max_position_embeddings = rope_scaling["original_max_position_embeddings"];

// Load short_factor and long_factor arrays
if (rope_scaling.contains("short_factor")) { short_factor = rope_scaling["short_factor"].get<std::vector<float>>(); }
if (rope_scaling.contains("long_factor")) { long_factor = rope_scaling["long_factor"].get<std::vector<float>>(); }
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate RoPE scaling factors for longrope configs

rope_scaling_type, short_factor, and long_factor are loaded here, but there’s no validation that:

  • long_factor is populated when short_factor is non‑empty, and
  • both factor vectors have length at least head_dim / 2 (the length used later in makeRoPEInvFreqWithLongRoPE).

Given the modeling code unconditionally indexes ext_factors[i] when !cfg.short_factor.empty(), a config with only short_factor (or mismatched lengths) will cause out‑of‑bounds reads.

Suggest adding basic shape/consistency checks after loading the vectors (and either throwing with a clear message or falling back to unscaled RoPE if invalid).

🤖 Prompt for AI Agents
mllm/models/minicpm4/configuration_minicpm4.hpp around lines 46 to 59: after
loading rope_scaling_type, original_max_position_embeddings, short_factor and
long_factor, add validation that if short_factor is non-empty then long_factor
is also non-empty, and verify both short_factor.size() and long_factor.size()
are >= required_len (compute required_len as head_dim/2 or pass head_dim in
scope); if checks fail either throw a clear std::runtime_error with a
descriptive message including which vector/mismatch failed, or reset
rope_scaling_type to disable longrope (fallback to unscaled RoPE) and log/warn
the issue; ensure validations occur before any later code that indexes
ext_factors to prevent out‑of‑bounds access.

Comment on lines +16 to +26
inline auto makeRoPEInvFreqWithLongRoPE(int output_dim, float rope_theta, const MiniCPM4Config& cfg, int seq_len) -> Tensor {
auto inv_freq = Tensor::empty({output_dim / 2}, kFloat32, kCPU).alloc();
auto inv_freq_ptr = inv_freq.ptr<float>();

for (int i = 0; i < output_dim / 2; i++) { inv_freq_ptr[i] = 1.0 / std::pow(rope_theta, 2.0 * i / output_dim); }

if (!cfg.short_factor.empty()) {
const auto& ext_factors = (seq_len > cfg.original_max_position_embeddings) ? cfg.long_factor : cfg.short_factor;

for (int i = 0; i < output_dim / 2; i++) { inv_freq_ptr[i] = inv_freq_ptr[i] / ext_factors[i]; }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid out‑of‑bounds access when applying long/short RoPE factors

makeRoPEInvFreqWithLongRoPE assumes that whenever !cfg.short_factor.empty():

  • cfg.long_factor is also populated for the long‑sequence case, and
  • the chosen ext_factors vector has length ≥ output_dim / 2.

If a config ships with only short_factor (or with shorter vectors than expected), ext_factors[i] will read out of bounds.

It’s safer to gate this scaling on both vectors being well‑formed, and to assert or fall back when sizes don’t match:

-  if (!cfg.short_factor.empty()) {
-    const auto& ext_factors = (seq_len > cfg.original_max_position_embeddings) ? cfg.long_factor : cfg.short_factor;
-
-    for (int i = 0; i < output_dim / 2; i++) { inv_freq_ptr[i] = inv_freq_ptr[i] / ext_factors[i]; }
-  }
+  const int dim_half = output_dim / 2;
+  if (!cfg.short_factor.empty()) {
+    const auto& ext_factors =
+        (seq_len > cfg.original_max_position_embeddings) ? cfg.long_factor : cfg.short_factor;
+    if ((int)ext_factors.size() < dim_half) {
+      throw std::runtime_error("MiniCPM4: RoPE scaling factors length mismatch");
+    }
+    for (int i = 0; i < dim_half; ++i) {
+      inv_freq_ptr[i] /= ext_factors[i];
+    }
+  }
🧰 Tools
🪛 Clang (14.0.6)

[error] 16-16: 4 adjacent parameters of 'makeRoPEInvFreqWithLongRoPE' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

🤖 Prompt for AI Agents
In mllm/models/minicpm4/modeling_minicpm4.hpp around lines 16–26, the code
applies ext_factors[i] without ensuring ext_factors exists and is at least
output_dim/2 long, risking out‑of‑bounds access; update the logic to (1) choose
ext_factors only if the intended vector (long_factor for long sequences or
short_factor) is nonempty and its size() >= output_dim/2, (2) otherwise fall
back to no scaling (or to the other factor if it is valid), and (3) optionally
add an assert/log when sizes mismatch; implement the scaling loop using the
validated size boundary (or min(size, output_dim/2)) so you never index past the
vector bounds.

Comment on lines +298 to +391
explicit MiniCPM4ForCausalLM(const MiniCPM4Config& cfg) : cfg(cfg) {
kv_cache_ = nn::StaticCache(cfg.max_cache_length, cfg.num_hidden_layers,
cfg.num_attention_heads, // q_heads
cfg.num_key_value_heads, // kv_heads
cfg.hidden_size / cfg.num_attention_heads, // kv_dim
kFloat32, // k_dtype
kFloat32, // v_dtype
kCPU, // device_type
false // use_fa2
);
eos_token_id_ = cfg.eos_token_id;
max_length_ = cfg.max_cache_length;
tie_word_embeddings_ = cfg.tie_word_embeddings;

llm = reg<MiniCPM4Text>("model", cfg);

// Only create lm_head if NOT using tied embeddings
if (!cfg.tie_word_embeddings) {
lm_head_ = reg<nn::Linear>("lm_head", cfg.hidden_size, cfg.vocab_size, false, cfg.linear_impl_type);
}

longrope_scaling_factor_ = computeLongRoPEScalingFactor(cfg);
}

ARGenerationOutputPast forward(const ARGenerationOutputPast& input, const ARGenerationArgs& args) override {
auto sequence = input.at("sequence");

// Generate position_ids for the current sequence
auto batch_size = sequence.shape()[0];
auto seq_len = sequence.shape()[1];

auto inv_freq = makeRoPEInvFreqWithLongRoPE(cfg.hidden_size / cfg.num_attention_heads, cfg.rope_theta, cfg, seq_len);

Tensor position_ids = Tensor::nil();
if (input.count("position_ids")) {
// Use existing position_ids for decode phase
position_ids = input.at("position_ids");

// For decode phase, increment the last position
if (seq_len == 1) {
auto last_pos = *position_ids.offsettedPtr<int64_t>({0, position_ids.shape()[1] - 1});
position_ids = Tensor::empty({batch_size, 1}, kInt64, kCPU).alloc();
*position_ids.offsettedPtr<int64_t>({0, 0}) = last_pos + 1;
}
} else {
// Generate position_ids for prefill phase
position_ids = Tensor::empty({batch_size, seq_len}, kInt64, kCPU).alloc();
auto position_ids_ptr = position_ids.ptr<int64_t>();
for (int b = 0; b < batch_size; ++b) {
for (int s = 0; s < seq_len; ++s) { position_ids_ptr[b * seq_len + s] = s; }
}
}

// Generate RoPE embeddings using the inv_freq buffer
auto [llm_embedding_sin, llm_embedding_cos] = makeRotaryPosEmbedding(position_ids, inv_freq, longrope_scaling_factor_);

sequence = llm(sequence, llm_embedding_sin, llm_embedding_cos, AnyValue(&kv_cache_))[0];

// clip x to one seq length
{
auto S = sequence.shape()[1];
sequence = sequence[{kAll, {S - 1}, kAll}];
}

// Apply MiniCPM specific scaling before output
// logits = lm_head(hidden_states / (hidden_size / dim_model_base))
if (cfg.dim_model_base != 1.0f) {
float scale_factor = cfg.hidden_size / cfg.dim_model_base;
sequence = sequence / scale_factor;
}

if (tie_word_embeddings_) {
auto embedding_weight = llm.embedding_.weight();
sequence = nn::functional::matmul(sequence, embedding_weight, false, true);
} else {
sequence = lm_head_(sequence);
}

return {
{"sequence", sequence},
{"position_ids", position_ids},
};
}

inline nn::StaticCache& kvCache() { return kv_cache_; }

private:
const MiniCPM4Config& cfg;
MiniCPM4Text llm;
nn::Linear lm_head_;
bool tie_word_embeddings_;
nn::StaticCache kv_cache_;
float longrope_scaling_factor_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid storing MiniCPM4Config by reference; this can dangle and is used in forward

MiniCPM4ForCausalLM stores:

private:
  const MiniCPM4Config& cfg;

and uses cfg in both the constructor and forward (e.g., cfg.hidden_size, cfg.dim_model_base). If the caller ever constructs the model from a temporary config:

MiniCPM4ForCausalLM model(MiniCPM4Config("config_minicpm4.json"));

cfg will dangle as soon as the temporary is destroyed, leading to undefined behavior on the first forward.

To make this robust, prefer storing the config by value:

-class MiniCPM4ForCausalLM : public ARGeneration, public nn::Module {
+class MiniCPM4ForCausalLM : public ARGeneration, public nn::Module {
   ...
-private:
-  const MiniCPM4Config& cfg;
+private:
+  MiniCPM4Config cfg;

and update the constructor to take MiniCPM4Config cfg_ by value or const ref and copy.

🧰 Tools
🪛 Clang (14.0.6)

[error] 298-298: constructor does not initialize these fields: lm_head_, tie_word_embeddings_, kv_cache_, longrope_scaling_factor_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 322-322: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 331-331: variable 'position_ids' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 358-358: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 358-358: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 365-365: variable 'scale_factor' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


namespace mllm::nn {

AvgPool1d::AvgPool1d() : Layer(OpTypes::kAvgPool1d, aops::AvgPool1dOpOptions{}) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Default constructor uses uninitialized AvgPool1dOpOptions.

The default constructor initializes with aops::AvgPool1dOpOptions{}, which leaves kernel_size and stride uninitialized (as noted in the review of mllm/core/aops/AvgPool1dOp.hpp). This can lead to undefined behavior.

Consider one of these approaches:

Option 1: Remove the default constructor if it's not needed:

-AvgPool1d::AvgPool1d() : Layer(OpTypes::kAvgPool1d, aops::AvgPool1dOpOptions{}) {}

Option 2: Provide sensible defaults in the constructor:

-AvgPool1d::AvgPool1d() : Layer(OpTypes::kAvgPool1d, aops::AvgPool1dOpOptions{}) {}
+AvgPool1d::AvgPool1d() : Layer(OpTypes::kAvgPool1d, aops::AvgPool1dOpOptions{
+                                                          .kernel_size = 1,
+                                                          .stride = 1,
+                                                          .padding = 0,
+                                                          .ceil_mode = false,
+                                                          .count_include_pad = true}) {}

Note: This issue will be resolved if the fields are initialized in AvgPool1dOpOptions as suggested in the review of mllm/core/aops/AvgPool1dOp.hpp.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
AvgPool1d::AvgPool1d() : Layer(OpTypes::kAvgPool1d, aops::AvgPool1dOpOptions{}) {}
🤖 Prompt for AI Agents
In mllm/nn/layers/AvgPool1d.cpp around line 7, the default constructor uses
aops::AvgPool1dOpOptions{} which leaves kernel_size and stride uninitialized;
either remove this default constructor if not needed, or change it to construct
with explicit safe defaults (e.g. set kernel_size and stride to 1 or other
sensible values) so AvgPool1dOpOptions fields are initialized before passing to
Layer; alternatively, ensure AvgPool1dOpOptions itself initializes its members
and adjust this constructor accordingly.

@KKkai0315 KKkai0315 changed the title fix(minicpmo): fix minicpmo tokenization logic fix(minicpmo): fix minicpmo tokenization logic & streaming generation Nov 24, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
mllm/models/minicpm_o2_6/streaming_generation.cpp (1)

207-217: Consider extracting num_spk_embs to avoid duplication.

The mask construction logic is correct and aligns with the intended streaming TTS behavior. However, num_spk_embs = 1 is hardcoded here and also in prepare_tts_text (line 196), creating duplication.

Consider extracting num_spk_embs as a class member constant to maintain consistency and simplify future adjustments:

// In the class header
static constexpr int32_t kNumSpeakerEmbeddings = 1;

Then reference it in both methods:

-  int32_t num_spk_embs = 1;
+  int32_t num_spk_embs = kNumSpeakerEmbeddings;

Additionally, the inline comment // = 303 on line 208 is only valid when streaming_text_reserved_len_ equals 300, which may not be obvious. Consider either removing it or making it conditional on the actual value.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 518d66d and db4c4f7.

📒 Files selected for processing (1)
  • mllm/models/minicpm_o2_6/streaming_generation.cpp (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
mllm/models/minicpm_o2_6/streaming_generation.cpp (1)
mllm/models/minicpm_o2_6/streaming_generation.hpp (1)
  • tts_token_lens (196-196)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-android
  • GitHub Check: build-macos
🔇 Additional comments (1)
mllm/models/minicpm_o2_6/streaming_generation.cpp (1)

116-116: Verify that lowering min_new_token to 0 doesn't compromise TTS quality.

The minimum token generation requirement has been lowered from 10 to 0, which allows the TTS model to finish generation immediately without producing any new tokens. This could potentially affect streaming audio quality or cause premature termination in edge cases.

Please confirm this change has been tested and doesn't degrade the streaming TTS experience.

@oreomaker oreomaker merged commit cb6ffbd into UbiquitousLearning:main Nov 25, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants