Fix streaming transcribe#576
Conversation
Signed-off-by: jakmro <kubamroz124@gmail.com>
Signed-off-by: jakmro <kubamroz124@gmail.com>
d2f3aa3 to
c32926b
Compare
There was a problem hiding this comment.
Pull request overview
Adds coverage and fixes for streaming speech-to-text behavior, particularly around Parakeet streaming semantics and token piece decoding, to address issues with “streaming transcribe”.
Changes:
- Add a new
stream_transcriptiontest that feeds PCM audio in chunks and consumesconfirmed/pendingstreaming results. - Adjust Parakeet model-type handling in
cactus_stream_transcribe_process/stop(apply certain Parakeet-specific streaming heuristics only toPARAKEET_TDT). - Update
SPTokenizer::decode()to better support single-token “piece” decoding (used by streaming/token callbacks).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/test_stt.cpp | Adds a new streaming transcription test and wires it into the test runner. |
| cactus/ffi/cactus_stream.cpp | Refines Parakeet streaming branches and stop/flush behavior to fix streaming transcription output. |
| cactus/engine/engine_sp.cpp | Tweaks SentencePiece-style decoding for single-token decode calls used in streaming output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| size_t word_count = 0; | ||
| bool in_word = false; | ||
| for (char c : full_transcription) { | ||
| if (std::isspace(c)) { |
There was a problem hiding this comment.
std::isspace(c) is undefined behavior when char is signed and c is negative. Cast to unsigned char (or use a locale-aware approach) before calling std::isspace to avoid UB on non-ASCII bytes.
| if (std::isspace(c)) { | |
| if (std::isspace(static_cast<unsigned char>(c))) { |
| std::cout << "\n[Results]\n" | ||
| << " \"success\": true,\n" | ||
| << " \"total_time_ms\": " << std::fixed << std::setprecision(2) << elapsed << ",\n" | ||
| << " \"audio_chunks\": " << ((pcm_samples.size() + chunk_size - 1) / chunk_size) << ",\n" | ||
| << " \"pcm_samples\": " << pcm_samples.size() << ",\n" | ||
| << " \"duration_sec\": " << std::setprecision(2) << (pcm_samples.size() / 16000.0) << ",\n" | ||
| << " \"words_transcribed\": " << word_count << "\n" | ||
| << "├─ Full transcription: \"" << full_transcription << "\"" << std::endl; | ||
|
|
||
| cactus_destroy(model); | ||
| return true; |
There was a problem hiding this comment.
This test currently returns true even if no transcript is produced (e.g., empty full_transcription / word_count == 0 / JSON success false). To actually catch regressions in streaming, add at least one assertion that some confirmed text/words were produced (or that the response JSON indicates success and non-empty output).
| return false; | ||
| } | ||
|
|
||
| std::string audio_path = std::string(g_assets_path) + "/test_long.wav"; |
There was a problem hiding this comment.
g_assets_path is used to build audio_path without checking it is set. If CACTUS_TEST_ASSETS is unset this will construct std::string(nullptr) and crash; please add a skip/early-fail similar to other tests in this file that require assets.
| fseek(wav_file, 44, SEEK_SET); | ||
| std::vector<int16_t> pcm_samples; | ||
| int16_t sample; | ||
| while (fread(&sample, sizeof(int16_t), 1, wav_file) == 1) { | ||
| pcm_samples.push_back(sample); | ||
| } |
There was a problem hiding this comment.
The WAV parsing assumes a fixed 44-byte header and raw int16 PCM data. This will misread files with extra chunks (e.g., LIST/fact) or non-16bit/mono formats; consider using the existing WAV loader/resampler utilities used in other tests so the input format is validated and converted to 16kHz PCM as needed.
No description provided.