feat: SRT-driven edit pipeline + edit-plan recommender by xiaogang-sudo · Pull Request #41 · browser-use/video-use

xiaogang-sudo · 2026-05-19T14:17:15Z

Summary

Adds an independent SRT-driven editing pipeline plus a lexical recommender that bridges Scribe transcripts to it. All existing helpers (render.py, grade.py, transcribe.py, etc.) are untouched.

helpers/srt_driven_edit.py — full extract → gap → concat → final-compose pipeline:
- safe-ASCII temp work dir so CJK / quoted user paths never reach libavfilter
- SRT encoding fallback (utf-8-sig / utf-8 / gb18030 / cp936 / cp1252) and cue-settings tolerance (position:90% etc.)
- ffmpeg + ffprobe preflight; per-source ffprobe with auto-degrade when source has no audio
- sync tails (fps=24,setpts=PTS-STARTPTS / aresample=async=1:first_pts=0,asetpts) on every clip
- per-segment cache keyed by ffmpeg version + encoding params + effective bg_volume
- global --voice spans the whole output timeline (mixed in the final compose, not per-segment)
- batch manifest (jobs.json / .csv) with auto-isolated outputs, --continue-on-error, --no-overwrite
- QC report (per-segment drift, audio mode, disk usage, subtitle style)
- subtitle burn LAST (Hard Rule 1), 30ms audio fades (Hard Rule 3)
helpers/recommend_edit_plan.py — bridges Scribe transcript JSON → edit_plan.json:
- candidate segmentation by sentence-end punctuation / silence gap / speaker change; phrase / hard window splits for long candidates
- local lexical scoring: 0.6 SequenceMatcher + 0.4 Jaccard (token-level for Latin, char-2-gram for CJK) blended with duration similarity
- greedy assignment (no reuse by default)
- emits Form A (default, drop-in for srt_driven_edit --plan) or Form B; sidecar *_review.md for human QA
- no LLM, no API; intentionally local. The matcher cannot understand storyline — low-score matches are flagged in the review markdown
- --packed / --context-window flags reserved as placeholders (documented as such)
tests/ — 28 pytest tests using lavfi-synthesized media:
- 9 e2e tests (basic, GBK SRT, CJK output path, per-segment voice, video-only source auto-degrade, range out-of-bounds, cache hit on rerun, --no-overwrite, gap insertion)
- 3 global-voice tests including a regression that proves segments cache independently of the global voice
- 5 batch tests (auto-isolation, continue-on-error, hard abort, CSV manifest, per-job bg_volume cache distinctness)
- 11 recommender tests including a full chain: recommend → sde.run_job → final.mp4
pyproject.toml — adds dev = ["pytest>=7"] as an optional dependency.
CLAUDE.md / AGENTS.md — project guidance for AI assistants working in this repo. Happy to remove or reword if these conflict with upstream framing.

Pipeline position

script.srt + transcript.json
  --(recommend_edit_plan.py)-->
edit_plan.json + edit_plan_review.md
  --(srt_driven_edit.py)-->
final.mp4

This complements the existing transcript-first EDL flow rather than replacing it — use the new pipeline when you already have a finished narration script and want to align it to a source recording.

Reviewer notes

Pure additive; no existing helper modified.
srt_driven_edit.py reuses 4 symbols from render.py via try: from render import ... with fallbacks, so it still runs if render.py is unavailable.
Tests need ffmpeg + ffprobe on PATH; conftest.py skips the whole tests/ directory otherwise.
The smoke test at examples/srt_driven/_smoke_test.py is a no-pytest fallback that covers the parser / encoding / cache-key layers.
All ffmpeg pipeline behavior verified end-to-end on Windows with ffmpeg 8.1.1 + Python 3.12; should be portable since the code uses only stdlib + ffmpeg subprocesses.

Test plan

pip install -e ".[dev]"
python -m pytest tests/ -v (~40s on a typical machine)
Optional offline check: python examples/srt_driven/_smoke_test.py

🤖 Generated with Claude Code

Summary by cubic

Add a standalone SRT‑driven edit pipeline and a local edit‑plan recommender to align finished scripts to source footage without changing existing helpers. Enables an offline flow: script.srt + transcript.json → edit_plan.json → final.mp4.

New Features
- helpers/srt_driven_edit.py: End‑to‑end SRT‑driven pipeline (parse + validate + align → cached extract → gap insert → concat → final compose with global --voice), with safe‑ASCII temp paths, SRT encoding fallback, ffmpeg/ffprobe preflight, batch manifests, QC report, and “burn subtitles last” with short audio fades.
- helpers/recommend_edit_plan.py: Builds edit_plan.json from script.srt and Scribe word‑level transcript via local lexical scoring and greedy assignment; outputs Form A/B and a *_review.md for QA.
Dependencies
- Add dev extra: pytest>=7.

^{Written for commit 87439d1. Summary will update on new commits. Review in cubic}

Independent helper that assembles a final cut by aligning source ranges to an SRT timeline, bypassing the existing transcript-based EDL flow. Use when you have a finished script (script.srt = final captions timeline) and a list of source ranges keyed by SRT id. Pipeline: parse SRT + plan -> strict validate -> align -> extract segments (per-source ffprobe, HDR tone-map, sync tails, cache) -> gap clips for non-contiguous SRT cues -> lossless concat -> final pass with optional global voice mix + subtitle burn LAST (Hard Rule 1). Key correctness properties: - All intermediates land in a safe-ASCII temp work_dir; CJK / quoted user paths never reach libavfilter or the concat demuxer. - SRT input decoded with utf-8-sig / utf-8 / gb18030 / cp936 / cp1252 fallback; cue settings (position:90% etc.) tolerated. - Per-segment cache keyed by ffmpeg version + encoding params + effective bg_volume so encoder tweaks invalidate stale clips. - Source streams probed once; no-audio source auto-degrades bg_volume to 0 for its segments; out-of-bounds ranges fail fast. - Global --voice spans the whole timeline (apad/atrim to total_duration in the final compose), not per-segment — a 5s VO does not restart at every cut. - 30ms audio fades + fps=24,setpts and aresample sync tails on every segment prevent A/V drift through many short concats. - burn_subtitles is self-defending: unsafe subs paths are copied to a temp ASCII SRT before being fed to libavfilter. - Batch (jobs.json / .csv) auto-isolates outputs by manifest index; --continue-on-error skips failing rows; --no-overwrite refuses to clobber existing outputs. Includes examples (Form A array, Form B object with multi-source + voices, batch manifest, CJK SRT) and pytest coverage (14 e2e + batch tests using lavfi-synthesized media; passes against ffmpeg 8.x on Windows). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cript Bridges the gap between Scribe word-level transcripts and the srt_driven_edit pipeline. Given a final-cut script.srt and a source recording's Scribe JSON, produces an edit_plan.json (Form A or B) plus a sidecar review markdown for human-in-the-loop QA. Matching strategy is intentionally local (no LLM, no API): 1. Filter the transcript to timestamped 'word' tokens (audio_event / spacing skipped; --keep-audio-events keeps markers as context). 2. Group consecutive words into non-overlapping candidates, breaking on sentence-end punctuation, silences >= gap_threshold, or speaker change. Long candidates split at phrase punctuation, then by hard word-level windows. All edges land on word boundaries. 3. Score each (cue, candidate) pair as 0.7 * (0.6 * SequenceMatcher + 0.4 * Jaccard) + 0.3 * 1/(1+|dur_delta|/cue_dur) where Jaccard auto-switches between Latin word-token and CJK character-bigram representations. 4. Greedy assignment; --allow-reuse drops the no-reuse constraint. 5. Emit Form A (default, drop-in for srt_driven_edit --plan) or Form B; review markdown lists matched text, score, duration delta, and warnings (low score / duration mismatch / candidate-shorter-than- cue). Hard failure modes (exit 1): any cue with no assignable candidate; malformed transcript JSON; transcript with no word tokens. Soft failures (warnings only): low score, candidate too short for cue. The matcher cannot understand storyline — if SRT narration words do not appear in the source transcript, scores will be low. The sidecar review.md is the manual QA surface; it is intentionally not pulled into the plan (parse_plan in srt_driven_edit stays strict). --packed (takes_packed.md) and --context-window flags are reserved placeholders only; both raise no error but do not yet alter behavior. Includes 11 pytest tests including a full end-to-end: recommend -> sde.run_job -> final.mp4 against lavfi-synthesized media. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CLAUDE.md is auto-loaded by Claude Code when working in this directory, giving sessions a consistent picture of the project's scope, tech constraints, and out-of-bounds behaviors before the user has to say it. AGENTS.md does the same for Codex review sessions, classifying review output into must-fix / should-improve / later so suggestions are actionable rather than open-ended rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cubic-dev-ai

3 issues found across 15 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/recommend_edit_plan.py">

<violation number="1" location="helpers/recommend_edit_plan.py:134">
P2: `--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</violation>
</file>

<file name="helpers/srt_driven_edit.py">

<violation number="1" location="helpers/srt_driven_edit.py:553">
P2: Per-segment voice files lack preflight audio stream validation</violation>

<violation number="2" location="helpers/srt_driven_edit.py:771">
P1: Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-05-19T14:37:19Z

+    vf_parts: list[str] = []
+    if is_hdr_source(seg.source_path):
+        vf_parts.append(TONEMAP_CHAIN)
+    vf_parts.append(scale_filter_for(seg.source_path))


P1: Per-segment orientation scaling conflicts with concat demuxer -c copy stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 771: <comment>Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</comment> <file context> @@ -0,0 +1,1522 @@ + vf_parts: list[str] = [] + if is_hdr_source(seg.source_path): + vf_parts.append(TONEMAP_CHAIN) + vf_parts.append(scale_filter_for(seg.source_path)) + + if seg.pad_short and seg.plan_src_dur + 1e-6 < target: </file context>

cubic-dev-ai · 2026-05-19T14:37:19Z

+    return out
+
+
+def build_candidates(


P2: --keep-audio-events / keep_audio_events is dead code: audio events kept in load_transcript_words are silently discarded in build_candidates' unconditional type != "word" filter. The flag produces identical output in both states, misleading users who expect (laughter)/(applause) context to be included in candidate text.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At helpers/recommend_edit_plan.py, line 134: <comment>`--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</comment> <file context> @@ -0,0 +1,561 @@ + return out + + +def build_candidates( + words: list[dict], + *, </file context>

cubic-dev-ai · 2026-05-19T14:37:19Z

+            raise SystemExit(f"source '{name}' missing on disk: {sp}")
+    for name, vp in voices_map.items():
+        if not vp.exists():
+            raise SystemExit(f"voice '{name}' missing on disk: {vp}")


P2: Per-segment voice files lack preflight audio stream validation

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 553: <comment>Per-segment voice files lack preflight audio stream validation</comment> <file context> @@ -0,0 +1,1522 @@ + raise SystemExit(f"source '{name}' missing on disk: {sp}") + for name, vp in voices_map.items(): + if not vp.exists(): + raise SystemExit(f"voice '{name}' missing on disk: {vp}") + if legacy_default_source is not None and not legacy_default_source.exists(): + raise SystemExit(f"--source missing on disk: {legacy_default_source}") </file context>

xiaogang-sudo and others added 3 commits May 19, 2026 21:00

cubic-dev-ai Bot reviewed May 19, 2026

View reviewed changes

xiaogang-sudo mentioned this pull request May 19, 2026

feat(run_episodes): batch runner for episode directories (depends on #41) #42

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SRT-driven edit pipeline + edit-plan recommender#41

feat: SRT-driven edit pipeline + edit-plan recommender#41
xiaogang-sudo wants to merge 3 commits into
browser-use:mainfrom
xiaogang-sudo:feat/srt-driven-edit

xiaogang-sudo commented May 19, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 19, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot May 19, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaogang-sudo commented May 19, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pipeline position

Reviewer notes

Test plan

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaogang-sudo commented May 19, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot May 19, 2026 •

edited

Loading

cubic-dev-ai Bot May 19, 2026 •

edited

Loading

cubic-dev-ai Bot May 19, 2026 •

edited

Loading