Skip to content

feat: SRT-driven edit pipeline + edit-plan recommender#41

Open
xiaogang-sudo wants to merge 3 commits into
browser-use:mainfrom
xiaogang-sudo:feat/srt-driven-edit
Open

feat: SRT-driven edit pipeline + edit-plan recommender#41
xiaogang-sudo wants to merge 3 commits into
browser-use:mainfrom
xiaogang-sudo:feat/srt-driven-edit

Conversation

@xiaogang-sudo
Copy link
Copy Markdown

@xiaogang-sudo xiaogang-sudo commented May 19, 2026

Summary

Adds an independent SRT-driven editing pipeline plus a lexical recommender that bridges Scribe transcripts to it. All existing helpers (render.py, grade.py, transcribe.py, etc.) are untouched.

  • helpers/srt_driven_edit.py — full extract → gap → concat → final-compose pipeline:

    • safe-ASCII temp work dir so CJK / quoted user paths never reach libavfilter
    • SRT encoding fallback (utf-8-sig / utf-8 / gb18030 / cp936 / cp1252) and cue-settings tolerance (position:90% etc.)
    • ffmpeg + ffprobe preflight; per-source ffprobe with auto-degrade when source has no audio
    • sync tails (fps=24,setpts=PTS-STARTPTS / aresample=async=1:first_pts=0,asetpts) on every clip
    • per-segment cache keyed by ffmpeg version + encoding params + effective bg_volume
    • global --voice spans the whole output timeline (mixed in the final compose, not per-segment)
    • batch manifest (jobs.json / .csv) with auto-isolated outputs, --continue-on-error, --no-overwrite
    • QC report (per-segment drift, audio mode, disk usage, subtitle style)
    • subtitle burn LAST (Hard Rule 1), 30ms audio fades (Hard Rule 3)
  • helpers/recommend_edit_plan.py — bridges Scribe transcript JSON → edit_plan.json:

    • candidate segmentation by sentence-end punctuation / silence gap / speaker change; phrase / hard window splits for long candidates
    • local lexical scoring: 0.6 SequenceMatcher + 0.4 Jaccard (token-level for Latin, char-2-gram for CJK) blended with duration similarity
    • greedy assignment (no reuse by default)
    • emits Form A (default, drop-in for srt_driven_edit --plan) or Form B; sidecar *_review.md for human QA
    • no LLM, no API; intentionally local. The matcher cannot understand storyline — low-score matches are flagged in the review markdown
    • --packed / --context-window flags reserved as placeholders (documented as such)
  • tests/ — 28 pytest tests using lavfi-synthesized media:

    • 9 e2e tests (basic, GBK SRT, CJK output path, per-segment voice, video-only source auto-degrade, range out-of-bounds, cache hit on rerun, --no-overwrite, gap insertion)
    • 3 global-voice tests including a regression that proves segments cache independently of the global voice
    • 5 batch tests (auto-isolation, continue-on-error, hard abort, CSV manifest, per-job bg_volume cache distinctness)
    • 11 recommender tests including a full chain: recommend → sde.run_job → final.mp4
  • pyproject.toml — adds dev = ["pytest>=7"] as an optional dependency.

  • CLAUDE.md / AGENTS.md — project guidance for AI assistants working in this repo. Happy to remove or reword if these conflict with upstream framing.

Pipeline position

script.srt + transcript.json
  --(recommend_edit_plan.py)-->
edit_plan.json + edit_plan_review.md
  --(srt_driven_edit.py)-->
final.mp4

This complements the existing transcript-first EDL flow rather than replacing it — use the new pipeline when you already have a finished narration script and want to align it to a source recording.

Reviewer notes

  • Pure additive; no existing helper modified.
  • srt_driven_edit.py reuses 4 symbols from render.py via try: from render import ... with fallbacks, so it still runs if render.py is unavailable.
  • Tests need ffmpeg + ffprobe on PATH; conftest.py skips the whole tests/ directory otherwise.
  • The smoke test at examples/srt_driven/_smoke_test.py is a no-pytest fallback that covers the parser / encoding / cache-key layers.
  • All ffmpeg pipeline behavior verified end-to-end on Windows with ffmpeg 8.1.1 + Python 3.12; should be portable since the code uses only stdlib + ffmpeg subprocesses.

Test plan

  • pip install -e ".[dev]"
  • python -m pytest tests/ -v (~40s on a typical machine)
  • Optional offline check: python examples/srt_driven/_smoke_test.py

🤖 Generated with Claude Code


Summary by cubic

Add a standalone SRT‑driven edit pipeline and a local edit‑plan recommender to align finished scripts to source footage without changing existing helpers. Enables an offline flow: script.srt + transcript.json → edit_plan.json → final.mp4.

  • New Features

    • helpers/srt_driven_edit.py: End‑to‑end SRT‑driven pipeline (parse + validate + align → cached extract → gap insert → concat → final compose with global --voice), with safe‑ASCII temp paths, SRT encoding fallback, ffmpeg/ffprobe preflight, batch manifests, QC report, and “burn subtitles last” with short audio fades.
    • helpers/recommend_edit_plan.py: Builds edit_plan.json from script.srt and Scribe word‑level transcript via local lexical scoring and greedy assignment; outputs Form A/B and a *_review.md for QA.
  • Dependencies

    • Add dev extra: pytest>=7.

Written for commit 87439d1. Summary will update on new commits. Review in cubic

xiaogang-sudo and others added 3 commits May 19, 2026 21:00
Independent helper that assembles a final cut by aligning source ranges
to an SRT timeline, bypassing the existing transcript-based EDL flow.
Use when you have a finished script (script.srt = final captions
timeline) and a list of source ranges keyed by SRT id.

Pipeline: parse SRT + plan -> strict validate -> align -> extract
segments (per-source ffprobe, HDR tone-map, sync tails, cache) -> gap
clips for non-contiguous SRT cues -> lossless concat -> final pass with
optional global voice mix + subtitle burn LAST (Hard Rule 1).

Key correctness properties:
- All intermediates land in a safe-ASCII temp work_dir; CJK / quoted
  user paths never reach libavfilter or the concat demuxer.
- SRT input decoded with utf-8-sig / utf-8 / gb18030 / cp936 / cp1252
  fallback; cue settings (position:90% etc.) tolerated.
- Per-segment cache keyed by ffmpeg version + encoding params +
  effective bg_volume so encoder tweaks invalidate stale clips.
- Source streams probed once; no-audio source auto-degrades bg_volume
  to 0 for its segments; out-of-bounds ranges fail fast.
- Global --voice spans the whole timeline (apad/atrim to total_duration
  in the final compose), not per-segment — a 5s VO does not restart at
  every cut.
- 30ms audio fades + fps=24,setpts and aresample sync tails on every
  segment prevent A/V drift through many short concats.
- burn_subtitles is self-defending: unsafe subs paths are copied to a
  temp ASCII SRT before being fed to libavfilter.
- Batch (jobs.json / .csv) auto-isolates outputs by manifest index;
  --continue-on-error skips failing rows; --no-overwrite refuses to
  clobber existing outputs.

Includes examples (Form A array, Form B object with multi-source +
voices, batch manifest, CJK SRT) and pytest coverage (14 e2e + batch
tests using lavfi-synthesized media; passes against ffmpeg 8.x on
Windows).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cript

Bridges the gap between Scribe word-level transcripts and the
srt_driven_edit pipeline. Given a final-cut script.srt and a source
recording's Scribe JSON, produces an edit_plan.json (Form A or B) plus
a sidecar review markdown for human-in-the-loop QA.

Matching strategy is intentionally local (no LLM, no API):
  1. Filter the transcript to timestamped 'word' tokens (audio_event /
     spacing skipped; --keep-audio-events keeps markers as context).
  2. Group consecutive words into non-overlapping candidates, breaking
     on sentence-end punctuation, silences >= gap_threshold, or speaker
     change. Long candidates split at phrase punctuation, then by hard
     word-level windows. All edges land on word boundaries.
  3. Score each (cue, candidate) pair as
       0.7 * (0.6 * SequenceMatcher + 0.4 * Jaccard)
       + 0.3 * 1/(1+|dur_delta|/cue_dur)
     where Jaccard auto-switches between Latin word-token and CJK
     character-bigram representations.
  4. Greedy assignment; --allow-reuse drops the no-reuse constraint.
  5. Emit Form A (default, drop-in for srt_driven_edit --plan) or Form
     B; review markdown lists matched text, score, duration delta, and
     warnings (low score / duration mismatch / candidate-shorter-than-
     cue).

Hard failure modes (exit 1): any cue with no assignable candidate;
malformed transcript JSON; transcript with no word tokens.
Soft failures (warnings only): low score, candidate too short for cue.

The matcher cannot understand storyline — if SRT narration words do
not appear in the source transcript, scores will be low. The sidecar
review.md is the manual QA surface; it is intentionally not pulled
into the plan (parse_plan in srt_driven_edit stays strict).

--packed (takes_packed.md) and --context-window flags are reserved
placeholders only; both raise no error but do not yet alter behavior.

Includes 11 pytest tests including a full end-to-end:
recommend -> sde.run_job -> final.mp4 against lavfi-synthesized media.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md is auto-loaded by Claude Code when working in this directory,
giving sessions a consistent picture of the project's scope, tech
constraints, and out-of-bounds behaviors before the user has to say it.

AGENTS.md does the same for Codex review sessions, classifying review
output into must-fix / should-improve / later so suggestions are
actionable rather than open-ended rewrites.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/recommend_edit_plan.py">

<violation number="1" location="helpers/recommend_edit_plan.py:134">
P2: `--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</violation>
</file>

<file name="helpers/srt_driven_edit.py">

<violation number="1" location="helpers/srt_driven_edit.py:553">
P2: Per-segment voice files lack preflight audio stream validation</violation>

<violation number="2" location="helpers/srt_driven_edit.py:771">
P1: Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

vf_parts: list[str] = []
if is_hdr_source(seg.source_path):
vf_parts.append(TONEMAP_CHAIN)
vf_parts.append(scale_filter_for(seg.source_path))
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Per-segment orientation scaling conflicts with concat demuxer -c copy stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 771:

<comment>Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</comment>

<file context>
@@ -0,0 +1,1522 @@
+    vf_parts: list[str] = []
+    if is_hdr_source(seg.source_path):
+        vf_parts.append(TONEMAP_CHAIN)
+    vf_parts.append(scale_filter_for(seg.source_path))
+
+    if seg.pad_short and seg.plan_src_dur + 1e-6 < target:
</file context>
Fix with Cubic

return out


def build_candidates(
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: --keep-audio-events / keep_audio_events is dead code: audio events kept in load_transcript_words are silently discarded in build_candidates' unconditional type != "word" filter. The flag produces identical output in both states, misleading users who expect (laughter)/(applause) context to be included in candidate text.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/recommend_edit_plan.py, line 134:

<comment>`--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</comment>

<file context>
@@ -0,0 +1,561 @@
+    return out
+
+
+def build_candidates(
+    words: list[dict],
+    *,
</file context>
Fix with Cubic

raise SystemExit(f"source '{name}' missing on disk: {sp}")
for name, vp in voices_map.items():
if not vp.exists():
raise SystemExit(f"voice '{name}' missing on disk: {vp}")
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Per-segment voice files lack preflight audio stream validation

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 553:

<comment>Per-segment voice files lack preflight audio stream validation</comment>

<file context>
@@ -0,0 +1,1522 @@
+            raise SystemExit(f"source '{name}' missing on disk: {sp}")
+    for name, vp in voices_map.items():
+        if not vp.exists():
+            raise SystemExit(f"voice '{name}' missing on disk: {vp}")
+    if legacy_default_source is not None and not legacy_default_source.exists():
+        raise SystemExit(f"--source missing on disk: {legacy_default_source}")
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant