SubtitleCreationWorkflow

Subtitle Creation Workflow

This guide outlines the recommended process for creating professional-grade subtitles. The system is designed around a flexible, multi-stage pipeline that begins with a simple "Staging" process. After this initial step, you can choose one of two paths based on your quality and effort requirements:

The High-Quality Path: For perfectionists. You will manually refine the AI-generated timings to create a "ground truth" master timeline, ensuring every subtitle is flawless.
The Automatic-Quality Path: For speed and efficiency. You will trust the AI's initial timing and segmentation, focusing only on adding essential metadata before proceeding.

Both paths converge into a final, common pipeline that uses advanced AI tools to perform deep analysis, nuanced translation, and intelligent formatting.

The process is iterative, involving a sequence of running the tool and performing manual refinement. This guide will walk you through each stage in detail.

A Note on Multi-Part Videos: If your video is split into multiple files (e.g., part-a, part-b), the best way to process them is by creating a simple .vseq file. This text file lists all the parts in order and allows the tool to treat them as a single, continuous video. This ensures the AI has full context of the entire scene, leading to better translations, and the tool will automatically handle splitting the final subtitles for you. For detailed instructions, please see the Handling Multi-Part Videos guide.

When you are using a .vseq file, you will now have multiple .srt files—one for each video part (for example, my-video-part-a.manual-edit.srt. You will need to edit all of these files.

It is essential to treat them as a single, continuous timeline. The tool understands the sequence, so context you provide in one file will carry over to the next.

Persistent Metadata: Context tags like {OngoingSpeakers} or {OngoingContext} only need to be placed at the beginning of the relevant scene, even if that scene spans multiple files. For example, if you set {OngoingSpeakers: Hana, Ena} in the first subtitle of part-a.srt, that context will remain active for part-b.srt and part-c.srt until you explicitly change it.

Step 1: The Staging Workflow (Initial Triage)

The first step for any new video is to get a quick, rough draft of the subtitles. This allows you to assess the video's content and decide if you want to commit to the full, high-quality process without spending time or API credits.

🧑‍💻 Your Task:

Create a new folder for your project inside the Staging directory (e.g., Staging/MyNewVideo/).
Place your video file(s) or .vseq file inside this new project folder.
Run the FSTB-CreateSubtitles.bat script. Because your video is in the Staging folder, the tool will automatically use the simple workflow override.

⚙️ What the Tool Does:

Audio Extraction: Extracts the audio from your video.
Whisper Transcription: Performs a full transcription using the local Purfview-Whisper engine (full-whisper).
Full clone: Clone full-whisper into full
Basic Translation: Translates the text using the basic Google Translate API (full_google).
Output: Generates a MyNewVideo.srt file for easy viewing.

At this point, you can review the basic subtitle. If you're happy with the video and want to create a professional-grade translation for it, proceed to the next step.

Step 2: The Crossroads: Choose Your Path

After the initial staging, move your entire project folder from the Staging directory into one of the two workflow directories. Your choice here will determine the next steps.

For Maximum Quality and Control: Move your project to the ManualHQWorkflow folder.
For Maximum Speed and Automation: Move your project to the AutomaticHQWorkflow folder.

This is also at this point where creating a .vseq file will give better result.

🧑‍💻 Your Task:

Move your entire project folder from the Staging directory to the advanced directory (e.g., from Staging/MyNewVideo/ to ManualHQWorkflow/MyNewVideo/).

Step 3: Generate sources trinity (timings, voice-texts, and manual-edit)

🧑‍💻 Your Task:

Run the FSTB-CreateSubtitles.bat script again. The tool will detect the new location, load the subfolder override, and begin the advanced process. It will see the work done in Staging and continue from there.

⚙️ What the Tool Does (First Run):

High-Quality AI Transcription: Both advanced workflows will begin by doing an advanced AI transcription (full-ai). It runs this on your original audio, creating a much more accurate and well-segmented transcription than the initial Whisper pass.
Output for Manual Review: The tool generates a file named my-video.manual-input.srt. This file contains the superior AI-generated timings, ready for your crucial manual review.

Troubleshooting: What if an AI worker fails?

From this point forward, several AI-powered workers will run (full-ai, visual-analyst, etc.). Occasionally, an AI might return a response with a syntax error. If this happens, the tool will pause and create a .txt file in your project folder for you to correct.
For a complete guide on how to resolve these errors, please see the Troubleshooting section of The AI Process.

🧑‍💻 Your Task (Back to you):

Open my-video.manual-input.srt in SubtitleEdit.
[Only for ManualHQWorkflow] The file is populated with subtitle segments generated by the full-ai worker. While the transcribed text is included for you to refer to, your primary goal in this step is not to perfect the text, but to perfect the TIMINGS. Go through the entire file in Subtitle Edit and perform the following essential actions:
- Add any missing subtitles. This is necessary for two main reasons:
  - Missed Dialogue: Occasionally, the transcriber AI might miss a short or quiet line of dialogue. Create a new subtitle with the correct timing for the missed line. The text you enter doesn't need to be perfect; the main goal is to create a timed segment that will be re-transcribed with higher accuracy later.
- Adjust the start and end times of existing subtitles to tightly match the spoken dialogue in the waveform.
- Split long subtitles into smaller, more readable chunks.
- Merge subtitles that are part of the same sentence and are very close together.
- Delete any unnecessary subtitles that correspond to non-speech sounds (e.g., moans, heavy breathing, background noise).
[Only for AutomaticHQWorkflow] While doing the next steps, do not delete or alter the timings of existing subtitles because they will be part of the "official" timings and need to be identical to the full-ai timings.
Add any missing On-Screen Text subtitles: The transcriber cannot "hear" text that appears on screen (like narration on a black screen). You must manually create a new subtitle that spans the duration the text is visible. Set the text of this new subtitle to {GrabOnScreenText}. The tool will use this tag to perform OCR in the next step.
Add Contextual Metadata. Add the following special tags to the text of certain subtitles to guide the AI.
- {OngoingContext}: On the first subtitle of a scene, provide a brief, general description of the setting. The visual-analyst will fill in the details later. However, if you plan to disable the visual-analyst worker to save API costs, you should provide a more detailed description here. Example: {OngoingContext: The scene takes place in a bedroom. The girl is the tutor for POV-man...}
- {VisualTraining}: If you are using the visual-analyst with multiple characters, this tag is essential for helping the AI distinguish between them. On at least one subtitle where the characters are clearly visible, add this tag with a brief description. For example: {VisualTraining: Hana is on the left; Ena is on the right.}
- {OngoingSpeakers}: This tag is essential for identifying the active speakers and should always be used. It tells the tool who the potential speakers are. You must add this tag to the first subtitle of a new scene and update it anytime the group of active speakers changes.
  Tip: Here's how to handle the different speaker scenarios.
  - Case 1: Single Speaker for the Entire Video If the same person speaks throughout, you only need to add the tag once on the very first subtitle.
    - Example: {OngoingSpeakers: Woman}
    - Or with a name: {OngoingSpeakers: Hana}
  - Case 2: Consistent Group of Speakers If the same group of people is present for the whole video, list all their names, separated by commas. Again, you only need to set this once on the first subtitle.
    - Example: {OngoingSpeakers: Hana, Ena}
  - Case 3: Speakers Change During the Video This is where updating the tag is most important. Place a new {OngoingSpeakers} tag on the first subtitle where the group of active speakers changes.
    
    For example, if a scene progresses like this:
    1. Start of video: Hana and Ena are talking.
    2. At 5 minutes: Ena leaves, and only Hana is talking.
    3. At 10 minutes: Ena returns, and a new character, Yui, joins them.
    You would tag your subtitles as follows:
    - Subtitle #1 (Time 00:00:01.000): Good morning.{OngoingSpeakers: Hana, Ena}
    - ... (many subtitles with both Hana and Ena speaking) ...
    - Subtitle #150 (Time 00:05:03.000): Now that she's gone...{OngoingSpeakers: Hana} (This is the first line after Ena leaves)
    - ... (many subtitles with only Hana speaking) ...
    - Subtitle #310 (Time 00:10:01.000): We're back!{OngoingSpeakers: Hana, Ena, Yui} (This is the first line after the group changes again)
Important: Notice that several key tags start with the prefix Ongoing. This is not just a naming convention; it triggers a special behavior in the tool that allows context to be carried forward automatically across the entire video, even between different processing batches sent to the AI. For a detailed explanation, please read the Special Metadata section of the Core Concepts guide.
The [USER-REVISION-NEEDED] Flag: In the first subtitle, you will see this text. This is a safety lock. The tool will halt the workflow and refuse to run advanced workers as long as this line is present. You must delete this line now that you have finished editing the file to signal that the ground truth is ready.
Run the FSTB-CreateSubtitles.bat script again.

⚙️ What the Tool Does (Second Run):

Clone manual-edit as timings.
[Only for AutomaticHQWorkflow] Clone full-ai as voice-texts.
[Only for ManualHQWorkflow] Do the singlevad-ai transcription using the timings.
[Only for ManualHQWorkflow] Clone singlevad-ai as voice-texts.
We have now created the trinity of sources for the rest of the process: timings, voice-texts, and manual-edit.

Under the Hood: How `Cloning` Makes This Possible

The clone feature is used mostly to create "aliases" of existing sources.

This acts like a switchboard, allowing different workflows to plug their preferred transcription source into the main pipeline without having to redefine every subsequent step. All the most advanced steps depend on the aliases, which could be created in different ways.

Step 4: The Common Path Begins - Deep Analysis & AI Processing

No matter which path you choose, all subsequent steps are the same. From this point on, the tool will build upon the "ground truth" you established.

🧑‍💻 Your Task:

Simply run the FSTB-CreateSubtitles.bat script again.

⚙️ What the Tool Does: With timings, voice-texts, and manual-edit in place, the tool then proceeds with the full suite of advanced analysis:

Performs Multiple Transcriptions: Based on your timings, it runs several specialized transcription workers. Each provides a different "take", giving the final AI more data to work with. (See Transcribers for more details).
- mergedvad: Provides an alternative transcription by stitching audio chunks together with small silent gaps and transcribing them as a single file. This method can serve as a useful backup in cases where the singlevad AI struggles with a particular segment.
- on-screen-texts: If you used the {GrabOnScreenText} tag, the tool takes a screenshot at that time and sends it to an AI to perform Optical Character Recognition (OCR).
- visual-analysis: Takes a screenshot from the middle of each subtitle where someone is speaking to analyze the scene. It generates detailed metadata about character poses, appearance, environment, and, through the TranslationAnalysis field, provides direct guidance to the translator on tone and subtext based on the speaker's non-verbal cues. This visual context is invaluable for the final translation phase.
Pre-validates Speakers: Intelligently attempts to assign speakers, minimizing your workload for the next step.

Post-Processing: Speaker Identification

🧑‍💻 Your Task:

After the script has finished its automated analysis, if there are still subtitles with ambiguous speakers to be validated, the interactive Speaker Identification Tool will launch.

Your task is to use the tool to assign the correct speaker to each remaining line. Click the link for a full guide.

Step 5 (Optional): Manual Metadata Review

This is an advanced step for power users who want to review and correct the AI's analysis before the final translation. To enable this, you would need, prior to running the process, to enable metadatas-review worker in your override file. This step will intentionally pause the process until you've reviewed the metadatas.

🧑‍💻 Your Task:

The metadatas-review worker will copy the content of my-video.wip-metadatas.srt to my-video.manual-edit.srt (the old content will be in the backup folder). It will also add the [USER-REVISION-NEEDED] flag to the file.
Review and Edit: Open the new my-video.manual-input.srt in Subtitle Edit. It will be filled with all the AI-generated metadata. Correct any errors directly in the text. Do not change the timings.

Example: Correcting a Visual Analysis Error

After renaming the files, you open the new my-video.perfect-vad.srt. A subtitle at 00:10:25.000 might look like this in Subtitle Edit:

[singlevad] {VoiceText:What are you looking at?}
[validated-speakers] {Speaker:Hana}
[visual-analyst] {ParticipantsPoses:Hana is **smiling** while looking at the camera.}
[visual-analyst] {TranslationAnalysis:Her **smiling expression suggests her question is playful and teasing.**}

You watch the video and notice she is actually frowning. You can edit the text directly in the subtitle editor to correct not only the pose but also the now-incorrect analysis:

[singlevad] {VoiceText:What are you looking at?}
[validated-speakers] {Speaker:Hana}
[visual-analyst] {ParticipantsPoses:Hana is **frowning** while looking away from the camera.}
[visual-analyst] {TranslationAnalysis:Her **frowning expression suggests her question is confrontational, not playful**.}

When you save the file, your manual corrections are locked in. The tool will now use your more accurate descriptions in the final translation step, leading to a better result.

When you run the tool again, it will read this heavily-annotated manual-edit.srt file as its new "ground truth." Because the original timing information is preserved, the workflow continues seamlessly, but now using your manual corrections to override the initial AI-generated data. The manual-edit is always added last in the metadata source, which makes it override any other sources.

Step 6: AI Translation & Arbitration

This is the final automated stage, where the tool uses multiple AI "personas" to generate translation options and then an "Arbitrer" to select, refine, and format the best possible result.

🧑‍💻 Your Task:

Run the FSTB-CreateSubtitles.bat script again.

⚙️ What the Tool Does: This step involves a sophisticated multi-part process:

Persona-Based Translation: The tool first generates several different English translations for the transcribed text using AI translators with distinct "personalities." The default configuration includes:
- naturalist: Aims for a translation that is faithful to the original text but uses natural, everyday English.
- maverick: A more creative persona that considers the overall scene context (including your corrections from Step 4) and is willing to deviate from a literal translation to deliver more impactful or character-appropriate dialogue.
- (You can define your own translator personas in the config file. See Translators for more info.)
Candidate Aggregation: The arbitrer-choices worker collects all the different translations (naturalist, maverick, etc.) and all the original transcriptions for each timed segment.
AI Arbitration: The arbitrer-final-choice worker receives all this information. It acts as a final editor with a strict set of rules to:
- Select the Best Translation: It chooses the best option from the candidates provided, following a user-defined order of preference.
- Enforce Subtitle Constraints: It ensures the final text adheres to technical limits (e.g., maximum two lines, character limits per line).
- Add Line Breaks: It strategically inserts line breaks (\n) for readability.
- Merge & Delete: It intelligently merges consecutive lines that form a single sentence and deletes subtitles that are redundant or unnecessary (e.g., simple moans). AutoMergeOn/AutoDeleteOn property needs to be specified for those changes to be automated (which they are, by default).
Output: The tool generates the file my-video.final-arbitrer-choice.srt. This file represents the AI's best effort at a polished, technically correct final subtitle.

Step 7: The Final Polish (Manual)

The AI has done its best, and the result is often very close to perfect. This final manual step is your chance to apply the finishing touches.

🧑‍💻 Your Task:

Open my-video.final-arbitrer-choice.srt in your subtitle editor.
Read through the subtitles, making any final corrections for grammar, style, or flow.
Save the file as your final subtitle (e.g., my-video.srt).

Congratulations! You have completed the high-quality subtitle workflow.

Home
Subtitle Creation Engine
- Core Concepts
Guides & Workflows
Configuration File Overview
- Infrastructure
  - Metadata Aggregation
  - The AI Process
- Worker Types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SubtitleCreationWorkflow

Subtitle Creation Workflow

Step 1: The Staging Workflow (Initial Triage)

Step 2: The Crossroads: Choose Your Path

Step 3: Generate sources trinity (timings, voice-texts, and manual-edit)

Step 4: The Common Path Begins - Deep Analysis & AI Processing

Post-Processing: Speaker Identification

Step 5 (Optional): Manual Metadata Review

Step 6: AI Translation & Arbitration

Step 7: The Final Polish (Manual)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally