Skip to content

SubtitleConfigInfraAIProcess

zalunda edited this page Sep 17, 2025 · 5 revisions

Configuration: Infra: The AI Process (AIEngine & AIOptions)

The AIEngine object specifies the endpoint and protocol for communicating with an AI. There are two types.

AIEngineAPI

  • Description: This is the primary, automated engine. It's a generic client designed to connect to any API that follows the standard OpenAI chat completions format. This makes it compatible with a wide range of services, including Poe, Google's Gemini API (via its OpenAI-compatible endpoint), and local models served through tools like LM Studio or Ollama.
  • Use Case: This is the engine used for all automated AI tasks in the default workflow.

Example Configuration:

"Engine": {
    "$type": "AIEngineAPI",
    "BaseAddress": "https://api.poe.com/v1",
    "Model": "GPT-5",
    "APIKeyName": "APIKeyPoe",
    "UseStreaming": true,
    "RequestBodyExtension": {
      "temperature": 0.5
    }
}

Parameters:

  • BaseAddress (required): The root URL of the AI service's API (e.g., https://api.poe.com/v1 or http://localhost:1234/v1).
  • Model (required): The specific identifier for the model you want to use (e.g., gemini-1.5-pro-latest, claude-3-haiku-20240307).
  • APIKeyName (required): The name of the key in your --FSTB-SubtitleGenerator.private.config file that holds the API key for this service.
  • UseStreaming (default: true): If true, the tool will stream the response from the AI token-by-token. This provides immediate feedback and is highly recommended. Critically, it also allows the tool to salvage partially completed responses. If the connection is interrupted, you won't lose all progress and can resume from the last successfully processed item. If false, the tool waits for the entire response, and any interruption means the entire batch must be re-run.
  • ValidateModelNameInResponse (default: false): A useful debugging feature for services like LM Studio. If true, the tool will check if the model name in the AI's response matches the Model you specified, preventing issues where the server might be using a different model than expected.
  • RequestBodyExtension (default: null): An object that allows you to add any extra parameters to the body of the API request. This is how you control things like temperature, max_tokens, or other model-specific settings. Refer to the documentation of your chosen AI service for available parameters.
  • TimeOut (default: 5 minutes): The maximum time to wait for a response from the API.
  • PauseBeforeSendingRequest / PauseBeforeSavingResponse (default: false): Debugging flags. If true, the tool will pause and wait for you to press a key, allowing you to review the generated prompt or the received response.

Troubleshooting: Handling AI Response Errors

Occasionally, an AI model may return a response that is not perfectly formatted JSON, containing syntax errors like a missing comma or an unclosed bracket. The tool is designed to handle this gracefully by allowing for manual correction.

When the tool detects a JSON syntax error in an AI's response, it will halt processing for that worker and perform the following actions:

  1. A .txt file is generated in your project directory. The filename will correspond to the AI worker that failed (e.g., my-video.TODO-singlevad_0004.txt).

  2. The error message is written at the top of this file, followed by the raw (or partially corrected) text received from the AI. The file content will look similar to this:

    EndTime not received and StartTime '00:06:02.176' cannot be matched to an existing item (closest time found:00:06:02.233). Error is segment starting at line 57, position 3.
    Error is segment starting at line 57, position 3.
    
    [
      {
        "StartTime": "00:04:42.974",
        "VoiceText": "恥ずかしいことじゃないから。",
        "ParticipantsPoses": "Man (POV): seated facing forward. Nurse: kneeling very close in front of him, torso leaning slightly toward his lap, both hands resting near the lower edge of the frame.",
        "TranslationAnalysis": "Her close, steady kneel and gentle forward lean frame the reassurance as intimate clinical care rather than detached advice.",
      },
    ...
    

To fix this, follow these steps:

  1. Open the generated .txt file in a text editor.
  2. Read the error message at the top to understand the problem and note the line number where the error occurred.
  3. Delete all the error message lines from the top of the file, leaving only the JSON content (starting with [ or {).
  4. Go to the line number mentioned in the error message and fix the JSON syntax. Common issues include missing/extra commas, unescaped quotation marks within the text, or incorrect bracket placement. Or a wrong StartTime used. In that case, try to change the StartTime of the node to the closest one, as included in the error message.
  5. Save the file.
  6. Rerun the tool. The tool will detect the corrected file, parse its content, and continue the workflow without needing to call the API again for this batch.
  7. In last resort You can always delete the file and rerun and hope the AI doesn't make the same mistake again.

* * *

AIEngineChatBot

  • Description: This is a manual fallback engine for interacting with AIs that do not have an API (e.g., a standard ChatGPT session).
  • Use Case: Allows you to leverage any AI model, even if it's not API-accessible, by turning the process into a manual copy-paste workflow.

How it works:

  1. The tool generates a .txt file containing the complete prompt.
  2. A "To-Do" message appears, telling you to open the file.
  3. You copy the content from the file and paste it into the AI's chat interface.
  4. You copy the AI's full response and paste it back into the .txt file, replacing the original prompt.
  5. You run the script again. The tool will read the response from the file and proceed.

Example Configuration:

"Engine": {
    "$type": "AIEngineChatBot"
}

This engine has no specific parameters.


AIOptions: The Content Layer

The AIOptions object is responsible for assembling the prompt. It acts as a query builder, filtering the necessary items, adding instructions and context, and formatting everything into a structured request that the AI can understand.

Example Configuration:

"Options": {
    "SystemPrompt": { "$ref": "ArbitrerSystemPrompt" },
    "MetadataNeeded": "CandidatesText",
    "MetadataAlwaysProduced": "FinalText",
    "BatchSize": 150,
    "NbContextItems": 15
    // ... other formatting options
}

Parameters:

Prompt & Metadata Rules

  • SystemPrompt / UserPrompt: These properties hold the main instructions for the AI. They typically use a $ref to point to a multi-line AIPrompt object defined in the SharedObjects section.
  • MetadataNeeded (required): A powerful filter that tells the worker which timed items it should process. An item will only be included in the batch if its metadata matches the rule. The syntax supports multiple rules separated by | (OR). Each rule can be a simple key, a negative match (!), or a regular expression.
    • "MetadataNeeded": "VoiceText": Process only items that have a VoiceText field.
    • "MetadataNeeded": "VoiceText|OnScreenText": Process items that have either a VoiceText field or an OnScreenText field.
    • "MetadataNeeded": "!OnScreenText": Process all items that do not have an OnScreenText field.
    • "MetadataNeeded": "!OnScreenText,!GrabOnScreenText": An example of an AND condition. This is achieved implicitly by separating rules with a comma. This would process items that do not have OnScreenText AND do not have 'GrabOnScreenText'.
  • MetadataAlwaysProduced (required): The primary metadata key that the worker's prompt instructs the AI to generate in its response. The tool uses the presence of this key to determine if an item has already been successfully processed, allowing it to resume an interrupted job. The tool will, however, extract and record all metadata fields returned by the AI.
  • MetadataForTraining: Used to provide "few-shot" examples to the AI. The tool will find items with this metadata key and present them to the AI as training or reference material before showing the items that need to be processed. This is used by the visual-analyst to learn character appearances. It could also be used for tasks like audio diarization by providing examples of different speakers' voices.

Batching & Context Control

  • BatchSize (default: 100000): The maximum number of items to include in a single API request. Smaller batches are faster per request but less efficient overall. Larger batches are more efficient but risk hitting the AI's context window limits. Some AI models also have a "thinking" or reasoning budget per request; in these cases, sending smaller batches may allocate more "thinking time" per item, potentially improving quality.
  • BatchSplitWindows (default: 0): Defines a flexible window for the BatchSize. The tool will look for the largest time gap between items within this margin (from BatchSize - BatchSplitWindows to BatchSize) and split the batch there. This is a powerful feature for workers like the Arbitrer, as it prevents potential merges from being missed because the two related subtitles were split across different API requests.
  • NbContextItems (default: 100000): The number of already processed items to include before the main batch. This gives the AI a "memory" of what just happened, which is crucial for maintaining conversational context.
  • NbItemsMinimumReceivedToContinue (default: 50): A safety mechanism. If an API call successfully completes but the AI returns fewer than this number of items in its response, the tool will stop. This prevents runaway API usage if the AI starts failing to follow instructions.

Prompt Formatting

  • TextBefore... / TextAfter...: These properties (TextBeforeTrainingData, TextAfterContextData, TextBeforeAnalysis, etc.) are string wrappers that structure the final prompt. They add the headings and separators that divide the prompt into logical sections (e.g., "Context from preceding nodes:", "Begin Node Analysis:"), making it easier for the AI to understand its task.

The AIPrompt Object

The AIPrompt object is a simple but important component used to define the text of a system or user prompt for an AI model. It is typically defined once in the SharedObjects section of your configuration and then referenced by one or more workers using $ref.

It offers a flexible way to construct the final prompt text using a combination of the following properties:

  • Text: The core content of the prompt. This property supports the "Hybrid-JSON" multi-line format, making it ideal for long prompts.
  • TextBefore (optional): A string that is prepended to the Text property.
  • TextAfter (optional): A string that is appended to the Text property.
  • Lines: An alternative way to define the prompt as a JSON array of strings.

The tool constructs the final prompt using the following logic:

  1. Primary Method (using Text): If the Text property is present in your configuration, the tool will create the prompt by combining TextBefore + Text + TextAfter. This is the recommended approach as it is the most flexible and supports multi-line text blocks. This structure is especially powerful when using configuration overrides; instead of redefining an entire multi-line prompt just to add an instructions, your override file can specify only the TextBefore or TextAfter property. This makes minor adjustments to complex prompts clean and easy to manage.
  2. Fallback Method (using Lines): If the Text property is not provided, the tool will fall back to using the Lines property. It will join all the strings in the array with newline characters to form the final prompt.

Note: The presence of the Text property will always cause the Lines property to be ignored.

Example: Using Text

This example shows how you can build a prompt from multiple parts.

{
  "$id": "MySystemPrompt",
  "$type": "AIPrompt",
  "Text": "
=_______________________________________

# System Prompt

This is the core instruction.

_______________________________________=",
}

Example: Using the Lines property

This method is useful for users who prefer to work within strict JSON standards without using the multi-line text feature.

{
  "$id": "MySystemPrompt",
  "$type": "AIPrompt",
  "Lines": [
    "",
    "# System Prompt",
    "",
    "This is the first line of my prompt.",
    "This is the second line."
  ]
}

Dynamic Placeholders

The text within an AIPrompt object is not entirely static. The tool will automatically find and replace the following placeholders at runtime, allowing you to create more generic and reusable prompts:

  • [TranscriptionLanguage]: Is replaced by the long name of the source language (e.g., "Japanese").
  • [TranslationLanguage]: Is replaced by the long name of the target language (e.g., "English").

Example Usage in a Prompt: "Translate the following from [TranscriptionLanguage] to [TranslationLanguage]:"

Clone this wiki locally