# Video Recaptioning Task Agent

This notebook demonstrates how to use the Encord Agents Task Runner to create a video recaptioning workflow. We'll use the GPT-4o-mini model to automatically generate multiple captions based on a human-written description.

The workflow for this agent is as follows:

1. A human creates a caption in the first text field of a video
2. The agent is triggered and generates three alternative captions
3. The human can review and optionally correct the generated captions

This notebook will walk through:
- Setting up the required dependencies
- Understanding the ontology structure needed
- Implementing the task agent using the Runner
- Running the agent on your project

## Setup

First, let's install the required dependencies:

In [None]:
!pip install encord-agents langchain-openai openai

Now let's import the necessary libraries:

In [None]:
import os
from typing import Annotated

import numpy as np
from encord.exceptions import LabelRowError
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.objects.classification_instance import ClassificationInstance
from encord.project import Project
from encord.workflow.stages.agent import AgentTask
from encord_agents.tasks import Runner
from encord_agents.tasks.dependencies import dep_single_frame, dep_label_row, Frame
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from typing_extensions import Annotated

## API Keys

You'll need to set up your API keys to use this notebook. Let's configure the OpenAI API key:

In [None]:
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Your Encord SSH key should be configured in your environment
# or you can set it here
# os.environ["ENCORD_SSH_KEY_FILE"] = "/path/to/your_private_key"

## Response Model

Let's define the response model for the agent to follow:

In [None]:
class AgentCaptionResponse(BaseModel):
    rephrase_1: str 
    rephrase_2: str     
    rephrase_3: str 

## LLM Setup

Let's set up the LLM with the system prompt:

In [None]:
# System prompt for the LLM to follow
SYSTEM_PROMPT = """
You are a helpful assistant that rephrases captions.

I will provide you with a video caption and an image of the scene of the video. 

The captions follow this format:

"The droid picks up <cup_0> and puts it on the <table_0>."

The captions that you make should replace the tags, e.g., <cup_0>, with the actual object names.
The replacements should be consistent with the scene.

Here are three rephrases: 

1. The droid picks up the blue mug and puts it on the left side of the table.
2. The droid picks up the cup and puts it to the left of the plate.
3. The droid is picking up the mug on the right side of the table and putting it down next to the plate.

You will rephrase the caption in three different ways, as above, the rephrases should be

1. Diverse in terms of adjectives, object relations, and object positions.
2. Sound in relation to the scene. You cannot talk about objects you cannot see.
3. Short and concise. Keep it within one sentence.
"""

# Create the LLM instance with structured output
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.4)
llm_structured = llm.with_structured_output(AgentCaptionResponse)

## Creating the LLM Prompt Function

Now let's create a function to prompt the LLM:

In [None]:
def prompt_gpt(caption: str, image: Frame) -> AgentCaptionResponse:
    """
    Prompt GPT with a caption and an image to get three rephrases of the caption.
    
    Args:
        caption: The original caption written by a human.
        image: The frame from the video that the caption refers to.
        
    Returns:
        A structured response with three rephrases of the original caption.
    """
    prompt = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"Video caption: `{caption}`"},
                image.b64_encoding(output_format="openai"),
            ],
        },
    ]
    return llm_structured.invoke(prompt)

## Creating the Task Agent

Now let's define our task agent function that will be triggered by the Runner:

In [None]:
def recaption_agent(
    project: Project,
    label_row: Annotated[LabelRowV2, Annotated],
    task: AgentTask,
    frame_content: Annotated[np.ndarray, Annotated[dep_single_frame, {"frame": 0}]]
) -> str:
    """
    Task agent that generates alternative captions for a video based on a human-written caption.
    
    Args:
        project: The Encord project.
        label_row: The label row containing the annotations.
        task: The agent task.
        frame_content: The content of the first frame of the video.
        
    Returns:
        The name of the pathway to follow after completing the task.
    """
    # Get the relevant ontology information
    # We expect: [human annotation, llm recaption 1, llm recaption 2, llm recaption 3]
    cap, *rs = label_row.ontology_structure.classifications
    
    # Read the existing human caption
    # We'll take the one from the current frame if it exists,
    # otherwise the one from frame zero or any caption, in said order.
    instances = label_row.get_classification_instances(filter_ontology_classification=cap, filter_frames=[0])
    if not instances:
        # Nothing to do if there are no human labels
        return "No Human Caption"
    elif len(instances) > 1:
        def order_by_current_frame_else_frame_0(instance: ClassificationInstance) -> int:
            try:
                instance.get_annotation(0)
                return 1
            except LabelRowError:
                return 0

        instance = sorted(instances, key=order_by_current_frame_else_frame_0)[-1]
    else:
        instance = instances[0]

    # Read the actual string caption
    caption = instance.get_answer()
    
    # Run the first frame of the video and the human caption against the LLM
    response = prompt_gpt(caption, frame_content)
    
    # Upsert the new captions
    for r, t in zip(
        rs, [response.rephrase_1, response.rephrase_2, response.rephrase_3]
    ):
        # Overwrite any existing re-captions
        existing_instances = label_row.get_classification_instances(filter_ontology_classification=r)
        for existing_instance in existing_instances:
            label_row.remove_classification(existing_instance)

        # Create new instances
        ins = r.create_instance()
        ins.set_answer(t, attribute=r.attributes[0])
        ins.set_for_frames(0)
        label_row.add_classification_instance(ins)

    # Save the label row
    label_row.save()
    
    # Return the pathway name to follow
    return "Completed"

## Setting Up the Runner

Now let's create and configure the Runner with our task agent:

In [None]:
# Initialize the Runner with your project hash
runner = Runner(project_hash="your-project-hash-here")

# Register the recaption_agent with a specific stage in your workflow
@runner.stage("Recaption Video")
def recaption_task(
    project: Project,
    label_row: Annotated[LabelRowV2, Annotated],
    task: AgentTask,
    frame_content: Annotated[np.ndarray, Annotated[dep_single_frame, {"frame": 0}]]
) -> str:
    return recaption_agent(project, label_row, task, frame_content)

## Running the Agent

Now we can run our agent on the project:

In [None]:
# Set to refresh every 60 seconds to continuously check for new tasks
runner(refresh_every=60)

## How It Works

### Workflow Requirements

For this notebook to work, your project should have:

1. An ontology with four text classifications:
   - A text classification for a human-created caption
   - Three text classifications for the LLM to fill in

2. A workflow with at least one agent stage named "Recaption Video" and at least two pathways:
   - "Completed" - for tasks that successfully generated captions
   - "No Human Caption" - for tasks that had no human caption to work with

### Process Flow

1. The Runner periodically checks for tasks in the "Recaption Video" stage
2. For each task, it loads the label row and checks if there's a human-created caption
3. If a caption exists, it sends the caption and the first frame to GPT-4o-mini
4. The LLM generates three alternative captions
5. The agent adds these captions to the label row and saves it
6. The task moves to the "Completed" pathway

### Agent Customization

You can customize this agent by:
- Changing the system prompt to get different types of rephrases
- Using a different LLM model
- Adjusting the temperature for more or less creative outputs
- Adding more classification fields for additional types of captions

## Conclusion

In this notebook, we created a task agent that automatically generates alternative captions for videos based on human-written descriptions. The agent uses GPT-4o-mini to create diverse, natural, and concise rephrases of the original caption.

This approach can be extended to other use cases, such as:
- Generating descriptions in multiple languages
- Creating captions optimized for different purposes (marketing, accessibility, technical documentation)
- Building a dataset of varied descriptions for training computer vision models

By using the Encord Agent Task Runner, we've created a workflow that can process tasks in batches, automatically retry failed operations, and continuously monitor for new tasks.