This document outlines the internal architecture and operational flow of the Omni-Agent, an autonomous system designed for UI interaction and task automation.
The agent operates on a graph-based workflow, managing state and transitioning between specialized AI agents to understand user requests, perceive UI elements, plan actions, and execute them in a continuous loop.
At the heart of the system is a State object. This object tracks all relevant information throughout the lifecycle of a user request, including:
- The original user request and expected outcome.
- The current plan (list of tasks) and the index of the current task.
- Outputs from various agents (Search Agent guide, Image Agent description, Planning Agent's new tasks, Action Agent's result).
- The current screenshot, detected UI elements, and last action performed.
- Agent histories (for Image and Planning agents).
- Error messages.
All agents and graph nodes read from and write to this state, ensuring a consistent view of the ongoing process.
main.py
is responsible for setting up and running the main operational loop defined using a state graph (e.g., LangGraph). It initializes agents and defines the sequence of operations:
- Search (Once): A
SearchAgent
first processes the initial request to generate a high-level guide. - Main Loop: The workflow then enters a continuous loop:
a. Screenshot: Captures the current screen and identifies UI elements.
b. Image Analysis: An
ImageAgent
analyzes the screenshot to provide a textual description and identify interactable elements, maintaining a history to note changes. c. Planning: APlanningAgent
uses the search guide (initially), the image analysis, and its own conversational history (of previous plans and actions) to decide on the next steps. It either generates a new list of sub-tasks or outputs "continue" if the previous plan is still valid and has pending steps. d. Action: If there's a sub-task to perform, anActionAgent
takes the sub-task details, the image analysis, detected UI elements, and the direct screenshot to execute a specific UI interaction (e.g., click, type) using low-level input functions. e. The loop then repeats from the Screenshot step.
The system employs several AI-driven agents:
- Responsibility: To perform an initial web search based on the user's request and expected output.
- Inputs:
original_request
,original_expected_output
. - Process: Uses a generative AI model with search capabilities.
- Outputs:
search_agent_guide
(a brief textual guide for completing the task).
- Responsibility: To analyze the current screenshot of the UI.
- Inputs: The current screenshot image.
- Process: Uses a generative AI model to:
- Provide a concise description of the GUI shown.
- List visible, interactable UI elements. It maintains a history to notice changes from previous screenshots.
- Outputs:
image_agent_output
(textual description and element list).
- Responsibility: To devise a sequence of actionable sub-tasks or decide to continue with an existing plan.
- Inputs (varies by
plan_mode
):- Initial:
original_request
,original_expected_output
,search_agent_guide
,image_agent_output
. - Replan:
original_request
,original_expected_output
,image_agent_output
,last_action_done
,step
. - Crucially, it uses its own conversational history to recall the previously generated plan.
- Initial:
- Process: Uses a generative AI model. Based on the inputs and its history, it either:
- Generates a new list of sub-tasks (each with a request and expected output).
- Outputs the string "continue" if it deems the prior plan (from its history) is still viable and has pending steps.
- Outputs:
newly_planned_tasks
(either the list of sub-tasks or the string "continue").
- Responsibility: To execute the current sub-task identified from the
task_list
. - Inputs: The current sub-task's request and expected output,
image_agent_output
,current_elements
(from screenshot node), and the current screenshot image directly. - Process:
- Uses a generative AI model (with function calling) to determine the specific UI interaction needed (e.g., click, type).
- The model's decision is informed by the sub-task, the textual UI description, the structured list of UI elements, and the direct visual context from the screenshot.
- Calls low-level functions (from
src/utils/input_functions.py
) for actual UI interaction.
- Outputs:
action_agent_tool_call_name
andaction_result
(the outcome of the function call).
screenshot_node
: Takes the screenshot, performs Omni processing (if models available) to identify elements, and updates the state withcurrent_screenshot
andcurrent_elements
.process_planning_output_node
: Handles thePlanningAgent
's output. If new tasks are provided, it updates the maintask_list
. If "continue" is received and the currenttask_list
is exhausted, it triggers a history refresh for thePlanningAgent
andImageAgent
and resetsplan_mode
to "initial".update_after_action_node
: Updates state withlast_action_done
andstep
after an action is completed and advancescurrent_task_index
.should_action_or_loop
(Conditional Edge Logic): Directs flow toaction_agent_node
if there's an actionable task, or back toloop_entry
(effectively toscreenshot_node
) if no task is pending (e.g., after "continue" on an exhausted list or an empty plan from planner).
Omni_loader.py
: Loads AI models for screen understanding (SOM for object detection, captioning models).screenshot.py
: Captures screenshots and uses Omni models (viamodel_helpers.py
) to detect UI elements, their content, coordinates, and types.model_helpers.py
: (Assumed to contain functions for running object detection and image captioning models – not directly modified in this refactor but used byscreenshot.py
).input_functions.py
: Provides low-level UI interaction functions likeclick(x, y)
,type(text)
, etc., callable by theActionAgent
.function_definitions.py
: Contains JSON schemas for the functions available to theActionAgent
's model.config/settings.py
: Stores configurations like API keys, model paths, and prompt paths.
- Initialization:
main.py
loads settings and initializes all agents and the LangGraph structure. - Search (Once): The
SearchAgent
processes theoriginal_request
andoriginal_expected_output
to produce asearch_agent_guide
. - Main Interaction Loop: The graph transitions to a loop starting with
loop_entry
: a. Screenshot (screenshot_node
): Captures the screen, processes it (potentially with Omni models) to getcurrent_screenshot
(image) andcurrent_elements
(list of UI element data like coordinates, content, type). b. Image Analysis (image_agent_node
): TheImageAgent
receivescurrent_screenshot
, analyzes it using its history, and producesimage_agent_output
(textual description of the UI and interactable elements). c. Planning (planning_agent_node
): ThePlanningAgent
takes current context (image_agent_output
,search_agent_guide
if initial,last_action_done
etc. if replan) and its conversational history to producenewly_planned_tasks
(a list of sub-tasks or the string "continue"). d. Process Planning Output (process_planning_output_node
): - If new tasks:task_list
in state is updated,current_task_index
reset. - If "continue" andtask_list
is complete/empty: Histories ofImageAgent
andPlanningAgent
are cleared,plan_mode
becomes "initial". e. Decision (should_action_or_loop
): - Iftask_list
has a pending task atcurrent_task_index
: Proceed to Action. - Else (no task, or error): Loop back toloop_entry
(for a new screenshot and cycle). f. Action (action_agent_node
): If a task is pending, theActionAgent
receives the sub-task details,image_agent_output
,current_elements
, and thecurrent_screenshot
. It determines and executes a UI function call. Producesaction_result
. g. Update After Action (update_after_action_node
):last_action_done
andstep
are updated in the state.current_task_index
is incremented. h. The flow returns toloop_entry
. - Workflow End: The loop can be ended by an error condition in
should_action_or_loop
or if a maximum iteration count (safety break) is hit. - Logging: Agents log their inputs, outputs, and significant decisions to respective log files.
This revised flow emphasizes continuous perception and adaptation, with planning decisions closely tied to the latest visual and textual understanding of the UI. It is acknowledged that this iterative process can be slow.