Skip to content

feat: add agent robustness improvements#449

Merged
jmacAJ merged 4 commits intoaj-archipelago:mainfrom
data-angel:agent-robustness-pr
Jan 27, 2026
Merged

feat: add agent robustness improvements#449
jmacAJ merged 4 commits intoaj-archipelago:mainfrom
data-angel:agent-robustness-pr

Conversation

@data-angel
Copy link
Copy Markdown
Contributor

Summary

This PR adds several improvements to make the agentic system more robust and reliable:

  • Tool call timeout handling: Tools now have a configurable timeout (default 2 minutes) to prevent hanging operations
  • Result truncation: Oversized tool results are automatically truncated to prevent context overflow
  • Improved SSE streaming: Better timeout, error detection, and completion tracking for streaming responses
  • withTimeout utility: A reusable utility function for wrapping promises with timeouts
  • Context compression pathway: New sys_compress_context pathway for chat history compression
  • Comprehensive tests: Added tests for agent error handling scenarios

Test plan

  • Verify tool calls timeout correctly after configured duration
  • Verify large tool results are truncated with appropriate message
  • Verify streaming responses complete properly and handle errors gracefully
  • Run new unit tests: npm test -- --grep "sys_entity_agent_errors"

- Add tool call timeout handling (default 2 min, configurable per tool)
- Add truncation for oversized tool results to prevent context overflow
- Improve SSE stream handling with timeout, error detection, and completion tracking
- Add withTimeout utility function for promise timeout wrapping
- Add sys_compress_context pathway for chat history compression
- Add comprehensive tests for agent error handling

These changes improve the reliability and stability of the agentic system
by preventing hanging tool calls and handling large responses gracefully.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds robustness improvements to the agentic system including tool call timeouts, result truncation, improved SSE streaming, a new context compression pathway, and comprehensive error handling tests.

Changes:

  • Added configurable tool call timeout with withTimeout utility function to prevent hanging operations
  • Implemented automatic truncation of oversized tool results (>50KB) to prevent context overflow
  • Added new sys_compress_context pathway for chat history compression when approaching context limits
  • Enhanced SSE streaming with timeout detection, error tracking, and completion signal handling
  • Added comprehensive unit tests for agent error handling scenarios

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/unit/sys_entity_agent_errors.test.js New comprehensive test suite for agent error handling, tool timeouts, result truncation, and SSE streaming logic
lib/pathwayTools.js Added withTimeout utility for wrapping promises with configurable timeouts
pathways/system/entity/sys_entity_agent.js Implemented tool call timeout wrapper, result truncation, and improved error handling
pathways/system/entity/sys_compress_context.js New pathway for compressing chat history while preserving URLs, citations, and numerical data
server/pathwayResolver.js Enhanced SSE streaming with timeout, error detection, and completion tracking
server/plugins/gemini15VisionPlugin.js Added toolCallbackInvoked signal to indicate expected stream closure for tool callbacks
Comments suppressed due to low confidence (1)

server/pathwayResolver.js:337

  • The toolCallbackInvoked flag set by the Gemini plugin (line 550 in gemini15VisionPlugin.js) is set on a local requestProgress object inside the onParse closure, but this value needs to be tracked at the handleStream function level so it can be checked after the stream closes at lines 388-403.

Add let toolCallbackInvoked = false; near line 278, and then inside onParse after line 317, add:

if (requestProgress.toolCallbackInvoked) {
  toolCallbackInvoked = true;
}

Then update the completion logic at lines 388-403 to check this flag.

                const onParse = (event) => {
                    let requestProgress = {
                        requestId
                    };

                    logger.debug(`Received event: ${event.type}`);

                    if (event.type === 'event') {
                        logger.debug('Received event!')
                        logger.debug(`id: ${event.id || '<none>'}`)
                        logger.debug(`name: ${event.name || '<none>'}`)
                        logger.debug(`data: ${event.data}`)
                        
                        // Check for error events in the stream data
                        try {
                            const eventData = JSON.parse(event.data);
                            if (eventData.error) {
                                streamErrorOccurred = true;
                                streamErrorMessage = eventData.error.message || JSON.stringify(eventData.error);
                                logger.error(`Stream contained error event: ${streamErrorMessage}`);
                            }
                        } catch {
                            // Not JSON or no error field, continue normal processing
                        }
                    } else if (event.type === 'reconnect-interval') {
                        logger.debug(`We should set reconnect interval to ${event.value} milliseconds`)
                    }

                    try {
                        requestProgress = this.modelExecutor.plugin.processStreamEvent(event, requestProgress);
                    } catch (error) {
                        streamErrorOccurred = true;
                        streamErrorMessage = error instanceof Error ? error.message : String(error);
                        logger.error(`Stream processing error: ${error instanceof Error ? error.stack || error.message : JSON.stringify(error)}`);
                        incomingMessage.off('data', processStream);
                        return;
                    }

                    try {
                        if (!streamEnded && requestProgress.data) {
                            this.publishNestedRequestProgress(requestProgress);
                            streamEnded = requestProgress.progress === 1;
                            if (streamEnded) {
                                completionSent = true;
                            }
                        }
                    } catch (error) {
                        logger.error(`Could not publish the stream message: "${event.data}", ${error instanceof Error ? error.stack || error.message : JSON.stringify(error)}`);
                    }


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

data-angel and others added 3 commits January 22, 2026 20:38
- Remove buggy 5-minute stream timeout that caused false positive errors
  for nested requests (completionSent was never set for nested requests)
- Add receivedSSEData tracking to avoid false warnings for non-streaming responses
- Add toolCallbackInvoked check to prevent premature stream closure during tool execution
- Fix completion logic to check all conditions before publishing
- Add MALFORMED_FUNCTION_CALL handling in Gemini plugins (both streaming and non-streaming)
  to gracefully handle cases where the model generates invalid function call JSON

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a provider (e.g., Gemini) opens a stream but immediately closes it
with no data, the client would hang forever. Now we:

- Track if ANY data was received from the stream (not just SSE events)
- Detect empty streams (opened but closed with no data)
- Retry up to 3 times on empty stream
- Send error completion if all retries exhausted so client doesn't hang

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When finish_reason is 'tool_calls', set requestProgress.toolCallbackInvoked = true
so that pathwayResolver knows to keep the stream open for tool results instead
of ending it when the [DONE] event arrives.

This fixes a regression where tool calls were incorrectly terminating the
parent stream, breaking agentic workflows.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jmacAJ jmacAJ merged commit b084b3a into aj-archipelago:main Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants