Durable execution for agent-openai-advanced + agent-langgraph-advanced by dhruv0811 · Pull Request #195 · databricks/app-templates

dhruv0811 · 2026-04-20T23:47:38Z

Summary

Wires agent-openai-advanced + agent-langgraph-advanced + the shared e2e-chatbot-app-next frontend to the durable-execution contract in databricks-ai-bridge PR #416 (ML-64230).

Agent code stays unchanged. All durability logic lives in the bridge or in the chatbot proxy.

Template changes

pyproject.toml — pin databricks-ai-bridge + integration package to the bridge PR branch (revert to registry once released).
start_server.py — raise databricks_ai_bridge logger to LOG_LEVEL so [durable] messages surface in app logs. The LongRunningAgentServer subclass already exists on main.
No changes to agent.py. Read-time repair happens inside AsyncDatabricksSession.get_items() (openai) and _repair_loaded_checkpoint_tuple wrapping the checkpointer (langgraph) — both in the bridge.

Chatbot proxy (`e2e-chatbot-app-next/server/src/index.ts`)

The Express /invocations handler rewrites streaming POSTs into the bridge's background-mode contract and transparently resumes on upstream drops. Zero client-side changes.

Rewrite POST /invocations {stream: true} → backend {background: true, stream: true}
pumpStream forwards SSE frames to the browser; three pure helpers (parseSseFrame, extractResponseId, isTerminalErrorFrame) classify each frame
On upstream close without [DONE], the loop reconnects via GET /responses/{id}?stream=true&starting_after={lastSeq}, capped at 10 attempts
Short-circuits on task_failed / task_timeout terminal errors

AI SDK provider (`packages/ai-sdk-providers/src/request-context.ts`)

New getApiProxyUrl() helper resolves the proxy URL:
1. explicit API_PROXY env var wins
2. DATABRICKS_SERVING_ENDPOINT set → direct-endpoint mode, no proxy
3. default → route via this Node server's own /invocations (advanced-template convention)
Advanced templates no longer declare API_PROXY / AGENT_BACKEND_URL in databricks.yml / app.yaml — the defaults live in chatbot code.

Testing

UI end-to-end on Claude via deployed agent-openai-advanced: multi-tool turns (get_current_time, get_weather, get_stock_price, deep_research) interrupted mid-stream via /_debug/kill_task/{id}. Durable resume inherits completed tool pairs and injects synthetic [INTERRUPTED] output for the killed call; the agent continues without re-running completed tools. Tool cards dedupe across attempts.
HTTP-only crash-and-recover loop (no browser) exits status=completed, attempt_number=2.
Local e2e suite green on openai-advanced[autoscaling] end-to-end. Other failures in the local run are environmental (Python 3.14 mlflow simulator compat, vector-embedding cold-start on langgraph LTM test) — not PR-caused.

How to test

cd agent-openai-advanced   # or agent-langgraph-advanced

uv run quickstart --profile <profile> --lakebase-provisioned-name <instance>

# Enable the debug kill endpoint in databricks.yml / app.yaml:
#   env:
#     - name: LONG_RUNNING_ENABLE_DEBUG_KILL
#       value: "1"

databricks bundle deploy --profile <profile>
databricks bundle run agent_openai_advanced --profile <profile>

# Grant Lakebase permissions to the app SP
SP=$(databricks apps get <app-name> --profile <profile> --output json | jq -r .service_principal_client_id)
DATABRICKS_CONFIG_PROFILE=<profile> uv run python scripts/grant_lakebase_permissions.py $SP

Mid-stream crash test (UI)

Open the deployed app, send a long prompt (e.g. "Do deep_research on quantum computing basics").
Tail app logs and grab response_id from [/invocations] background started response_id=resp_....

While the response is still streaming:

curl -sS -X POST -H "Authorization: Bearer $TOKEN" "$APP_URL/_debug/kill_task/$RID"

After the ~10s stale window, the UI continues from where it left off; completed tool calls don't re-run.

Mid-stream crash test (HTTP only)

RESP=$(curl -sS -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"input":[{"role":"user","content":"Write 500 words about Linux history."}],"background":true}' \
  "$APP_URL/responses")
RID=$(echo "$RESP" | jq -r .id)

sleep 3
curl -sS -X POST -H "Authorization: Bearer $TOKEN" "$APP_URL/_debug/kill_task/$RID"
sleep 12

for i in $(seq 1 15); do
  sleep 3
  curl -sS -H "Authorization: Bearer $TOKEN" "$APP_URL/responses/$RID" | jq '{status, attempt_number}'
done
# Expect: status=completed, attempt_number=2

Pre-merge checklist

Bridge PR #416 merged and a release cut
Revert pyproject.toml git-branch pins in both advanced templates to registry versions
Revert APP_TEMPLATES_BRANCH default in both scripts/start_app.py from dhruv0811/durable-execution-templates to main
Remove LONG_RUNNING_ENABLE_DEBUG_KILL=1 from deploy configs before production use

Pins both advanced templates to the ai-bridge PR branch so the long-running agent server crash-resumes in-flight runs via heartbeat + CAS claim. Revert the [tool.uv.sources] entry once that PR merges and a new release is cut. Also fixes a latent IndexError in agent-openai-advanced's deduplicate_input: when the long-running server re-invokes the handler with input=[] to resume from the session (the agnostic resume contract validated by prototyping), messages[-1] blew up. Now we return [] for empty input — the session already has prior turns so there is nothing to dedupe. No change to either template's agent.py.

Makes the bundled chat UI durable end-to-end without any client-side changes. The Express /invocations proxy in e2e-chatbot-app-next now: - Rewrites streaming POSTs to { ...body, background: true, stream: true }, so every user turn persists each SSE event to Lakebase via LongRunningAgentServer. - Sniffs response.id + sequence_number out of the forwarded SSE stream. - If upstream closes before [DONE] (pod died, lost connection), the proxy transparently reconnects via GET /responses/{id}?stream=true&starting_after=N and resumes emitting events to the still-connected browser client. The browser sees one continuous stream. Non-streaming requests and non-POST methods keep the original passthrough behavior. Also points agent-openai-advanced/scripts/start_app.py at the dhruv0811/durable-execution-templates branch of app-templates so the new proxy code is actually deployed (override via APP_TEMPLATES_BRANCH env var). Revert once this lands on main.

… actually fires Previous attempt left the proxy dead-code: the Node AI SDK honored API_PROXY verbatim and sent requests straight to http://localhost:8000/invocations (FastAPI), skipping the Express /invocations handler at :3000 entirely. Confirmed in logs: requests reached the backend with {"stream": true} but never with "background": true. Split the two concerns across env vars: API_PROXY=http://localhost:3000/invocations (AI SDK -> Express proxy) AGENT_BACKEND_URL=http://localhost:8000/invocations (Express proxy -> FastAPI) Express handler prefers AGENT_BACKEND_URL, falls back to API_PROXY for backwards compat so existing templates don't break.

response_id is buried in the raw backend SSE stream and never surfaces to the browser because the Vercel AI SDK re-wraps the stream as its own message format before sending to the client. Log it on the server side instead so test instructions can `grep 'background started response_id=' ` from apps logs. Also distinguish the startup log so it's clear the durable-resume code path is live. No behavior change; pure observability.

app.yaml env vars were overriding databricks.yml at runtime, so the AI SDK was still talking directly to the Python FastAPI backend and the Express /invocations proxy never saw the request. Keep both files in sync.

…RL to FastAPI The script was unconditionally overwriting API_PROXY with the backend URL right before launching the frontend, which defeated our whole durable- resume-rewrite story: the Node AI SDK bypassed the Express /invocations handler and streamed straight from FastAPI. Fix: API_PROXY now points at CHAT_APP_PORT (the Express proxy), and we default AGENT_BACKEND_URL (previously unset) to the Python backend. Use os.environ.setdefault for AGENT_BACKEND_URL so operators can still override via databricks.yml or app.yaml.

…resp_* Broadens the response_id parser so it works whether the backend tags frames with top-level response_id (preferred) or the older nested-only shape.

…tally Matches the [/invocations] prefix so the full story is greppable from apps logs without correlating Node and Python timestamps.

The library logger inherits from root (default WARNING) so INFO-level lifecycle messages from LongRunningAgentServer (heartbeat, claim, resume, stream lifecycle) were being dropped. Set both the ai-bridge logger and the root level to LOG_LEVEL so apps logs carry the full durable-resume story without requiring callers to tune logging themselves.

When a response is killed mid-stream, the partial assistant text that was already rendered to the client kept receiving fresh deltas from attempt 2 — users saw attempt-1-partial + attempt-2-full concatenated in one bubble. Express /invocations proxy now seals the in-progress assistant message across an attempt boundary: 1. On upstream close without [DONE], immediately append a '(connection interrupted — reconnecting…)' suffix delta to the active message so the user sees something is happening during the ~10s stale window. 2. On the response.resumed sentinel, emit synthetic response.content_part.done + response.output_item.done events for the active message — effectively ending the first assistant bubble at OpenAI Responses API level. 3. Attempt 2's natural response.output_item.added (with a fresh item_id) then creates a clean second bubble showing the full answer. Tool calls naturally de-dup by call_id across attempts, so no closure synthesis needed for them. Also mirrors the routing + logging fixes previously applied to agent-openai-advanced onto agent-langgraph-advanced so both templates get durable resume with the full [durable] log lifecycle visible: - app.yaml + databricks.yml: split API_PROXY (-> Express :3000) from AGENT_BACKEND_URL (-> FastAPI :8000). - scripts/start_app.py: honor AGENT_BACKEND_URL, point API_PROXY at the Express proxy, clone e2e-chatbot-app-next from the durable-execution branch. - agent_server/start_server.py: raise databricks_ai_bridge + root logger to LOG_LEVEL so [durable] INFO lines surface in apps logs.

Durable-resume can interrupt the pod between an LLM emitting tool_calls and the SDK finishing the tool executions — the Session is left with function_call items whose matching function_call_output never got written. The next LLM request over that session fails: 400 BAD_REQUEST: An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_xxx, call_yyy, ... Piggy-back on deduplicate_input (which already touches the session each turn) to inject synthetic function_call_output items for every orphan function_call. Message is plain-text, so the LLM sees 'tool X was interrupted, please retry if needed' and can decide whether to re-call or continue. No change to agent.py.

The previous heal added synthetic function_call_output at the END of the session (add_items only appends). When the conversation has a message between the orphan function_call and the synthetic output, the SDK rebuilds the LLM request as an assistant-with-tool_calls message that doesn't have its tool responses right after it, and the API rejects with 'assistant message with tool_calls must be followed by tool messages'. Also: the Vercel AI SDK client echoes the full conversation back each turn. deduplicate_input drops most of it but the Runner.run path can still re-persist prior items, leaving DUPLICATE function_call rows for the same call_id. Replace with a clear+rebuild sanitize pass: dedupe function_call / function_call_output by call_id, inject synthetic outputs immediately after any orphan function_call, clear the session, and re-add the canonical sequence. No-op when already clean.

Keep the UI minimal but fix the doubled-text issue: when a mid-stream kill happens, the AI SDK merges all deltas within one streamText call into one UIMessage — so our proxy-level seal events were valid but invisible, and attempt 2's text kept appending to attempt 1's partial. Minimal solution: 1. Express /invocations proxy already emits response.resumed at the attempt boundary (unchanged). 2. chat.ts server: detect response.resumed via onChunk and forward it to the UI stream as { type: 'data-resumed', data: { attempt } }. 3. chat.tsx client: on 'data-resumed', call setMessages to drop all text parts from the last (assistant) message. Tool call parts stay because they dedupe by call_id naturally. Also: fix auto-resume loop burning MAX_RESUME_ATTEMPTS on terminal errors by exiting early when an error event with code=task_failed or code=task_timeout comes through the proxy. No changes to agent.py. Agnosticism tenet intact.

Your 'clean up at end of stream' idea — much more robust than relying on mid-stream mutation sticking. On data-resumed we now snapshot the attempt-1 text length, and in onFinish we slice exactly that many chars off the front of the last assistant message's text parts. Whatever the AI SDK accumulator did during streaming, the final rendered state contains only attempt 2's content. The mid-stream mutation wipe stays in place too — when it sticks the text visibly clears during the 10s stale window, which is nicer UX than waiting for onFinish. When it doesn't stick, onFinish catches it.

PreviewMessage is memoized: while loading it compares prevProps.message to nextProps.message by reference; when not loading it deep-equals the parts array (which short-circuits on identical references). Our previous truncate mutated part.text in place and returned [...prev] — same message + same parts array refs, so the memo skipped the re-render and the old text stuck on screen even though state was technically updated. Map to NEW part objects with sliced text and wrap a NEW message object so both the reference check (loading path) and deep-equal (done path) see a change and re-render.

State-level wipes were getting clobbered by the AI SDK accumulator — ReactChatState.replaceMessage deep-clones state.message on every write(), and activeTextParts keeps mutating the originals behind the UI's back. Solution: transform at the VIEW layer instead of fighting the state machine. Chat component tracks attempt1TextLen per messageId (state, not ref, so it propagates to children). Messages maps each message through a render-time slice that drops the leading attempt-1 chars from text parts before passing to PreviewMessage. Creates new message + part objects so the memo's reference check trips and the component re-renders. onFinish still does the authoritative setMessages truncate so the persisted-to-DB final message reflects only attempt 2. That truncate now also clears attempt1TextLen, so the render-time slice becomes a no-op after completion (state is already truncated).

…cution-templates # Conflicts: # agent-openai-advanced/databricks.yml

Drop the [chat][onData] / [chat][onFinish] / [chat][onChunk] tracing statements that were used to trace the attempt-1 → attempt-2 flow while tuning the render-time slice and post-stream truncate. The server-side Express proxy still logs resume lifecycle (background started / resume fetch / terminal error / stream done) since that's operationally useful; the ai-bridge backend's [durable] INFO logs stay as-is. Co-authored-by: Isaac

Move the per-template workarounds for mid-tool crash-resume into the databricks-ai-bridge library and wire them in: - agent-openai-advanced/utils.py: deduplicate_input now calls session.repair() (new public method on AsyncDatabricksSession) instead of the 100-line in-template _sanitize_session. Same behavior — dedupe function_call/function_call_output by call_id, inject synthetic outputs for orphans — just owned by the library. - agent-langgraph-advanced/agent.py: before agent.astream, call build_tool_resume_repair on the checkpointer's messages and apply via agent.aupdate_state(..., as_node="tools"). The as_node is critical — without it LangGraph re-evaluates the model→{tools,END} branch from the updated state and crashes with KeyError: 'model'. - agent-langgraph-advanced/agent.py: when the checkpointer already has a thread, only forward the latest user turn from request.input — the UI client (Vercel AI SDK) re-echoes the full history on every turn, which can re-inject orphan tool_uses from a previously-interrupted attempt that the client kept in its buffer. Both pyproject.toml files now pin databricks-openai / databricks-langchain to the same ai-bridge branch (subdirectory git sources) so the new helpers are picked up. Temporary; revert to registry once the bridge PR merges. Co-authored-by: Isaac

Library side (databricks-langchain, PR #416): - New build_tool_resume_repair_middleware() returns an AgentMiddleware whose before_model hook runs build_tool_resume_repair. Swaps the manual aget_state / aupdate_state(as_node="tools") surgery in the template for a one-line `middleware=[...]` arg to create_agent. - The as_node="tools" footgun (KeyError: 'model' in the model→{tools,END} conditional branch re-eval) disappears entirely; repair runs inside the graph's own execution flow, not as external state surgery. Template (agent-langgraph-advanced): - init_agent: add middleware=[build_tool_resume_repair_middleware()] to create_agent. stream_handler drops the 8-line repair block. - utils.py process_agent_astream_events: skip None node_data (the graph's updates stream emits {middleware_node: None} when the middleware is a no-op, which is every turn on the happy path). UI (e2e-chatbot-app-next): - On data-resumed from the backend, wipe text parts from the last assistant message in one setMessages. Tool-call parts are kept as-is (they already dedupe across attempts by call_id). Dropped: * attempt1TextLen state + per-message snapshot in onData * render-time text slice in Messages.tsx * onFinish authoritative post-stream truncate The AI SDK's seal-on-resume synthesis (Express proxy) still creates a fresh output_item_id for attempt 2, so new deltas land in a fresh text part — our wipe of the old text part is sufficient. Net: -99 LOC across 4 files. Same behavior for the "delete old text, leave tools alone" UX; substantially less state-machine choreography. Co-authored-by: Isaac

setMessages can't wipe mid-stream — the AI SDK's activeResponse.state is a snapshot taken at makeRequest time, and every text-delta calls write() → this.state.replaceMessage(lastIdx, activeResponse.state.message), which overwrites any setMessages we do. Our wipe was visible for a single chunk then reverted. Fix: snapshot the assistant message's parts.length at data-resumed, and at render time hide text parts at indices BEFORE that cutoff. Tool / step parts render normally at every index. Works for openai and langgraph because it transforms at the view layer rather than fighting the AI SDK state machine. Removes server-side debug log. Keeps the minimal delete-old-text UX. Co-authored-by: Isaac

…lper - Removed the "_(connection interrupted — reconnecting…)_" delta block. Render-time slice hides attempt-1 text on resume anyway, so the suffix was invisible past the 10s stale window and too subtle during it. - Extracted writeEvent(type, payload) helper; sealActiveMessage went from 45 → 22 lines, no behavior change. - Removed readActive() TS-widening helper (no longer needed without the suffix block). - Inlined onFirstResponseId helper into its single call site. Net: 92 lines removed, 36 added in this file. Co-authored-by: Isaac

Durability mechanics now live entirely in databricks-ai-bridge's LongRunningAgentServer (rotate conv_id on resume + full-history input sanitizer, see ai-bridge PR #416). Templates can drop the explicit repair surface: - agent-langgraph-advanced/agent.py: drop middleware=[build_tool_resume_repair_middleware()] from create_agent and the unused import. Also drop the stream_handler UI-echo dedupe block — the server sanitizer handles mid-history orphans end-to-end. - agent-openai-advanced/utils.py: drop await session.repair() from deduplicate_input. session.repair() stays available as a public method for callers who want destructive session cleanup. Net: agent.py / utils.py in both advanced templates have zero durability-specific lines. The contract becomes "use our checkpointer/ session classes with LongRunningAgentServer — durable resume + orphan repair is free." Co-authored-by: Isaac

Temporarily short-circuit the resumeCutIndex write so attempt-1's text stays visible while attempt-2 streams over it. Lets us see how the server-side inheritance + synthetic-output prompt shape the LLM's mid-turn continuation behavior without the visual wipe hiding what attempt-2 actually emits. Re-enable by uncommenting the block; the rest of the wipe plumbing (state hook, Messages prop threading, render-time slice) is left in place so re-enabling is a 1-line flip. Co-authored-by: Isaac

…les resume Server-side changes earlier in this branch (prior-attempt tool-event inheritance + partial-stream reassembly in databricks-ai-bridge) make the client-side "wipe attempt-1 text when resume fires" machinery unnecessary: attempt-2's LLM sees attempt-1's work as history and continues seamlessly instead of restarting. The wipe was also hiding the new continuation quality from the user. Turning the wipe off in UI testing confirmed the server-side story is sufficient. Delete the full stack: - packages/core/src/types.ts: drop `resumed` from CustomUIDataTypes. - server/src/routes/chat.ts: drop writerRef + emittedResumedAttempts + the onChunk raw-event branch that emitted data-resumed parts. Trace-extraction stays; only the resume-forwarding path is removed. - client/src/components/chat.tsx: drop resumeCutIndex state hook, the data-resumed onData handler (was already commented out), and the prop pass to <Messages/>. - client/src/components/messages.tsx: drop resumeCutIndex prop from MessagesProps + its destructuring + the render-time text-part slice. The server still emits `response.resumed` as a sentinel so the Express proxy's sealActiveMessage() call correctly closes attempt-1's open text part before attempt-2's fresh output_item.added creates a new one. The proxy no longer extracts it into a UI data part. Co-authored-by: Isaac

Remove everything that isn't strictly required for durable resume with the server-side-only approach in ai-bridge PR #416: - agent-langgraph-advanced/agent_server/agent.py: revert entirely. The test-scaffolding tools (get_weather, get_stock_price, deep_research) were only for crash-test harnesses; the asyncio import only existed to support them. User-space durability surface for this template is now zero lines. - agent-openai-advanced/agent_server/agent.py: revert entirely. Drop the test-scaffolding tools (get_weather, get_stock_price, search_best_restaurants, deep_research) and asyncio import. Same zero-user-space result. - agent-langgraph-advanced/agent_server/utils.py: revert. The "middleware nodes that no-op return None" guard was defensive against middleware we no longer install. - agent-openai-advanced/agent_server/utils.py: revert. The empty-input guard was defensive against the old input=[] resume replay that no longer happens — server always replays the original input. - e2e-chatbot-app-next/server/src/index.ts: drop the activeMessage / sealActiveMessage / writeEvent machinery. Was synthesizing closure events on response.resumed to seal attempt-1's text part for the UI wipe. UI wipe is gone; the AI SDK creates parts by item_id so attempt-2's fresh output_item.added naturally starts a new part and attempt-1's open part finalizes on stream end. - Plus the earlier UI cleanup (chat.tsx, messages.tsx, types.ts, routes/chat.ts) that removed the data-resumed / resumeCutIndex plumbing. Remaining essentials: - agent_server/start_server.py: log-level setup so [durable] logs surface in app logs. - scripts/start_app.py: API_PROXY / AGENT_BACKEND_URL wiring so the Node AI SDK routes streaming POSTs through the Express background-mode + auto-resume proxy. Clone-from-branch is marked TEMPORARY (revert when ai-bridge ships). - pyproject.toml: databricks-ai-bridge git source pointer (TEMPORARY). - e2e-chatbot-app-next/server/src/index.ts: background-mode rewrite + auto-resume proxy for the /invocations route. Co-authored-by: Isaac

Infinite Stream Resume loop seen with Claude multi-tool turns via durable retrieve. Root: - useChat's onStreamPart reset resumeAttemptCountRef on every chunk, so the 3-retry cap was only enforced when a stream ended empty. When Claude's provider failed to emit a clean `finish` UIMessageChunk at the end of the stream, lastPart.type !== 'finish' kept streamIncomplete = true. Each resume replayed the cached stream, delivered chunks, reset the counter to 0, onFinish fired without `finish`, looped. Fix: - Remove the per-chunk reset in onStreamPart. - Reset only in prepareSendMessagesRequest when the last message is a user message (a genuine new turn). Tool-result continuations (non-user-message continuations) don't reset. - Cap stays at 3; after that, fetchChatHistory() pulls the DB-persisted state so the user sees the final assistant output instead of spinning forever. Co-authored-by: Isaac

Final stable state for durable execution. End-to-end UI-validated scenarios that now work: - Multi-tool turn interrupted mid-sequence, durable resume inherits completed tool pairs + narrative (reordered) + synthetic output for the interrupted call, agent continues from where it left off. - Text-only mid-stream crash, partial-text reassembly + Claude prefill → continuation. - Cross-turn recall after crash-and-resume (stable thread via read- time checkpoint repair on LangGraph / session auto-repair on OpenAI). - Multi-tool on GPT-5 + openai-agents (single-response-per-turn). Template fix here: process_agent_stream_events now disambiguates by (a) item.type bucket for delta routing and (b) call_id bucket for multiple open function_calls. The original single curr_item_id bucket worked for GPT-5's strictly serial events but collided on Claude's interleaved + parallel tool-call events, which produced two items sharing one id and broke the client's part tracking. Pairs with databricks-ai-bridge PR #416 changes (rotate + replay + full-history sanitizer + prior-attempt tool-pair inheritance + narrative hoist + checkpoint read-time repair + session auto-repair). Co-authored-by: Isaac

End-to-end UI test on Claude (via deployed agent-openai-advanced with the updated databricks-ai-bridge) confirmed that the bridge-side ordering fix (sanitizer + narrative hoist + tool-pair inheritance + session auto-repair) is sufficient on its own. The two template-side guards added in earlier commits are no longer needed: - Revert 0ddbd60: `process_agent_stream_events` per-type + per-call-id id tracking. The single-bucket implementation handles Claude's interleaved + parallel tool-call events correctly now that the upstream ordering is clean. - Revert 5f3c507: `chat.tsx` user-message-only resume-counter reset. Claude now emits a clean `finish` UIMessageChunk through the durable retrieve path, so the per-chunk reset no longer traps the 3-retry cap in an infinite loop. Keeps the advanced templates lean — durability logic lives entirely in databricks-ai-bridge (LongRunningAgentServer). Co-authored-by: Isaac

Extract three pure helpers above the route handler so the SSE frame loop reads like prose: - parseSseFrame(frame): classifies a frame as done / passthrough / data. - extractResponseId(payload): tolerates FastAPI's three response_id locations (response_id, response.id, top-level id with resp_ prefix). - isTerminalErrorFrame(payload): detects task_failed / task_timeout so the resume loop can short-circuit. pumpStream now just drives the reader + forwards bytes; the parsing logic is testable in isolation and the handler body is substantially shorter. Co-authored-by: Isaac

Both advanced templates were setting these env vars to hard-coded localhost URLs that match the bundled-process topology (Node on 3000, FastAPI on 8000). The values are fixed by the templates themselves — a customer deploying the advanced stack can't change them without breaking the bundle. Making them required in yaml adds noise without adding configurability. Push the defaults into the chatbot: - New ``getApiProxyUrl()`` helper in ``packages/ai-sdk-providers/src/ api-proxy.ts`` resolves the effective proxy URL: 1. explicit ``API_PROXY`` wins, 2. ``DATABRICKS_SERVING_ENDPOINT`` set → direct-endpoint mode, no proxy, 3. otherwise → ``http://localhost:${CHAT_APP_PORT|PORT|3000}/invocations`` (advanced-template convention). Used from ``providers-server.ts`` and ``request-context.ts`` so both agree on proxy activation. - ``server/src/index.ts`` defaults ``AGENT_BACKEND_URL`` to ``http://localhost:8000/invocations`` when unset. Explicit empty string still disables the ``/invocations`` proxy route. - Drop the ``API_PROXY`` / ``AGENT_BACKEND_URL`` block (and its comment) from both advanced templates' ``app.yaml`` and ``databricks.yml``. Preserves direct-serving-endpoint CUJs: when ``DATABRICKS_SERVING_ENDPOINT`` is set (basic chatbot deployments), the AI SDK talks straight to the endpoint and never hits ``/invocations``. Co-authored-by: Isaac

Prior cleanup commit dropped ``API_PROXY=http://localhost:8000/invocations`` from the advanced templates' ``app.yaml`` and ``databricks.yml``. That line pre-existed on ``main``; the PR never meant to remove it. Scope of the previous change was only the *newly-added* ``API_PROXY`` + ``AGENT_BACKEND_URL`` block that activated the Node proxy path. Restore the four files to exactly match ``main``. The chatbot-side ``getApiProxyUrl()`` default only fires when ``API_PROXY`` is unset, so users with main's explicit setting keep their existing behavior. Co-authored-by: Isaac

Both helpers answer routing-decision questions for the provider layer (proxy URL + context-injection gate), and the separate file wasn't buying isolation — providers-server.ts already imports from request-context.ts. One file, same logic. Co-authored-by: Isaac

dhruv0811 added 27 commits April 16, 2026 22:27

Match API_PROXY + AGENT_BACKEND_URL in app.yaml too

399ffde

app.yaml env vars were overriding databricks.yml at runtime, so the AI SDK was still talking directly to the Python FastAPI backend and the Express /invocations proxy never saw the request. Keep both files in sync.

Proxy: accept response_id from top-level, nested response.id, or id= …

948f7b4

…resp_* Broadens the response_id parser so it works whether the backend tags frames with top-level response_id (preferred) or the older nested-only shape.

Proxy: log upstream close + each resume-fetch attempt + final stream …

cea0508

…tally Matches the [/invocations] prefix so the full story is greppable from apps logs without correlating Node and Python timestamps.

Add debug logs for durable-resume data-resumed event propagation

75dae5d

Catch-all log in onData to trace which data parts reach client

47a063f

Wipe text in-place on data-resumed instead of removing the part

c27c016

Merge remote-tracking branch 'origin/main' into dhruv0811/durable-exe…

89f096b

…cution-templates # Conflicts: # agent-openai-advanced/databricks.yml

debug: log response.resumed detection in chat.ts onChunk

e9b4064

debug: log every dataPart in chat.tsx onData to diagnose UI drop

337c39f

debug: log setMessages wipe details on data-resumed

d144ec0

dhruv0811 mentioned this pull request Apr 21, 2026

LongRunningAgentServer: durable resume via heartbeat + CAS claim databricks/databricks-ai-bridge#416

Open

4 tasks

dhruv0811 marked this pull request as ready for review April 21, 2026 23:14

dhruv0811 requested a review from bbqiu April 21, 2026 23:20

dhruv0811 added 11 commits April 22, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable execution for agent-openai-advanced + agent-langgraph-advanced#195

Durable execution for agent-openai-advanced + agent-langgraph-advanced#195
dhruv0811 wants to merge 39 commits intomainfrom
dhruv0811/durable-execution-templates

dhruv0811 commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhruv0811 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Template changes

Chatbot proxy (e2e-chatbot-app-next/server/src/index.ts)

AI SDK provider (packages/ai-sdk-providers/src/request-context.ts)

Testing

How to test

Mid-stream crash test (UI)

Mid-stream crash test (HTTP only)

Pre-merge checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhruv0811 commented Apr 20, 2026 •

edited

Loading

Chatbot proxy (`e2e-chatbot-app-next/server/src/index.ts`)

AI SDK provider (`packages/ai-sdk-providers/src/request-context.ts`)