Skip to content

Workflow results become inaccessible after CU termination — retrieval path is bound to CU lifetime even though storage outlives it #5363

@mengw15

Description

@mengw15

What happened?

Execution results (results, runtime_stats, console_logs Iceberg tables) are written to a global storage backend (Iceberg catalog server + S3 / MinIO) that survives well beyond the lifetime of the Computing Unit (CU) that produced them. However, the result-read endpoints — the WebSocket and REST resources that the frontend talks to for the Result panel — are hosted by a Dropwizard app that lives inside the CU pod, and the Iceberg client that connects to the storage backend is currently only wired up there.

When a user terminates the CU and later reopens the same workflow, the frontend has no live CU to talk to for results. The Result panel comes up empty even though the underlying data is still intact in the global storage.

This is an asymmetry between data lifetime and access-path lifetime:

Storage backend (global):  CU created ─── CU terminated ─── workflow reopened
                                                              data still present ✓

Retrieval path (in CU)  :  CU created ─── CU terminated ─── workflow reopened
                                          retrieval path gone ✗

Why it happens (architecture):

Texera deploys two separate Dropwizard applications:

  • TexeraWebApplication (amber/.../web/TexeraWebApplication.scala) — the brain-layer web server (dashboard, auth, workflow CRUD, admin, etc.). One per Texera deployment, persistent. Does not import or use IcebergCatalogInstance / IcebergDocument / DocumentFactory — the brain currently has no Iceberg client at all.
  • ComputingUnitMaster (amber/.../web/ComputingUnitMaster.scala) — runs inside each CU pod (the container named computing-unit-master in KubernetesClient.scala:138). Registers the result-read endpoints WorkflowWebsocketResource (line 116) and WorkflowExecutionsResource (line 182). Hosts WorkflowExecutionService / ExecutionResultService / ExecutionStatsService / ExecutionConsoleService / ResultExportService, all of which obtain their Iceberg client via IcebergCatalogInstance — a Scala object (per-JVM singleton) that only gets instantiated in this JVM.

ComputingUnitManagingResource @DELETE /{cuid}/terminate (ComputingUnitManagingResource.scala:622-641) calls KubernetesClient.deletePod(cuid), which tears down the entire ComputingUnitMaster JVM — taking the WebSocket endpoints, REST resources, and the IcebergCatalogInstance client with it. The storage backend on the other end (catalog server + S3) is untouched, but there is no longer any wired-up client able to talk to it on behalf of the user.

Related: #4126 (Migrate to Result Service and MinIO for Execution Results), #5135 (Per-user Iceberg warehouse).

How to reproduce?

  1. Create a workflow with an operator that produces a result table (e.g., Filter → sink).
  2. Execute on a CU; observe the rows in the Result panel.
  3. Terminate the CU (Dashboard → Computing Units → Terminate).
  4. Reopen the same workflow.
  5. The Result panel does not display the prior execution's results. The underlying data is still present in the global storage backend — verifiable via s3 ls — but the UI has no path to reach it without a live CU.

Version

1.1.0-incubating (Pre-release/Master)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions