Refactor: Decoupling Direct Database Connection From ComputingUnitMaster & ComputingUnitWorker #5295

bobbai00 · 2026-05-30T04:35:43Z

bobbai00
May 30, 2026
Collaborator

Motivation of this refactoring

Currently the amber engine directly queries the Postgres for execution status update, workflow compilation and cost-based optimization. Therefore, DB credentials have to be stored in Computing Unit as environment variables. Removing the DB credentials from Computing Unit makes each CU a better, safer sandbox environment for executing workflows, especially for workflows with UDF operators.

Refactor Overview

The fix is to move privileged database work out of the executor, not to harden the sandbox.

Before:  Frontend ─WS─▶ CU Master ─JDBC─▶ Postgres        (UDFs share the creds)
After:   Frontend ─WS─▶ CU Master ─HTTP(JWT)─▶ Backend ─JDBC─▶ Postgres
                        (no creds; runs a pre-resolved spec)

The Iceberg REST catalog already removes Iceberg metadata operations from direct Postgres. The remaining dependencies are what I want to decouple and discuss in this post

Current System Architecture and Flow

Here are two diagrams describing the high-level traffic between services and storages when a workflow is being executed:

Phase 1: Workflow (logical plan) is submitted to Computing Unit

Phase 2: Physical plan is being executed

Execution Flow

Every red arrow below is a direct database hit from computing unit

sequenceDiagram
  autonumber
  actor U as Frontend
  participant M as CU Master
  participant PG as Postgres
  participant CAT as Catalog Service
  participant S3I as S3 (Iceberg Tables)

  U->>M: Run workflow (**logical plan** + JWT)
  rect rgb(255, 235, 235)
  M->>PG: Look up latest workflow version
  M->>PG: Create execution record (get eid)
  M->>PG: Clear previous run's output locations
  opt fault tolerance
    M->>PG: Save replay-log location
  end
  Note over M: compile **logical plan** → **physical plan** (in-process)
  M->>PG: Resolve dataset paths to storage locations
  Note over M: cost-based optimization (in-process)
  M->>PG: Read last run's stats (cost-based optimization)
  M->>PG: Record where results/console/stats go
  end
  M->>CAT: Commit result/console/stats tables (REST — no DB login)
  M->>S3I: Write result/console/stats data
  M-->>U: Live status, stats & errors (WebSocket)

Proposed Design

Design A — Proxy to the Backend

Keep the CU's logic in place; replace each SqlServer call with an HTTP call that forwards the user's JWT. The Dashboard Service authorizes and runs the SQL.

sequenceDiagram
  autonumber
  actor U as Frontend
  participant M as CU Master
  participant DS as Dashboard Service
  participant FS as file-service
  participant PG as Postgres

  U->>M: Run workflow (**logical plan** + JWT)
  Note over M: no DB credentials
  M->>DS: Create execution record (JWT)
  DS->>PG: Create execution record
  DS-->>M: eid
  rect rgb(255, 245, 200)
  Note over M: Compile & Execute
  M->>FS: Resolve dataset paths (JWT)
  FS->>PG: Look up datasets & versions
  FS-->>M: resolved dataset URIs
  Note over M: compile **logical plan** → **physical plan**, then execute (in CU Master)
  end
  M->>DS: Report result/console/stats locations (JWT)
  DS->>PG: Persist result/console/stats locations
  M-->>U: Live status, stats & console (WebSocket)

Yellow highlights compilation and execution — in Design A both happen inside the CU Master.

The main ideas of this design:

The CU Master keeps its current role — it still compiles the logical plan → physical plan and executes, both in-process — only its database access is removed.
Every former direct SqlServer call becomes an HTTP call that forwards the user's JWT: execution-record create/update goes to the Dashboard Service, dataset resolution goes to file-service, and each service authorizes the caller before touching Postgres.
ComputingUnitMaster still receives the logical plan, but resolves datasets and reads/writes execution metadata over HTTP — it holds no DB credentials and never touches Postgres directly.

Design B — Pure Execution Backend (My Preference)

The CU becomes a stateless executor. It receives one self-contained ExecutionSpec (the already-compiled physical plan + eid + last execution stats), runs it, writes only to object storage through the REST catalog, and reports a completion manifest. The backend (Dashboard Service + workflow-compiling-service + file-service) does all compilation, dataset resolution, eid allocation, and persistence.

sequenceDiagram
  autonumber
  actor FE as Frontend
  participant WCS as workflow-compiling-service
  participant DS as Dashboard Service
  participant FS as file-service
  participant PG as Postgres
  participant CU as CU Master
  participant CAT as Catalog Service

  rect rgb(255, 245, 200)
  Note over WCS: Compile
  FE->>WCS: Compile workflow (**logical plan** + JWT)
  WCS->>FS: Resolve dataset paths (access-checked)
  FS->>PG: Look up datasets & versions
  WCS-->>FE: **physical plan** (dataset URIs resolved)
  end
  FE->>DS: Prepare execution (wid, cuid, JWT)
  DS->>PG: Create execution record & read last execution stats
  DS-->>FE: eid + last execution stats
  rect rgb(255, 245, 200)
  Note over CU: Execute
  FE->>CU: Run execution (**physical plan** + eid + last execution stats + JWT)
  Note over CU: no DB credentials
  CU->>CAT: Commit & write results (REST — no DB login)
  end
  CU-->>FE: Live status, stats & console (WebSocket)
  CU->>DS: Report final status & manifest (JWT)
  DS->>PG: Persist final status & locations

Yellow highlights compilation (in workflow-compiling-service) and execution (in CU Master) — split across services, unlike Design A.

The main ideas of this design:

The frontend asks workflow-compiling-service to compile the workflow — which resolves dataset paths via file-service and bakes the resolved URIs into the physical plan.
It then asks the Dashboard Service to prepare the execution, passing only the workflow id/version + cuid (not the plan). The Dashboard Service creates the execution record, reads the last execution stats (used by the engine for cost-based optimization), and returns the eid + last execution stats. It never sees the physical plan and never resolves datasets.
The frontend assembles the run request — physical plan (from the compiling-service) + eid + last execution stats (from the Dashboard Service) — and sends it to the CU over the WebSocket.
ComputingUnitMaster receives the physical plan, executes it, streams status to the frontend, and reports the final status/manifest back. It no longer compiles the logical plan and never touches Postgres.

aglinxinyuan · 2026-05-30T05:34:28Z

aglinxinyuan
May 30, 2026
Collaborator

I like design B. What's the disadvantage of design B compared to design A?

1 reply

bobbai00 May 30, 2026
Collaborator Author

B is harder to implement as it touches more components. That's the only disadvantage of B I can think of.

mengw15 · 2026-05-30T06:43:42Z

mengw15
May 30, 2026
Collaborator

Agree Design B feels more natural — CU as a pure executor feel clean.

Two questions:

1. Effort & transition. How big is the change of design B? Could we land Design A first (DB-credential removal, small delta) as a stepping stone, then evolve to B without a big-bang switchover?

2. Who orchestrates? In Design B's current sketch the frontend coordinates compile → eid allocation → CU dispatch. That makes the frontend a critical control-plane component, which feels off to me. Would it be worth introducing a thin Execution Service that owns this orchestration? Any client would then just call POST /executions {wid, cuid} and the Execution Service handles compile / eid allocation / CU dispatch internally.

2 replies

bobbai00 May 30, 2026
Collaborator Author

The effort should be managable
The concern is valid. We let the clients do lots of logic. So I am exploring the possibility to delegate some effort to CUMaster

aglinxinyuan May 30, 2026
Collaborator

It's actually a good idea to let clients to do work so we can utilize their computing power and decrease transfer cost.

Yicong-Huang · 2026-05-30T06:57:07Z

Yicong-Huang
May 30, 2026
Collaborator

First, I want to ask, what is the biggest issue of the current architecture? exposing DB credentials could be one, but that's not that fundamental. I need some more justification for the redesign.

That aside, I always wanted to go with option B, and that was the reason we initially introduced the workflow compilation service. but there are some blockers:

is it secure to expose physical plan to frontend in option B?? Let's be more clear: after compilation, we will have physical plan which contains resolved file URL. Is it safe to expose that to frontend? What it does not have is the execution configuration (those are only available after scheduling, which currently stays in CU master), so computation resources and worker location are less of a concern.
I believe during execution, CU also needs to write stuff, including iceberg metadata, runtime logs, etc. How are we going to handle those DB access with option B?

3 replies

bobbai00 May 30, 2026
Collaborator Author

I added the "Motivation of this refactoring" section.

For the two blockers:

The concern of exposing the physical plan to frontend is valid. There are two ways to address it: (1) encrypt the physical plan sent to the frontend; (2) frontend still sends the logical plan to CU, and CU sends a request to WorkflowCompilingService to compile logical plan to physical plan. I think (2) might be the way to go.
This is addressed by switching to RESTCatalog.

aglinxinyuan May 30, 2026
Collaborator

Why is exposing the physical plan to the frontend considered dangerous? Is a resolved file URL actually sensitive, and if so, what specific information or risks does it expose?

I think it’s important that we understand the reasoning behind each design decision rather than relying on a general feeling that exposing the physical plan is dangerous. If there are concrete security, privacy, or architectural concerns, we should identify and document them so we can make informed trade-offs. We can look at each information physical plan contains, and encrypt the sensitive one instead encrypt the whole physical plan.

Yicong-Huang Jun 1, 2026
Collaborator

I think the resolved file path is not safe to send back to frontend. I am not sure about other info.

Physical plan (the dag shop) itself is not that sensitive IMO. I think we can avoid encrypting the whole plan if we could move out sensitive info from it, or encrypt sensitive info in it.

chenlica · 2026-05-30T21:51:26Z

chenlica
May 30, 2026
Collaborator

A related note: I believe currently a workflow is compiled twice in its lifecycle, and we want to remove one of them in the near future.

2 replies

bobbai00 May 31, 2026
Collaborator Author

Yes, it is a related topic. The design B will make the compilation happen once.

Yicong-Huang Jun 1, 2026
Collaborator

yeah I think we are on this direction. which is good

bobbai00 · 2026-06-06T18:58:16Z

bobbai00
Jun 6, 2026
Collaborator Author

Here is the layout of Physical Plan

Physical Plan Spec

Layout:

{
  "operators": [...list of physical operators...],
  "links": [...list of physical links...]
}

Physical Operator Spec

{
  "id": {
    "logicalOpId": { "id": "CSVScanSource-operator-id" },
    "layerName": "main"                                   // distinguishes physical stages from same logical op: main | partial | final
  },
  "workflowId": { "id": 0 },                              
  "executionId": { "id": 1 },                            

  "opExecInitInfo": {                                     // tells Amber how to construct the runtime executor
    // JVM operators use kind "className":
    "kind": "className",
    "className": "org.apache.texera.amber.operator.source.scan.csv.CSVScanSourceOpExec",
    "descString": "{...a JSON STRING that describes the property of the physical operator...}"
    // For scan sources (CSV/JSONL/Arrow/file), source path lives here as `fileName`.
       It looks like this: `dataset:///dataset-15/versionHash/raw/data.csv`  (if the file is resolved on local file system, it will start with `file:///...`)
    // For UDF operators, the descStringuse kind "code" instead:
    //   { "kind": "code", "code": "class ProcessTupleOperator(...): ...", "language": "python" }
  },

  "parallelizable": true,
  "locationPreference": { "type": "roundRobin" },
  "partitionRequirement": [],                             // what each INPUT expects (array: one entry per input port)
    // null                                          -> no requirement for that input
    // { "type": "single" }                          -> gather into one partition
    // { "type": "hash", "hashAttributeNames": ["id"] } -> hash-partitioned by attributes
    // { "type": "broadcast" }                       -> broadcast to workers
    // { "type": "oneToOne" }                        -> partitioning maps one-to-one
    // { "type": "none" }                            -> no partitioning

  "partitionDeriveSpec": { "type": "passthrough" },       // what partitioning this operator PRODUCES
    // passthrough                                   -> preserve upstream partitioning
    // toSingle                                      -> produce a single partition
    // toHash + hashAttributeNames                   -> produce hash partitioning
    // toUnknown                                     -> partitioning unknown
    // projection                                    -> derive through projection

  "inputPortsSerialized": {},                             // map keyed "<portId>_<internalFlag>", e.g. "0_false"
  "outputPortsSerialized": {},                            // value = 2-item array: [portMetadata, schema|null]
    // portMetadata: { id:{id,internal}, displayName, blocking, mode }
    //   output `mode`: 0 = set snapshot | 1 = set delta | 2 = single snapshot
    // schema: { attributes: [ { attributeName, attributeType }, ... ] } or null
    //   attributeType: string | integer (32-bit) | long (64-bit) | double |
    //                  boolean | timestamp | binary | large_binary (pointer-like)

  "isOneToManyOp": false,
  "suggestedWorkerNum": 1,
  "pveName": ""
}

Physical Link Spec

Each item in links connects one physical output port to one physical input port.

{
  "fromOpId": {
    "logicalOpId": { "id": "source-op-id" },
    "layerName": "main"
  },
  "fromPortId": { "id": 0, "internal": false },
  "toOpId": {
    "logicalOpId": { "id": "target-op-id" },
    "layerName": "main"
  },
  "toPortId": { "id": 0, "internal": false }
}

@Yicong-Huang @aglinxinyuan @Xiao-zhen-Liu Is this interpretation accurate ? If so I don't think physical plan contains any sensitive information and we can safely expose it to the client.

1 reply

Yicong-Huang Jun 7, 2026
Collaborator

The frontend environment is owned by a user, and we don't want

expose our backend details, I would push back on those information.
a. resolved file path, which could show the actual file location in our backend (local or lakefs).
b. worker machine location, I don't see it in the plan you posted, just want to make sure.
other user's information during compilation of shared sessions. I know logical plan possibly already have/expose them on the frontend, but we need to be careful on the compilation endpoint's access, and not to expose those to a wrong user.
c. api keys (from Twitter operator, LLM operators, etc).
d. DB source passwords, table, etc.

Uh oh!

Refactor: Decoupling Direct Database Connection From ComputingUnitMaster & ComputingUnitWorker #5295

Uh oh!

Uh oh!

bobbai00 May 30, 2026 Collaborator

Motivation of this refactoring

Refactor Overview

Current System Architecture and Flow

Execution Flow

Proposed Design

Design A — Proxy to the Backend

Design B — Pure Execution Backend (My Preference)

Replies: 5 comments · 9 replies

Uh oh!

aglinxinyuan May 30, 2026 Collaborator

Uh oh!

bobbai00 May 30, 2026 Collaborator Author

Uh oh!

Uh oh!

mengw15 May 30, 2026 Collaborator

Uh oh!

bobbai00 May 30, 2026 Collaborator Author

Uh oh!

aglinxinyuan May 30, 2026 Collaborator

Uh oh!

Yicong-Huang May 30, 2026 Collaborator

Uh oh!

bobbai00 May 30, 2026 Collaborator Author

Uh oh!

aglinxinyuan May 30, 2026 Collaborator

Uh oh!

Yicong-Huang Jun 1, 2026 Collaborator

Uh oh!

chenlica May 30, 2026 Collaborator

Uh oh!

bobbai00 May 31, 2026 Collaborator Author

Uh oh!

Yicong-Huang Jun 1, 2026 Collaborator

Uh oh!

bobbai00 Jun 6, 2026 Collaborator Author

Physical Plan Spec

Physical Operator Spec

Physical Link Spec

Uh oh!

Uh oh!

Yicong-Huang Jun 7, 2026 Collaborator

bobbai00
May 30, 2026
Collaborator

Replies: 5 comments 9 replies

aglinxinyuan
May 30, 2026
Collaborator

bobbai00 May 30, 2026
Collaborator Author

mengw15
May 30, 2026
Collaborator

bobbai00 May 30, 2026
Collaborator Author

aglinxinyuan May 30, 2026
Collaborator

Yicong-Huang
May 30, 2026
Collaborator

bobbai00 May 30, 2026
Collaborator Author

aglinxinyuan May 30, 2026
Collaborator

Yicong-Huang Jun 1, 2026
Collaborator

chenlica
May 30, 2026
Collaborator

bobbai00 May 31, 2026
Collaborator Author

Yicong-Huang Jun 1, 2026
Collaborator

bobbai00
Jun 6, 2026
Collaborator Author

Yicong-Huang Jun 7, 2026
Collaborator