Skip to content

[Hackathon] feat(bin): one-command local dev orchestrator (bin/texera)#5079

Open
MelihErduran wants to merge 1 commit into
apache:mainfrom
MelihErduran:texera-launcher
Open

[Hackathon] feat(bin): one-command local dev orchestrator (bin/texera)#5079
MelihErduran wants to merge 1 commit into
apache:mainfrom
MelihErduran:texera-launcher

Conversation

@MelihErduran
Copy link
Copy Markdown

PR: One-command local dev orchestrator for Texera (bin/texera)

Summary

Adds bin/texera — a single Bash CLI that replaces the previous "open 7 IntelliJ run configs in the right order, then yarn start in a different terminal" workflow with one command:

texera start

It launches Postgres + LakeFS/MinIO + every backend JVM service + agent-service + frontend, in the right order, with prefixed log streams in one terminal, a live bottom-pinned health bar, and clean teardown on Ctrl+C. Also ships subcommands for setup, build, stop, status, and logs. Complementary helper bin/check-services.sh provides a one-shot probe outside an active session.

This is a dev-tool addition — nothing about the services themselves changes. The existing .run/*.xml IntelliJ configs and bin/single-node Docker deploy paths are untouched.

Motivation

Before this PR, getting Texera running locally required:

  • Knowing the launch order (master before worker, infra before JVMs).
  • Knowing the seven bin/*-service.sh scripts plus the un-scripted agent-service and frontend.
  • Eyeballing seven different terminals to figure out whether the stack was actually up.
  • Manually pkilling JVMs when something crashed, because there was no cleanup story.
  • Hitting file-service crashed on boot ~50% of the time when LakeFS wasn't quite ready.

New contributors hit all of this on day one. Existing contributors lived with it but lost ~5 minutes per restart.

What's in this PR

bin/texera                  new   one-command orchestrator (~875 lines)
bin/check-services.sh       new   standalone health probe (~118 lines)
bin/build-services.sh       mod   add access-control-service to dist+unzip;
                                  rename amber zip target (texera-*.zip → amber-*.zip)

bin/texera is the main feature; the other two are small.

Subcommands

texera setup           One-time: toolchain check + sbt build + frontend/python deps + SQL DDLs
texera build           Re-build staged backend binaries (after backend code changes)
texera start [mode]    Start services. Interactive menu if no mode given.
texera stop            Stop everything started by `texera start`.
texera status          Per-service port reachability table.
texera logs <service>  Tail one service's log file.

texera setup

Idempotent first-time bootstrap. Verifies the toolchain (java 17, sbt, node 24, yarn, docker, pg_isready, psql, curl, unzip), runs bin/build-services.sh, installs frontend (yarn install) and agent-service (bun install) deps, applies sql/texera_ddl.sql and sql/iceberg_postgres_catalog.sql. Skips agent-service gracefully if bun isn't installed.

texera build

Delegates to bin/build-services.sh (sbt clean dist + unzip each service's stage). Same path the deploy scripts use.

texera start

Five modes, chosen by argument or interactive menu:

Mode Postgres + LakeFS/MinIO Backend JVM services + agent Frontend
full
backend
frontend
infra
services

The interactive menu (texera start with no arg, TTY only) renders a box-drawn numbered prompt; q quits. Stdin not a TTY + no arg → errors with the list of valid modes (so it's CI-safe).

Service registry is a single declarative table inside the script:

SERVICES=(
  "config|.|target/config-service-*/bin/config-service|9094"
  "compile|.|target/workflow-compiling-service-*/bin/workflow-compiling-service|9090"
  "file|.|target/file-service-*/bin/file-service|9092"
  "managing|.|target/computing-unit-managing-service-*/bin/computing-unit-managing-service|8888"
  "access|.|target/access-control-service-*/bin/access-control-service|9096"
  "master|amber|target/amber-*/bin/computing-unit-master|8085"
  "worker|amber|target/amber-*/bin/computing-unit-worker|-"
  "web|amber|target/amber-*/bin/texera-web-application|8080"
)

Adding a service later means adding one row.

Each row spawns from its sbt-native-packager staged binary (not sbt runMain) — that avoids the sbt boot-lock contention you get from launching several sbt processes in parallel and skips sbt startup overhead per service.

Each service's stdout/stderr is piped through a colored prefixer:

[config]   INFO  ConfigService starting…
[compile]  INFO  WorkflowCompilingService starting…
[file]     ERROR Failed to connect to lake fs server: …
[master]   [INFO] [ClusterListener] received member event = MemberUp(...)

Color is a stable hash of the service name → ANSI palette. Stream prefixer is awk -v p="$prefix" '{ print p, $0; fflush(); }'.

Per-service logs also written to logs/texera-dev/<name>.log (so texera logs <name> works mid-run).

texera stop

stop kills every service launched by texera start, then docker compose downs the LakeFS/MinIO stack.

The kill path matters because the previous scripts left orphan JVMs — see the Hard problems section below.

texera status and texera logs

status makes one curl /api/healthcheck per service and renders an aligned table (up/down). Independent of any active texera start. Useful for checking dev state from a different shell.

logs <name> is tail -F logs/texera-dev/<name>.log. Names come from the same registry.

Hard problems and how they're solved

1. Per-service liveness while logs scroll past

Spawning seven JVM services into one terminal means thousands of lines of log spam during a normal boot. The user can't tell from the stream which services are up.

Solution: persistent bottom-pinned status bar. When stdout is a TTY, status_bar_init sets the terminal scroll region via DECSTBM (ESC[1;LINES-3 r), reserving the bottom 3 rows. A background poller redraws those rows every 2 s:

═════════════════════════════════════════════════════════════
 ✓ ALL 9 SERVICES UP  (47s elapsed)
═════════════════════════════════════════════════════════════

or on failure:

═════════════════════════════════════════════════════════════
 ✗ 2/9 DOWN: master✗ file…  (12s)
═════════════════════════════════════════════════════════════

Symbols: = pipeline collapsed (JVM exited), = process alive but port not yet bound.

The whole 3-row redraw is one printf with DECSC/DECRC (save/restore cursor) around it, so concurrent log writes from the spawned services and the poller don't interleave at byte level. Worst case is one garbled frame, self-heals on the next 2 s tick.

When stdout isn't a TTY (CI, | tee log.txt, etc), status_bar_supported returns false and the code falls back to a one-shot wait + trailing banner.

Teardown lives in two places: shutdown (Ctrl+C trap) calls it before printing anything else so "shutting down…" lands in normal layout, and trap status_bar_teardown EXIT is a belt-and-suspenders safety net so the terminal is never left with a stuck scroll region even on an unexpected exit.

2. file-service vs LakeFS startup race

file-service calls LakeFSStorageClient.healthCheck() during boot (file-service/src/main/scala/.../FileService.scala:77). If LakeFS isn't accepting HTTP, the JVM exits.

docker compose up -d returns when the container is up, not when LakeFS's HTTP server is accepting connections — a 5–15 s gap. So file-service crashed ~50% of the time on cold starts.

Solution: start_lakefs now polls http://localhost:8000/_health (falling back to /) for up to 60 s after docker compose up -d, and only returns once LakeFS answers. Both endpoints verified to return 200 against the running container.

3. Orphan JVMs holding ports after stop

The previous bin/*-service.sh launchers and earlier iterations of bin/texera recorded the wrong PID. The pipeline ( exec binary ) | tee log | prefix_stream ends up with $! = the awk PID. Killing awk does not propagate to the JVM, which is a sibling, not a child. The fallback pkill -f <basename> didn't help either, because the launcher script's filename (computing-unit-master etc) isn't in the JVM's command line after exec java -cp ….

Result: every texera start after the first failed with BindException: 127.0.0.1:2552 Address already in use, and you'd have to lsof -ti :2552 | xargs kill -9 manually.

Solution: process groups.

  • Each spawn_* now toggles set -m briefly so the backgrounded pipeline gets its own process group. With job control on, the PGID equals the PID of the pipeline leader, which is the JVM subshell.
  • pgid_of_pipeline reads it via ps -o pgid= -p $! and stores it in the pidfile (so the pidfile effectively holds the JVM's PID, not awk's).
  • kill_all_pgids <grace> does kill -TERM -- -PGID per recorded group → SIGTERMs JVM + tee + awk together. Sleeps grace seconds. Then kill -KILL -- -PGID on any group still alive. Used by both shutdown (Ctrl+C, 2 s grace) and stop (subcommand, 3 s grace).
  • For JVMs left over from before this PR existed (no pidfile to consult), stop also pkill -f <mainclass>s each known Java mainclass:
    • org.apache.texera.web.{ComputingUnitMaster, ComputingUnitWorker, TexeraWebApplication}
    • org.apache.texera.service.{ConfigService, FileService, AccessControlService, ComputingUnitManagingService, WorkflowCompilingService}
    • List verified against META-INF/MANIFEST.MF in every built jar and the app_mainclass= declarations in the amber launcher scripts.

Free side benefit: is_spawn_alive now kill -0 PGIDs, which directly probes the JVM leader rather than using awk's liveness as a proxy. The status bar's "crashed" detection is precise.

4. Ordering constraints

ComputingUnitMaster must bind its Pekko/Akka cluster port before computing-unit-worker tries to join. Encoded as one sleep 4 after spawning the master row. The launch loop walks SERVICES in declaration order, so the table itself is the canonical ordering.

LakeFS comes before all JVM spawns because file-service depends on it; Postgres comes before LakeFS because LakeFS uses it.

File-by-file

  • bin/texera — entire orchestrator. Sections: service registry, mode table, color/printing, tool checks, infra (ensure_postgres, start_lakefs, stop_lakefs), stream prefixer, spawns, kill_all_pgids + shutdown trap, status/logs subcommands, setup/build, mode lookup + interactive menu, readiness probes (probe_port, is_spawn_alive, wait_for_services, print_readiness_banner), status bar, start, stop, dispatch.

  • bin/check-services.sh — standalone one-shot probe of every service's HTTP port. Independent of texera start session state. Prints a per-service table + a green/red trailing banner, exits non-zero on any failure. Useful from a second shell or in CI.

  • bin/build-services.sh — minor: adds the access-control-service unzip step that was missing, and renames the amber zip target from texera-*.zip to amber-*.zip to match the new artifact name.

What's intentionally not in scope

  • The IntelliJ .run/*.xml configs still work; they're the path for breakpoint debugging. texera start is for "I want everything running, fast."
  • The bin/single-node Docker Compose deploy isn't touched.
  • No CI hookup added. texera start backend works in non-TTY mode (banner fallback), but no GitHub Actions job exercises it.

Configuration knobs

  • TEXERA_READY_TIMEOUT (default 90) — seconds the one-shot non-TTY readiness check waits before giving up. The persistent bar polls forever; this only applies to the fallback path.
  • TEXERA_HOST (default localhost, used by check-services.sh) — host to probe from.
  • TEXERA_PROBE_TIMEOUT (default 2, used by check-services.sh) — per-probe curl timeout.

LakeFS-ready timeout in start_lakefs is currently hard-coded at 60 s; making it env-configurable is a small follow-up.

Test plan

Verified locally (macOS, bash 3.2):

  • texera setup from a clean checkout, then texera start full → menu → mode 1 → all 9 services come up → bar flips green → frontend loads at :4200.
  • Ctrl+C while running → bar disappears, scroll region restored, JVMs all exit within a couple seconds.
  • Immediate texera start full again → no port conflicts (the previous orphan-JVM regression is gone).
  • texera stop from a separate shell while a start is running → both terminals come back clean.
  • Kill file-service mid-run via pkill -f FileService → bar flips to ✗ 1/9 DOWN: file✗ (… elapsed) within 2 s.
  • texera start infra → only Postgres + LakeFS/MinIO come up; script exits cleanly without blocking on wait.
  • texera status from a second terminal during a healthy run → all up.
  • texera logs file → tails logs/texera-dev/file.log.
  • PGID/group kill round-trip verified with a synthetic sleep | cat | awk pipeline (ps -o pid,pgid,comm -g <pgid> empty after one TERM).
  • LakeFS probe endpoints (/_health, /) both verified to return 200.
  • All Java mainclass names verified against built jar manifests and amber launcher scripts.

Not yet verified (follow-ups, see below):

  • Cross-terminal: only tested in macOS Terminal.app + tmux. Behavior in iTerm2, VS Code's embedded terminal, IntelliJ console, screen, etc. unverified.
  • Terminal resize during a run (SIGWINCH).
  • Headless / texera start backend | tee path through CI.

Known limitations

  • TTY only for the live bar. Non-TTY runs fall back to a one-shot banner. This is intentional but means texera start full | tee session.log won't show the live view.
  • Concurrent log writes can occasionally corrupt one bar frame. Single-printf renders mitigate but don't fully eliminate byte-level interleaving on shared stdout. Self-heals on the next refresh.
  • Mainclass pkill in stop is broad. If you have another checkout of this repo running, texera stop here will kill that one too. Could be tightened with pkill -u "$USER"; left as-is for now since most devs run a single instance.
  • set -m semantics vary slightly across bash versions. Verified on macOS bash 3.2 and Linux bash 5.x; unusual non-POSIX shells aren't supported (and the shebang is #!/usr/bin/env bash anyway).

Follow-ups

Tracking these separately, not blocking this PR:

  1. bin/README.md section documenting subcommands, modes, env vars, and the status bar.
  2. Make the LakeFS readiness timeout env-configurable (TEXERA_LAKEFS_TIMEOUT).
  3. pkill -u "$USER" on the orphan-mainclass fallback.
  4. CI smoke job: texera start backend headless, assert exit code on readiness.
  5. AGENTS.md mention so subagents prefer texera start over the bin/*-service.sh set when bringing the stack up.

Migration notes

For existing contributors: nothing breaks. The old bin/*-service.sh scripts, IntelliJ .run/*.xml configs, and bin/single-node deploy are untouched and continue to work. texera start is opt-in.

The first time you use it: texera setup once, then texera start. If you've ever Ctrl+C'd one of the old scripts and left an orphan JVM, run texera stop first — its mainclass fallback will clean those up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant