[Hackathon] feat(bin): one-command local dev orchestrator (bin/texera)#5079
Open
MelihErduran wants to merge 1 commit into
Open
[Hackathon] feat(bin): one-command local dev orchestrator (bin/texera)#5079MelihErduran wants to merge 1 commit into
MelihErduran wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: One-command local dev orchestrator for Texera (
bin/texera)Summary
Adds
bin/texera— a single Bash CLI that replaces the previous "open 7 IntelliJ run configs in the right order, thenyarn startin a different terminal" workflow with one command:It launches Postgres + LakeFS/MinIO + every backend JVM service + agent-service + frontend, in the right order, with prefixed log streams in one terminal, a live bottom-pinned health bar, and clean teardown on Ctrl+C. Also ships subcommands for
setup,build,stop,status, andlogs. Complementary helperbin/check-services.shprovides a one-shot probe outside an active session.This is a dev-tool addition — nothing about the services themselves changes. The existing
.run/*.xmlIntelliJ configs andbin/single-nodeDocker deploy paths are untouched.Motivation
Before this PR, getting Texera running locally required:
bin/*-service.shscripts plus the un-scripted agent-service and frontend.pkilling JVMs when something crashed, because there was no cleanup story.file-service crashed on boot~50% of the time when LakeFS wasn't quite ready.New contributors hit all of this on day one. Existing contributors lived with it but lost ~5 minutes per restart.
What's in this PR
bin/texerais the main feature; the other two are small.Subcommands
texera setupIdempotent first-time bootstrap. Verifies the toolchain (java 17, sbt, node 24, yarn, docker, pg_isready, psql, curl, unzip), runs
bin/build-services.sh, installs frontend (yarn install) and agent-service (bun install) deps, appliessql/texera_ddl.sqlandsql/iceberg_postgres_catalog.sql. Skips agent-service gracefully ifbunisn't installed.texera buildDelegates to
bin/build-services.sh(sbt clean dist+ unzip each service's stage). Same path the deploy scripts use.texera startFive modes, chosen by argument or interactive menu:
fullbackendfrontendinfraservicesThe interactive menu (
texera startwith no arg, TTY only) renders a box-drawn numbered prompt;qquits. Stdin not a TTY + no arg → errors with the list of valid modes (so it's CI-safe).Service registry is a single declarative table inside the script:
Adding a service later means adding one row.
Each row spawns from its sbt-native-packager staged binary (not
sbt runMain) — that avoids the sbt boot-lock contention you get from launching severalsbtprocesses in parallel and skips sbt startup overhead per service.Each service's stdout/stderr is piped through a colored prefixer:
Color is a stable hash of the service name → ANSI palette. Stream prefixer is
awk -v p="$prefix" '{ print p, $0; fflush(); }'.Per-service logs also written to
logs/texera-dev/<name>.log(sotexera logs <name>works mid-run).texera stopstopkills every service launched bytexera start, thendocker compose downs the LakeFS/MinIO stack.The kill path matters because the previous scripts left orphan JVMs — see the Hard problems section below.
texera statusandtexera logsstatusmakes onecurl /api/healthcheckper service and renders an aligned table (up/down). Independent of any activetexera start. Useful for checking dev state from a different shell.logs <name>istail -F logs/texera-dev/<name>.log. Names come from the same registry.Hard problems and how they're solved
1. Per-service liveness while logs scroll past
Spawning seven JVM services into one terminal means thousands of lines of log spam during a normal boot. The user can't tell from the stream which services are up.
Solution: persistent bottom-pinned status bar. When stdout is a TTY,
status_bar_initsets the terminal scroll region via DECSTBM (ESC[1;LINES-3 r), reserving the bottom 3 rows. A background poller redraws those rows every 2 s:or on failure:
Symbols:
✗= pipeline collapsed (JVM exited),…= process alive but port not yet bound.The whole 3-row redraw is one
printfwithDECSC/DECRC(save/restore cursor) around it, so concurrent log writes from the spawned services and the poller don't interleave at byte level. Worst case is one garbled frame, self-heals on the next 2 s tick.When stdout isn't a TTY (CI,
| tee log.txt, etc),status_bar_supportedreturns false and the code falls back to a one-shot wait + trailing banner.Teardown lives in two places:
shutdown(Ctrl+C trap) calls it before printing anything else so "shutting down…" lands in normal layout, andtrap status_bar_teardown EXITis a belt-and-suspenders safety net so the terminal is never left with a stuck scroll region even on an unexpected exit.2. file-service vs LakeFS startup race
file-servicecallsLakeFSStorageClient.healthCheck()during boot (file-service/src/main/scala/.../FileService.scala:77). If LakeFS isn't accepting HTTP, the JVM exits.docker compose up -dreturns when the container is up, not when LakeFS's HTTP server is accepting connections — a 5–15 s gap. So file-service crashed ~50% of the time on cold starts.Solution:
start_lakefsnow pollshttp://localhost:8000/_health(falling back to/) for up to 60 s afterdocker compose up -d, and only returns once LakeFS answers. Both endpoints verified to return 200 against the running container.3. Orphan JVMs holding ports after stop
The previous
bin/*-service.shlaunchers and earlier iterations ofbin/texerarecorded the wrong PID. The pipeline( exec binary ) | tee log | prefix_streamends up with$!= the awk PID. Killing awk does not propagate to the JVM, which is a sibling, not a child. The fallbackpkill -f <basename>didn't help either, because the launcher script's filename (computing-unit-masteretc) isn't in the JVM's command line afterexec java -cp ….Result: every
texera startafter the first failed withBindException: 127.0.0.1:2552 Address already in use, and you'd have tolsof -ti :2552 | xargs kill -9manually.Solution: process groups.
spawn_*now togglesset -mbriefly so the backgrounded pipeline gets its own process group. With job control on, the PGID equals the PID of the pipeline leader, which is the JVM subshell.pgid_of_pipelinereads it viaps -o pgid= -p $!and stores it in the pidfile (so the pidfile effectively holds the JVM's PID, not awk's).kill_all_pgids <grace>doeskill -TERM -- -PGIDper recorded group → SIGTERMs JVM + tee + awk together. Sleepsgraceseconds. Thenkill -KILL -- -PGIDon any group still alive. Used by bothshutdown(Ctrl+C, 2 s grace) andstop(subcommand, 3 s grace).stopalsopkill -f <mainclass>s each known Java mainclass:org.apache.texera.web.{ComputingUnitMaster, ComputingUnitWorker, TexeraWebApplication}org.apache.texera.service.{ConfigService, FileService, AccessControlService, ComputingUnitManagingService, WorkflowCompilingService}META-INF/MANIFEST.MFin every built jar and theapp_mainclass=declarations in the amber launcher scripts.Free side benefit:
is_spawn_alivenowkill -0 PGIDs, which directly probes the JVM leader rather than using awk's liveness as a proxy. The status bar's "crashed" detection is precise.4. Ordering constraints
ComputingUnitMastermust bind its Pekko/Akka cluster port beforecomputing-unit-workertries to join. Encoded as onesleep 4after spawning the master row. The launch loop walksSERVICESin declaration order, so the table itself is the canonical ordering.LakeFS comes before all JVM spawns because file-service depends on it; Postgres comes before LakeFS because LakeFS uses it.
File-by-file
bin/texera— entire orchestrator. Sections: service registry, mode table, color/printing, tool checks, infra (ensure_postgres,start_lakefs,stop_lakefs), stream prefixer, spawns,kill_all_pgids+shutdowntrap,status/logssubcommands,setup/build, mode lookup + interactive menu, readiness probes (probe_port,is_spawn_alive,wait_for_services,print_readiness_banner), status bar,start,stop, dispatch.bin/check-services.sh— standalone one-shot probe of every service's HTTP port. Independent oftexera startsession state. Prints a per-service table + a green/red trailing banner, exits non-zero on any failure. Useful from a second shell or in CI.bin/build-services.sh— minor: adds theaccess-control-serviceunzip step that was missing, and renames the amber zip target fromtexera-*.ziptoamber-*.zipto match the new artifact name.What's intentionally not in scope
.run/*.xmlconfigs still work; they're the path for breakpoint debugging.texera startis for "I want everything running, fast."bin/single-nodeDocker Compose deploy isn't touched.texera start backendworks in non-TTY mode (banner fallback), but no GitHub Actions job exercises it.Configuration knobs
TEXERA_READY_TIMEOUT(default 90) — seconds the one-shot non-TTY readiness check waits before giving up. The persistent bar polls forever; this only applies to the fallback path.TEXERA_HOST(defaultlocalhost, used bycheck-services.sh) — host to probe from.TEXERA_PROBE_TIMEOUT(default 2, used bycheck-services.sh) — per-probe curl timeout.LakeFS-ready timeout in
start_lakefsis currently hard-coded at 60 s; making it env-configurable is a small follow-up.Test plan
Verified locally (macOS, bash 3.2):
texera setupfrom a clean checkout, thentexera start full→ menu → mode 1 → all 9 services come up → bar flips green → frontend loads at:4200.texera start fullagain → no port conflicts (the previous orphan-JVM regression is gone).texera stopfrom a separate shell while astartis running → both terminals come back clean.file-servicemid-run viapkill -f FileService→ bar flips to✗ 1/9 DOWN: file✗ (… elapsed)within 2 s.texera start infra→ only Postgres + LakeFS/MinIO come up; script exits cleanly without blocking onwait.texera statusfrom a second terminal during a healthy run → all up.texera logs file→ tailslogs/texera-dev/file.log.sleep | cat | awkpipeline (ps -o pid,pgid,comm -g <pgid>empty after one TERM)./_health,/) both verified to return 200.Not yet verified (follow-ups, see below):
texera start backend | teepath through CI.Known limitations
texera start full | tee session.logwon't show the live view.stopis broad. If you have another checkout of this repo running,texera stophere will kill that one too. Could be tightened withpkill -u "$USER"; left as-is for now since most devs run a single instance.set -msemantics vary slightly across bash versions. Verified on macOS bash 3.2 and Linux bash 5.x; unusual non-POSIX shells aren't supported (and the shebang is#!/usr/bin/env bashanyway).Follow-ups
Tracking these separately, not blocking this PR:
bin/README.mdsection documenting subcommands, modes, env vars, and the status bar.TEXERA_LAKEFS_TIMEOUT).pkill -u "$USER"on the orphan-mainclass fallback.texera start backendheadless, assert exit code on readiness.AGENTS.mdmention so subagents prefertexera startover thebin/*-service.shset when bringing the stack up.Migration notes
For existing contributors: nothing breaks. The old
bin/*-service.shscripts, IntelliJ.run/*.xmlconfigs, andbin/single-nodedeploy are untouched and continue to work.texera startis opt-in.The first time you use it:
texera setuponce, thentexera start. If you've ever Ctrl+C'd one of the old scripts and left an orphan JVM, runtexera stopfirst — its mainclass fallback will clean those up.