Skip to content

fix(docker): unblock 8 GB VM target — OpenBLAS musl pthread + RTS memory discipline#60

Merged
ccomb merged 2 commits into
mainfrom
fix/openblas-musl-pthread-stack
May 16, 2026
Merged

fix(docker): unblock 8 GB VM target — OpenBLAS musl pthread + RTS memory discipline#60
ccomb merged 2 commits into
mainfrom
fix/openblas-musl-pthread-stack

Conversation

@ccomb
Copy link
Copy Markdown
Owner

@ccomb ccomb commented May 16, 2026

Goal

Make VoLCA run on an 8 GB RAM VM with pre-loaded databases (no parser pressure). Two independent root causes were stopping us; both are addressed here, in two atomic commits.

Commit 1 — RTS memory return (subset of #59)

`docker/rts-flags.sh`:

  • Add `-Fd1.0` (GHC 9.10+). Decays free heap blocks back to the OS over ~1 idle period. Without it (default 4.0), RSS stays pinned near the peak for minutes after a spike.
  • `-I30` → `-I0.3` (GHC default). A 30 s deferred idle GC was hiding live-data drops and starving `-Fd` of free blocks.

`-M` left at 75 % of RAM for now. #59 also halves it to 50 %; that's likely the right call eventually (MUMPS Fortran workspace allocates outside the GHC heap so 75 % leaves no headroom on tight VMs), but the immediate motivator for #59 was the OOM-killer firing — which we now attribute to the OpenBLAS musl crash addressed in commit 2. Re-evaluate `-M` once we have RSS curves on the 8 GB target.

`-A`, `-c`, `-F1.5`, `-qg0`, `-n` left alone — they trade off with throughput and shouldn't move without benchmarking.

Commit 2 — OpenBLAS pthread stack on musl

Static Alpine/musl builds segfaulted (exit 139 / SIGSEGV) inside MUMPS factorization on the first dense BLAS3 call. musl's hardcoded 128 KB default pthread stack (vs glibc's 8 MB read from `RLIMIT_STACK`) is overflowed by OpenBLAS `DYNAMIC_ARCH` Fortran kernels with large auto-arrays.

Patch `driver/others/blas_server.c` during the Docker build to call `pthread_attr_setstacksize(&attr, 8 << 20)` right after `pthread_attr_init`. Two `grep` guards bracket the `sed` — a future upstream refactor fails the Docker build loudly instead of silently regressing the runtime.

Workaround for verification: `OPENBLAS_NUM_THREADS=1` (no worker pthreads spawned) makes the crash disappear without the patch — proves the worker-thread stack is the only contributor.

Why combine in one PR

The two changes solve different layers (Haskell RTS hygiene vs native BLAS thread setup), but the goal is shared: shipping a binary that runs on an 8 GB VM with Agribalyse-class workloads. The OpenBLAS fix is the load-bearing change; the RTS tweaks are cheap incremental hygiene on top.

Test plan

  • `./docker-build.sh --with-frontend` succeeds (both `grep` guards pass)
  • On an 8 GB Alpine VM: `docker run … volca-with-frontend …`, load Agribalyse 3.2 from cache, request `/impacts/Environmental Footprint 3.1 (adapted)` — completes without exit 139
  • No `OPENBLAS_NUM_THREADS` env override needed: parallel BLAS workers auto-scale to `nproc`
  • Printed `RTS: ... -> +RTS ...` summary shows `-Fd1.0` and `-I0.3` (`-M` stays at 75 % of cgroup limit)
  • Spot-check post-request RSS drops within seconds of going idle (`-Fd1.0` working)

Statically-linked Alpine/musl builds segfault inside MUMPS factorization
on the first BLAS3 call from a worker thread (exit 139). musl's default
pthread stack is 128 KB, vs glibc's 8 MB read from RLIMIT_STACK. OpenBLAS
worker threads inherit that 128 KB and overflow on DYNAMIC_ARCH Fortran
kernels that hold large auto-arrays — typically dgemm/dtrsm on dense
frontal blocks during sparse LU factorization.

Reproduces on a 16 GB VM running volca-with-frontend, on the first
impact request hitting Agribalyse 3.2 (21510 activities). The crash is
not memory-bound (RSS stays low, exit code 139 = SIGSEGV, not 137).

Patch driver/others/blas_server.c to call pthread_attr_setstacksize at
8 MB before pthread_create. The change aligns musl's behaviour with
glibc's effective default and is a no-op on glibc rebuilds.

Two grep guards bracket the sed: if a future OpenBLAS release moves the
pthread_attr_init anchor, the Docker build fails loudly instead of
silently producing a binary that crashes in production.
@ccomb ccomb changed the title fix(docker): patch OpenBLAS pthread stack size for musl static build fix(docker): unblock 8 GB VM target — OpenBLAS musl pthread + RTS memory discipline May 16, 2026
Two cheap RTS tweaks (cherry-picked from #59, minus the -M change):

- Add -Fd1.0 (GHC 9.10+): decay free heap blocks back to the OS over
  ~1 idle period instead of the default 4.0, which keeps RSS pinned
  near peak for minutes after a parsing spike.
- -I30 -> -I0.3 (GHC default): trigger idle-time major GC promptly.
  The previous 30 s deferral hid live-data drops and starved -Fd of
  free blocks to release.

Keeping -M at 75 % of RAM for now: dropping to 50 % may be the right
call eventually, but the OpenBLAS musl crash that motivated the change
in #59 is fixed independently in the previous commit. Re-evaluate -M
once we have RSS curves on the 8 GB target.
@ccomb ccomb force-pushed the fix/openblas-musl-pthread-stack branch from 19a7e21 to bf4ffd7 Compare May 16, 2026 15:44
@ccomb ccomb merged commit 3b3cf5e into main May 16, 2026
5 checks passed
@ccomb ccomb deleted the fix/openblas-musl-pthread-stack branch May 16, 2026 16:00
ccomb added a commit that referenced this pull request May 16, 2026
PR #60 shipped broken because nothing in the build pipeline verified
the injected pthread_attr_setstacksize call survived compilation; the
silent regression was only caught by a production SIGSEGV. The fix in
this branch (-Wl,-z,stack-size=8388608) replaces that with a different
invariant — "PT_GNU_STACK->p_memsz == 0x800000 in the shipped ELF" —
which is again only checked manually.

Add a readelf-based assertion right after UPX so a future linker-flag
refactor (or a UPX-side header rewrite) fails the image build with a
diagnostic naming the actual value found, instead of recurring as a
runtime crash on a customer VM.

The check runs on /build/output/volca (post-strip, post-UPX) because
that's the binary that actually runs. binutils — which provides
readelf — is already in the build-stage apk add.
ccomb added a commit that referenced this pull request May 16, 2026
PR #60 diagnosed the SIGSEGV correctly (musl's 128 KB default pthread
stack overflows on the first BLAS3 call from MUMPS factorization) but
patched the wrong place: the injected pthread_attr_setstacksize sat
inside #ifdef NEED_STACKATTR, which blas_server.c #undef's on Linux.
The code compiled out; the crash reproduced unchanged on a 16 GB VM
and `OPENBLAS_NUM_THREADS=1` worked around it.

Real fix: bake the desired default into the ELF's PT_GNU_STACK header
via -Wl,-z,stack-size=8388608 in LINK_MODE=musl. musl reads p_memsz at
process start and uses it as __default_stacksize, so every pthread
created with NULL attr — OpenBLAS workers, GHC RTS capabilities,
anything else — starts with 8 MB. Linker flag covers more ground than
a source patch would and avoids the second pthread_create site in
goto_set_num_threads that lacks the `attr` symbol.

Also adds a readelf assertion in docker/Dockerfile right after UPX so
a future linker-flag refactor — or a UPX-side header rewrite — fails
the image build loudly instead of recurring as a runtime crash on a
customer VM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant