diff --git a/AGENTS.md b/AGENTS.md index 97030a873..907e90059 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -141,6 +141,55 @@ ╚══════════════════════════════════════════════════════════════════════════════╝ ``` +## ⚠️⚠️⚠️ ALWAYS WRAP `jperl`/`jcpan` IN `timeout` ⚠️⚠️⚠️ + +``` +╔══════════════════════════════════════════════════════════════════════════════╗ +║ ║ +║ Investigative agents that launch PerlOnJava test runs MUST wrap every ║ +║ `jperl`/`jcpan`/`prove` invocation with `timeout N` — NEVER just ║ +║ `/usr/bin/time -p` (which only measures, never kills) and NEVER bare ║ +║ `./jperl …` for anything that could hang. ║ +║ ║ +║ # WRONG — JVM survives forever if it hangs ║ +║ /usr/bin/time -p ./jperl t/foo.t ║ +║ ./jperl t/foo.t & ║ +║ ║ +║ # RIGHT — JVM is hard-killed after 60 s ║ +║ timeout 60 ./jperl t/foo.t ║ +║ timeout 60 ./jperl -Ilib -It/lib t/foo.t ║ +║ ║ +║ Why this matters: ║ +║ ║ +║ - `./jperl` ends with `exec java …`, so the bash wrapper is replaced ║ +║ by the JVM. When the agent's own bash exits, those JVMs get ║ +║ reparented to PID 1 and KEEP RUNNING at 100% CPU — there is no ║ +║ SIGHUP propagation and no JVM-side self-watchdog. ║ +║ - On a 48 GB Mac the JVM defaults to ~12 GB heap. A handful of orphan ║ +║ JVMs at 100% CPU silently starves the whole machine, which then ║ +║ makes the NEXT `jcpan -t Module` run miss the 300 s no-output deadline ║ +║ in `TAP::Parser::Iterator::Process` — the symptom looks like "test ║ +║ X hangs" when it's really just CPU starvation from orphans. ║ +║ - `t/96_is_deteministic_value.t` and `t/76joins.t` SIGKILLs in PR #635 ║ +║ CI runs were caused exactly by this: a previous agent left ~14 orphan ║ +║ JVMs at 100% CPU each, load avg climbed to 50, and the harness gave ║ +║ up on innocent tests after 5 minutes of no TAP output. ║ +║ ║ +║ If your run REALLY may exceed any sane wall clock (e.g. a full ║ +║ `jcpan -t DBIx::Class` is ~40 min), still wrap it: `timeout 3600 ...`. ║ +║ If you spawn parallel test workers, give each its own `timeout`. ║ +║ ║ +║ When you finish an investigation, sanity-check your cleanup: ║ +║ ║ +║ ps aux | awk '$3 > 20 {print $2, $3, $11, $12}' ║ +║ ║ +║ If any unexpected `java …perlonjava…` shows up, kill it: ║ +║ ║ +║ pkill -9 -f "perlonjava-.*\.jar.*\.t\b" ║ +║ ║ +╚══════════════════════════════════════════════════════════════════════════════╝ +``` + ## Incident Log (do not delete — this is why the rules above exist) | Date | What was lost | Root cause | @@ -148,6 +197,7 @@ | 2026-04-28 | ~600 cpan-tester module results (4736 → 4139) | Agent ran `git checkout dev/cpan-reports/` on an unstaged refresh; concurrent `cpan_random_tester.pl` instances also race on `.dat` files (separate bug). | | 2026-04-29 | cpan-reports refresh commit (briefly, on a feature branch — recovered from reflog) | Agent resolved a rebase conflict with `git checkout --ours` thinking it would keep the branch's version. During rebase, `--ours` means UPSTREAM, so the upstream files were taken, the replayed commit became empty, and rebase silently dropped it. Recovery: `git reset --hard ` from `git reflog`, then re-rebase using `--theirs`. | | 2026-04-30 | (no work lost — recovered) Working tree on `fix/class-trait-tests` was overwritten with master content | Agent ran `git checkout master -- .` to A/B test failures vs master without first snapshotting and without switching branches. Recovery only worked because the changes had already been committed to HEAD: `git restore .` (also a forbidden command on a dirty tree, but safe here because "dirty" was master content, not user work) brought the tree back from HEAD. Correct workflow would have been: stash via `git diff > /tmp/wip.patch`, or use `git worktree add` for the master comparison instead of mutating the current tree. | +| 2026-04-30 | A full afternoon chasing a phantom "DBIx::Class regression" in `t/76joins.t` / `t/96_is_deteministic_value.t` | Investigative agent launched the test repeatedly under `/usr/bin/time -p ./jperl …` (no `timeout` wrapper). Each hung JVM survived past the agent's lifetime, accumulated as ~14 orphans at 100% CPU each, and starved the active `jcpan` harness — which then SIGKILLed innocent tests after 300 s of no TAP output. Symptom looked exactly like a real perf regression. Fix: always `timeout N ./jperl …` for any potentially-hanging run. | When you cause a new incident, append a row here in the same commit that fixes it. Future agents need to see that these warnings are real. diff --git a/jcpan b/jcpan index f7d7515b9..822f56b02 100755 --- a/jcpan +++ b/jcpan @@ -56,6 +56,15 @@ fi # Override: JPERL_TEST_TIMEOUT=0 (disable) or JPERL_TEST_TIMEOUT=600 (10 min) export JPERL_TEST_TIMEOUT="${JPERL_TEST_TIMEOUT:-300}" +# Enable the orphan-exit watchdog in every jperl this run spawns. If the +# parent jcpan / test_harness process is killed (e.g. SIGKILL'd by the +# user, or terminated by a CI step), each child JVM polls its initial +# parent PID every 2s and self-exits when that parent disappears. +# Without this, killing the harness leaves dozens of in-flight test +# JVMs reparented to PID 1, all spinning at 100% CPU until manually +# pkill'd. See AGENTS.md "ALWAYS WRAP jperl/jcpan IN timeout" rule. +export JPERL_ORPHAN_EXIT=1 + # Expose the jperl launcher AND the jcpan launcher itself so distroprefs # (e.g. Moose.yml) can run upstream tests against the bundled shims with # `prove --exec jperl`, and bootstrap missing helper modules with diff --git a/jcpan.bat b/jcpan.bat index e741824a8..788364bea 100644 --- a/jcpan.bat +++ b/jcpan.bat @@ -22,6 +22,10 @@ goto parse_args :run rem Set default per-test timeout (300s) to kill hanging tests if not defined JPERL_TEST_TIMEOUT set "JPERL_TEST_TIMEOUT=300" +rem Enable orphan-exit watchdog in every jperl this run spawns — when +rem the parent jcpan dies, each child JVM self-exits within ~4s instead +rem of getting reparented to PID 1 and burning 100% CPU forever. +set "JPERL_ORPHAN_EXIT=1" rem Expose jperl and jcpan launchers, and prepend SCRIPT_DIR to PATH so rem shell-spawned subprocesses (distroprefs commandlines, prove --exec, rem etc.) can find jperl/jcpan without tokens that don't expand in diff --git a/jprove b/jprove index 647370b47..63f176297 100755 --- a/jprove +++ b/jprove @@ -15,4 +15,11 @@ else exit 1 fi +# Enable the orphan-exit watchdog in every jperl this run spawns. If the +# parent jprove process is killed (e.g. SIGKILL'd by the user, or +# terminated by a CI step), each child JVM polls its initial parent PID +# every 2s and self-exits when that parent disappears. See the matching +# block in `./jcpan` and AGENTS.md for the full rationale. +export JPERL_ORPHAN_EXIT=1 + exec "$SCRIPT_DIR/jperl" "$PROVE_SCRIPT" "$@" diff --git a/jprove.bat b/jprove.bat index a15cdde6a..4d4f5ca97 100644 --- a/jprove.bat +++ b/jprove.bat @@ -6,5 +6,9 @@ rem Repository: github.com/fglock/PerlOnJava rem Get the directory where this script is located set SCRIPT_DIR=%~dp0 +rem Enable orphan-exit watchdog in every jperl this run spawns — when +rem the parent jprove dies, each child JVM self-exits within ~4s. +set "JPERL_ORPHAN_EXIT=1" + rem Run jperl with the prove script call "%SCRIPT_DIR%jperl.bat" "%SCRIPT_DIR%src\main\perl\bin\prove" %* diff --git a/src/main/java/org/perlonjava/app/cli/Main.java b/src/main/java/org/perlonjava/app/cli/Main.java index 9600e299f..c49bbc406 100644 --- a/src/main/java/org/perlonjava/app/cli/Main.java +++ b/src/main/java/org/perlonjava/app/cli/Main.java @@ -18,6 +18,65 @@ public class Main { static { // Set default locale to US (uses dot as decimal separator) Locale.setDefault(Locale.US); + + // Optional orphan-exit watchdog. When the env var + // JPERL_ORPHAN_EXIT is set (typically by `./jcpan` and + // `./jprove`, which spawn many short-lived sub-jperls), this + // JVM self-exits a few seconds after its initial parent + // process disappears. Without this, a `kill -9` on the parent + // jcpan/test_harness leaves all in-flight test JVMs reparented + // to PID 1, where they happily keep running at 100% CPU + // forever — burning the box and starving subsequent runs. + // + // SIGTERM-style parent death is already handled by the + // shutdown hook in RuntimeIO; this watchdog covers the SIGKILL + // case (no shutdown hooks fire on the kernel-side kill). + // + // Direct `./jperl your_script.pl` does NOT set the env var, so + // user programs are never killed when their shell exits — they + // get the standard nohup-style behavior they'd expect from any + // long-running interpreter. + if (System.getenv("JPERL_ORPHAN_EXIT") != null) { + startOrphanWatchdog(); + } + } + + private static void startOrphanWatchdog() { + java.util.Optional parentOpt = + java.lang.ProcessHandle.current().parent(); + if (parentOpt.isEmpty()) return; // no parent? nothing to watch. + long initialParentPid = parentOpt.get().pid(); + // PID 1 = init/launchd. If we were directly spawned by it, + // there's no point watching — we're already at the root. + if (initialParentPid <= 1) return; + + Thread watchdog = new Thread(() -> { + // Poll every 2s. Exit only after two consecutive misses + // (~4s) to avoid race with rapid parent restarts. + int missCount = 0; + while (true) { + try { + Thread.sleep(2000); + } catch (InterruptedException ie) { + return; + } + java.util.Optional p = + java.lang.ProcessHandle.of(initialParentPid); + boolean parentGone = p.isEmpty() || !p.get().isAlive(); + if (parentGone) { + if (++missCount >= 2) { + System.err.println("[jperl] orphaned: parent PID " + + initialParentPid + + " is gone — exiting"); + Runtime.getRuntime().halt(143); // 128 + SIGTERM + } + } else { + missCount = 0; + } + } + }, "perlonjava-orphan-watchdog"); + watchdog.setDaemon(true); + watchdog.start(); } /** diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index e7a2fe6de..91811f2fd 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "eddceb611"; + public static final String gitCommitId = "9a1145435"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). @@ -48,7 +48,7 @@ public final class Configuration { * Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at" * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String buildTimestamp = "Apr 30 2026 16:52:57"; + public static final String buildTimestamp = "Apr 30 2026 11:43:39"; // Prevent instantiation private Configuration() { diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java b/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java index e6592f6c3..0006bd929 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java @@ -96,6 +96,7 @@ private static RuntimeList freezeImpl(RuntimeArray args, boolean netorder) { // byte-string scalar so consumers see it as raw bytes (matches // the existing freeze() return shape). RuntimeScalar result = new RuntimeScalar(encoded); + result.type = RuntimeScalarType.BYTE_STRING; return result.getList(); } catch (Exception e) { return WarnDie.die(new RuntimeScalar("freeze failed: " + e.getMessage()), new RuntimeScalar("\n")).getList(); diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/storable/StorableWriter.java b/src/main/java/org/perlonjava/runtime/perlmodule/storable/StorableWriter.java index 51b56a669..48d79c8b4 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/storable/StorableWriter.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/storable/StorableWriter.java @@ -223,9 +223,23 @@ private boolean tryEmitHook(StorableContext c, RuntimeScalar refScalar, String c // First element is the frozen cookie; rest are sub-refs. RuntimeScalar cookieSv = items.get(0); - byte[] frozen = cookieSv == null - ? new byte[0] - : cookieSv.toString().getBytes(StandardCharsets.UTF_8); + // The cookie returned by STORABLE_freeze is a binary Storable + // blob (chars 0..255 stored as Java chars). Treat it as raw + // bytes — encoding it as UTF-8 mangles the high bytes (0x80..0xFF + // become 2-byte sequences) and corrupts the embedded stream. + byte[] frozen; + if (cookieSv == null) { + frozen = new byte[0]; + } else if (cookieSv.type == RuntimeScalarType.BYTE_STRING) { + String s = cookieSv.toString(); + frozen = new byte[s.length()]; + for (int i = 0; i < frozen.length; i++) frozen[i] = (byte) s.charAt(i); + } else { + // Plain STRING — also a byte string in practice for hook cookies, + // since STORABLE_freeze returns the result of nfreeze(). Use + // ISO_8859_1 to preserve every char 0..255 as a single byte. + frozen = cookieSv.toString().getBytes(StandardCharsets.ISO_8859_1); + } int subCount = items.size() - 1; // Determine object kind from the bless target. @@ -433,6 +447,15 @@ public void dispatch(StorableContext c, RuntimeScalar value) { /** Emit the body of a non-reference scalar. Mirrors * {@code store_scalar} (Storable.xs L2393). */ private void writeScalar(StorableContext c, RuntimeScalar v) { + // Every fresh leaf scalar consumes a seen-tag on the read side + // (Storable.xs `retrieve_*` for SX_SCALAR / SX_BYTE / SX_INTEGER / + // SX_DOUBLE / SX_UTF8STR / SX_LSCALAR / SX_LUTF8STR / SX_UNDEF / + // SX_SV_* all call SEEN_NN). The writer must allocate the + // matching tag here so subsequent SX_OBJECT backrefs line up. + // The key is unique per emission — leaf scalars don't + // participate in identity-shared backref deduplication. + c.recordWriteSeen(new Object()); + // undef if (v.type == RuntimeScalarType.UNDEF || !v.getDefinedBoolean()) { c.writeByte(Opcodes.SX_UNDEF);