Skip to content

Iterative GC mark to fix iOS stack overflow on deep graphs (#3136)#4980

Merged
shai-almog merged 7 commits into
masterfrom
fix/3136-iterative-gc-mark
May 19, 2026
Merged

Iterative GC mark to fix iOS stack overflow on deep graphs (#3136)#4980
shai-almog merged 7 commits into
masterfrom
fix/3136-iterative-gc-mark

Conversation

@shai-almog
Copy link
Copy Markdown
Collaborator

@shai-almog shai-almog commented May 19, 2026

Summary

  • Replace recursive gcMarkObject in ParparVM with an iterative worklist; deep reachable graphs (a few thousand references) no longer SIGBUS the GC thread on iOS, which gives secondary pthreads only ~512KB of stack.
  • Worklist size is overridable at compile time via -DCN1_GC_MARK_WORKLIST_SIZE=N (default 4096 entries, ~64KB on 64-bit).
  • On worklist overflow the object stays marked (sweep preserves it) and its scan is deferred to a heap-rescan pass; a progress flag terminates the rescan when the reachable set is closed under "marked," so closed sets larger than the worklist don't spin forever.
  • The bytecode-translator-emitted __GC_MARK_<cls> per-class functions are unchanged — they already enumerate children via gcMarkObject, which now means "push onto worklist."
  • Second commit: fix two pre-existing inefficiencies in the force-mode dedupe (recursionKey was never advanced per cycle; the unmarked fallthrough didn't write __codenameOneReferenceCount).

Why

Per issue #3136, the recursive mark builds one C stack frame per Java reference traversed. A Link { Link next; } chain of just a few thousand nodes blew the 512KB stack on iPad and crashed the GC thread with no recoverable info. Android wasn't affected because ART uses its own mark phase; the JavaSE simulator wasn't either because HotSpot's GC is independent. The reporter's torture test (Dtest.java in the issue) is contrived, but any reasonably deep reachable chain would hit the same wall.

Approach

  • gcMarkObject is now O(1): set the mark bit, push (obj, force) onto the fixed worklist, return.
  • gcMarkDrain (new) is called at the end of codenameOneGCMark and pops the worklist, invoking each entry's class mark function — which calls gcMarkObject for each reference field, recursively in Java terms but iteratively in C.
  • Worst-case worklist usage is bounded by the fan-out frontier, not graph depth. Linear chains use O(1) entries; balanced trees use O(depth). Only wide arrays-of-objects approach the limit.
  • Overflow strategy: if the worklist fills, the failing push sets a flag; the object is still marked (so sweep won't reclaim it), but its fields aren't visited yet. After the worklist drains, walk allObjectsInHeap and re-push every marked object whose mark function exists; their re-invocation visits their fields and pushes any unmarked children. Repeat until either the worklist drains without overflow or a rescan + drain marks nothing new (closed set / fixed point).
  • The progress flag (gcMarkFoundUnmarkedChildInPass, set inside gcMarkObject only when transitioning unmarked→marked) is what prevents the failure mode where the marked set is larger than the worklist and a naive rescan would keep re-raising overflow without progress.

recursionKey fix (second commit)

Two related pre-existing bugs in the force-mode dedupe:

  1. recursionKey was declared int recursionKey = 1; and never incremented anywhere. The intended invariant — "refcount == recursionKey means visited-in-force-mode this cycle" — only worked by coincidence with the gcMalloc default of refcount=1. Across cycles, leftover refcount=1 state masked first-visit checks. Fixed by recursionKey++ alongside currentGcMarkValue++ at the top of each GC mark cycle.
  2. The unmarked fallthrough branch never wrote __codenameOneReferenceCount, so on a second force visit the dedupe-check failed for objects with non-default refcounts (notably pinned constant pool entries with refcount=999999), causing one redundant subtree re-traversal per cycle. Fixed by writing obj->__codenameOneReferenceCount = recursionKey in the unmarked branch when force == JAVA_TRUE.

Trade-offs vs. the recursive version

  • Memory: the worklist preallocates ~64KB once at startup. The recursive version used ~150 bytes of stack per node in flight — for a 50-deep container hierarchy that's ~7.5KB of stack, so the iterative version is heavier at the trivial end and dramatically lighter at the deep end.
  • CPU: one heap push/pop per traversed reference vs. one C function-call prologue/epilogue. Roughly a wash; the iterative version avoids the cache-line dirty cost of register save/restore. Per-array-element processing is unchanged.
  • Concurrency: the GC remains concurrent. Mark stays read-only on the live object graph (the worklist lives in static storage, not pointer-reversal). Snapshot semantics relative to the existing recursive version: slightly weaker because children of a popped object are scanned lazily rather than during root scanning, but the new-object grace period (sweep ignores objects with __codenameOneGcMark == -1) absorbs the same races as before.

Test plan

  • Generated an Xcode project via the cn1app-archetype integration test (maven/integration-tests/cn1app-archetype-test.sh).
  • Compiled the project with xcodebuild -sdk iphonesimulator -arch arm64: BUILD SUCCEEDED with no new warnings from cn1_globals.m. The two pre-existing warnings (-Wformat at line 665 and -Wpointer-to-int-cast at line 1025) are unrelated to this change.
  • Run the Dtest reproducer from issue crash in garbage collector #3136 on iOS device — should run until OOM rather than crashing in 3s.
  • Spot-check memory usage on a normal app to verify the 64KB worklist allocation doesn't show up as a measurable regression.

Fixes #3136

🤖 Generated with Claude Code

The mark phase of the ParparVM GC recursed through reference fields, building
one C stack frame per Java reference traversed. iOS gives secondary pthreads
a ~512KB stack, so deep reachable graphs (linked lists, parse trees, deeply
nested containers of a few thousand references) would SIGBUS the GC thread.

Switch gcMarkObject to an explicit fixed-size worklist:
- gcMarkObject sets the mark bit and pushes; it no longer recurses.
- gcMarkDrain pops entries and invokes their per-class mark function, which
  pushes any unmarked children onto the same worklist.
- On worklist overflow, the offending object stays marked (sweep preserves it)
  but its scan is deferred to a heap-rescan pass that re-invokes mark functions
  on already-marked objects to pick up children skipped on overflow. A progress
  flag terminates the rescan loop when the reachable set is closed under
  "marked," so closed sets larger than the worklist don't spin forever.

CN1_GC_MARK_WORKLIST_SIZE (default 4096 entries ~ 64KB on 64-bit) is overridable
via -D for tuning. Linear chains and balanced trees use O(1) and O(depth) of
worklist respectively; only multi-million-element object arrays approach the
limit, and they fall back to the rescan slow path rather than crashing.

The per-class __GC_MARK_<cls> functions emitted by the bytecode translator are
unchanged -- they already invoke gcMarkObject for each reference field, which
is exactly the "enumerate children onto the worklist" step the iterative
algorithm needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shai-almog
Copy link
Copy Markdown
Collaborator Author

shai-almog commented May 19, 2026

Compared 20 screenshots: 20 matched.
✅ JavaScript-port screenshot tests passed.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

✅ Continuous Quality Report

Test & Coverage

Static Analysis

  • SpotBugs [Report archive]
    • ByteCodeTranslator: 0 findings (no issues)
    • android: 0 findings (no issues)
    • codenameone-maven-plugin: 0 findings (no issues)
    • core-unittests: 0 findings (no issues)
    • ios: 0 findings (no issues)
  • PMD: 0 findings (no issues) [Report archive]
  • Checkstyle: 0 findings (no issues) [Report archive]

Generated automatically by the PR CI workflow.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

✅ ByteCodeTranslator Quality Report

Test & Coverage

  • Tests: 644 total, 0 failed, 2 skipped

Benchmark Results

  • Execution Time: 10656 ms

  • Hotspots (Top 20 sampled methods):

    • 21.80% java.lang.String.indexOf (407 samples)
    • 18.32% java.util.ArrayList.indexOf (342 samples)
    • 18.21% com.codename1.tools.translator.Parser.isMethodUsed (340 samples)
    • 4.34% java.lang.Object.hashCode (81 samples)
    • 4.02% com.codename1.tools.translator.ByteCodeClass.markDependent (75 samples)
    • 3.05% com.codename1.tools.translator.Parser.addToConstantPool (57 samples)
    • 2.73% java.lang.System.identityHashCode (51 samples)
    • 2.04% com.codename1.tools.translator.ByteCodeClass.calcUsedByNative (38 samples)
    • 1.98% com.codename1.tools.translator.ByteCodeClass.updateAllDependencies (37 samples)
    • 1.77% com.codename1.tools.translator.BytecodeMethod.optimize (33 samples)
    • 1.61% com.codename1.tools.translator.BytecodeMethod.appendMethodSignatureSuffixFromDesc (30 samples)
    • 1.50% com.codename1.tools.translator.Parser.generateClassAndMethodIndexHeader (28 samples)
    • 1.02% com.codename1.tools.translator.Parser.cullMethods (19 samples)
    • 0.96% java.lang.StringBuilder.append (18 samples)
    • 0.64% com.codename1.tools.translator.BytecodeMethod.appendMethodC (12 samples)
    • 0.59% com.codename1.tools.translator.BytecodeMethod.appendCMethodPrefix (11 samples)
    • 0.59% com.codename1.tools.translator.Parser.writeOutput (11 samples)
    • 0.59% sun.nio.fs.UnixNativeDispatcher.open0 (11 samples)
    • 0.59% java.lang.StringCoding.encode (11 samples)
    • 0.54% java.util.TreeMap.getEntry (10 samples)
  • ⚠️ Coverage report not generated.

Static Analysis

  • ✅ SpotBugs: no findings (report was not generated by the build).
  • ⚠️ PMD report not generated.
  • ⚠️ Checkstyle report not generated.

Generated automatically by the PR CI workflow.

shai-almog and others added 4 commits May 19, 2026 06:16
Two related fixes to the force-mode dedupe in gcMarkObject:

1. `recursionKey` was declared but never incremented, so refcount==recursionKey
   state persisted across GC cycles. The dedupe-skip in the already-marked
   force branch only worked by coincidence -- recursionKey stayed at 1, which
   happens to be the gcMalloc default for __codenameOneReferenceCount.
   Incrementing per cycle alongside currentGcMarkValue gives the key correct
   per-cycle semantics.

2. The unmarked fallthrough didn't write __codenameOneReferenceCount, so on a
   second force visit within the same cycle the dedupe check would miss for
   any object whose refcount had been set elsewhere (e.g. pinned constant pool
   objects with refcount=999999), forcing one redundant re-traversal of the
   subtree before the third visit caught up.

Both are minor pre-existing inefficiencies rather than soundness bugs (the
mark/sweep is still correct), but worth cleaning up while the area is touched.
Verified end-to-end by generating an Xcode project via the cn1app archetype
integration test and compiling it for iphonesimulator/arm64 -- BUILD SUCCEEDED
with no new warnings from cn1_globals.m.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut of the iterative mark deferred all field scans to a single
drain at the end of codenameOneGCMark, after every mutator thread had
already been unblocked. That broke a concurrency invariant the recursive
implementation had relied on: by the time a thread was unblocked, every
object transitively reachable from its stack roots was marked, so any new
local the thread captured after unblock could only point at already-marked
memory.

With deferred draining, a thread could unblock while its stack roots were
still grey (marked, but unscanned). The mutator could then:

  oldChild = root.field;   // capture into a new local (invisible to GC)
  root.field = null;        // strand the original reference

When the GC eventually popped `root` from the worklist, it scanned a now-
nulled field and never marked `oldChild`. Sweep reclaimed it. The next
mutator dereference into `oldChild` hit freed memory -- which manifested
in CI as a hang in `MainScreenScreenshotTest` (waiting on a lock object
that had been freed under the lock holder).

Fix: drain the worklist before unblocking each thread. Each thread's
reachable subgraph is now fully marked while the thread is paused, exactly
as the recursive version did. Total work is unchanged (idempotent --
already-marked objects short-circuit in gcMarkObject); only the
distribution shifts. Verified against the iOS UI screenshot suite locally
-- the same MainScreenScreenshotTest that hung in CI now finishes in ~2s
and the suite progresses through 30+ more tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit advanced recursionKey per cycle and wrote
__codenameOneReferenceCount in the unmarked fallthrough when force=true.
That changed force-mode dedupe semantics in subtle ways: it converted the
"refcount stayed at 1 by coincidence with gcMalloc's default" into an
explicit per-cycle invariant, and along the way made every force re-visit
on an object whose refcount didn't match recursionKey trigger a re-push.

The bug fix in this branch was the iterative mark and the per-thread
drain; the recursionKey rewrite was a cleanup the user asked for while we
were in the area. Keeping the scope tighter to just the GC algorithm
change makes the diff easier to reason about and removes a potential
source of behavioral drift if anything in the codebase implicitly relied
on the old "refcount == 1 == recursionKey" coincidence (e.g. mark-function
side effects we haven't audited).

Net effect of this revert: gcMarkObject's force-mode behavior matches
exactly what shipped before the iterative-mark change. Only the
recursive-vs-worklist mark machinery and the per-thread drain remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-thread drain inside codenameOneGCMark consistently hangs CI on
KotlinUiTest (right after `suite starting` is logged, before the first
screenshot is captured). Reverting it leaves only the final drain at the
end of codenameOneGCMark, matching the original cut of the iterative-mark
change.

That earlier cut had a different failure mode (MainScreenScreenshotTest
hung, KotlinUiTest passed), which I attributed to a snapshot-at-the-
beginning race -- mutator captures a grey field reference into a new
local, nulls the field, GC misses, sweep frees. The per-thread drain
was meant to close that window by transitively marking each thread's
reachable graph before unblocking, the way the recursive implementation
did. But the iOS UI screenshot suite consistently times out at
KotlinUiTest with the drain in place, even after I ruled out the
recursionKey cleanup and verified the inner drain logic terminates. The
drain isn't doing the thing I thought it was, or it's interacting with
some startup-time behavior I haven't tracked down yet.

Going back to the simpler structure and re-investigating the original
MainScreenScreenshotTest hang from there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shai-almog
Copy link
Copy Markdown
Collaborator Author

shai-almog commented May 19, 2026

Compared 110 screenshots: 110 matched.
✅ Native iOS Metal screenshot tests passed.

Benchmark Results

  • VM Translation Time: 0 seconds
  • Compilation Time: 365 seconds

Build and Run Timing

Metric Duration
Simulator Boot 84000 ms
Simulator Boot (Run) 1000 ms
App Install 17000 ms
App Launch 14000 ms
Test Execution 332000 ms

Detailed Performance Metrics

Metric Duration
Base64 payload size 8192 bytes
Base64 benchmark iterations 6000
Base64 native encode 1745.000 ms
Base64 CN1 encode 2358.000 ms
Base64 encode ratio (CN1/native) 1.351x (35.1% slower)
Base64 native decode 1134.000 ms
Base64 CN1 decode 1587.000 ms
Base64 decode ratio (CN1/native) 1.399x (39.9% slower)
Base64 SIMD encode 646.000 ms
Base64 encode ratio (SIMD/native) 0.370x (63.0% faster)
Base64 encode ratio (SIMD/CN1) 0.274x (72.6% faster)
Base64 SIMD decode 482.000 ms
Base64 decode ratio (SIMD/native) 0.425x (57.5% faster)
Base64 decode ratio (SIMD/CN1) 0.304x (69.6% faster)
Image encode benchmark iterations 100
Image createMask (SIMD off) 216.000 ms
Image createMask (SIMD on) 13.000 ms
Image createMask ratio (SIMD on/off) 0.060x (94.0% faster)
Image applyMask (SIMD off) 709.000 ms
Image applyMask (SIMD on) 673.000 ms
Image applyMask ratio (SIMD on/off) 0.949x (5.1% faster)
Image modifyAlpha (SIMD off) 669.000 ms
Image modifyAlpha (SIMD on) 111.000 ms
Image modifyAlpha ratio (SIMD on/off) 0.166x (83.4% faster)
Image modifyAlpha removeColor (SIMD off) 721.000 ms
Image modifyAlpha removeColor (SIMD on) 131.000 ms
Image modifyAlpha removeColor ratio (SIMD on/off) 0.182x (81.8% faster)
Image PNG encode (SIMD off) 2250.000 ms
Image PNG encode (SIMD on) 1320.000 ms
Image PNG encode ratio (SIMD on/off) 0.587x (41.3% faster)
Image JPEG encode 694.000 ms

The first iterative-mark commit set CN1_GC_MARK_WORKLIST_SIZE to 4096
entries (~64KB) and used a rescan path for the overflow case where the
worklist filled before everything reachable was pushed. The rescan walked
allObjectsInHeap pushing every marked object whose mark function hadn't
been called yet -- but it restarted from iter=0 on every batch. With a
heap whose marked-object count exceeds the worklist size, the same first
~WORKLIST_SIZE objects got pushed and processed over and over while
objects at higher indices were starved. Their mark functions never ran,
their children stayed unmarked, sweep reclaimed reachable memory, and
the app deadlocked or crashed at the first dereference.

HelloCodenameOne's constant pool alone is ~15K entries -- well past the
4096 threshold -- so the rescan path was active and broken on every GC
cycle from app startup. That's the silent hang we saw at KotlinUiTest /
MainScreenScreenshotTest, depending on which test happened to dereference
a freed object first.

Two fixes:

1. Bump the default worklist to 65536 entries (~1MB on 64-bit). Sized so
   typical app heaps (constant pool, statics, UI graph) fit comfortably
   without ever triggering the rescan slow path. Still overridable via
   -DCN1_GC_MARK_WORKLIST_SIZE for memory-constrained scenarios.

2. Track the rescan position in a `rescanCursor` local that resumes
   across drain/rescan batches instead of resetting to 0. The cursor only
   restarts at 0 after a full sweep through the heap finishes AND the
   subsequent drain marked something new (which means there may be marked
   objects past indices we already visited this round). The termination
   condition now requires cursor >= total and no pending overflow, so we
   can't exit while there are still marked-but-unscanned objects in the
   tail of the heap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shai-almog
Copy link
Copy Markdown
Collaborator Author

shai-almog commented May 19, 2026

Compared 110 screenshots: 110 matched.
✅ Native iOS screenshot tests passed.

Benchmark Results

  • VM Translation Time: 0 seconds
  • Compilation Time: 194 seconds

Build and Run Timing

Metric Duration
Simulator Boot 67000 ms
Simulator Boot (Run) 0 ms
App Install 11000 ms
App Launch 7000 ms
Test Execution 304000 ms

Detailed Performance Metrics

Metric Duration
Base64 payload size 8192 bytes
Base64 benchmark iterations 6000
Base64 native encode 1277.000 ms
Base64 CN1 encode 1400.000 ms
Base64 encode ratio (CN1/native) 1.096x (9.6% slower)
Base64 native decode 779.000 ms
Base64 CN1 decode 1213.000 ms
Base64 decode ratio (CN1/native) 1.557x (55.7% slower)
Base64 SIMD encode 473.000 ms
Base64 encode ratio (SIMD/native) 0.370x (63.0% faster)
Base64 encode ratio (SIMD/CN1) 0.338x (66.2% faster)
Base64 SIMD decode 393.000 ms
Base64 decode ratio (SIMD/native) 0.504x (49.6% faster)
Base64 decode ratio (SIMD/CN1) 0.324x (67.6% faster)
Image encode benchmark iterations 100
Image createMask (SIMD off) 91.000 ms
Image createMask (SIMD on) 13.000 ms
Image createMask ratio (SIMD on/off) 0.143x (85.7% faster)
Image applyMask (SIMD off) 155.000 ms
Image applyMask (SIMD on) 82.000 ms
Image applyMask ratio (SIMD on/off) 0.529x (47.1% faster)
Image modifyAlpha (SIMD off) 165.000 ms
Image modifyAlpha (SIMD on) 70.000 ms
Image modifyAlpha ratio (SIMD on/off) 0.424x (57.6% faster)
Image modifyAlpha removeColor (SIMD off) 203.000 ms
Image modifyAlpha removeColor (SIMD on) 93.000 ms
Image modifyAlpha removeColor ratio (SIMD on/off) 0.458x (54.2% faster)
Image PNG encode (SIMD off) 1145.000 ms
Image PNG encode (SIMD on) 927.000 ms
Image PNG encode ratio (SIMD on/off) 0.810x (19.0% faster)
Image JPEG encode 545.000 ms

Two related changes:

1. Re-add the per-thread drain inside codenameOneGCMark's per-thread
   loop, right after markStatics and before t->threadBlockedByGC=false.
   Without this drain, an unblocked mutator can read a still-grey
   reference field into a new local and null the original field; the
   captured object never gets visited by the final drain at the end of
   codenameOneGCMark, sweep reclaims it, and a later monitorEnter on
   its freed pthread_mutex_t silently deadlocks. That's the Metal-job
   hang we saw at SpanLabelThemeScreenshotTest's finish() callback,
   right where initFirstTheme allocates heavily and triggers GC. The
   recursive baseline didn't have this race because it transitively
   marked everything before unblocking.

   Earlier per-thread drain attempts hung at app startup, but that was
   the overflow-rescan cursor bug (rescan restarting from 0 each batch
   while HelloCodenameOne's 14878-entry constant pool overflowed the
   4096 worklist). With both the cursor fix and the bumped 65536-entry
   default worklist in the previous commit, this drain runs to
   completion in O(reachable) per thread.

2. Drop the workflow's `continue-on-error: true` on build-ios-metal and
   the stale comments referencing METAL_PORT_STATUS.md (which was
   deleted in the previous commit). The Metal port works; failures
   there should block the PR like the GL job's failures do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shai-almog shai-almog merged commit 9b92d96 into master May 19, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

crash in garbage collector

1 participant