Iterative GC mark to fix iOS stack overflow on deep graphs (#3136)#4980
Merged
Conversation
The mark phase of the ParparVM GC recursed through reference fields, building one C stack frame per Java reference traversed. iOS gives secondary pthreads a ~512KB stack, so deep reachable graphs (linked lists, parse trees, deeply nested containers of a few thousand references) would SIGBUS the GC thread. Switch gcMarkObject to an explicit fixed-size worklist: - gcMarkObject sets the mark bit and pushes; it no longer recurses. - gcMarkDrain pops entries and invokes their per-class mark function, which pushes any unmarked children onto the same worklist. - On worklist overflow, the offending object stays marked (sweep preserves it) but its scan is deferred to a heap-rescan pass that re-invokes mark functions on already-marked objects to pick up children skipped on overflow. A progress flag terminates the rescan loop when the reachable set is closed under "marked," so closed sets larger than the worklist don't spin forever. CN1_GC_MARK_WORKLIST_SIZE (default 4096 entries ~ 64KB on 64-bit) is overridable via -D for tuning. Linear chains and balanced trees use O(1) and O(depth) of worklist respectively; only multi-million-element object arrays approach the limit, and they fall back to the rescan slow path rather than crashing. The per-class __GC_MARK_<cls> functions emitted by the bytecode translator are unchanged -- they already invoke gcMarkObject for each reference field, which is exactly the "enumerate children onto the worklist" step the iterative algorithm needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Compared 20 screenshots: 20 matched. |
Contributor
Contributor
✅ ByteCodeTranslator Quality ReportTest & Coverage
Benchmark Results
Static Analysis
Generated automatically by the PR CI workflow. |
Two related fixes to the force-mode dedupe in gcMarkObject: 1. `recursionKey` was declared but never incremented, so refcount==recursionKey state persisted across GC cycles. The dedupe-skip in the already-marked force branch only worked by coincidence -- recursionKey stayed at 1, which happens to be the gcMalloc default for __codenameOneReferenceCount. Incrementing per cycle alongside currentGcMarkValue gives the key correct per-cycle semantics. 2. The unmarked fallthrough didn't write __codenameOneReferenceCount, so on a second force visit within the same cycle the dedupe check would miss for any object whose refcount had been set elsewhere (e.g. pinned constant pool objects with refcount=999999), forcing one redundant re-traversal of the subtree before the third visit caught up. Both are minor pre-existing inefficiencies rather than soundness bugs (the mark/sweep is still correct), but worth cleaning up while the area is touched. Verified end-to-end by generating an Xcode project via the cn1app archetype integration test and compiling it for iphonesimulator/arm64 -- BUILD SUCCEEDED with no new warnings from cn1_globals.m. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut of the iterative mark deferred all field scans to a single drain at the end of codenameOneGCMark, after every mutator thread had already been unblocked. That broke a concurrency invariant the recursive implementation had relied on: by the time a thread was unblocked, every object transitively reachable from its stack roots was marked, so any new local the thread captured after unblock could only point at already-marked memory. With deferred draining, a thread could unblock while its stack roots were still grey (marked, but unscanned). The mutator could then: oldChild = root.field; // capture into a new local (invisible to GC) root.field = null; // strand the original reference When the GC eventually popped `root` from the worklist, it scanned a now- nulled field and never marked `oldChild`. Sweep reclaimed it. The next mutator dereference into `oldChild` hit freed memory -- which manifested in CI as a hang in `MainScreenScreenshotTest` (waiting on a lock object that had been freed under the lock holder). Fix: drain the worklist before unblocking each thread. Each thread's reachable subgraph is now fully marked while the thread is paused, exactly as the recursive version did. Total work is unchanged (idempotent -- already-marked objects short-circuit in gcMarkObject); only the distribution shifts. Verified against the iOS UI screenshot suite locally -- the same MainScreenScreenshotTest that hung in CI now finishes in ~2s and the suite progresses through 30+ more tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit advanced recursionKey per cycle and wrote __codenameOneReferenceCount in the unmarked fallthrough when force=true. That changed force-mode dedupe semantics in subtle ways: it converted the "refcount stayed at 1 by coincidence with gcMalloc's default" into an explicit per-cycle invariant, and along the way made every force re-visit on an object whose refcount didn't match recursionKey trigger a re-push. The bug fix in this branch was the iterative mark and the per-thread drain; the recursionKey rewrite was a cleanup the user asked for while we were in the area. Keeping the scope tighter to just the GC algorithm change makes the diff easier to reason about and removes a potential source of behavioral drift if anything in the codebase implicitly relied on the old "refcount == 1 == recursionKey" coincidence (e.g. mark-function side effects we haven't audited). Net effect of this revert: gcMarkObject's force-mode behavior matches exactly what shipped before the iterative-mark change. Only the recursive-vs-worklist mark machinery and the per-thread drain remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-thread drain inside codenameOneGCMark consistently hangs CI on KotlinUiTest (right after `suite starting` is logged, before the first screenshot is captured). Reverting it leaves only the final drain at the end of codenameOneGCMark, matching the original cut of the iterative-mark change. That earlier cut had a different failure mode (MainScreenScreenshotTest hung, KotlinUiTest passed), which I attributed to a snapshot-at-the- beginning race -- mutator captures a grey field reference into a new local, nulls the field, GC misses, sweep frees. The per-thread drain was meant to close that window by transitively marking each thread's reachable graph before unblocking, the way the recursive implementation did. But the iOS UI screenshot suite consistently times out at KotlinUiTest with the drain in place, even after I ruled out the recursionKey cleanup and verified the inner drain logic terminates. The drain isn't doing the thing I thought it was, or it's interacting with some startup-time behavior I haven't tracked down yet. Going back to the simpler structure and re-investigating the original MainScreenScreenshotTest hang from there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Compared 110 screenshots: 110 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
The first iterative-mark commit set CN1_GC_MARK_WORKLIST_SIZE to 4096 entries (~64KB) and used a rescan path for the overflow case where the worklist filled before everything reachable was pushed. The rescan walked allObjectsInHeap pushing every marked object whose mark function hadn't been called yet -- but it restarted from iter=0 on every batch. With a heap whose marked-object count exceeds the worklist size, the same first ~WORKLIST_SIZE objects got pushed and processed over and over while objects at higher indices were starved. Their mark functions never ran, their children stayed unmarked, sweep reclaimed reachable memory, and the app deadlocked or crashed at the first dereference. HelloCodenameOne's constant pool alone is ~15K entries -- well past the 4096 threshold -- so the rescan path was active and broken on every GC cycle from app startup. That's the silent hang we saw at KotlinUiTest / MainScreenScreenshotTest, depending on which test happened to dereference a freed object first. Two fixes: 1. Bump the default worklist to 65536 entries (~1MB on 64-bit). Sized so typical app heaps (constant pool, statics, UI graph) fit comfortably without ever triggering the rescan slow path. Still overridable via -DCN1_GC_MARK_WORKLIST_SIZE for memory-constrained scenarios. 2. Track the rescan position in a `rescanCursor` local that resumes across drain/rescan batches instead of resetting to 0. The cursor only restarts at 0 after a full sweep through the heap finishes AND the subsequent drain marked something new (which means there may be marked objects past indices we already visited this round). The termination condition now requires cursor >= total and no pending overflow, so we can't exit while there are still marked-but-unscanned objects in the tail of the heap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Compared 110 screenshots: 110 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
Two related changes: 1. Re-add the per-thread drain inside codenameOneGCMark's per-thread loop, right after markStatics and before t->threadBlockedByGC=false. Without this drain, an unblocked mutator can read a still-grey reference field into a new local and null the original field; the captured object never gets visited by the final drain at the end of codenameOneGCMark, sweep reclaims it, and a later monitorEnter on its freed pthread_mutex_t silently deadlocks. That's the Metal-job hang we saw at SpanLabelThemeScreenshotTest's finish() callback, right where initFirstTheme allocates heavily and triggers GC. The recursive baseline didn't have this race because it transitively marked everything before unblocking. Earlier per-thread drain attempts hung at app startup, but that was the overflow-rescan cursor bug (rescan restarting from 0 each batch while HelloCodenameOne's 14878-entry constant pool overflowed the 4096 worklist). With both the cursor fix and the bumped 65536-entry default worklist in the previous commit, this drain runs to completion in O(reachable) per thread. 2. Drop the workflow's `continue-on-error: true` on build-ios-metal and the stale comments referencing METAL_PORT_STATUS.md (which was deleted in the previous commit). The Metal port works; failures there should block the PR like the GL job's failures do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gcMarkObjectin ParparVM with an iterative worklist; deep reachable graphs (a few thousand references) no longer SIGBUS the GC thread on iOS, which gives secondary pthreads only ~512KB of stack.-DCN1_GC_MARK_WORKLIST_SIZE=N(default 4096 entries, ~64KB on 64-bit).__GC_MARK_<cls>per-class functions are unchanged — they already enumerate children viagcMarkObject, which now means "push onto worklist."recursionKeywas never advanced per cycle; the unmarked fallthrough didn't write__codenameOneReferenceCount).Why
Per issue #3136, the recursive mark builds one C stack frame per Java reference traversed. A
Link { Link next; }chain of just a few thousand nodes blew the 512KB stack on iPad and crashed the GC thread with no recoverable info. Android wasn't affected because ART uses its own mark phase; the JavaSE simulator wasn't either because HotSpot's GC is independent. The reporter's torture test (Dtest.java in the issue) is contrived, but any reasonably deep reachable chain would hit the same wall.Approach
gcMarkObjectis now O(1): set the mark bit, push(obj, force)onto the fixed worklist, return.gcMarkDrain(new) is called at the end ofcodenameOneGCMarkand pops the worklist, invoking each entry's class mark function — which callsgcMarkObjectfor each reference field, recursively in Java terms but iteratively in C.allObjectsInHeapand re-push every marked object whose mark function exists; their re-invocation visits their fields and pushes any unmarked children. Repeat until either the worklist drains without overflow or a rescan + drain marks nothing new (closed set / fixed point).gcMarkFoundUnmarkedChildInPass, set insidegcMarkObjectonly when transitioning unmarked→marked) is what prevents the failure mode where the marked set is larger than the worklist and a naive rescan would keep re-raising overflow without progress.recursionKey fix (second commit)
Two related pre-existing bugs in the force-mode dedupe:
recursionKeywas declaredint recursionKey = 1;and never incremented anywhere. The intended invariant — "refcount == recursionKey means visited-in-force-mode this cycle" — only worked by coincidence with the gcMalloc default ofrefcount=1. Across cycles, leftover refcount=1 state masked first-visit checks. Fixed byrecursionKey++alongsidecurrentGcMarkValue++at the top of each GC mark cycle.__codenameOneReferenceCount, so on a second force visit the dedupe-check failed for objects with non-default refcounts (notably pinned constant pool entries withrefcount=999999), causing one redundant subtree re-traversal per cycle. Fixed by writingobj->__codenameOneReferenceCount = recursionKeyin the unmarked branch whenforce == JAVA_TRUE.Trade-offs vs. the recursive version
__codenameOneGcMark == -1) absorbs the same races as before.Test plan
cn1app-archetypeintegration test (maven/integration-tests/cn1app-archetype-test.sh).xcodebuild -sdk iphonesimulator -arch arm64: BUILD SUCCEEDED with no new warnings fromcn1_globals.m. The two pre-existing warnings (-Wformatat line 665 and-Wpointer-to-int-castat line 1025) are unrelated to this change.Fixes #3136
🤖 Generated with Claude Code