Skip to content

fix(browse): clarify localhost bind failures in sandboxes#1664

Closed
spacegeologist wants to merge 1 commit into
garrytan:mainfrom
spacegeologist:fix-browse-startup-diagnostics-code-only
Closed

fix(browse): clarify localhost bind failures in sandboxes#1664
spacegeologist wants to merge 1 commit into
garrytan:mainfrom
spacegeologist:fix-browse-startup-diagnostics-code-only

Conversation

@spacegeologist
Copy link
Copy Markdown
Contributor

Reviewer Guide

  • browse/src/server.ts preserves the actual localhost bind failure from net.createServer().listen() and separates real EADDRINUSE port occupancy from sandbox / OS bind-denial failures such as EPERM.
  • browse/test/findport.test.ts adds regression coverage for both explicit BROWSE_PORT and random-port diagnostics, while keeping the old “port is in use” meaning for true EADDRINUSE.

Why

I have been experimenting with running gstack inside Codex and noticed that
browse can report random port exhaustion when the underlying issue is actually
a sandbox-denied 127.0.0.1 listen() call.

browse startup currently collapses every failed localhost bind attempt into a port-availability message like “No available port after 5 attempts” or “Port ... is in use.” In a sandboxed agent environment, including Codex, listen() can fail with EPERM even when the sampled port is not occupied. That sends the user and reviewer toward the wrong diagnosis: looking for port exhaustion instead of realizing that localhost binding is blocked by sandbox / OS permissions.

This patch keeps the successful path unchanged, but carries the original bind error through the diagnostic path. If the failure is EADDRINUSE, browse still reports an occupied port. If the failure is something else, the error now says that localhost binding is likely blocked and suggests allowing localhost binding, setting BROWSE_PORT to an approved port, or running from an unrestricted terminal.

Evidence

I hit this while dogfooding browse from a sandboxed Codex session. The installed gstack binary failed with the old message:

[browse] No available port after 5 attempts in range 10000-60000

After the local build, the same default sandbox scenario reports the actual class of failure:

[browse] Cannot bind localhost ports after 5 attempts in range 10000-60000. Last error: <sampled-port> (EPERM: Failed to listen at 127.0.0.1). This usually means the current sandbox or OS permissions are blocking localhost port binding, not that every sampled port is occupied. Allow localhost binding, set BROWSE_PORT to an approved port, or run browse from an unrestricted terminal.

When localhost binding is approved in the same environment, browse status becomes healthy, which confirms this is a diagnostic clarity issue rather than every random port being occupied.

Risk / Scope

  • Low runtime risk: successful port selection and server lifecycle are unchanged.
  • True EADDRINUSE still reports the existing occupied-port message.
  • This does not attempt to bypass sandbox restrictions or change permission behavior.

Verification

  • bun test browse/test/findport.test.ts
  • bun test browse/test/server-factory.test.ts
  • bun run build
  • git diff --check
  • Manual dogfood: browse/dist/browse status under the default sandbox now reports the EPERM bind-denial diagnostic instead of the misleading port-exhaustion message.

garrytan added a commit that referenced this pull request May 25, 2026
…bundle (#1682)

* fix(office-hours): #1671 — session writer was writing to the legacy file

User-visible symptom: returning /office-hours users get the same closing
pitch every visit, no matter how many times they've run the skill. The
welcome_back tier (which exists specifically to skip the pitch for
returning users) was unreachable. Live since 2026-04-18 / v1.0.0.0 on
every fresh-$HOME user.

Root cause: the v1.0.0.0 migration moved the read path to
~/.gstack/developer-profile.json but left the writer in
office-hours/SKILL.md.tmpl writing to the legacy
~/.gstack/builder-profile.jsonl. Reader and writer disagreed on storage,
so SESSION_COUNT never incremented and /office-hours always treated the
user as a first-timer.

Fix:
- bin/gstack-developer-profile: new --log-session subcommand that
  read-modify-writes developer-profile.json's sessions[] array (atomic
  mktemp+mv, signals/resources/topics aggregation, gbrain-enqueue mirror
  of gstack-timeline-log:40). Naming matches the gstack-*-log family verb.
- bin/gstack-developer-profile: do_read filters mode:"resources" entries
  when picking LAST_PROJECT/LAST_ASSIGNMENT/LAST_DESIGN_TITLE so the Phase
  6 resources auto-append doesn't clobber real-session state. Latent bug
  that was masked by the broken writer; activated by the fix.
- office-hours/SKILL.md.tmpl: lines 490 + 893 swap echo >> for --log-session.
- test/gstack-developer-profile.test.ts: +8 tests covering --log-session
  contract (regression, aggregation, dedup, validation, ts handling) plus
  the mode-filter regression. All 8 fail on main, all 8 pass with this fix.
- test/static-no-legacy-writes.test.ts: new static-grep invariant walking
  every skill dir to prevent future regressions onto the legacy file.

Affected users: stranded builder-profile.jsonl entries are not recovered
automatically by this PR. On their next /office-hours run, the first new
session lands in welcome_back; past data stays in the legacy file (still
readable by other tools during deprecation). Most pre-existing users have
only a handful of stranded sessions.

See docs/designs/FIX_1671_PROFILE_MIGRATION.md for scope decisions
(RC2/RC3 follow-ups, what was intentionally left out, and why).

Issue: #1671

* test(office-hours): refine #1671 invariant regex comment for literal-path scope

Clarifies that the WRITE_PATTERN regex catches literal-path writes only;
variable-indirected writes (FILE=...; echo >> "$FILE") are not detected.
The SKILL.md.tmpl assertions in the same suite pin the exact #1671
regression class directly; this regex is a backstop, not a flow analyzer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(timeline): pass read filters as data

* feat(next-version): support monorepo VERSION paths via --version-path + .gstack/version-path

The workspace-aware ship queue hardcoded the VERSION file at the repo root.
In monorepos where versioning is subproject-scoped (one app inside a larger
repo), every PR's VERSION lookup 404s, the queue silently empties, and
parallel /ship sessions all bump from "current main + 1" — producing a
cascade of slot collisions.

Repro: tinas-second-brain repo. Root VERSION is absent; the real VERSION
lives at "Tinas Second Brain/health-tracker/VERSION". In one day, four
sequential collisions: 0.4.0.1 -> 0.5.0.0 -> 0.5.0.1 -> 0.5.0.2 -> 0.5.0.3.

Fix: add a --version-path flag and a repo-local .gstack/version-path
config file. Resolution priority: CLI flag > .gstack/version-path > "VERSION".
The resolved path threads through all four call sites — git show
origin/<base>:<path>, the GitHub Contents API, the GitLab files API, and
the local sibling-worktree scan — and shows up in the JSON output as
version_path so /ship and operators can see what got picked.

The previous warning "could not fetch VERSION (fork or private)" was
misleading whenever the real cause was wrong path. The new wording names
the path that 404'd and hints at the two knobs.

Backward-compatible: no flag, no config, no change in behavior.

Tests: 6 unit tests for resolveVersionPath (priority, parsing, blank /
missing / empty edge cases) + a second integration smoke that drives
--version-path end-to-end and asserts it surfaces in JSON output.

* fix(investigate): support standalone freeze hook path

* fix(browse): clarify localhost bind failures

* fix(migration): defer v1.40.0.0 done-marker until every repair succeeds (#1581)

The v1.40.0.0 migration unconditionally `touch`ed its done-marker, even
when the jq-gated `.brain-privacy-map.json` patch was skipped because jq
was missing on the user's machine. On subsequent runs, the script
short-circuited on the marker so the privacy-map repair never landed.
Federation sync then silently dropped `/plan-eng-review` test plans.

Track every failure mode via a single `incomplete` flag: jq missing,
malformed JSON, jq mutation failure, tempfile creation failure, `mv`
failure, allowlist append failure, gitattributes append failure. The
marker is written only when `incomplete=0`, so the migration runner
retries on the next /gstack-upgrade once the prerequisites are met.

* test(migration): unit tests for v1.40.0.0 deferred done-marker fix (#1581)

8 cases pinning the fix:

- Case 1 (happy path): jq present, fresh privacy-map → all three files
  patched, marker written.
- Case 2 (regression for #1581): jq missing, privacy-map present →
  marker must NOT be written. Fails against the buggy script, passes
  against the fix.
- Case 3 (recovery): jq missing, then jq restored → patch lands on
  second run.
- Case 4 (idempotency): privacy-map already has correct entry →
  no mutation, marker written.
- Case 5 (fresh-init): privacy-map file absent → allowlist + gitattrs
  patched, marker written.
- Case 6 (malformed JSON): broken privacy-map JSON → no marker, no
  mutation.
- Case 7 (jq mutation failure): fake jq returning 1 → no marker,
  tempfile cleaned up.
- Case 8 (allowlist append failure): read-only allowlist → no marker.

Tests use spawnSync('bash', [MIGRATION], …) with isolated tmpHomes.
"jq missing" sets PATH to a curated dir of symlinks to standard utils,
omitting jq; "jq mutation fails" uses an `exit 1` shim. Avoids
blanket-clearing PATH (which would hide bash/grep/etc).

* fix(brain-sync): make artifact sync work on Windows (discover-new + drain)

Automatic artifact sync was fully non-functional on Windows (Git Bash):
--discover-new enqueued nothing and the --once drain staged nothing, so
artifacts_sync_mode looked active but no artifacts ever reached the repo.
Three independent Windows-only causes in bin/gstack-brain-sync:

1. discover-new matched os.path.relpath (backslash separators on Windows)
   against the forward-slash allowlist globs, so no nested file ever matched.
   Normalized the relpath to "/".
2. discover-new enqueued via subprocess.run([gstack-brain-enqueue, rel]), but
   Windows Python cannot exec a bash-shebang script, so nothing was enqueued
   even once matched. Now appends to the queue in-process.
3. compute_paths_to_stage ends in print(p); Windows Python emits CRLF, the
   bash `read -r` keeps the trailing CR, and `git add -- "path<CR>"` matches
   nothing under `2>/dev/null || true`. Now strips the CR before staging.

The in-process enqueue mirrors gstack-brain-enqueue's contract: one atomic
O_APPEND write per record (each line < PIPE_BUF) so a parallel writer-shim
append can't interleave mid-record, and the discover cursor advances only
after the write succeeds, so a failed write retries instead of silently
recording the file as synced. Skip-list entries are separator-normalized on
both the discover and drain (compute_paths_to_stage) sides, so a backslash
.brain-skip.txt entry can't be honored at discovery yet bypassed at commit.

Adds test/brain-sync-windows-paths.test.ts (static invariants -- behavioral
spawn tests cannot run on the Windows lane, since Node/Bun cannot exec the
bin/ shebang scripts there) and wires it into windows-free-tests.yml.
Verified red->green and end-to-end on Windows 11 / Git Bash; macOS/Linux
behavior unchanged (os.sep is already "/", no CRLF, compute path logic
unchanged besides the shared skip normalization).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: detect bun.lock (Bun v1.2+ text lockfile) in diff-scope CONFIG

gstack-diff-scope only matched the legacy binary lockfile `bun.lockb`
but not the newer text-based `bun.lock` introduced in Bun v1.2+.
Projects using current Bun versions were silently missing the
SCOPE_CONFIG signal when only the lockfile changed.

🤖 Generated with [Qoder][https://qoder.com]

* fix(ios-qa): resolve CoreDevice tunnel via devicectl + keep tunnel alive

The daemon's tunnel bootstrap used `dns.resolve6` to look up
`<device>.coredevice.local`, which fails with ESERVFAIL on macOS 26.x
(Darwin 25.x) because Node's resolve6 path goes through libresolv and
does NOT consult mDNSResponder. `dns.lookup` (getaddrinfo) does.

Even when resolution works, CoreDevice in Xcode 26 only holds the
USB tunnel up while a devicectl command is in-flight, so the IPv6 ULA
becomes unroutable within ~10-15s of idle and subsequent proxy
requests time out.

Two-part fix:

  1. Resolution order is now (a) `xcrun devicectl device info details
     --json-output` to read `result.connectionProperties.tunnelIPAddress`
     directly, (b) mDNS via `dns.lookup`, (c) legacy `dns.resolve6` as
     a last-ditch fallback.
  2. After a successful bootstrap the daemon spawns a periodic
     `devicectl device info details` (~5s) to keep the tunnel session
     alive. Cleaned up on SIGINT/SIGTERM/exit.

Adds tests for `getDeviceTunnelIPv6FromDevicectl`, the
`resolveTunnelIPv6` fallback chain, and `startTunnelKeepalive`.
Existing bootstrap tests updated to include the new
`device info details` spawn step.

Tested against: iPhone 12 Pro on iOS 26.x via Mac Mini M-series
running macOS Sequoia 15.x / Darwin 25.3.0.

* chore(release): v1.44.1.0 — 9-PR community fix wave (post-windhoek paper-cut)

Bump VERSION + CHANGELOG entry. Wave covers /office-hours session
counter, iOS QA macOS 26 tunnels, Windows brain-sync, browse server
bind diagnostics, monorepo VERSION layouts, /investigate freeze hook
on standalone installs, gstack-timeline-read quote injection,
v1.40.0.0 migration on jq-less machines, bun.lock detection.

9 community PRs: #1676 #1635 #1627 #1648 #1664 #1589 #1672 #1649 #1673
9 contributors credited: @pryow @jbetala7 @cfeddersen @Gujiassh
@spacegeologist @stedfn @daveowenatl @hiSandog @sternryan
4 issues closed: #1671 #1677 #1634 #1647 #1581

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Rook <rook@robomovers.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com>
Co-authored-by: Christoph <astaran@herr-der-ringe-film.de>
Co-authored-by: gujishh <baiaoshh@163.com>
Co-authored-by: zhengzuo0-ai <zheng.zuo0@gmail.com>
Co-authored-by: Stefan Neamtu <stefan.neamtu@nearone.org>
Co-authored-by: Dave Owen <daveowen66@gmail.com>
Co-authored-by: 陈家名 <chenjiaming@kezaihui.com>
Co-authored-by: Ryan Stern <206953196+sternryan@users.noreply.github.com>
@garrytan
Copy link
Copy Markdown
Owner

Landed in 64f9aaf as part of v1.44.1.0 wave (PR #1682). Thanks for the fix, @spacegeologist.

@garrytan garrytan closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants