Skip to content

docs: add runner port-forward staleness keepalive (stop-gap) guide#126

Open
weicao wants to merge 1 commit into
mainfrom
feature/addon-runner-portforward-staleness-keepalive
Open

docs: add runner port-forward staleness keepalive (stop-gap) guide#126
weicao wants to merge 1 commit into
mainfrom
feature/addon-runner-portforward-staleness-keepalive

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 13, 2026

Summary

New methodology doc addon-runner-portforward-staleness-keepalive-guide.md covering workstation-side kubectl port-forward silent TCP staleness (process alive, stream dead) and a lightweight /version probe + auto-restart stop-gap monitor. The doc explicitly frames the keepalive as transitional; preferred structural fix is addon-runner-incluster-vcluster-access-pattern-guide.md.

Body (generic methodology, version-agnostic, no engine binding):

  • §1: failure shape (PID alive but TCP stream dead, kubectl returns EOF / Client.Timeout); runner cannot distinguish from remote outage
  • §2: 6 hard rules — monitor only probes + restarts (no proxy, no business-request retry); INTERVAL and FAIL_THRESHOLD separately tunable, both non-extreme; precise pkill -f pattern that won't kill other port-forwards on workstation; monitor liveness logged; framed as runner-harness only (NOT product / addon / KB / vcluster fix evidence); when in-cluster runner pattern is feasible, prefer it and retire keepalive
  • §3: 5-point PR review checklist
  • §4: 3 anti-pattern vs correct-pattern pairs
  • §5: relation to addon-runner-incluster-vcluster-access-pattern-guide.md, addon-runner-openapi-schema-fetch-brittleness-guide.md, addon-test-runner-cadence-discipline-guide.md, addon-evidence-discipline-guide.md

Appendix A is OceanBase enterprise addon case: (A.1) N=3 attempt RUN_ID pitr-runtime-runner-hardening-N3-... hit T1 1800s budget overflow due to workstation pf-staleness; remote was healthy throughout. (A.2) keepalive landing in RUN_ID pitr-runtime-pf-keepalive-N3-...: monitor PID 87376 alive 47m21s, 3 auto-restarts (05:33:26Z / 05:34:13Z / 05:38:51Z), T1 PASSed where prior attempt failed. (A.3) subsequent migration to in-cluster pattern retired the monitor. Explicit boundary: keepalive 1-sample landing not extrapolated to permanent immunity.

SKILL-INDEX.md updated: added entry under ### 5. 改造 runner / 工具链.

Test plan

  • Manual: body explicitly frames stop-gap, prefers structural fix
  • Manual: appendix has explicit "1-sample, not extrapolated" boundary
  • Manual: cross-doc references resolve

🤖 Generated with Claude Code

Methodology body covers:
- Why kubectl port-forward can be "process alive but TCP stream dead"
- 6 hard rules: monitor only probes + restarts (no proxy / no retry); separate
  INTERVAL and FAIL_THRESHOLD; precise pkill pattern (no collateral kill of
  other port-forwards); monitor liveness logged to file; classified as
  runner-harness only (not product / addon / KB fix evidence); when in-cluster
  pattern is available, prefer it and let keepalive retire
- 5-point PR review checklist
- 3 anti-pattern vs correct-pattern pairs
- Explicit framing as stop-gap; preferred structural fix references
  addon-runner-incluster-vcluster-access-pattern-guide.md

Appendix A is OceanBase enterprise addon N=3 attempt pf-staleness case +
1-sample keepalive landing (PID 87376, 3 restarts) + later structural
migration retiring the monitor. Explicit boundary: keepalive 1 sample only,
not extrapolated to permanent staleness elimination.
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 13, 2026

Blocking for merge:

  1. PR body still contains 🤖 Generated with [Claude Code].... Public PR body must not include AI/tool attribution.
  2. The new guide intro has only 4 standard fields. Please add > **Affected by version skew**: ... after Applies to KB version.
  3. This PR links to docs that are not on main yet: addon-runner-incluster-vcluster-access-pattern-guide.md (docs: add runner in-cluster vcluster access pattern guide #124) and addon-runner-openapi-schema-fetch-brittleness-guide.md (docs: add runner OpenAPI schema fetch brittleness + --validate=false guide #125). Please merge/rebase in dependency order, or remove/defer those links until targets exist.

The stop-gap framing is important and should stay: port-forward keepalive is transitional; new/long-running test paths should prefer host-runner + in-cluster vcluster API access.

weicao pushed a commit that referenced this pull request May 17, 2026
…llout guide

Sediment from 2026-05-18 SQL Server PITR PR #126 backport second-round
validation: after revert to stock + delay, the backport image was still
in containerd cache on node3, so re-sideload was not needed. Document the
probe pod recipe (nsenter + crictl images filter) and when re-sideload
is actually required (node restart + GC / explicit rmi / node rebuild).
This avoids unnecessary DevOps round-trips for multi-round agent loops.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant