feat(telemetry): 7-day delivered-heartbeat with stamp-on-success#173
Merged
saurabhjain1592 merged 7 commits intomainfrom Apr 29, 2026
Merged
feat(telemetry): 7-day delivered-heartbeat with stamp-on-success#173saurabhjain1592 merged 7 commits intomainfrom
saurabhjain1592 merged 7 commits intomainfrom
Conversation
Mirror of the Go SDK reference impl (commit 45658bf), adapted to Python's
threading + asyncio model. The SDK now follows the cross-language
contract:
AxonFlow emits at most one anonymous heartbeat per environment every
7 days during SDK activity.
Implementation:
- New axonflow/heartbeat.py owns the gate. Module-level singleton
HeartbeatState (shared across all clients in the process so
multiple AxonFlow instances coalesce). Fields:
* threading.Lock
* last_checked_monotonic (in-memory 1-hour cache to bound stat()
spam on hot request paths)
* in_flight (coalesces concurrent stampedes)
* stamp_path (auto-resolved via _resolve_stamp_path() — cross-
platform: macOS Library/Caches, XDG on Linux, LOCALAPPDATA on
Windows. None when no cache dir is available, e.g. Lambda).
- maybe_send_heartbeat checks in order: AXONFLOW_TELEMETRY=off /
config gating → in-flight → 1h cache → stamp mtime. Each predicate
short-circuits; AXONFLOW_TELEMETRY=off is checked first (lock-free)
so mid-process opt-out toggles take effect.
- Stamp-on-DELIVERY semantics: stamp written ONLY on POST success.
Failed POSTs leave stamp unchanged so the next call after the 1h
cache expires retries. No "one transient failure = silence for 7
days" failure mode.
- Single hook point: client._pre_request_hook() called at the top of
_request, _orchestrator_request, and _map_request — the three
central HTTP entry points. Async-thread-spawn so user API calls
are never delayed.
- Stamp file written via tempfile.mkstemp + os.replace (atomic on
POSIX). Contents are advisory; SDK reads mtime, not contents.
- Atexit flush handler tracks heartbeat threads (mirrors the existing
pattern from issue #1692) so short-lived processes still deliver
the ping before main() returns.
- _send_telemetry_ping_now is the new pure-network function (returns
bool); _do_ping is kept as a backward-compat wrapper for the legacy
daemon-thread call path that the existing test suite exercises.
- HeartbeatState constructor uses a sentinel (_USE_DEFAULT_CACHE_DIR)
to distinguish "auto-resolve" (default) from "no persistence"
(explicit None) — needed for clean Lambda/restricted-env tests.
Tests:
- tests/test_heartbeat.py — 9-case matrix:
1. cold start, no stamp → 1 ping, stamp written
2. fresh stamp (1d) → 0 pings
3. stale stamp (8d) → 1 ping, stamp updated
4. 5 calls within 1h cache → exactly 1 ping
5. cache expired + stale stamp → 2nd ping fires
6. AXONFLOW_TELEMETRY=off mid-run → 0 pings, stamp unchanged
7. 100 concurrent threads → exactly 1 ping (stampede coalesced)
8. no cache dir (stamp_path=None) → ping per process, no crash
9. ping returns False → stamp NOT written; retry on True lands
- tests/test_heartbeat_e2e.py — 4-run cycle:
Run 1 cold → 1 ping; Run 2 warm → 0; Run 3 stale → 1; Run 4 stale+503
→ attempt counted but stamp NOT advanced; retry on success lands and
advances stamp.
Deep-review fixes on the heartbeat module:
- write_stamp_atomic runs outside the lock so concurrent gate runs are
not serialized through mkdir + tempfile + rename syscalls.
- Cleaned up dead double-replacement in the no_cache_dir test.
- Folded in a pre-existing PLC0415 lint that was tripping ruff.
E2E test upgraded to use a real http.server.HTTPServer on a localhost
socket instead of stubbing _send_telemetry_ping_now via mock — now
matches the Go httptest and Java WireMock E2E coverage. The autouse
_disable_telemetry fixture in conftest.py blocks real httpx as a
safety net; this E2E un-mocks httpx.{get,post} for the duration of
its fixture so the SDK actually hits the local socket.
New cross-platform real-stack E2E under tests/heartbeat-real-stack/:
stands up a localhost fake checkpoint server, constructs AxonFlow
through its public async API, verifies the OS-native stamp file
appears, and runs a warm-cache regression check that asserts 0
additional checkpoint hits on a second construction.
New workflow .github/workflows/heartbeat-real-stack.yml runs this
matrix on [ubuntu-latest, windows-latest, macos-latest]. Validated
locally on macOS + Linux (Docker) before push.
Comment on lines
+20
to
+42
| name: real-stack ${{ matrix.os }} | ||
| runs-on: ${{ matrix.os }} | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| os: [ubuntu-latest, windows-latest, macos-latest] | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
|
|
||
| - name: Install SDK (editable) | ||
| run: pip install -e . | ||
|
|
||
| - name: Run real-stack heartbeat E2E (cold + warm) | ||
| # AXONFLOW_TELEMETRY must be empty (NOT off) for this test — | ||
| # we're explicitly validating the heartbeat path. The driver | ||
| # sets this in the smoke env. | ||
| run: python tests/heartbeat-real-stack/run_real_stack.py python tests/heartbeat-real-stack/smoke_python.py |
CI surfaced two issues the local validation didn't catch: - macOS: 15s port-file timeout was too short for cold-start macos-latest runners. Bump to 60s and surface the server's stdout+stderr on timeout so the cause is visible in CI logs instead of failing silently. - ruff: 20 lints across the new tests/heartbeat-real-stack/ harness files, plus a TRY300 in axonflow/telemetry.py introduced by the heartbeat refactor. Added a tests/heartbeat-real-stack/** per-file-ignores entry for harness-appropriate lints (subprocess with constructed args, urlopen on a localhost endpoint, global counters in the fake server, etc.). TRY300 in telemetry.py gets an inline noqa with rationale — restructuring as `else:` would force splitting the try block; linear flow is more readable.
The 7-day-heartbeat work imported `maybe_send_heartbeat` and dropped `send_telemetry_ping` in axonflow/client.py, shifting line numbers in the falsey-clobber baseline by ~14 lines. Same finding count (65), same patterns — purely a baseline refresh.
saurabhjain1592
added a commit
that referenced
this pull request
Apr 29, 2026
The merge of #173 (7-day delivered-heartbeat) into the release branch was auto-resolved cleanly by git, but the section title on the [7.0.0] CHANGELOG header still listed only DO_NOT_TRACK + StaticPolicy + skip_llm. The heartbeat is a behavioural change real users will notice, so promote it to the section title alongside the other headlines.
saurabhjain1592
added a commit
that referenced
this pull request
Apr 29, 2026
* chore(release)!: cut [Unreleased] → [7.0.0] - 2026-04-29 Major release. Headline breaking change: removal of DO_NOT_TRACK as an AxonFlow telemetry opt-out — AXONFLOW_TELEMETRY=off is the canonical and only opt-out signal. Bundles StaticPolicy/PolicyVersion snake_case alignment with the OpenAPI spec and the new ClientRequest.skip_llm request flag (both already merged on main). CHANGELOG cuts the Unreleased section over to a versioned 7.0.0 release dated 2026-04-29 UTC, drops the internal \`### CI / Testing\` block per the user-facing-only changelog policy, and tightens a few descriptions. Bumps pyproject.toml + axonflow/_version.py to 7.0.0. Companion releases ship the same day: TypeScript v7.0.0, Go v7.0.0 (with /v7 module path migration), Java v7.0.0. * chore(release): expand 7.0.0 title to surface heartbeat as headline The merge of #173 (7-day delivered-heartbeat) into the release branch was auto-resolved cleanly by git, but the section title on the [7.0.0] CHANGELOG header still listed only DO_NOT_TRACK + StaticPolicy + skip_llm. The heartbeat is a behavioural change real users will notice, so promote it to the section title alongside the other headlines. * docs(release): prepend "Upgrade strongly recommended" banner to release notes Surfaces the family-wide hardening message at the top of the 2026-04-29 release section so it lands in the GitHub Release body when the tag is cut. Users browsing the CHANGELOG at any future point also see it as the first line of the release notes. The line is identical across all 4 plugins + 4 SDKs in this same-day release train. * docs(release): drop technical title + descriptor recap from 7.0.0 header Header reduces from "## [7.0.0] - 2026-04-29 — DO_NOT_TRACK removal + 7-day delivered heartbeat" to just "## [7.0.0] - 2026-04-29". The descriptor paragraph that followed and re-listed those two headlines shrinks to a one-line coordinated-release note pointing to the same-day companion SDK releases. The substantive bullets in BREAKING / Changed / Fixed are unchanged — users who care about the specifics will read those. The "Upgrade strongly recommended" banner above already conveys the release's intent for everyone else. * docs(release): canonicalize telemetry entries + restructure plugins to BREAKING-first Aligns the common DNT removal + 7-day heartbeat + deprecation-warning removal entries to identical compact wording across all 4 plugins + 4 SDKs. Plugins now use the same `### BREAKING` section header as the SDKs (was: `**BREAKING:**` inline under `### Removed`), so the four sections — BREAKING / Added / Changed / Fixed / Security — read in the same order whether you're looking at a plugin or an SDK CHANGELOG. Telemetry change descriptions trimmed: kept the substantive contract (7-day cadence, stamp-on-delivery, transient-failure resilience, in-flight de-dup, restricted-runtime fallback), dropped the implementation detail (specific syscall names, cache dir paths, 1-hour in-memory cache) — the bullets in BREAKING / Changed / Fixed all carry the headline behaviour without restating it three times. The "Upgrade strongly recommended" banner above and the bullets below cover the message; this commit just removes redundancy. No semantic content removed. Anyone who wants the implementation details can read the source. Anyone who wants to know what changed sees it in three or four lines. * docs: surface security framing + GHSA links + README upgrade banner Mirrors the format of axonflow-enterprise#1772: - Restore CHANGELOG H2 suffix to "— Production, quality, and security hardening — upgrade encouraged". - Add "Security highlights" block under the upgrade-recommended banner citing the three vulnerability fixes shipped in this cycle (webhook signing-key exposure, DO_NOT_TRACK removal, nightly strict-mode integration) plus a link to the per-SDK advisory GHSA-7f4h-6264-89fr and the consolidated platform advisory GHSA-9h64-2846-7x7f. - Add "Reliability and bug-fix highlights" block citing the three operator-facing fixes (retry_context + idempotency_key, atexit telemetry flush, wire-shape contract CI + baseline burndown). - Add upgrade-recommended banner near the top of README.md. Diff is CHANGELOG.md + README.md only; no code or test changes. * docs(readme): switch upgrade banner to evergreen format Mirrors axonflow-enterprise#1774. The banner no longer hard-codes the current version or specific GHSA IDs — instead it links to the canonical /releases/latest and /security/advisories surfaces of this repository so the README doesn't need a re-edit on every release. Same one-paragraph blockquote near the top of the README, just with evergreen links.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Move SDK telemetry from "ping per AxonFlow construction" to the cross-language contract:
Mirrors Go SDK reference (#TBD), TS (#TBD), Java (#TBD); contract in axonflow-enterprise.
How
axonflow/heartbeat.pywith cross-platform stamp file (XDG /Library/Caches/LOCALAPPDATA), 1-hour in-memory cache, in-flight gate, stamp-on-DELIVERY semantics._pre_request_hook()called from_request,_orchestrator_request,_map_request— the three central HTTP entry points._send_telemetry_ping_now() -> boolis the new pure-network function (returns delivery status);send_telemetry_pingis the legacy fire-and-forget shim.Tests
tests/test_heartbeat.py— 9-case matrix.tests/test_heartbeat_e2e.py— 4-run cycle.Test plan
pytest tests/test_heartbeat.py tests/test_heartbeat_e2e.py tests/test_telemetry.pygreen