Skip to content

feat(telemetry): 7-day delivered-heartbeat with stamp-on-success#173

Merged
saurabhjain1592 merged 7 commits intomainfrom
feat/7d-telemetry-heartbeat
Apr 29, 2026
Merged

feat(telemetry): 7-day delivered-heartbeat with stamp-on-success#173
saurabhjain1592 merged 7 commits intomainfrom
feat/7d-telemetry-heartbeat

Conversation

@saurabhjain1592
Copy link
Copy Markdown
Member

Summary

Move SDK telemetry from "ping per AxonFlow construction" to the cross-language contract:

AxonFlow emits at most one anonymous heartbeat per environment every 7 days during SDK activity.

Mirrors Go SDK reference (#TBD), TS (#TBD), Java (#TBD); contract in axonflow-enterprise.

How

  • New axonflow/heartbeat.py with cross-platform stamp file (XDG / Library/Caches / LOCALAPPDATA), 1-hour in-memory cache, in-flight gate, stamp-on-DELIVERY semantics.
  • Single hook point: _pre_request_hook() called from _request, _orchestrator_request, _map_request — the three central HTTP entry points.
  • _send_telemetry_ping_now() -> bool is the new pure-network function (returns delivery status); send_telemetry_ping is the legacy fire-and-forget shim.
  • Atexit-tracked daemon thread preserves issue #1692 short-lived-process delivery.

Tests

  • tests/test_heartbeat.py — 9-case matrix.
  • tests/test_heartbeat_e2e.py — 4-run cycle.
  • 35/35 telemetry tests pass.

Test plan

  • pytest tests/test_heartbeat.py tests/test_heartbeat_e2e.py tests/test_telemetry.py green
  • Ruff format + lint clean
  • CI green

Mirror of the Go SDK reference impl (commit 45658bf), adapted to Python's
threading + asyncio model. The SDK now follows the cross-language
contract:

  AxonFlow emits at most one anonymous heartbeat per environment every
  7 days during SDK activity.

Implementation:

- New axonflow/heartbeat.py owns the gate. Module-level singleton
  HeartbeatState (shared across all clients in the process so
  multiple AxonFlow instances coalesce). Fields:
  * threading.Lock
  * last_checked_monotonic (in-memory 1-hour cache to bound stat()
    spam on hot request paths)
  * in_flight (coalesces concurrent stampedes)
  * stamp_path (auto-resolved via _resolve_stamp_path() — cross-
    platform: macOS Library/Caches, XDG on Linux, LOCALAPPDATA on
    Windows. None when no cache dir is available, e.g. Lambda).
- maybe_send_heartbeat checks in order: AXONFLOW_TELEMETRY=off /
  config gating → in-flight → 1h cache → stamp mtime. Each predicate
  short-circuits; AXONFLOW_TELEMETRY=off is checked first (lock-free)
  so mid-process opt-out toggles take effect.
- Stamp-on-DELIVERY semantics: stamp written ONLY on POST success.
  Failed POSTs leave stamp unchanged so the next call after the 1h
  cache expires retries. No "one transient failure = silence for 7
  days" failure mode.
- Single hook point: client._pre_request_hook() called at the top of
  _request, _orchestrator_request, and _map_request — the three
  central HTTP entry points. Async-thread-spawn so user API calls
  are never delayed.
- Stamp file written via tempfile.mkstemp + os.replace (atomic on
  POSIX). Contents are advisory; SDK reads mtime, not contents.
- Atexit flush handler tracks heartbeat threads (mirrors the existing
  pattern from issue #1692) so short-lived processes still deliver
  the ping before main() returns.
- _send_telemetry_ping_now is the new pure-network function (returns
  bool); _do_ping is kept as a backward-compat wrapper for the legacy
  daemon-thread call path that the existing test suite exercises.
- HeartbeatState constructor uses a sentinel (_USE_DEFAULT_CACHE_DIR)
  to distinguish "auto-resolve" (default) from "no persistence"
  (explicit None) — needed for clean Lambda/restricted-env tests.

Tests:

- tests/test_heartbeat.py — 9-case matrix:
  1. cold start, no stamp           → 1 ping, stamp written
  2. fresh stamp (1d)               → 0 pings
  3. stale stamp (8d)               → 1 ping, stamp updated
  4. 5 calls within 1h cache        → exactly 1 ping
  5. cache expired + stale stamp    → 2nd ping fires
  6. AXONFLOW_TELEMETRY=off mid-run → 0 pings, stamp unchanged
  7. 100 concurrent threads         → exactly 1 ping (stampede coalesced)
  8. no cache dir (stamp_path=None) → ping per process, no crash
  9. ping returns False             → stamp NOT written; retry on True lands

- tests/test_heartbeat_e2e.py — 4-run cycle:
  Run 1 cold → 1 ping; Run 2 warm → 0; Run 3 stale → 1; Run 4 stale+503
  → attempt counted but stamp NOT advanced; retry on success lands and
  advances stamp.
Deep-review fixes on the heartbeat module:
- write_stamp_atomic runs outside the lock so concurrent gate runs are
  not serialized through mkdir + tempfile + rename syscalls.
- Cleaned up dead double-replacement in the no_cache_dir test.
- Folded in a pre-existing PLC0415 lint that was tripping ruff.

E2E test upgraded to use a real http.server.HTTPServer on a localhost
socket instead of stubbing _send_telemetry_ping_now via mock — now
matches the Go httptest and Java WireMock E2E coverage. The autouse
_disable_telemetry fixture in conftest.py blocks real httpx as a
safety net; this E2E un-mocks httpx.{get,post} for the duration of
its fixture so the SDK actually hits the local socket.

New cross-platform real-stack E2E under tests/heartbeat-real-stack/:
stands up a localhost fake checkpoint server, constructs AxonFlow
through its public async API, verifies the OS-native stamp file
appears, and runs a warm-cache regression check that asserts 0
additional checkpoint hits on a second construction.

New workflow .github/workflows/heartbeat-real-stack.yml runs this
matrix on [ubuntu-latest, windows-latest, macos-latest]. Validated
locally on macOS + Linux (Docker) before push.
Comment on lines +20 to +42
name: real-stack ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install SDK (editable)
run: pip install -e .

- name: Run real-stack heartbeat E2E (cold + warm)
# AXONFLOW_TELEMETRY must be empty (NOT off) for this test —
# we're explicitly validating the heartbeat path. The driver
# sets this in the smoke env.
run: python tests/heartbeat-real-stack/run_real_stack.py python tests/heartbeat-real-stack/smoke_python.py
CI surfaced two issues the local validation didn't catch:

- macOS: 15s port-file timeout was too short for cold-start
  macos-latest runners. Bump to 60s and surface the server's
  stdout+stderr on timeout so the cause is visible in CI logs
  instead of failing silently.

- ruff: 20 lints across the new tests/heartbeat-real-stack/ harness
  files, plus a TRY300 in axonflow/telemetry.py introduced by the
  heartbeat refactor. Added a tests/heartbeat-real-stack/**
  per-file-ignores entry for harness-appropriate lints (subprocess
  with constructed args, urlopen on a localhost endpoint, global
  counters in the fake server, etc.). TRY300 in telemetry.py gets
  an inline noqa with rationale — restructuring as `else:` would
  force splitting the try block; linear flow is more readable.
The 7-day-heartbeat work imported `maybe_send_heartbeat` and dropped
`send_telemetry_ping` in axonflow/client.py, shifting line numbers
in the falsey-clobber baseline by ~14 lines. Same finding count
(65), same patterns — purely a baseline refresh.
@saurabhjain1592 saurabhjain1592 merged commit 02665be into main Apr 29, 2026
18 checks passed
saurabhjain1592 added a commit that referenced this pull request Apr 29, 2026
The merge of #173 (7-day delivered-heartbeat) into the release branch
was auto-resolved cleanly by git, but the section title on the
[7.0.0] CHANGELOG header still listed only DO_NOT_TRACK + StaticPolicy
+ skip_llm. The heartbeat is a behavioural change real users will
notice, so promote it to the section title alongside the other
headlines.
saurabhjain1592 added a commit that referenced this pull request Apr 29, 2026
* chore(release)!: cut [Unreleased] → [7.0.0] - 2026-04-29

Major release. Headline breaking change: removal of DO_NOT_TRACK as an
AxonFlow telemetry opt-out — AXONFLOW_TELEMETRY=off is the canonical
and only opt-out signal. Bundles StaticPolicy/PolicyVersion snake_case
alignment with the OpenAPI spec and the new ClientRequest.skip_llm
request flag (both already merged on main).

CHANGELOG cuts the Unreleased section over to a versioned 7.0.0
release dated 2026-04-29 UTC, drops the internal \`### CI / Testing\`
block per the user-facing-only changelog policy, and tightens a few
descriptions. Bumps pyproject.toml + axonflow/_version.py to 7.0.0.

Companion releases ship the same day: TypeScript v7.0.0,
Go v7.0.0 (with /v7 module path migration), Java v7.0.0.

* chore(release): expand 7.0.0 title to surface heartbeat as headline

The merge of #173 (7-day delivered-heartbeat) into the release branch
was auto-resolved cleanly by git, but the section title on the
[7.0.0] CHANGELOG header still listed only DO_NOT_TRACK + StaticPolicy
+ skip_llm. The heartbeat is a behavioural change real users will
notice, so promote it to the section title alongside the other
headlines.

* docs(release): prepend "Upgrade strongly recommended" banner to release notes

Surfaces the family-wide hardening message at the top of the 2026-04-29
release section so it lands in the GitHub Release body when the tag is
cut. Users browsing the CHANGELOG at any future point also see it as
the first line of the release notes.

The line is identical across all 4 plugins + 4 SDKs in this same-day
release train.

* docs(release): drop technical title + descriptor recap from 7.0.0 header

Header reduces from "## [7.0.0] - 2026-04-29 — DO_NOT_TRACK removal +
7-day delivered heartbeat" to just "## [7.0.0] - 2026-04-29". The
descriptor paragraph that followed and re-listed those two headlines
shrinks to a one-line coordinated-release note pointing to the same-day
companion SDK releases.

The substantive bullets in BREAKING / Changed / Fixed are unchanged —
users who care about the specifics will read those. The
"Upgrade strongly recommended" banner above already conveys the
release's intent for everyone else.

* docs(release): canonicalize telemetry entries + restructure plugins to BREAKING-first

Aligns the common DNT removal + 7-day heartbeat + deprecation-warning
removal entries to identical compact wording across all 4 plugins +
4 SDKs. Plugins now use the same `### BREAKING` section header as the
SDKs (was: `**BREAKING:**` inline under `### Removed`), so the four
sections — BREAKING / Added / Changed / Fixed / Security — read in
the same order whether you're looking at a plugin or an SDK CHANGELOG.

Telemetry change descriptions trimmed: kept the substantive contract
(7-day cadence, stamp-on-delivery, transient-failure resilience,
in-flight de-dup, restricted-runtime fallback), dropped the
implementation detail (specific syscall names, cache dir paths,
1-hour in-memory cache) — the bullets in BREAKING / Changed / Fixed
all carry the headline behaviour without restating it three times.
The "Upgrade strongly recommended" banner above and the bullets
below cover the message; this commit just removes redundancy.

No semantic content removed. Anyone who wants the implementation
details can read the source. Anyone who wants to know what changed
sees it in three or four lines.

* docs: surface security framing + GHSA links + README upgrade banner

Mirrors the format of axonflow-enterprise#1772:

- Restore CHANGELOG H2 suffix to "— Production, quality, and security
  hardening — upgrade encouraged".
- Add "Security highlights" block under the upgrade-recommended banner
  citing the three vulnerability fixes shipped in this cycle (webhook
  signing-key exposure, DO_NOT_TRACK removal, nightly strict-mode
  integration) plus a link to the per-SDK advisory GHSA-7f4h-6264-89fr
  and the consolidated platform advisory GHSA-9h64-2846-7x7f.
- Add "Reliability and bug-fix highlights" block citing the three
  operator-facing fixes (retry_context + idempotency_key, atexit
  telemetry flush, wire-shape contract CI + baseline burndown).
- Add upgrade-recommended banner near the top of README.md.

Diff is CHANGELOG.md + README.md only; no code or test changes.

* docs(readme): switch upgrade banner to evergreen format

Mirrors axonflow-enterprise#1774. The banner no longer hard-codes
the current version or specific GHSA IDs — instead it links to the
canonical /releases/latest and /security/advisories surfaces of
this repository so the README doesn't need a re-edit on every
release. Same one-paragraph blockquote near the top of the README,
just with evergreen links.
@saurabhjain1592 saurabhjain1592 deleted the feat/7d-telemetry-heartbeat branch April 29, 2026 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants