feat(observability): add LFS phase metrics and snapshot serve bandwidth histogram#318
Conversation
078b88e to
c8fc785
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8fc7857a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
c8fc785 to
265fee5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 265fee5f2f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…th histogram Adds two pieces of visibility motivated by staging benchmark findings: 1. cachew.git.snapshot_serve_bandwidth_mbps (per-request MiB/s, by source and repository). Aggregate bytes/duration averages are ambiguous — e.g. cash-server cached serves averaged ~325 MiB/s while a raw curl from the same workstation saw ~588 MiB/s. A per-request distribution lets us see whether slow clients pull the tail down vs the server itself being slow. 2. cachew.git.lfs_phase_duration_seconds and cachew.git.lfs_phase_bytes for LFS-snapshot generation phases (discover, clone, fetch, archive_upload). LFS snapshot generation is the biggest server-side cost in staging (~8.7 min average), and today we only see total duration; per-phase breakdown is needed to know whether to target clone, LFS fetch, or pack/upload. Also stamps cachew.snapshot.bandwidth_mbps onto the active snapshot span so trace samples carry the same value. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019e41af-0a15-718d-a9d8-e26df6071f9b
265fee5 to
77b4ac2
Compare
The bandwidth histogram added in #318 topped out at 5000 MiB/s (~5.2 GB/s), so any serve from ~5 GiB/s through the cachew server NIC ceiling collapses into the +Inf bucket and we lose the signal where we most need it. Add 10000 and 15000 MiB/s buckets so cachew.git.snapshot_serve_bandwidth_mbps can distinguish 'saturating a 10 GbE workstation' from 'approaching the server NIC limit', with some headroom past the theoretical max to spot misattribution. Amp-Thread-ID: https://ampcode.com/threads/T-019e41af-0a15-718d-a9d8-e26df6071f9b Co-authored-by: Amp <amp@ampcode.com>
Adds two pieces of visibility motivated by staging benchmark findings:
cachew.git.snapshot_serve_bandwidth_mbps— per-request MiB/s, keyed by source and repository. Aggregate bytes/duration averages are ambiguous:cash-servercached serves averaged ~325 MiB/s while a rawcurlfrom the same workstation saw ~588 MiB/s. A per-request distribution lets us tell whether slow clients pull the tail down vs the server itself being slow. Also stamped onto the active snapshot span ascachew.snapshot.bandwidth_mbps.cachew.git.lfs_phase_duration_secondsandcachew.git.lfs_phase_bytes— broken out by phase (discover,clone,fetch,archive_upload). LFS-snapshot generation is the biggest server-side cost in staging (~8.7 min average), and today we only see total duration; per-phase breakdown is needed to know whether to target clone, LFS fetch, or pack/upload.Both histograms use explicit buckets sized for the values we expect.