Skip to content

feat(observability): add LFS phase metrics and snapshot serve bandwidth histogram#318

Merged
worstell merged 1 commit into
mainfrom
feat/lfs-and-bandwidth-metrics
May 19, 2026
Merged

feat(observability): add LFS phase metrics and snapshot serve bandwidth histogram#318
worstell merged 1 commit into
mainfrom
feat/lfs-and-bandwidth-metrics

Conversation

@worstell
Copy link
Copy Markdown
Contributor

Adds two pieces of visibility motivated by staging benchmark findings:

  1. cachew.git.snapshot_serve_bandwidth_mbps — per-request MiB/s, keyed by source and repository. Aggregate bytes/duration averages are ambiguous: cash-server cached serves averaged ~325 MiB/s while a raw curl from the same workstation saw ~588 MiB/s. A per-request distribution lets us tell whether slow clients pull the tail down vs the server itself being slow. Also stamped onto the active snapshot span as cachew.snapshot.bandwidth_mbps.

  2. cachew.git.lfs_phase_duration_seconds and cachew.git.lfs_phase_bytes — broken out by phase (discover, clone, fetch, archive_upload). LFS-snapshot generation is the biggest server-side cost in staging (~8.7 min average), and today we only see total duration; per-phase breakdown is needed to know whether to target clone, LFS fetch, or pack/upload.

Both histograms use explicit buckets sized for the values we expect.

@worstell worstell changed the title observability: add LFS phase metrics and snapshot serve bandwidth histogram feat(observability): add LFS phase metrics and snapshot serve bandwidth histogram May 19, 2026
@worstell worstell force-pushed the feat/lfs-and-bandwidth-metrics branch from 078b88e to c8fc785 Compare May 19, 2026 22:19
@worstell worstell marked this pull request as ready for review May 19, 2026 22:21
@worstell worstell requested a review from a team as a code owner May 19, 2026 22:21
@worstell worstell requested review from inez and removed request for a team May 19, 2026 22:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8fc7857a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/strategy/git/snapshot.go Outdated
@worstell worstell force-pushed the feat/lfs-and-bandwidth-metrics branch from c8fc785 to 265fee5 Compare May 19, 2026 22:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 265fee5f2f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/strategy/git/snapshot.go Outdated
…th histogram

Adds two pieces of visibility motivated by staging benchmark findings:

1. cachew.git.snapshot_serve_bandwidth_mbps (per-request MiB/s, by source
   and repository). Aggregate bytes/duration averages are ambiguous —
   e.g. cash-server cached serves averaged ~325 MiB/s while a raw curl
   from the same workstation saw ~588 MiB/s. A per-request distribution
   lets us see whether slow clients pull the tail down vs the server
   itself being slow.

2. cachew.git.lfs_phase_duration_seconds and cachew.git.lfs_phase_bytes
   for LFS-snapshot generation phases (discover, clone, fetch,
   archive_upload). LFS snapshot generation is the biggest server-side
   cost in staging (~8.7 min average), and today we only see total
   duration; per-phase breakdown is needed to know whether to target
   clone, LFS fetch, or pack/upload.

Also stamps cachew.snapshot.bandwidth_mbps onto the active snapshot
span so trace samples carry the same value.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019e41af-0a15-718d-a9d8-e26df6071f9b
@worstell worstell force-pushed the feat/lfs-and-bandwidth-metrics branch from 265fee5 to 77b4ac2 Compare May 19, 2026 22:40
@worstell worstell merged commit 6b1f756 into main May 19, 2026
8 checks passed
@worstell worstell deleted the feat/lfs-and-bandwidth-metrics branch May 19, 2026 22:43
worstell added a commit that referenced this pull request May 19, 2026
The bandwidth histogram added in #318 topped out at 5000 MiB/s (~5.2 GB/s),
so any serve from ~5 GiB/s through the cachew server NIC ceiling collapses
into the +Inf bucket and we lose the signal where we most need it.

Add 10000 and 15000 MiB/s buckets so cachew.git.snapshot_serve_bandwidth_mbps
can distinguish 'saturating a 10 GbE workstation' from 'approaching the
server NIC limit', with some headroom past the theoretical max to spot
misattribution.

Amp-Thread-ID: https://ampcode.com/threads/T-019e41af-0a15-718d-a9d8-e26df6071f9b
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants