Skip to content

chore(agent-builder): diagnostic logs for /daily 403 root-cause hunt#420

Merged
eanzhao merged 2 commits intodevfrom
chore/2026-04-25_diag-daily-preflight-403
Apr 25, 2026
Merged

chore(agent-builder): diagnostic logs for /daily 403 root-cause hunt#420
eanzhao merged 2 commits intodevfrom
chore/2026-04-25_diag-daily-preflight-403

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 25, 2026

Summary

After #418 merged, production /daily still fails with github_proxy_access_denied. Independent manual reproduction (session-token-minted child key, both narrow and broad scopes) succeeds end-to-end against the same NyxID account, GitHub UserService, and proxy slug. The production-only failure has to live in either the deployed POST /api-keys payload or the preflight response shape, neither of which is visible in current logs.

This PR adds diagnostic-only log lines on the daily-report create path so the next failed /daily captures ground truth, then revert.

What's logged

  • Before CreateApiKeyAsync: resolved per-user UserService.ids and the literal payload JSON. Tells us whether the deployed binary actually carries allow_all_services=false and whether ResolveProxyServiceIdsAsync returned per-user ids vs catalog ids.
  • After successful create: the new api-key id, for correlating to NyxID-side proxy_request_denied audit records.
  • Inside PreflightGitHubProxyAsync (Information level): the raw probe response. The parsed proxy_body we return only carries the inner Lark/GitHub body string when SendAsync wraps non-2xx; the unparsed probe is the only place we see exactly what NyxID returned. Distinguishes:
    • NyxID ApiKeyScopeForbidden (error_code:9000, our payload is still wrong somehow)
    • GitHub upstream 403 (Bad credentials, OAuth grant revoked)
    • Other shapes
  • On preflight failure (Warning level): the structured envelope we hand back to the user, so the failure is one self-contained log line.

Why this PR exists separately

These logs are temporary debugging — once the production log captures the real failure shape, this PR (or just the four log lines) gets reverted. Splitting it off keeps the revert clean and avoids re-opening the #418 conversation.

Test plan

  • dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj — clean (40 pre-existing warnings, 0 errors).
  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --filter AgentBuilderToolTests — 30/30 passing.
  • Deploy + trigger /daily alice from Lark, capture log lines tagged Diagnostic[#417]. Revert this PR after.

Follow-up after capture

Depending on what Diagnostic[#417]: GitHub preflight probe response shows:

  • If NyxID error_code:9000 → resolver still returns wrong ids despite the apparent fix; need to revisit ResolveProxyServiceIdsAsync or how the relay-token auth path lists user-services
  • If GitHub Bad credentials → OAuth grant on the bot owner's GitHub binding is unhealthy on the relay-token path even though it's healthy on session-token reads; need to look at how NyxID picks the credential to inject for relay-scoped api-keys
  • If something else → analyze case-by-case

eanzhao and others added 2 commits April 25, 2026 21:37
After PR #418 merged to dev, production `/daily` from a Lark DM still
fails with `github_proxy_access_denied`. Manual reproduction with the
same NyxID account (session-token-minted child key, both narrow
`allow_all_services=false` and broad scopes) succeeds end-to-end. The
production-only failure means the deployed `POST /api-keys` payload or
the preflight response shape carries information we can't see in the
existing logs.

Adds three diagnostic log lines on the daily-report create path:
- Right before `nyxClient.CreateApiKeyAsync`: the resolved per-user
  `UserService.id`s and the literal payload JSON. Lets us verify
  whether the deployed binary actually emits `allow_all_services=false`
  and whether the resolver returned `UserService.id`s vs catalog ids.
- Right after a successful create: the new api-key id, so we can
  correlate to NyxID-side audit logs (`proxy_request_denied`).
- Inside `PreflightGitHubProxyAsync`: the *raw* probe response. The
  parsed `proxy_body` we return only carries the inner Lark/GitHub
  body string when SendAsync wraps non-2xx; the unparsed probe is the
  only place we see what NyxID actually returned. Distinguishes:
    - NyxID `ApiKeyScopeForbidden` (`error_code:9000`, our payload still wrong)
    - GitHub upstream 403 (`Bad credentials`, OAuth grant revoked)
    - other shapes
- And on preflight failure: the structured envelope we send back to
  the user (`github_proxy_access_denied`), so the failure record is
  self-contained in one log line.

Information-only, behavior unchanged. Tests pass (30/30 in
AgentBuilderToolTests). Intended to be reverted once the production
log captures the real failure shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.39%. Comparing base (87568b6) to head (c81879a).
⚠️ Report is 4 commits behind head on dev.

@@            Coverage Diff             @@
##              dev     #420      +/-   ##
==========================================
+ Coverage   70.37%   70.39%   +0.01%     
==========================================
  Files        1175     1175              
  Lines       84453    84453              
  Branches    11124    11124              
==========================================
+ Hits        59438    59447       +9     
+ Misses      20723    20715       -8     
+ Partials     4292     4291       -1     
Flag Coverage Δ
ci 70.39% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

eanzhao added a commit that referenced this pull request Apr 25, 2026
The four `Diagnostic[#417]` log lines added in PR #420 captured the
production failure shape (GitHub 403 with body "Request forbidden by
administrative rules ... User-Agent header"), which led to the fix in
this PR (default User-Agent injection in `NyxIdApiClient`). The logs
were always intended to be temporary; they include full payload JSON
including api-key prefixes and don't have a long-term place in the
hot path.

Also tightens the comment block at the preflight call site to reflect
the new understanding: preflight catches misconfigurations that surface
at request time, not just OAuth revocation. Original case (#421) was a
missing User-Agent header.

421/421 ChannelRuntime tests + 12/12 AI tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant