Skip to content

Fix relay registration scope backfill#389

Merged
eanzhao merged 5 commits intodevfrom
fix/2026-04-24_relay-empty-scope-backfill
Apr 24, 2026
Merged

Fix relay registration scope backfill#389
eanzhao merged 5 commits intodevfrom
fix/2026-04-24_relay-empty-scope-backfill

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 24, 2026

Problem

Existing Lark relay registrations can have an empty scope_id, which prevents relay callbacks from resolving the bot owner's Aevatar scope and applying the owner UserConfig.

Solution

  • Reject new Lark channel bot registrations that do not carry a canonical scope_id.
  • Resolve scope_id from the current NyxID request context for channel_registrations tool calls, so nyxid-chat should not ask the user for it.
  • Add rebuild/backfill logic that re-registers empty-scope Lark mirror entries with the resolved scope before dispatching projection rebuild.
  • Allow targeted repair by registration_id or nyx_agent_api_key_id, with force for deliberate bulk repair.

Verification

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --filter "FullyQualifiedName~ChannelBotRegistrationGAgentTests|FullyQualifiedName~ChannelCallbackEndpointsTests|FullyQualifiedName~ChannelRegistrationToolTests|FullyQualifiedName~NyxLarkProvisioningServiceTests" --no-restore --nologo
  • bash tools/ci/test_stability_guards.sh

@eanzhao eanzhao changed the base branch from fix/2026-04-24_relay-scope-id-from-registration to dev April 24, 2026 11:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 448c251faa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +73 to +74
if (registrationId is null && apiKeyId is null && !selection.Force && candidates.Length != 1)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate single-candidate backfill behind explicit selector

The backfill path auto-repairs whenever there is exactly one empty-scope registration, even if the caller did not provide registration_id/nyx_agent_api_key_id. In that case the record is reassigned to the caller’s resolved scope_id without any ownership proof, so a manual rebuild from another tenant can silently claim an orphaned registration. The current guard only blocks multi-candidate bulk updates; it does not protect the single-candidate case.

Useful? React with 👍 / 👎.

Comment on lines +202 to +205
catch (JsonException ex)
{
logger.LogWarning(ex, "Invalid channel registration rebuild request payload");
return Results.BadRequest(new { error = "Invalid JSON" });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Catch unsupported content types in rebuild request parsing

ReadOptionalRebuildRequestAsync uses ReadFromJsonAsync, which throws InvalidOperationException when Content-Type is not a JSON media type, but HandleRebuildRegistrationsAsync only catches JsonException. A rebuild request sent with a non-JSON body/content-type (for example a common curl -d default) will bubble an unhandled exception instead of returning a controlled client error, turning a recoverable input issue into a server failure path.

Useful? React with 👍 / 👎.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 24, 2026

Reviewed — the shape is right (reject empty-scope at the edge, provide a backfill path) but the backfill authorization story has a couple of real holes. Posting what I’d want tightened before this goes to prod.

P1 — Force=true 把所有 empty-scope 条目强制挪到调用者的 scope,没有所有权校验

ChannelBotRegistrationScopeBackfill.BackfillAsyncregistrationId == null && apiKeyId == null && Force=true 时,会对所有 scope_id 为空的 Lark registration 用 normalizedScopeId(即调用者 JWT 的 scope_id claim 或 body 里的值)直接 DispatchRegisterAsync 重建一次。HandleRegister 侧没有任何所有者一致性校验(它根本不知道原 owner 是谁,这正是本 PR 要修复的历史坏数据)。

// ChannelBotRegistrationScopeBackfill.cs:L80-100
if (registrationId is null && apiKeyId is null && !selection.Force && candidates.Length != 1) { ... }

foreach (var entry in candidates)
{
    await ChannelBotRegistrationStoreCommands.DispatchRegisterAsync(
        actorRuntime, dispatchPort,
        new ChannelBotRegisterCommand { RequestedId = entry.Id, ScopeId = normalizedScopeId, ... }, ct);
}

在多租户部署里,任一持有合法 NyxID JWT 的用户(比如租户 A)调一次 POST /api/channels/registrations/rebuild {"force": true},就能把历史上由租户 B 创建、只是 scope_id 丢失的 Lark 桥接记录全部改挂到 A 的 scope 下——之后 relay 回调就会按 A 的 UserConfig 走 A 的 LLM route,B 的 bot 静默“易主”。这和 #385 里 Codex 标的 ambiguous api_key_id cross-tenant 的根因一样:缺一个 owner anchor。

建议至少选一个口子收紧:

  • Force 语义限定成“selector 命中多条时允许全部处理”,而不是“selector 为空时扫全表”——即 registrationId is null && apiKeyId is null && Force 直接拒绝。
  • 或者在 backfill 之前先调 NyxID(/api/v1/agents/api-keys/{id} 之类)验证 entry.NyxAgentApiKeyId 的 owner scope 和调用者 scope 一致,不一致的 entry 跳过、并在 note 里明确列出。
  • 或者把 Force 挪到只有运维/admin 角色才能走的后台端点。

P2 — 单选择器路径(registration_id / nyx_agent_api_key_id)同样没有所有权校验

即使不走 Force,调用者只要猜对/扫到一个 nyx_agent_api_key_id(它就在 GET /api/channels/registrations 列表里可见),就能把对应 entry 抢挂到自己的 scope。爆破面更窄,但根因一致。建议同上——要么 NyxID 侧反查 owner 核对,要么只接受能通过独立身份证明的选择器(例如仅允许 registration_id 匹配当前 caller 名下已知的 bot 集合)。

Info — HandleRegister 对 empty ScopeId 静默吞掉

// ChannelBotRegistrationGAgent.cs:L46-54
if (string.IsNullOrWhiteSpace(cmd.ScopeId))
{
    Logger.LogWarning(...);
    return;   // 没有 event、没有拒绝信号
}

定位成 defense-in-depth 的不变量守卫没问题(NyxLarkProvisioningService + HTTP endpoint + tool 都已经在上游拒了),但 LogWarning + silent return 会让“某条代码路径 bug 把 empty-scope 命令打到 actor”这件事在运维侧完全不可见。考虑改成 LogError + 一条 diagnostic 记录(IChannelRuntimeDiagnostics),或者至少 bump 到 metric 计数,这样合约破裂能被看到。

Info — backfill 重入时 CreatedAt 被覆盖成“现在”

HandleRegister 每次都 CreatedAt = Timestamp.FromDateTimeOffset(DateTimeOffset.UtcNow)ApplyRegistered 是 remove-then-add 的 upsert,所以 backfill 重建后原 entry 的原始创建时间丢失——历史取证能力会受影响。如果不打算保留(用 dedicated event 区分 scope 填充 vs 真正重新注册),至少在 note / 日志里记一下“backfill overwrites createdAt”。

测试覆盖缺口

新增的 tests 覆盖了 happy path(带 scope claim 能 backfill 1 条)和 missing-scope 拒绝。没有覆盖:

  • Force=true 多条 empty-scope 时的行为(= P1 的回归测试)。
  • body 里 scope_id 和 claim scope_id 不一致时 ResolveScopeId 返回 400(已经写了 if,但没测)。
  • backfill 中某一条 Dispatch 抛异常时的原子性(目前是 foreach 裸循环,一半成功一半失败,状态观察不到)。

总体方向 OK,P1 修掉就可以上。

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.90%. Comparing base (63520ab) to head (14da71e).
⚠️ Report is 6 commits behind head on dev.

@@            Coverage Diff             @@
##              dev     #389      +/-   ##
==========================================
- Coverage   69.91%   69.90%   -0.01%     
==========================================
  Files        1172     1172              
  Lines       83456    83456              
  Branches    10969    10969              
==========================================
- Hits        58346    58341       -5     
- Misses      20899    20902       +3     
- Partials     4211     4213       +2     
Flag Coverage Δ
ci 69.90% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 24, 2026

修复看了一遍,P1(跨租户)核心路径已经堵上,整体方向没问题。列几个仍值得收紧的点,都是 Info 级:

✅ 核心安全不变量已成立

  • hasExplicitSelector 守卫 + 「Force 只在 selector 命中多条时生效」的新语义,已经禁掉了「无选择器 Force 扫全表」这条路。
  • NyxRelayApiKeyOwnershipVerifier 走 NyxID 的 /api/v1/api-keys/{id} + /api/v1/users/me,对接 NyxID 侧 resolve_api_key_read_owner 的「admin 可读,member/viewer 只读」授权:
    • credential_source.type == "org"role != "admin" → 返回 nullapi_key_owner_scope_unresolved,member/viewer 无法劫持 org bot。✅
    • credential_source.type == "personal" → 依赖 NyxID 侧「跨 user 读不到对方 personal key」保证,/users/me 回到的 id 必然是 key 的 owner。✅
    • org 且 admin → org_id 必须等于 caller JWT claim 的 scope_id,不一致直接拒。✅
  • endpoint / tool 两条入口都把 bearer 透传进 ChannelBotRegistrationScopeBackfillAuthorizationINyxRelayApiKeyOwnershipVerifier 注入方式是 GetService(允许未注册时安全失败,不是 GetRequiredService),这层 defense-in-depth 也合理。

Info — credential_source.type == "personal" 隐式依赖 NyxID 读权限合同

// 走到 /users/me fallback 的前提是「caller 能 GET 这个 api-key 回来」
var currentUserResponse = await _nyxClient.GetCurrentUserAsync(accessToken, ct);

这条路径的正确性完全依赖 NyxID resolve_api_key_read_owner:personal key 不会被 owner 之外的用户读到。这个不变量现在成立(api_keys.rs:75-92),但这里没有显式防御——如果未来 NyxID 放宽了 personal key 的读策略(比如让同组织成员也能看),Aevatar 这里会静默跟着放开。

两种风格二选一:

  • 加一行注释点明这条依赖;或
  • 在 personal 分支里也做一次 /users/me.id == key.user_id 的显式比对(NyxID 的 api-key response 里有 user_id 字段——api_keys.rsenrich_api_keys_batch 会回它)。后者更鲁棒。

Info — api_key_owner_scope_unresolved 失败 detail 信息量不够

if (!string.Equals(role, "admin", StringComparison.OrdinalIgnoreCase))
    return null;
// → caller 看到: "failed NyxID api-key ownership verification: api_key_owner_scope_unresolved"

运维看到这条日志只能猜——是 role 不够、还是 credential_source 结构变了、还是 /users/me 回空。建议 Failure 的 detail 带上上下文,比如:

  • api_key_owner_scope_unresolved org_role=member
  • api_key_owner_scope_unresolved source_type=unknown
  • api_key_owner_scope_unresolved current_user_id_missing

测试缺口

  • 非 admin org 成员失败路径credential_source={type:"org", role:"member"} 时 verifier 返回 api_key_owner_scope_unresolved——目前没覆盖,这是 P1 的反向守卫。
  • NyxID 返回 error envelopeTryReadErrorEnvelope 分支(NyxID 回 404 / 401 / body 的场景)没测。
  • verifier 失败时 dispatch 不发出ChannelCallbackEndpointsTestsRecordingOwnershipVerifier 总是返 success,没测「verifier 拒绝 → 一个 DispatchRegisterAsync 都不发、ChannelBotRebuildProjectionCommand 仍然发」——这条原子性现在只是 foreach 在抛前 return,值得钉一个测试。

Info — backfill 失败时 projection rebuild 照发

// ChannelCallbackEndpoints.cs:L232-258
// backfill 可能返回 BackfilledRegistrations=0 + note="...ownership verification"
// 但后面无条件执行:
await ChannelBotRegistrationStoreCommands.DispatchRebuildProjectionAsync(...);

rebuild 和 backfill 在语义上是分开的动作(rebuild 对既有权威态做投影重放),这个分层 OK,但外部看到「status=accepted + empty_scope_registrations_backfilled=0 + note=权限验证失败」时容易误读成「操作成功」。要不要在 response 里加一个 backfill_status: "skipped|verified|rejected" 三态字段,或者当 backfill 拒绝且 caller 明确传了 selector 时把 HTTP status 降到 200 OKwarnings 数组。不是必须,但能让 CLI/UI 侧更好处理。

Nit

  • NyxRelayApiKeyOwnershipVerifier.csinternal,但 NyxRelayApiKeyOwnershipVerifierTests 能跨 assembly 访问说明有 InternalsVisibleTo——如果有,就地 ok;如果没加而测试靠反射,值得检查一下。
  • HandleRebuildRegistrationsAsync 新增了 InvalidOperationException 捕获(不支持的 content-type)——好;但 ReadOptionalRebuildRequestAsync 里并没有显式抛 InvalidOperationException,这个是靠 ReadFromJsonAsync 对非 JSON content-type 的默认行为。加一行注释说明会更清楚。

P1 已经解决,以上都是 Info,不阻塞合并。

@eanzhao eanzhao merged commit 010f5d9 into dev Apr 24, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant