feat(api-proxy): add startup API key validation#2200
Conversation
Add a validateApiKeys() function that probes each configured provider's API at startup to detect expired or invalid credentials before the agent starts making requests. This directly addresses issue #2185 where an expired COPILOT_GITHUB_TOKEN caused cryptic 401 errors deep in agent logs with no clear guidance on the fix. Key design: - Validates Copilot (GET /models), OpenAI (GET /v1/models), Anthropic (POST /v1/messages with anthropic-version header), and Gemini (GET /v1beta/models) tokens - Runs after all listeners are ready via a startup latch - Results exposed in /health endpoint (key_validation field) - Non-blocking by default (AWF_VALIDATE_KEYS=warn) — logs clear error messages but doesn't prevent startup - AWF_VALIDATE_KEYS=strict exits with code 1 on auth rejection - AWF_VALIDATE_KEYS=off disables validation entirely - Skips validation for custom API targets (non-default endpoints) - Skips COPILOT_API_KEY-only setups (no probe endpoint available) - Classifies errors as auth_rejected vs network_error vs inconclusive - Routes probe requests through Squid proxy (respects domain allowlist) - 10s timeout per probe, all probes run in parallel Closes #2185 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds startup-time API key validation to the api-proxy sidecar so misconfigured/expired provider credentials (notably Copilot) are surfaced immediately with actionable logs and are exposed via /health.
Changes:
- Implement startup key validation probes for Copilot/OpenAI/Anthropic/Gemini with warn/strict/off modes.
- Expose validation status/results in the
/healthresponse and trigger validation after listener startup. - Add Jest coverage for the low-level
httpProbehelper.
Show a summary per file
| File | Description |
|---|---|
| containers/api-proxy/server.js | Adds key validation workflow, probe helpers, health reporting, and a startup latch to run validation after servers listen. |
| containers/api-proxy/server.test.js | Adds unit tests for httpProbe covering success, auth failure, bad request, connection refusal, and timeout. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 4
|
|
||
| const req = mod.request(reqOpts, (res) => { | ||
| // Consume body to free the socket | ||
| res.resume(); | ||
| res.on('end', () => resolve(res.statusCode)); | ||
| }); | ||
|
|
||
| req.on('timeout', () => { | ||
| req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`)); | ||
| }); | ||
| req.on('error', reject); |
There was a problem hiding this comment.
httpProbe only resolves on the response 'end' event and only rejects on request errors/timeouts. If the response stream emits an 'error' (or the socket closes early), this Promise can hang and keep validateApiKeys() from ever setting keyValidationComplete. Consider also rejecting on res.on('error') (and/or resolving on 'close' as a fallback).
| const req = mod.request(reqOpts, (res) => { | |
| // Consume body to free the socket | |
| res.resume(); | |
| res.on('end', () => resolve(res.statusCode)); | |
| }); | |
| req.on('timeout', () => { | |
| req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`)); | |
| }); | |
| req.on('error', reject); | |
| let settled = false; | |
| const resolveOnce = (statusCode) => { | |
| if (settled) return; | |
| settled = true; | |
| resolve(statusCode); | |
| }; | |
| const rejectOnce = (err) => { | |
| if (settled) return; | |
| settled = true; | |
| reject(err); | |
| }; | |
| const req = mod.request(reqOpts, (res) => { | |
| // Consume body to free the socket | |
| res.resume(); | |
| res.on('end', () => resolveOnce(res.statusCode)); | |
| res.on('error', rejectOnce); | |
| res.on('close', () => resolveOnce(res.statusCode)); | |
| }); | |
| req.on('timeout', () => { | |
| req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`)); | |
| }); | |
| req.on('error', rejectOnce); |
| // Startup latch: count expected listeners, run validation when all are ready | ||
| let expectedListeners = 1; // port 10000 (always) | ||
| if (ANTHROPIC_API_KEY) expectedListeners++; | ||
| if (COPILOT_AUTH_TOKEN) expectedListeners++; | ||
| if (GEMINI_API_KEY) expectedListeners++; | ||
| if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004) | ||
| let readyListeners = 0; | ||
| function onListenerReady() { | ||
| readyListeners++; | ||
| if (readyListeners === expectedListeners) { | ||
| logRequest('info', 'startup_complete', { message: `All ${expectedListeners} listeners ready, starting key validation` }); |
There was a problem hiding this comment.
The startup latch counts the Gemini listener only when GEMINI_API_KEY is set, but the process still starts a server on port 10003 even when the key is missing (503 handler). As a result, validateApiKeys() can run before all listener ports are actually bound, and the startup_complete log message can be inaccurate. Consider counting port 10003 in expectedListeners unconditionally and calling onListenerReady() in the no-key Gemini branch (or clarify the latch semantics in comments/logs).
| // Startup latch: count expected listeners, run validation when all are ready | |
| let expectedListeners = 1; // port 10000 (always) | |
| if (ANTHROPIC_API_KEY) expectedListeners++; | |
| if (COPILOT_AUTH_TOKEN) expectedListeners++; | |
| if (GEMINI_API_KEY) expectedListeners++; | |
| if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004) | |
| let readyListeners = 0; | |
| function onListenerReady() { | |
| readyListeners++; | |
| if (readyListeners === expectedListeners) { | |
| logRequest('info', 'startup_complete', { message: `All ${expectedListeners} listeners ready, starting key validation` }); | |
| // Startup latch: count listeners that participate in startup key validation, | |
| // then run validation when that subset is ready. This does not necessarily | |
| // include every port that may be bound for disabled-provider fallback handling. | |
| let expectedListeners = 1; // port 10000 (always) | |
| if (ANTHROPIC_API_KEY) expectedListeners++; | |
| if (COPILOT_AUTH_TOKEN) expectedListeners++; | |
| if (GEMINI_API_KEY) expectedListeners++; // Count Gemini only when it participates in key validation | |
| if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004) | |
| let readyListeners = 0; | |
| function onListenerReady() { | |
| readyListeners++; | |
| if (readyListeners === expectedListeners) { | |
| logRequest('info', 'startup_complete', { message: `All ${expectedListeners} validation-participating listeners ready, starting key validation` }); |
| async function validateApiKeys() { | ||
| const mode = (process.env.AWF_VALIDATE_KEYS || 'warn').toLowerCase(); // off | warn | strict |
There was a problem hiding this comment.
New startup key-validation behavior (validateApiKeys/probeProvider) is not covered by tests (only httpProbe is). Adding unit tests for at least: mode handling (off/warn/strict), skip behavior for custom targets/COPILOT_API_KEY-only, and status classification (auth_rejected vs inconclusive) would prevent regressions and make strict-mode CI behavior safer to change.
| it('should reject on connection refused', async () => { | ||
| await expect( | ||
| httpProbe('http://127.0.0.1:19999/health', { | ||
| method: 'GET', | ||
| headers: {}, | ||
| }, 5000) | ||
| ).rejects.toThrow(); | ||
| }); |
There was a problem hiding this comment.
This test uses a hard-coded port (19999) to simulate connection refused. That can be flaky if something happens to be listening on that port in the test environment. Consider allocating an unused port deterministically (e.g., start a server on port 0, capture the port, close it, then probe it) so the refusal is guaranteed.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Port and adapt 18 tests from PR #2199 (copilot-swe-agent) covering the validateApiKeys orchestrator for all four providers: - OpenAI: valid (200), auth_rejected (401), skipped (custom target), no-op (no key) - Anthropic: valid (400 = key accepted), auth_rejected (401, 403), skipped (custom target) - Copilot: valid (200 with ghu_ token), auth_rejected (401), skipped (custom target, BYOK mode) - Gemini: valid (200), auth_rejected (403), skipped (custom target) - Cross-cutting: network_error (timeout), no-op (no keys at all) To make validateApiKeys testable without module-level state: - Added overrides parameter for injecting keys/targets in tests - Exported keyValidationResults and resetKeyValidationState() - Used 'in' operator for override resolution (supports explicit undefined) Co-authored-by: copilot-swe-agent[bot] <198982749+copilot-swe-agent[bot]@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| const anthropicTarget = ov('anthropicTarget', ANTHROPIC_API_TARGET); | ||
| const copilotGithubToken = ov('copilotGithubToken', COPILOT_GITHUB_TOKEN); | ||
| const copilotApiKey = ov('copilotApiKey', COPILOT_API_KEY); | ||
| const copilotAuthToken = ov('copilotAuthToken', COPILOT_AUTH_TOKEN); |
- httpProbe: add settle-once guard with resolveOnce/rejectOnce to prevent hanging if response stream errors or socket closes early; also handle res 'error' and 'close' events - Startup latch: clarify comment that only validation-participating listeners are counted (no-key Gemini 503 handler excluded) - Test: replace hard-coded port 19999 with dynamic port allocation to prevent flakiness when something listens on that port Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Copilot Engine Smoke Test Results
Overall: PASS Author: @lpcox | Assignees: none
|
🔥 Smoke Test: Copilot BYOK (Offline) Mode
Running in BYOK offline mode ( Overall: PASS — PR by @lpcox, no assignees.
|
|
Smoke Test Results
Status: PASS
|
🏗️ Build Test Suite Results
Overall: 0/8 ecosystems passed — ❌ FAIL Error detailsAll repositories failed to clone with the same error: The
|
|
Smoke test report
Warning The following domain was blocked by the firewall during workflow execution:
To allow these domains, add them to the network:
allowed:
- defaults
- "registry.npmjs.org"See Network Configuration for more information.
|
Chroot Version Comparison Results
Overall: ❌ Not all tests passed — Python and Node.js versions differ between host and chroot.
|
Smoke Test Results: GitHub Actions Services Connectivity
All checks failed. The
|
Problem
Smoke Copilot has been failing since April 24 with cryptic
HTTP 401errors from the api-proxy (issue #2185). The root cause is an expiredCOPILOT_GITHUB_TOKENsecret, but the error only appeared deep in agent process logs with no clear guidance:The api-proxy had no mechanism to detect this at startup — it would happily proxy requests with an expired token and let the upstream API reject them.
Solution
Add startup API key validation to the api-proxy sidecar. After all HTTP listeners are ready, the proxy probes each configured provider's API with a lightweight request:
GET /modelsGET /v1/modelsPOST /v1/messages(empty body)GET /v1beta/modelsKey design decisions
AWF_VALIDATE_KEYS=warn): logs clear error messages but doesn't prevent startup. The agent might have other working providers.AWF_VALIDATE_KEYS=strict): exits with code 1 on auth rejection — useful for CI smoke tests./models).auth_rejected(401/403) fromnetwork_error(proxy/DNS issues) frominconclusive(unexpected status)./healthresponse underkey_validation.Example log output (what operators will see)
{"level":"error","event":"key_validation_failed","provider":"copilot","message":"COPILOT API key validation failed — HTTP 401 — token expired or invalid. Rotate the secret and re-run."}Testing
httpProbecovering 200, 401, 400, connection refused, and timeout casesRoot cause note
The immediate fix for #2185 is to rotate the expired
COPILOT_GITHUB_TOKENrepo secret. This PR adds the diagnostic infrastructure to catch such issues at startup in the future, with clear error messages pointing to the fix.Closes #2185