Skip to content

feat(api-proxy): add startup API key validation#2200

Merged
lpcox merged 3 commits intomainfrom
fix/copilot-token-validation
Apr 24, 2026
Merged

feat(api-proxy): add startup API key validation#2200
lpcox merged 3 commits intomainfrom
fix/copilot-token-validation

Conversation

@lpcox
Copy link
Copy Markdown
Collaborator

@lpcox lpcox commented Apr 24, 2026

Problem

Smoke Copilot has been failing since April 24 with cryptic HTTP 401 errors from the api-proxy (issue #2185). The root cause is an expired COPILOT_GITHUB_TOKEN secret, but the error only appeared deep in agent process logs with no clear guidance:

Authentication failed with provider at http://172.30.0.30:10002 (HTTP 401).
  Check your COPILOT_PROVIDER_API_KEY or COPILOT_PROVIDER_BEARER_TOKEN.

The api-proxy had no mechanism to detect this at startup — it would happily proxy requests with an expired token and let the upstream API reject them.

Solution

Add startup API key validation to the api-proxy sidecar. After all HTTP listeners are ready, the proxy probes each configured provider's API with a lightweight request:

Provider Probe Valid Invalid
Copilot GET /models 200 401/403
OpenAI GET /v1/models 200 401/403
Anthropic POST /v1/messages (empty body) 400 401/403
Gemini GET /v1beta/models 200 401/403

Key design decisions

  • Non-blocking by default (AWF_VALIDATE_KEYS=warn): logs clear error messages but doesn't prevent startup. The agent might have other working providers.
  • Strict mode (AWF_VALIDATE_KEYS=strict): exits with code 1 on auth rejection — useful for CI smoke tests.
  • Custom targets skipped: validation only runs against known default API endpoints. Custom/enterprise targets may have different probe endpoints.
  • COPILOT_API_KEY-only skipped: no probe endpoint works with direct API keys (only COPILOT_GITHUB_TOKEN supports /models).
  • Error classification: distinguishes auth_rejected (401/403) from network_error (proxy/DNS issues) from inconclusive (unexpected status).
  • Health endpoint integration: validation results are exposed in /health response under key_validation.
  • Startup latch: validation waits for all listener ports to be ready before running probes.

Example log output (what operators will see)

{"level":"error","event":"key_validation_failed","provider":"copilot","message":"COPILOT API key validation failed — HTTP 401 — token expired or invalid. Rotate the secret and re-run."}

Testing

  • 5 new tests for httpProbe covering 200, 401, 400, connection refused, and timeout cases
  • All 257 api-proxy tests pass
  • All existing repo tests pass (pre-existing docker-manager test failures unrelated)

Root cause note

The immediate fix for #2185 is to rotate the expired COPILOT_GITHUB_TOKEN repo secret. This PR adds the diagnostic infrastructure to catch such issues at startup in the future, with clear error messages pointing to the fix.

Closes #2185

Add a validateApiKeys() function that probes each configured provider's
API at startup to detect expired or invalid credentials before the agent
starts making requests. This directly addresses issue #2185 where an
expired COPILOT_GITHUB_TOKEN caused cryptic 401 errors deep in agent
logs with no clear guidance on the fix.

Key design:
- Validates Copilot (GET /models), OpenAI (GET /v1/models), Anthropic
  (POST /v1/messages with anthropic-version header), and Gemini
  (GET /v1beta/models) tokens
- Runs after all listeners are ready via a startup latch
- Results exposed in /health endpoint (key_validation field)
- Non-blocking by default (AWF_VALIDATE_KEYS=warn) — logs clear error
  messages but doesn't prevent startup
- AWF_VALIDATE_KEYS=strict exits with code 1 on auth rejection
- AWF_VALIDATE_KEYS=off disables validation entirely
- Skips validation for custom API targets (non-default endpoints)
- Skips COPILOT_API_KEY-only setups (no probe endpoint available)
- Classifies errors as auth_rejected vs network_error vs inconclusive
- Routes probe requests through Squid proxy (respects domain allowlist)
- 10s timeout per probe, all probes run in parallel

Closes #2185

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox lpcox requested a review from Mossaka as a code owner April 24, 2026 17:56
Copilot AI review requested due to automatic review settings April 24, 2026 17:56
@lpcox lpcox mentioned this pull request Apr 24, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds startup-time API key validation to the api-proxy sidecar so misconfigured/expired provider credentials (notably Copilot) are surfaced immediately with actionable logs and are exposed via /health.

Changes:

  • Implement startup key validation probes for Copilot/OpenAI/Anthropic/Gemini with warn/strict/off modes.
  • Expose validation status/results in the /health response and trigger validation after listener startup.
  • Add Jest coverage for the low-level httpProbe helper.
Show a summary per file
File Description
containers/api-proxy/server.js Adds key validation workflow, probe helpers, health reporting, and a startup latch to run validation after servers listen.
containers/api-proxy/server.test.js Adds unit tests for httpProbe covering success, auth failure, bad request, connection refusal, and timeout.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 4

Comment thread containers/api-proxy/server.js Outdated
Comment on lines +1078 to +1088

const req = mod.request(reqOpts, (res) => {
// Consume body to free the socket
res.resume();
res.on('end', () => resolve(res.statusCode));
});

req.on('timeout', () => {
req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));
});
req.on('error', reject);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpProbe only resolves on the response 'end' event and only rejects on request errors/timeouts. If the response stream emits an 'error' (or the socket closes early), this Promise can hang and keep validateApiKeys() from ever setting keyValidationComplete. Consider also rejecting on res.on('error') (and/or resolving on 'close' as a fallback).

Suggested change
const req = mod.request(reqOpts, (res) => {
// Consume body to free the socket
res.resume();
res.on('end', () => resolve(res.statusCode));
});
req.on('timeout', () => {
req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));
});
req.on('error', reject);
let settled = false;
const resolveOnce = (statusCode) => {
if (settled) return;
settled = true;
resolve(statusCode);
};
const rejectOnce = (err) => {
if (settled) return;
settled = true;
reject(err);
};
const req = mod.request(reqOpts, (res) => {
// Consume body to free the socket
res.resume();
res.on('end', () => resolveOnce(res.statusCode));
res.on('error', rejectOnce);
res.on('close', () => resolveOnce(res.statusCode));
});
req.on('timeout', () => {
req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));
});
req.on('error', rejectOnce);

Copilot uses AI. Check for mistakes.
Comment thread containers/api-proxy/server.js Outdated
Comment on lines +1140 to +1150
// Startup latch: count expected listeners, run validation when all are ready
let expectedListeners = 1; // port 10000 (always)
if (ANTHROPIC_API_KEY) expectedListeners++;
if (COPILOT_AUTH_TOKEN) expectedListeners++;
if (GEMINI_API_KEY) expectedListeners++;
if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004)
let readyListeners = 0;
function onListenerReady() {
readyListeners++;
if (readyListeners === expectedListeners) {
logRequest('info', 'startup_complete', { message: `All ${expectedListeners} listeners ready, starting key validation` });
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup latch counts the Gemini listener only when GEMINI_API_KEY is set, but the process still starts a server on port 10003 even when the key is missing (503 handler). As a result, validateApiKeys() can run before all listener ports are actually bound, and the startup_complete log message can be inaccurate. Consider counting port 10003 in expectedListeners unconditionally and calling onListenerReady() in the no-key Gemini branch (or clarify the latch semantics in comments/logs).

Suggested change
// Startup latch: count expected listeners, run validation when all are ready
let expectedListeners = 1; // port 10000 (always)
if (ANTHROPIC_API_KEY) expectedListeners++;
if (COPILOT_AUTH_TOKEN) expectedListeners++;
if (GEMINI_API_KEY) expectedListeners++;
if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004)
let readyListeners = 0;
function onListenerReady() {
readyListeners++;
if (readyListeners === expectedListeners) {
logRequest('info', 'startup_complete', { message: `All ${expectedListeners} listeners ready, starting key validation` });
// Startup latch: count listeners that participate in startup key validation,
// then run validation when that subset is ready. This does not necessarily
// include every port that may be bound for disabled-provider fallback handling.
let expectedListeners = 1; // port 10000 (always)
if (ANTHROPIC_API_KEY) expectedListeners++;
if (COPILOT_AUTH_TOKEN) expectedListeners++;
if (GEMINI_API_KEY) expectedListeners++; // Count Gemini only when it participates in key validation
if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004)
let readyListeners = 0;
function onListenerReady() {
readyListeners++;
if (readyListeners === expectedListeners) {
logRequest('info', 'startup_complete', { message: `All ${expectedListeners} validation-participating listeners ready, starting key validation` });

Copilot uses AI. Check for mistakes.
Comment thread containers/api-proxy/server.js Outdated
Comment on lines +915 to +916
async function validateApiKeys() {
const mode = (process.env.AWF_VALIDATE_KEYS || 'warn').toLowerCase(); // off | warn | strict
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New startup key-validation behavior (validateApiKeys/probeProvider) is not covered by tests (only httpProbe is). Adding unit tests for at least: mode handling (off/warn/strict), skip behavior for custom targets/COPILOT_API_KEY-only, and status classification (auth_rejected vs inconclusive) would prevent regressions and make strict-mode CI behavior safer to change.

Copilot uses AI. Check for mistakes.
Comment on lines +1037 to +1044
it('should reject on connection refused', async () => {
await expect(
httpProbe('http://127.0.0.1:19999/health', {
method: 'GET',
headers: {},
}, 5000)
).rejects.toThrow();
});
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a hard-coded port (19999) to simulate connection refused. That can be flaky if something happens to be listening on that port in the test environment. Consider allocating an unused port deterministically (e.g., start a server on port 0, capture the port, close it, then probe it) so the refusal is guaranteed.

Copilot uses AI. Check for mistakes.
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Port and adapt 18 tests from PR #2199 (copilot-swe-agent) covering the
validateApiKeys orchestrator for all four providers:

- OpenAI: valid (200), auth_rejected (401), skipped (custom target),
  no-op (no key)
- Anthropic: valid (400 = key accepted), auth_rejected (401, 403),
  skipped (custom target)
- Copilot: valid (200 with ghu_ token), auth_rejected (401),
  skipped (custom target, BYOK mode)
- Gemini: valid (200), auth_rejected (403), skipped (custom target)
- Cross-cutting: network_error (timeout), no-op (no keys at all)

To make validateApiKeys testable without module-level state:
- Added overrides parameter for injecting keys/targets in tests
- Exported keyValidationResults and resetKeyValidationState()
- Used 'in' operator for override resolution (supports explicit undefined)

Co-authored-by: copilot-swe-agent[bot] <198982749+copilot-swe-agent[bot]@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
const anthropicTarget = ov('anthropicTarget', ANTHROPIC_API_TARGET);
const copilotGithubToken = ov('copilotGithubToken', COPILOT_GITHUB_TOKEN);
const copilotApiKey = ov('copilotApiKey', COPILOT_API_KEY);
const copilotAuthToken = ov('copilotAuthToken', COPILOT_AUTH_TOKEN);
- httpProbe: add settle-once guard with resolveOnce/rejectOnce to
  prevent hanging if response stream errors or socket closes early;
  also handle res 'error' and 'close' events
- Startup latch: clarify comment that only validation-participating
  listeners are counted (no-key Gemini 503 handler excluded)
- Test: replace hard-coded port 19999 with dynamic port allocation
  to prevent flakiness when something listens on that port

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Copilot Engine Smoke Test Results

Overall: PASS

Author: @lpcox | Assignees: none

📰 BREAKING: Report filed by Smoke Copilot

@github-actions
Copy link
Copy Markdown
Contributor

🔥 Smoke Test: Copilot BYOK (Offline) Mode

Test Result
GitHub MCP (list merged PRs) ✅ PR #2171 "feat: add Gemini engine smoke test workflow"
GitHub.com connectivity (HTTP 200)
File write/read
BYOK inference (this response)

Running in BYOK offline mode (COPILOT_OFFLINE=true) via api-proxy → api.githubcopilot.com

Overall: PASS — PR by @lpcox, no assignees.

🔑 BYOK report filed by Smoke Copilot BYOK

@github-actions
Copy link
Copy Markdown
Contributor

Smoke Test Results

  • ✅ GitHub MCP: Last 2 merged PRs retrieved
  • ✅ Playwright: github.com page title verified
  • ✅ File Writing: Test file created successfully
  • ✅ Bash: File contents verified

Status: PASS

💥 [THE END] — Illustrated by Smoke Claude

@github-actions
Copy link
Copy Markdown
Contributor

🏗️ Build Test Suite Results

⚠️ ALL CLONES FAILED — The firewall/proxy blocked access to all external test repositories (Mossaka/*). All gh repo clone commands returned HTTP 403.

Ecosystem Project Build/Install Tests Status
Bun elysia N/A ❌ CLONE_FAILED
Bun hono N/A ❌ CLONE_FAILED
C++ fmt N/A ❌ CLONE_FAILED
C++ json N/A ❌ CLONE_FAILED
Deno oak N/A ❌ CLONE_FAILED
Deno std N/A ❌ CLONE_FAILED
.NET hello-world N/A ❌ CLONE_FAILED
.NET json-parse N/A ❌ CLONE_FAILED
Go color N/A ❌ CLONE_FAILED
Go env N/A ❌ CLONE_FAILED
Go uuid N/A ❌ CLONE_FAILED
Java gson N/A ❌ CLONE_FAILED
Java caffeine N/A ❌ CLONE_FAILED
Node.js clsx N/A ❌ CLONE_FAILED
Node.js execa N/A ❌ CLONE_FAILED
Node.js p-limit N/A ❌ CLONE_FAILED
Rust fd N/A ❌ CLONE_FAILED
Rust zoxide N/A ❌ CLONE_FAILED

Overall: 0/8 ecosystems passed — ❌ FAIL

Error details

All repositories failed to clone with the same error:

remote: access denied: unrecognized endpoint
fatal: unable to access '(localhost/redacted) The requested URL returned error: 403

The gh CLI proxy sidecar (localhost:18443) rejected requests to the Mossaka organization repositories. This indicates the firewall is not permitting access to these external repositories in this environment.

Generated by Build Test Suite for issue #2200 · ● 118.4K ·

@github-actions github-actions Bot mentioned this pull request Apr 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Smoke test report
PR titles:

  • feat(api-proxy): add startup API key validation
  • fix: check binary existence for gh-aw install instead of gh aw --version
  1. GitHub MCP testing (last 2 merged PRs reviewed): ✅
  2. safeinputs-gh query test: ❌
  3. Playwright title contains "GitHub": ✅
  4. Tavily search test: ❌
  5. File write: ✅; 6) Bash cat readback: ✅; 7) Discussion oracle comment: ✅; 8) npm ci && npm run build: ✅
    Overall status: FAIL

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

@github-actions
Copy link
Copy Markdown
Contributor

Chroot Version Comparison Results

Runtime Host Version Chroot Version Match?
Python Python 3.12.13 Python 3.12.3 ❌ NO
Node.js v24.14.1 v20.20.2 ❌ NO
Go go1.22.12 go1.22.12 ✅ YES

Overall: ❌ Not all tests passed — Python and Node.js versions differ between host and chroot.

Tested by Smoke Chroot

@github-actions
Copy link
Copy Markdown
Contributor

Smoke Test Results: GitHub Actions Services Connectivity

Check Status Details
Redis PING (host.docker.internal:6379) ❌ FAILED redis-cli not available — apt-get is non-functional in this environment
PostgreSQL pg_isready (host.docker.internal:5432) ❌ FAILED no response (exit code 2)
PostgreSQL SELECT 1 (smoketest db) ❌ FAILED Host unreachable (skipped after pg_isready failure)

All checks failed. The host.docker.internal hostname is not reachable from this runner environment, and package installation via apt-get is unavailable (no /etc/apt/sources.list). The smoke-services label was not applied.

🔌 Service connectivity validated by Smoke Services

@lpcox lpcox merged commit c7d506a into main Apr 24, 2026
63 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[aw] Smoke Copilot failed

3 participants