feat(api-proxy): add startup API key validation by lpcox · Pull Request #2200 · github/gh-aw-firewall

lpcox · 2026-04-24T17:56:21Z

Problem

Smoke Copilot has been failing since April 24 with cryptic HTTP 401 errors from the api-proxy (issue #2185). The root cause is an expired COPILOT_GITHUB_TOKEN secret, but the error only appeared deep in agent process logs with no clear guidance:

Authentication failed with provider at http://172.30.0.30:10002 (HTTP 401).
  Check your COPILOT_PROVIDER_API_KEY or COPILOT_PROVIDER_BEARER_TOKEN.

The api-proxy had no mechanism to detect this at startup — it would happily proxy requests with an expired token and let the upstream API reject them.

Solution

Add startup API key validation to the api-proxy sidecar. After all HTTP listeners are ready, the proxy probes each configured provider's API with a lightweight request:

Provider	Probe	Valid	Invalid
Copilot	`GET /models`	200	401/403
OpenAI	`GET /v1/models`	200	401/403
Anthropic	`POST /v1/messages` (empty body)	400	401/403
Gemini	`GET /v1beta/models`	200	401/403

Key design decisions

Non-blocking by default (AWF_VALIDATE_KEYS=warn): logs clear error messages but doesn't prevent startup. The agent might have other working providers.
Strict mode (AWF_VALIDATE_KEYS=strict): exits with code 1 on auth rejection — useful for CI smoke tests.
Custom targets skipped: validation only runs against known default API endpoints. Custom/enterprise targets may have different probe endpoints.
COPILOT_API_KEY-only skipped: no probe endpoint works with direct API keys (only COPILOT_GITHUB_TOKEN supports /models).
Error classification: distinguishes auth_rejected (401/403) from network_error (proxy/DNS issues) from inconclusive (unexpected status).
Health endpoint integration: validation results are exposed in /health response under key_validation.
Startup latch: validation waits for all listener ports to be ready before running probes.

Example log output (what operators will see)

{"level":"error","event":"key_validation_failed","provider":"copilot","message":"COPILOT API key validation failed — HTTP 401 — token expired or invalid. Rotate the secret and re-run."}

Testing

5 new tests for httpProbe covering 200, 401, 400, connection refused, and timeout cases
All 257 api-proxy tests pass
All existing repo tests pass (pre-existing docker-manager test failures unrelated)

Root cause note

The immediate fix for #2185 is to rotate the expired COPILOT_GITHUB_TOKEN repo secret. This PR adds the diagnostic infrastructure to catch such issues at startup in the future, with clear error messages pointing to the fix.

Closes #2185

Add a validateApiKeys() function that probes each configured provider's API at startup to detect expired or invalid credentials before the agent starts making requests. This directly addresses issue #2185 where an expired COPILOT_GITHUB_TOKEN caused cryptic 401 errors deep in agent logs with no clear guidance on the fix. Key design: - Validates Copilot (GET /models), OpenAI (GET /v1/models), Anthropic (POST /v1/messages with anthropic-version header), and Gemini (GET /v1beta/models) tokens - Runs after all listeners are ready via a startup latch - Results exposed in /health endpoint (key_validation field) - Non-blocking by default (AWF_VALIDATE_KEYS=warn) — logs clear error messages but doesn't prevent startup - AWF_VALIDATE_KEYS=strict exits with code 1 on auth rejection - AWF_VALIDATE_KEYS=off disables validation entirely - Skips validation for custom API targets (non-default endpoints) - Skips COPILOT_API_KEY-only setups (no probe endpoint available) - Classifies errors as auth_rejected vs network_error vs inconclusive - Routes probe requests through Squid proxy (respects domain allowlist) - 10s timeout per probe, all probes run in parallel Closes #2185 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds startup-time API key validation to the api-proxy sidecar so misconfigured/expired provider credentials (notably Copilot) are surfaced immediately with actionable logs and are exposed via /health.

Changes:

Implement startup key validation probes for Copilot/OpenAI/Anthropic/Gemini with warn/strict/off modes.
Expose validation status/results in the /health response and trigger validation after listener startup.
Add Jest coverage for the low-level httpProbe helper.

Show a summary per file

File	Description
containers/api-proxy/server.js	Adds key validation workflow, probe helpers, health reporting, and a startup latch to run validation after servers listen.
containers/api-proxy/server.test.js	Adds unit tests for `httpProbe` covering success, auth failure, bad request, connection refusal, and timeout.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 4

Copilot · 2026-04-24T17:59:32Z

+
+    const req = mod.request(reqOpts, (res) => {
+      // Consume body to free the socket
+      res.resume();
+      res.on('end', () => resolve(res.statusCode));
+    });
+
+    req.on('timeout', () => {
+      req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));
+    });
+    req.on('error', reject);


httpProbe only resolves on the response 'end' event and only rejects on request errors/timeouts. If the response stream emits an 'error' (or the socket closes early), this Promise can hang and keep validateApiKeys() from ever setting keyValidationComplete. Consider also rejecting on res.on('error') (and/or resolving on 'close' as a fallback).

Suggested change

const req = mod.request(reqOpts, (res) => {

// Consume body to free the socket

res.resume();

res.on('end', () => resolve(res.statusCode));

});

req.on('timeout', () => {

req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));

});

req.on('error', reject);

let settled = false;

const resolveOnce = (statusCode) => {

if (settled) return;

settled = true;

resolve(statusCode);

};

const rejectOnce = (err) => {

if (settled) return;

settled = true;

reject(err);

};

const req = mod.request(reqOpts, (res) => {

// Consume body to free the socket

res.resume();

res.on('end', () => resolveOnce(res.statusCode));

res.on('error', rejectOnce);

res.on('close', () => resolveOnce(res.statusCode));

});

req.on('timeout', () => {

req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));

});

req.on('error', rejectOnce);

Copilot · 2026-04-24T17:59:33Z

+  // Startup latch: count expected listeners, run validation when all are ready
+  let expectedListeners = 1; // port 10000 (always)
+  if (ANTHROPIC_API_KEY) expectedListeners++;
+  if (COPILOT_AUTH_TOKEN) expectedListeners++;
+  if (GEMINI_API_KEY) expectedListeners++;
+  if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004)
+  let readyListeners = 0;
+  function onListenerReady() {
+    readyListeners++;
+    if (readyListeners === expectedListeners) {
+      logRequest('info', 'startup_complete', { message: `All ${expectedListeners} listeners ready, starting key validation` });


The startup latch counts the Gemini listener only when GEMINI_API_KEY is set, but the process still starts a server on port 10003 even when the key is missing (503 handler). As a result, validateApiKeys() can run before all listener ports are actually bound, and the startup_complete log message can be inaccurate. Consider counting port 10003 in expectedListeners unconditionally and calling onListenerReady() in the no-key Gemini branch (or clarify the latch semantics in comments/logs).

Suggested change

// Startup latch: count expected listeners, run validation when all are ready

let expectedListeners = 1; // port 10000 (always)

if (ANTHROPIC_API_KEY) expectedListeners++;

if (COPILOT_AUTH_TOKEN) expectedListeners++;

if (GEMINI_API_KEY) expectedListeners++;

if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004)

let readyListeners = 0;

function onListenerReady() {

readyListeners++;

if (readyListeners === expectedListeners) {

logRequest('info', 'startup_complete', { message: `All ${expectedListeners} listeners ready, starting key validation` });

// Startup latch: count listeners that participate in startup key validation,

// then run validation when that subset is ready. This does not necessarily

// include every port that may be bound for disabled-provider fallback handling.

let expectedListeners = 1; // port 10000 (always)

if (ANTHROPIC_API_KEY) expectedListeners++;

if (COPILOT_AUTH_TOKEN) expectedListeners++;

if (GEMINI_API_KEY) expectedListeners++; // Count Gemini only when it participates in key validation

if (OPENAI_API_KEY || ANTHROPIC_API_KEY || COPILOT_AUTH_TOKEN) expectedListeners++; // OpenCode (10004)

let readyListeners = 0;

function onListenerReady() {

readyListeners++;

if (readyListeners === expectedListeners) {

logRequest('info', 'startup_complete', { message: `All ${expectedListeners} validation-participating listeners ready, starting key validation` });

Copilot · 2026-04-24T17:59:33Z

+async function validateApiKeys() {
+  const mode = (process.env.AWF_VALIDATE_KEYS || 'warn').toLowerCase(); // off | warn | strict


New startup key-validation behavior (validateApiKeys/probeProvider) is not covered by tests (only httpProbe is). Adding unit tests for at least: mode handling (off/warn/strict), skip behavior for custom targets/COPILOT_API_KEY-only, and status classification (auth_rejected vs inconclusive) would prevent regressions and make strict-mode CI behavior safer to change.

Copilot · 2026-04-24T17:59:33Z

+  it('should reject on connection refused', async () => {
+    await expect(
+      httpProbe('http://127.0.0.1:19999/health', {
+        method: 'GET',
+        headers: {},
+      }, 5000)
+    ).rejects.toThrow();
+  });


This test uses a hard-coded port (19999) to simulate connection refused. That can be flaky if something happens to be listening on that port in the test environment. Consider allocating an unused port deterministically (e.g., start a server on port 0, capture the port, close it, then probe it) so the refusal is guaranteed.

Port and adapt 18 tests from PR #2199 (copilot-swe-agent) covering the validateApiKeys orchestrator for all four providers: - OpenAI: valid (200), auth_rejected (401), skipped (custom target), no-op (no key) - Anthropic: valid (400 = key accepted), auth_rejected (401, 403), skipped (custom target) - Copilot: valid (200 with ghu_ token), auth_rejected (401), skipped (custom target, BYOK mode) - Gemini: valid (200), auth_rejected (403), skipped (custom target) - Cross-cutting: network_error (timeout), no-op (no keys at all) To make validateApiKeys testable without module-level state: - Added overrides parameter for injecting keys/targets in tests - Exported keyValidationResults and resetKeyValidationState() - Used 'in' operator for override resolution (supports explicit undefined) Co-authored-by: copilot-swe-agent[bot] <198982749+copilot-swe-agent[bot]@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+  const anthropicTarget = ov('anthropicTarget', ANTHROPIC_API_TARGET);
+  const copilotGithubToken = ov('copilotGithubToken', COPILOT_GITHUB_TOKEN);
+  const copilotApiKey = ov('copilotApiKey', COPILOT_API_KEY);
+  const copilotAuthToken = ov('copilotAuthToken', COPILOT_AUTH_TOKEN);


- httpProbe: add settle-once guard with resolveOnce/rejectOnce to prevent hanging if response stream errors or socket closes early; also handle res 'error' and 'close' events - Startup latch: clarify comment that only validation-participating listeners are counted (no-key Gemini 503 handler excluded) - Test: replace hard-coded port 19999 with dynamic port allocation to prevent flakiness when something listens on that port Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-24T18:11:34Z

🤖 Copilot Engine Smoke Test Results

MCP GitHub: ✅ Listed merged PR feat: add Gemini engine smoke test workflow #2171 "feat: add Gemini engine smoke test workflow"
GitHub.com connectivity: ✅ HTTP 200/301
File write/read: ✅ smoke-test-copilot-24904610437.txt verified

Overall: PASS

Author: @lpcox | Assignees: none

📰 BREAKING: Report filed by Smoke Copilot

github-actions · 2026-04-24T18:11:49Z

🔥 Smoke Test: Copilot BYOK (Offline) Mode

Test	Result
GitHub MCP (list merged PRs)	✅ PR #2171 "feat: add Gemini engine smoke test workflow"
GitHub.com connectivity (HTTP 200)	✅
File write/read	✅
BYOK inference (this response)	✅

Running in BYOK offline mode (COPILOT_OFFLINE=true) via api-proxy → api.githubcopilot.com

Overall: PASS — PR by @lpcox, no assignees.

🔑 BYOK report filed by Smoke Copilot BYOK

github-actions · 2026-04-24T18:11:55Z

Smoke Test Results

✅ GitHub MCP: Last 2 merged PRs retrieved
✅ Playwright: github.com page title verified
✅ File Writing: Test file created successfully
✅ Bash: File contents verified

Status: PASS

💥 [THE END] — Illustrated by Smoke Claude

github-actions · 2026-04-24T18:12:21Z

🏗️ Build Test Suite Results

⚠️ ALL CLONES FAILED — The firewall/proxy blocked access to all external test repositories (Mossaka/*). All gh repo clone commands returned HTTP 403.

Ecosystem	Project	Build/Install	Tests	Status
Bun	elysia	❌	N/A	❌ CLONE_FAILED
Bun	hono	❌	N/A	❌ CLONE_FAILED
C++	fmt	❌	N/A	❌ CLONE_FAILED
C++	json	❌	N/A	❌ CLONE_FAILED
Deno	oak	❌	N/A	❌ CLONE_FAILED
Deno	std	❌	N/A	❌ CLONE_FAILED
.NET	hello-world	❌	N/A	❌ CLONE_FAILED
.NET	json-parse	❌	N/A	❌ CLONE_FAILED
Go	color	❌	N/A	❌ CLONE_FAILED
Go	env	❌	N/A	❌ CLONE_FAILED
Go	uuid	❌	N/A	❌ CLONE_FAILED
Java	gson	❌	N/A	❌ CLONE_FAILED
Java	caffeine	❌	N/A	❌ CLONE_FAILED
Node.js	clsx	❌	N/A	❌ CLONE_FAILED
Node.js	execa	❌	N/A	❌ CLONE_FAILED
Node.js	p-limit	❌	N/A	❌ CLONE_FAILED
Rust	fd	❌	N/A	❌ CLONE_FAILED
Rust	zoxide	❌	N/A	❌ CLONE_FAILED

Overall: 0/8 ecosystems passed — ❌ FAIL

Error details

All repositories failed to clone with the same error:

remote: access denied: unrecognized endpoint
fatal: unable to access '(localhost/redacted) The requested URL returned error: 403

The gh CLI proxy sidecar (localhost:18443) rejected requests to the Mossaka organization repositories. This indicates the firewall is not permitting access to these external repositories in this environment.

Generated by Build Test Suite for issue #2200 · ● 118.4K · ◷

github-actions · 2026-04-24T18:12:49Z

Smoke test report
PR titles:

feat(api-proxy): add startup API key validation
fix: check binary existence for gh-aw install instead of gh aw --version

GitHub MCP testing (last 2 merged PRs reviewed): ✅
safeinputs-gh query test: ❌
Playwright title contains "GitHub": ✅
Tavily search test: ❌
File write: ✅; 6) Bash cat readback: ✅; 7) Discussion oracle comment: ✅; 8) npm ci && npm run build: ✅
Overall status: FAIL

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

github-actions · 2026-04-24T18:12:56Z

Chroot Version Comparison Results

Runtime	Host Version	Chroot Version	Match?
Python	Python 3.12.13	Python 3.12.3	❌ NO
Node.js	v24.14.1	v20.20.2	❌ NO
Go	go1.22.12	go1.22.12	✅ YES

Overall: ❌ Not all tests passed — Python and Node.js versions differ between host and chroot.

Tested by Smoke Chroot

github-actions · 2026-04-24T18:13:25Z

Smoke Test Results: GitHub Actions Services Connectivity

Check	Status	Details
Redis PING (`host.docker.internal:6379`)	❌ FAILED	`redis-cli` not available — `apt-get` is non-functional in this environment
PostgreSQL `pg_isready` (`host.docker.internal:5432`)	❌ FAILED	`no response` (exit code 2)
PostgreSQL `SELECT 1` (`smoketest` db)	❌ FAILED	Host unreachable (skipped after `pg_isready` failure)

All checks failed. The host.docker.internal hostname is not reachable from this runner environment, and package installation via apt-get is unavailable (no /etc/apt/sources.list). The smoke-services label was not applied.

🔌 Service connectivity validated by Smoke Services

lpcox requested a review from Mossaka as a code owner April 24, 2026 17:56

Copilot AI review requested due to automatic review settings April 24, 2026 17:56

lpcox mentioned this pull request Apr 24, 2026

[aw] Smoke Copilot failed #2185

Closed

Copilot started reviewing on behalf of lpcox April 24, 2026 17:56 View session

This comment has been minimized.

Sign in to view

github-actions Bot added the smoke-copilot-byok label Apr 24, 2026

Copilot AI reviewed Apr 24, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

github-actions Bot added the smoke-claude label Apr 24, 2026

This comment has been minimized.

Sign in to view

github-actions Bot added the build-test label Apr 24, 2026

This comment has been minimized.

Sign in to view

lpcox mentioned this pull request Apr 24, 2026

feat(api-proxy): fail-fast API key validation at startup #2199

Closed

github-advanced-security AI found potential problems Apr 24, 2026

View reviewed changes

github-actions Bot added the smoke-copilot label Apr 24, 2026

github-actions Bot mentioned this pull request Apr 24, 2026

[aw] No-Op Runs #2151

Open

lpcox merged commit c7d506a into main Apr 24, 2026
63 of 68 checks passed

lpcox deleted the fix/copilot-token-validation branch April 24, 2026 18:25

-    const req = mod.request(reqOpts, (res) => {
-      // Consume body to free the socket
-      res.resume();
-      res.on('end', () => resolve(res.statusCode));
-    });
-    req.on('timeout', () => {
-      req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));
-    });
-    req.on('error', reject);
+    let settled = false;
+    const resolveOnce = (statusCode) => {
+      if (settled) return;
+      settled = true;
+      resolve(statusCode);
+    };
+    const rejectOnce = (err) => {
+      if (settled) return;
+      settled = true;
+      reject(err);
+    };
+    const req = mod.request(reqOpts, (res) => {
+      // Consume body to free the socket
+      res.resume();
+      res.on('end', () => resolveOnce(res.statusCode));
+      res.on('error', rejectOnce);
+      res.on('close', () => resolveOnce(res.statusCode));
+    });
+    req.on('timeout', () => {
+      req.destroy(new Error(`Probe timed out after ${timeoutMs}ms`));
+    });
+    req.on('error', rejectOnce);

		async function validateApiKeys() {
		const mode = (process.env.AWF_VALIDATE_KEYS \|\| 'warn').toLowerCase(); // off \| warn \| strict

Conversation

lpcox commented Apr 24, 2026

Problem

Solution

Key design decisions

Example log output (what operators will see)

Testing

Root cause note

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Apr 24, 2026

🤖 Copilot Engine Smoke Test Results

Uh oh!

github-actions Bot commented Apr 24, 2026

🔥 Smoke Test: Copilot BYOK (Offline) Mode

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

🏗️ Build Test Suite Results

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Chroot Version Comparison Results

Uh oh!

github-actions Bot commented Apr 24, 2026

Smoke Test Results: GitHub Actions Services Connectivity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants