Summary
The Copilot CLI's hosted-mode (no BYOK; default Copilot Enterprise
premium-request path) authenticates and runs the agent loop
correctly, but does not autonomously call write / edit /
create_file tools even when the prompt explicitly asks for code
to be generated and the workspace is empty. The agent makes
exploratory tool calls (view, glob, shell, grep) and emits
text responses, then closes the session without writing any files.
This makes the Copilot CLI unsuitable for unattended autonomous
codegen workloads (e.g. CI pipelines, factory tiers) regardless of
which model is selected.
Environment
copilot CLI: 0.0.353 / 1.0.36-0
github-copilot-sdk: 0.3.0 (Python wrapper)
- Auth: GitHub fine-grained PAT (
github_pat_*) with Copilot scope
in COPILOT_GITHUB_TOKEN env var
- Models tried (selected via SDK
model= parameter):
claude-sonnet-4.5, gpt-5
Reproducer
import asyncio
from copilot import CopilotClient, SubprocessConfig
async def main() -> None:
client = CopilotClient(
config=SubprocessConfig(
github_token="<github_pat_with_copilot_scope>",
use_logged_in_user=True,
),
)
session = await client.create_session(
model="claude-sonnet-4.5", # also seen with gpt-5
working_directory="/tmp/empty-dir",
github_token="<github_pat_with_copilot_scope>",
on_permission_request=async_handler_returning_allow,
)
# Realistic codegen prompt — full NLSpec/LangSpec inlined,
# detailed task description, target package layout, acceptance
# criteria, the works. Same prompt that produces 14 passing
# files via Anthropic's Claude Agent SDK.
response = await session.send_and_wait(prompt, timeout=60 * 30)
print(await session.get_messages()) # zero write tool calls
asyncio.run(main())
Expected behavior
The agent — given an explicit prompt asking for code to be written
to a workspace, with permissions auto-granted — calls write_file
/ edit_file / create_file (or whatever tool name the CLI's tool
catalog uses) to produce the requested artifacts. Same prompt
produces working code via Anthropic's Claude Agent SDK
(claude_agent_sdk.query) and OpenAI's Codex SDK
(codex_app_server.Codex).
Actual behavior
- 12 successful LLM round-trips against
claude-sonnet-4.5 over
~110 seconds, ~12 premium requests consumed.
- Tool calls observed in the Copilot CLI log:
view ×4, glob
×3, shell ×2, grep ×1. Zero write / edit /
create_file calls.
- Session closes cleanly without errors. Workspace stays empty.
- Same outcome with
gpt-5 instead of Claude Sonnet 4.5
(different scaffolding test under BYOK→OpenAI direct).
- Same outcome with
qwen3-coder-next:51b under BYOK→Ollama (where
the CLI auto-allows but the model still doesn't commit Writes).
The pattern is consistent: the Copilot CLI's agent loop appears
biased against autonomous file writes regardless of the underlying
model. Likely the CLI's system prompt or tool-use posture is tuned
for interactive developer assistance ("answer the question,
suggest, let the human review") rather than autonomous codegen.
Why this matters
We're standing up a code-agent abstraction layer above Microsoft
Agent Framework where each vendor's CLI / SDK is a peer adapter.
Anthropic's Claude Agent SDK (claude-sonnet-4.5) and OpenAI's
Codex SDK (gpt-5.4 via Foundry) both produce passing test suites
on the same prompt that yields zero files via Copilot CLI. The
Copilot CLI is the outlier; the model is fine.
For organizations with GitHub Copilot Enterprise quota who want to
use Copilot CLI for unattended autonomous codegen (CI runners,
factory tiers, batch jobs), this behavior gap is a hard block —
even though the auth, billing, and wire layers all work.
Suggested fixes
- Add a
permission_mode: "bypassPermissions"-style option
that biases the agent loop toward autonomous file writes when
the calling environment is non-interactive, mirroring the Claude
Agent SDK's permission_mode="bypassPermissions". The
on_permission_request callback handles the wire-level
approval; the agent loop's "should I write or just suggest"
disposition is separate and currently leans "suggest."
- Document the autonomous-codegen posture clearly so operators
know the CLI is interactive-first and wouldn't expect it to
write files autonomously without explicit prompting tweaks.
- Surface a system-prompt override / agent-mode selector so
the SDK caller can opt the agent loop into
"implement-and-write" mode (the current behavior would remain
the default for human-in-the-loop usage).
Workarounds we don't think work cleanly
- Prompt engineering ("you MUST write the files using
write_file
tool, ABSOLUTELY DO NOT just describe what you would do"): the
agent's posture appears system-prompt-rooted, not user-prompt
steerable in our experiments.
- Pre-creating empty stub files in the workspace: the agent calls
view to inspect them but still doesn't edit to populate.
- Different model selection: tested with
claude-sonnet-4.5,
gpt-5, and qwen3-coder-next:51b (BYOK Ollama). Same outcome
across all three.
Companion issue: #
Related (different scope):
Summary
The Copilot CLI's hosted-mode (no BYOK; default Copilot Enterprise
premium-request path) authenticates and runs the agent loop
correctly, but does not autonomously call
write/edit/create_filetools even when the prompt explicitly asks for codeto be generated and the workspace is empty. The agent makes
exploratory tool calls (
view,glob,shell,grep) and emitstext responses, then closes the session without writing any files.
This makes the Copilot CLI unsuitable for unattended autonomous
codegen workloads (e.g. CI pipelines, factory tiers) regardless of
which model is selected.
Environment
copilotCLI:0.0.353/1.0.36-0github-copilot-sdk:0.3.0(Python wrapper)github_pat_*) with Copilot scopein
COPILOT_GITHUB_TOKENenv varmodel=parameter):claude-sonnet-4.5,gpt-5Reproducer
Expected behavior
The agent — given an explicit prompt asking for code to be written
to a workspace, with permissions auto-granted — calls
write_file/
edit_file/create_file(or whatever tool name the CLI's toolcatalog uses) to produce the requested artifacts. Same prompt
produces working code via Anthropic's Claude Agent SDK
(
claude_agent_sdk.query) and OpenAI's Codex SDK(
codex_app_server.Codex).Actual behavior
claude-sonnet-4.5over~110 seconds, ~12 premium requests consumed.
view×4,glob×3,
shell×2,grep×1. Zerowrite/edit/create_filecalls.gpt-5instead of Claude Sonnet 4.5(different scaffolding test under BYOK→OpenAI direct).
qwen3-coder-next:51bunder BYOK→Ollama (wherethe CLI auto-allows but the model still doesn't commit Writes).
The pattern is consistent: the Copilot CLI's agent loop appears
biased against autonomous file writes regardless of the underlying
model. Likely the CLI's system prompt or tool-use posture is tuned
for interactive developer assistance ("answer the question,
suggest, let the human review") rather than autonomous codegen.
Why this matters
We're standing up a code-agent abstraction layer above Microsoft
Agent Framework where each vendor's CLI / SDK is a peer adapter.
Anthropic's Claude Agent SDK (
claude-sonnet-4.5) and OpenAI'sCodex SDK (
gpt-5.4via Foundry) both produce passing test suiteson the same prompt that yields zero files via Copilot CLI. The
Copilot CLI is the outlier; the model is fine.
For organizations with GitHub Copilot Enterprise quota who want to
use Copilot CLI for unattended autonomous codegen (CI runners,
factory tiers, batch jobs), this behavior gap is a hard block —
even though the auth, billing, and wire layers all work.
Suggested fixes
permission_mode: "bypassPermissions"-style optionthat biases the agent loop toward autonomous file writes when
the calling environment is non-interactive, mirroring the Claude
Agent SDK's
permission_mode="bypassPermissions". Theon_permission_requestcallback handles the wire-levelapproval; the agent loop's "should I write or just suggest"
disposition is separate and currently leans "suggest."
know the CLI is interactive-first and wouldn't expect it to
write files autonomously without explicit prompting tweaks.
the SDK caller can opt the agent loop into
"implement-and-write" mode (the current behavior would remain
the default for human-in-the-loop usage).
Workarounds we don't think work cleanly
write_filetool, ABSOLUTELY DO NOT just describe what you would do"): the
agent's posture appears system-prompt-rooted, not user-prompt
steerable in our experiments.
viewto inspect them but still doesn'teditto populate.claude-sonnet-4.5,gpt-5, andqwen3-coder-next:51b(BYOK Ollama). Same outcomeacross all three.
Companion issue: #
Related (different scope):