fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O by Quratulain-bilal · Pull Request #2686 · github/spec-kit

Quratulain-bilal · 2026-05-23T10:44:44Z

What

Path.read_text() / Path.write_text() default to the system locale codec, which is cp1252 / gb2312 / cp932 on Windows.
Two user-facing file paths in spec-kit were calling them without an explicit encoding= argument:

src/specify_cli/__init__.py:400, 412 — save_init_options / load_init_options for
.specify/init-options.json. A peer machine with a different default locale (or a UTF-8 Unix CI runner reading a file
written on a cp1252 Windows host) cannot decode the file, raising UnicodeDecodeError. UnicodeDecodeError is a
subclass of ValueError — not OSError / json.JSONDecodeError — so the existing fall-back except tuple in
load_init_options also misses it and the error propagates raw to the CLI.
src/specify_cli/extensions.py:764 — .extensionignore pattern reader. The very next line already normalises
backslashes "so Windows-authored files work", proving the codebase expects Windows authors to write this file.
Multibyte UTF-8 patterns (Chinese filenames, accented directory names) silently mojibake when the host locale is not
UTF-8, so the patterns fail to match and unintended files are shipped with the extension.

Reproducer

# On a non-UTF-8 Windows host (e.g. cp1252):
from specify_cli import save_init_options, load_init_options
save_init_options(Path("."), {"project_name": "café"})
# File written as cp1252.

# On a UTF-8 host (or another Windows host with a different default codec):
load_init_options(Path("."))
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9...
# — propagates raw because UnicodeDecodeError is a ValueError, not an OSError.

Why this matters

The sibling integration-catalog reader at src/specify_cli/integrations/catalog.py:150,156,193,202,374 already pins
encoding="utf-8" everywhere. PR fix(powershell): ensure UTF-8 templates are written without BOM #2280 (f684305) fixed the symmetric PowerShell-template BOM bug. The two paths in
this PR were the remaining drifted ones.
init-options.json is meant to be a portable record of how a project was scaffolded — a peer cloning the repo on a
different OS / locale must be able to read it. Today they can't if the original author's project name (or any future
field) contains non-ASCII.
.extensionignore already explicitly supports Windows authors (see line 766). UTF-8 patterns are part of that same
contract.

The change

src/specify_cli/__init__.py — pin encoding="utf-8" on both write_text and read_text; extend the existing
except tuple in load_init_options to include UnicodeDecodeError so a peer file written in a non-UTF-8 codec falls
back to {} per the existing contract instead of crashing.
src/specify_cli/extensions.py — pin encoding="utf-8" on the .extensionignore reader.

Tests

tests/test_presets.py::TestInitOptions — parametrized non-ASCII round-trip (CJK / Latin-1 / Greek / emoji) plus a
0xe9-byte corrupted-file fallback test.
tests/test_extensions.py::TestExtensionIgnore — Japanese (ドキュメント/) and Latin-1 (café/) ignore patterns
correctly exclude their directories during install.

$ python -m pytest tests/test_presets.py::TestInitOptions tests/test_extensions.py::TestExtensionIgnore -q
22 passed in 4.56s
$ python -m ruff check src/
All checks passed!

Scope

Intentionally narrow: no behaviour change for ASCII content (UTF-8 is a superset). Only non-ASCII content that previously
round-tripped accidentally (when host locale happened to be UTF-8) or silently mojibaked (when it wasn't) now
round-trips reliably across all hosts.

``Path.read_text`` / ``Path.write_text`` default to the system locale codec, which is cp1252 / gb2312 / cp932 on Windows. Two user-facing file paths in spec-kit were calling them without an explicit ``encoding=`` argument: - ``src/specify_cli/__init__.py:400,412`` — ``save_init_options`` / ``load_init_options`` for ``.specify/init-options.json``. A peer machine with a different default locale (or a UTF-8 Unix CI runner reading a file written on a cp1252 Windows host) cannot decode the file, raising ``UnicodeDecodeError``. ``UnicodeDecodeError`` is a subclass of ``ValueError`` — not ``OSError`` / ``json.JSONDecodeError`` — so the existing fall-back ``except`` tuple in ``load_init_options`` also misses it and the error propagates raw to the CLI. - ``src/specify_cli/extensions.py:764`` — ``.extensionignore`` pattern reader. The very next line already normalises backslashes "so Windows-authored files work", proving the codebase expects Windows authors to write this file. Multibyte UTF-8 patterns (Chinese filenames, accented directory names) silently mojibake when the host locale is not UTF-8, so the patterns fail to match and unintended files are shipped with the extension. The sibling integration-catalog reader at ``src/specify_cli/integrations/catalog.py:150,156,193,202,374`` already pins ``encoding="utf-8"`` everywhere. PR github#2280 fixed the symmetric PowerShell-template BOM bug. This change brings the two remaining drifted paths in line with that precedent. Regression tests: - ``tests/test_presets.py::TestInitOptions`` — parametrized non-ASCII round-trip (CJK, Latin-1, Greek, emoji) plus a corrupted-file case that asserts the existing "fall back to {}" contract still holds when a peer file contains bytes invalid as UTF-8. - ``tests/test_extensions.py::TestExtensionIgnore`` — Japanese (``ドキュメント/``) and Latin-1 (``café/``) ignore patterns correctly exclude their directories during install.

Copilot

Pull request overview

Pins UTF-8 explicitly for a couple of user-authored/portable files to avoid Windows locale-dependent decoding/encoding behavior, and adds regression tests around non-ASCII content.

Changes:

Use encoding="utf-8" for .specify/init-options.json read/write and catch UnicodeDecodeError in the loader fallback path.
Use encoding="utf-8" when reading .extensionignore so multibyte patterns round-trip across platforms/locales.
Add tests covering UTF-8 patterns in .extensionignore and fallback behavior for invalid UTF-8 init-options.

Show a summary per file

File	Description
`src/specify_cli/__init__.py`	Pins UTF-8 for init-options I/O; extends error handling to include decode failures.
`src/specify_cli/extensions.py`	Pins UTF-8 when reading `.extensionignore` patterns.
`tests/test_presets.py`	Adds init-options tests for non-ASCII values and corrupted-byte fallback.
`tests/test_extensions.py`	Adds `.extensionignore` tests ensuring non-ASCII patterns exclude intended paths.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 4/4 changed files
Comments generated: 3

mnriem · 2026-05-27T12:50:50Z

Please address Copilot feedback. If not applicable, please explain why

Addresses Copilot review feedback on this PR. Three issues, three fixes: 1. ``save_init_options`` now writes JSON with ``ensure_ascii=False``. Without that flag, ``json.dumps`` emits ASCII-only ``\uXXXX`` escapes, which means the ``encoding="utf-8"`` pin on the surrounding ``Path.write_text`` makes no observable difference for any value we currently write. Flipping ``ensure_ascii`` makes the non-ASCII bytes hit the file directly, so the encoding pin becomes the thing that decides between cp1252 garbage and clean UTF-8 on Windows. The comment above the call now describes the real reason instead of the previously-misleading rationale Copilot flagged. 2. ``test_save_load_round_trip_preserves_non_ascii`` was a no-op under the old ``ensure_ascii=True`` writer (Copilot's second comment). Added ``test_save_writes_real_utf8_bytes`` that asserts the on-disk bytes contain the UTF-8 encoding of ``café`` (``0xC3 0xA9``), not its JSON escape form ``é``. Removing either ``ensure_ascii=False`` or ``encoding="utf-8"`` from the writer now breaks this test — the contract is pinned. 3. ``.extensionignore`` reader wraps ``UnicodeDecodeError`` as ``ValidationError`` with a pointer to the offending byte (Copilot's third comment). Mirrors ``ExtensionManifest._load_yaml``'s existing handler for ``extension.yml``. Adds ``test_extensionignore_invalid_utf8_raises_validation_error`` asserting installation aborts with the wrapped error instead of a raw Python traceback.

Copilot

Copilot's findings

Files reviewed: 4/4 changed files
Comments generated: 0 new

mnriem · 2026-06-02T12:19:21Z

Thank you!

Quratulain-bilal requested a review from mnriem as a code owner May 23, 2026 10:44

mnriem requested a review from Copilot May 26, 2026 18:40

Copilot started reviewing on behalf of mnriem May 26, 2026 18:40 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread src/specify_cli/__init__.py Outdated

Comment thread tests/test_presets.py

Comment thread src/specify_cli/extensions.py Outdated

mnriem requested a review from Copilot June 1, 2026 16:19

Copilot started reviewing on behalf of mnriem June 1, 2026 16:20 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

mnriem approved these changes Jun 2, 2026

View reviewed changes

mnriem merged commit 442a581 into github:main Jun 2, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686
mnriem merged 2 commits into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding

Quratulain-bilal commented May 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mnriem commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

mnriem commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Quratulain-bilal commented May 23, 2026

What

Reproducer

Why this matters

The change

Tests

Scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mnriem commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

mnriem commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants