Skip to content

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686

Merged
mnriem merged 2 commits into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding
Jun 2, 2026
Merged

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686
mnriem merged 2 commits into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding

Conversation

@Quratulain-bilal
Copy link
Copy Markdown
Contributor

What

Path.read_text() / Path.write_text() default to the system locale codec, which is cp1252 / gb2312 / cp932 on Windows.
Two user-facing file paths in spec-kit were calling them without an explicit encoding= argument:

  • src/specify_cli/__init__.py:400, 412save_init_options / load_init_options for
    .specify/init-options.json. A peer machine with a different default locale (or a UTF-8 Unix CI runner reading a file
    written on a cp1252 Windows host) cannot decode the file, raising UnicodeDecodeError. UnicodeDecodeError is a
    subclass of ValueError — not OSError / json.JSONDecodeError — so the existing fall-back except tuple in
    load_init_options also misses it and the error propagates raw to the CLI.

  • src/specify_cli/extensions.py:764.extensionignore pattern reader. The very next line already normalises
    backslashes "so Windows-authored files work", proving the codebase expects Windows authors to write this file.
    Multibyte UTF-8 patterns (Chinese filenames, accented directory names) silently mojibake when the host locale is not
    UTF-8, so the patterns fail to match and unintended files are shipped with the extension.

Reproducer

# On a non-UTF-8 Windows host (e.g. cp1252):
from specify_cli import save_init_options, load_init_options
save_init_options(Path("."), {"project_name": "café"})
# File written as cp1252.

# On a UTF-8 host (or another Windows host with a different default codec):
load_init_options(Path("."))
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9...
# — propagates raw because UnicodeDecodeError is a ValueError, not an OSError.

Why this matters

  • The sibling integration-catalog reader at src/specify_cli/integrations/catalog.py:150,156,193,202,374 already pins
    encoding="utf-8" everywhere. PR fix(powershell): ensure UTF-8 templates are written without BOM #2280 (f684305) fixed the symmetric PowerShell-template BOM bug. The two paths in
    this PR were the remaining drifted ones.
  • init-options.json is meant to be a portable record of how a project was scaffolded — a peer cloning the repo on a
    different OS / locale must be able to read it. Today they can't if the original author's project name (or any future
    field) contains non-ASCII.
  • .extensionignore already explicitly supports Windows authors (see line 766). UTF-8 patterns are part of that same
    contract.

The change

  • src/specify_cli/__init__.py — pin encoding="utf-8" on both write_text and read_text; extend the existing
    except tuple in load_init_options to include UnicodeDecodeError so a peer file written in a non-UTF-8 codec falls
    back to {} per the existing contract instead of crashing.
  • src/specify_cli/extensions.py — pin encoding="utf-8" on the .extensionignore reader.

Tests

  • tests/test_presets.py::TestInitOptions — parametrized non-ASCII round-trip (CJK / Latin-1 / Greek / emoji) plus a
    0xe9-byte corrupted-file fallback test.
  • tests/test_extensions.py::TestExtensionIgnore — Japanese (ドキュメント/) and Latin-1 (café/) ignore patterns
    correctly exclude their directories during install.
$ python -m pytest tests/test_presets.py::TestInitOptions tests/test_extensions.py::TestExtensionIgnore -q
22 passed in 4.56s
$ python -m ruff check src/
All checks passed!

Scope

Intentionally narrow: no behaviour change for ASCII content (UTF-8 is a superset). Only non-ASCII content that previously
round-tripped accidentally (when host locale happened to be UTF-8) or silently mojibaked (when it wasn't) now
round-trips reliably across all hosts.

``Path.read_text`` / ``Path.write_text`` default to the system locale
codec, which is cp1252 / gb2312 / cp932 on Windows. Two user-facing
file paths in spec-kit were calling them without an explicit
``encoding=`` argument:

  - ``src/specify_cli/__init__.py:400,412`` —
    ``save_init_options`` / ``load_init_options`` for
    ``.specify/init-options.json``. A peer machine with a different
    default locale (or a UTF-8 Unix CI runner reading a file written on
    a cp1252 Windows host) cannot decode the file, raising
    ``UnicodeDecodeError``. ``UnicodeDecodeError`` is a subclass of
    ``ValueError`` — not ``OSError`` / ``json.JSONDecodeError`` — so
    the existing fall-back ``except`` tuple in ``load_init_options``
    also misses it and the error propagates raw to the CLI.

  - ``src/specify_cli/extensions.py:764`` — ``.extensionignore``
    pattern reader. The very next line already normalises
    backslashes "so Windows-authored files work", proving the codebase
    expects Windows authors to write this file. Multibyte UTF-8
    patterns (Chinese filenames, accented directory names) silently
    mojibake when the host locale is not UTF-8, so the patterns fail
    to match and unintended files are shipped with the extension.

The sibling integration-catalog reader at
``src/specify_cli/integrations/catalog.py:150,156,193,202,374``
already pins ``encoding="utf-8"`` everywhere. PR github#2280 fixed the
symmetric PowerShell-template BOM bug. This change brings the two
remaining drifted paths in line with that precedent.

Regression tests:

  - ``tests/test_presets.py::TestInitOptions`` — parametrized non-ASCII
    round-trip (CJK, Latin-1, Greek, emoji) plus a corrupted-file case
    that asserts the existing "fall back to {}" contract still holds
    when a peer file contains bytes invalid as UTF-8.
  - ``tests/test_extensions.py::TestExtensionIgnore`` — Japanese
    (``ドキュメント/``) and Latin-1 (``café/``) ignore patterns
    correctly exclude their directories during install.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Pins UTF-8 explicitly for a couple of user-authored/portable files to avoid Windows locale-dependent decoding/encoding behavior, and adds regression tests around non-ASCII content.

Changes:

  • Use encoding="utf-8" for .specify/init-options.json read/write and catch UnicodeDecodeError in the loader fallback path.
  • Use encoding="utf-8" when reading .extensionignore so multibyte patterns round-trip across platforms/locales.
  • Add tests covering UTF-8 patterns in .extensionignore and fallback behavior for invalid UTF-8 init-options.
Show a summary per file
File Description
src/specify_cli/__init__.py Pins UTF-8 for init-options I/O; extends error handling to include decode failures.
src/specify_cli/extensions.py Pins UTF-8 when reading .extensionignore patterns.
tests/test_presets.py Adds init-options tests for non-ASCII values and corrupted-byte fallback.
tests/test_extensions.py Adds .extensionignore tests ensuring non-ASCII patterns exclude intended paths.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 3

Comment thread src/specify_cli/__init__.py Outdated
Comment thread tests/test_presets.py
Comment thread src/specify_cli/extensions.py Outdated
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 27, 2026

Please address Copilot feedback. If not applicable, please explain why

Addresses Copilot review feedback on this PR.

Three issues, three fixes:

1. ``save_init_options`` now writes JSON with ``ensure_ascii=False``.
   Without that flag, ``json.dumps`` emits ASCII-only ``\uXXXX``
   escapes, which means the ``encoding="utf-8"`` pin on the
   surrounding ``Path.write_text`` makes no observable difference for
   any value we currently write. Flipping ``ensure_ascii`` makes the
   non-ASCII bytes hit the file directly, so the encoding pin becomes
   the thing that decides between cp1252 garbage and clean UTF-8 on
   Windows. The comment above the call now describes the real reason
   instead of the previously-misleading rationale Copilot flagged.

2. ``test_save_load_round_trip_preserves_non_ascii`` was a no-op under
   the old ``ensure_ascii=True`` writer (Copilot's second comment).
   Added ``test_save_writes_real_utf8_bytes`` that asserts the on-disk
   bytes contain the UTF-8 encoding of ``café`` (``0xC3 0xA9``), not
   its JSON escape form ``é``. Removing either
   ``ensure_ascii=False`` or ``encoding="utf-8"`` from the writer now
   breaks this test — the contract is pinned.

3. ``.extensionignore`` reader wraps ``UnicodeDecodeError`` as
   ``ValidationError`` with a pointer to the offending byte
   (Copilot's third comment). Mirrors
   ``ExtensionManifest._load_yaml``'s existing handler for
   ``extension.yml``. Adds
   ``test_extensionignore_invalid_utf8_raises_validation_error``
   asserting installation aborts with the wrapped error instead of a
   raw Python traceback.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 0 new

@mnriem mnriem merged commit 442a581 into github:main Jun 2, 2026
11 checks passed
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented Jun 2, 2026

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants