Skip to content

fix(extensions): use explicit UTF-8 encoding when reading manifest YAML#2370

Merged
mnriem merged 3 commits intogithub:mainfrom
Quratulain-bilal:fix/utf8-encoding-windows
Apr 28, 2026
Merged

fix(extensions): use explicit UTF-8 encoding when reading manifest YAML#2370
mnriem merged 3 commits intogithub:mainfrom
Quratulain-bilal:fix/utf8-encoding-windows

Conversation

@Quratulain-bilal
Copy link
Copy Markdown
Contributor

Summary

Fixes #2325specify extension add crashes on Windows with UnicodeDecodeError: 'gbk' codec can't decode byte ...
when extension.yml / preset.yml contains non-ASCII content (e.g., Chinese in description).

Root cause

ExtensionManifest._load_yaml (src/specify_cli/extensions.py:142) and PresetManifest._load_yaml
(src/specify_cli/presets.py:139) call open(path, 'r') without explicit encoding. On Windows, Python uses the system
locale (GBK on Chinese Windows), which fails on UTF-8 manifests.

Fix

Add encoding='utf-8' to both open() calls — matching convention already used in integrations/catalog.py:449,
workflows/engine.py:63, and every read_text(encoding="utf-8") callsite.

Test plan

  • Manual: extension.yml with Chinese description loads on Windows
  • No behavior change on macOS/Linux (UTF-8 is already default there)
  • CI green

On Windows, Python's open() defaults to the system locale encoding
(e.g., GBK on Chinese Windows), which causes UnicodeDecodeError when
extension.yml or preset.yml contains non-ASCII content such as Chinese
characters in description fields.

Add encoding='utf-8' to ExtensionManifest._load_yaml and
PresetManifest._load_yaml so manifests are read consistently across
platforms.

Fixes github#2325
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Windows UnicodeDecodeError failures when loading UTF-8 encoded extension.yml / preset.yml by explicitly reading those YAML manifests using UTF-8.

Changes:

  • Read preset.yml using open(..., encoding="utf-8") in PresetManifest._load_yaml.
  • Read extension.yml using open(..., encoding="utf-8") in ExtensionManifest._load_yaml.
Show a summary per file
File Description
src/specify_cli/presets.py Forces UTF-8 decoding when loading preset manifests to avoid Windows locale-dependent decoding.
src/specify_cli/extensions.py Forces UTF-8 decoding when loading extension manifests to avoid Windows locale-dependent decoding.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (5)

src/specify_cli/presets.py:140

  • Test coverage: consider adding a unit test that writes a UTF-8 preset.yml containing non-ASCII text (e.g., Chinese description) and verifies PresetManifest loads successfully. This would lock in the intended Windows behavior and guard against future regressions.
        try:
            with open(path, 'r', encoding='utf-8') as f:
                return yaml.safe_load(f) or {}

src/specify_cli/extensions.py:147

  • _load_yaml() only wraps yaml.YAMLError/FileNotFoundError. If the manifest is unreadable (e.g., UnicodeDecodeError for non-UTF-8 content, PermissionError, etc.), this will bubble up as an unhandled exception and can crash specify extension add. Consider catching (OSError, UnicodeError) here and re-raising as ValidationError with a clear message (similar to IntegrationDescriptor._load).

This issue also appears on line 141 of the same file.

    def _load_yaml(self, path: Path) -> dict:
        """Load YAML file safely."""
        try:
            with open(path, 'r', encoding='utf-8') as f:
                data = yaml.safe_load(f)
        except yaml.YAMLError as e:
            raise ValidationError(f"Invalid YAML in {path}: {e}")
        except FileNotFoundError:
            raise ValidationError(f"Manifest not found: {path}")

src/specify_cli/presets.py:144

  • PresetManifest._load_yaml() currently returns yaml.safe_load(...) without validating the root type. If preset.yml parses to a non-mapping (e.g., a list/string), _validate() will raise TypeError/KeyError instead of a PresetValidationError. Consider validating data is a dict (and treating None as {}) before returning, matching ExtensionManifest/IntegrationDescriptor behavior.

This issue also appears in the following locations of the same file:

  • line 138
  • line 138
    def _load_yaml(self, path: Path) -> dict:
        """Load YAML file safely."""
        try:
            with open(path, 'r', encoding='utf-8') as f:
                return yaml.safe_load(f) or {}
        except yaml.YAMLError as e:
            raise PresetValidationError(f"Invalid YAML in {path}: {e}")
        except FileNotFoundError:
            raise PresetValidationError(f"Manifest not found: {path}")

src/specify_cli/presets.py:145

  • PresetManifest._load_yaml() doesn’t wrap file I/O / decode failures (e.g., UnicodeDecodeError, PermissionError). With the new explicit UTF-8 decoding, non-UTF-8 manifests will now raise an unhandled exception. Consider catching (OSError, UnicodeError) and re-raising as PresetValidationError with a readable error message.
        try:
            with open(path, 'r', encoding='utf-8') as f:
                return yaml.safe_load(f) or {}
        except yaml.YAMLError as e:
            raise PresetValidationError(f"Invalid YAML in {path}: {e}")
        except FileNotFoundError:
            raise PresetValidationError(f"Manifest not found: {path}")

src/specify_cli/extensions.py:143

  • Test coverage: this change fixes a Windows-specific decoding failure when manifests contain non-ASCII characters. There don’t appear to be tests exercising non-ASCII UTF-8 extension.yml parsing; adding one (write bytes encoded as UTF-8 with e.g. Chinese description, then load ExtensionManifest) would prevent regressions.
        try:
            with open(path, 'r', encoding='utf-8') as f:
                data = yaml.safe_load(f)
  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented Apr 27, 2026

Can you add positive / negative tests so we do not regress on this?

Copy link
Copy Markdown
Collaborator

@mnriem mnriem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

…hub#2325

Positive: extension.yml/preset.yml with non-ASCII (Chinese + emoji)
descriptions load correctly when written as UTF-8 bytes — fails on
Windows without explicit encoding='utf-8'.

Negative: files containing invalid UTF-8 bytes raise a clean error
(ValidationError or UnicodeDecodeError), not a silent crash.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Windows-specific crash when reading extension.yml / preset.yml containing non‑ASCII UTF‑8 content by ensuring the manifest loaders read YAML with an explicit UTF‑8 encoding.

Changes:

  • Use encoding="utf-8" when opening extension and preset manifest YAML files.
  • Add regression tests that write explicit UTF‑8 bytes and validate non‑ASCII descriptions load correctly.
  • Add negative tests around invalid UTF‑8 bytes.
Show a summary per file
File Description
src/specify_cli/extensions.py Opens extension.yml using explicit UTF‑8 to avoid Windows locale decoding issues.
src/specify_cli/presets.py Opens preset.yml using explicit UTF‑8 to avoid Windows locale decoding issues.
tests/test_extensions.py Adds regression test for UTF‑8 non‑ASCII descriptions and a negative invalid-bytes test.
tests/test_presets.py Adds regression test for UTF‑8 non‑ASCII descriptions and a negative invalid-bytes test.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (2)

src/specify_cli/presets.py:145

  • PresetManifest._load_yaml() will currently propagate UnicodeDecodeError for invalid UTF-8, and it also returns whatever yaml.safe_load() yields (which may be a scalar), which can later trigger a TypeError in _validate() when checking required fields. Consider loading into a local data, catching UnicodeDecodeError to raise PresetValidationError, and validating data is a mapping (similar to ExtensionManifest) so invalid/garbled manifests consistently produce a PresetValidationError instead of leaking low-level exceptions.
    def _load_yaml(self, path: Path) -> dict:
        """Load YAML file safely."""
        try:
            with open(path, 'r', encoding='utf-8') as f:
                return yaml.safe_load(f) or {}
        except yaml.YAMLError as e:
            raise PresetValidationError(f"Invalid YAML in {path}: {e}")
        except FileNotFoundError:
            raise PresetValidationError(f"Manifest not found: {path}")

src/specify_cli/extensions.py:147

  • ExtensionManifest._load_yaml() can still raise a raw UnicodeDecodeError (e.g., if extension.yml contains invalid UTF-8). Since this code already normalizes YAML parse/file-not-found errors into ValidationError, consider also catching UnicodeDecodeError and raising ValidationError with a clear message (e.g., that the manifest must be valid UTF-8) so specify extension add fails gracefully and tests can assert a single error type.
    def _load_yaml(self, path: Path) -> dict:
        """Load YAML file safely."""
        try:
            with open(path, 'r', encoding='utf-8') as f:
                data = yaml.safe_load(f)
        except yaml.YAMLError as e:
            raise ValidationError(f"Invalid YAML in {path}: {e}")
        except FileNotFoundError:
            raise ValidationError(f"Manifest not found: {path}")
  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment thread tests/test_extensions.py
Comment thread tests/test_presets.py
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented Apr 27, 2026

Please address Copilot feedback

Address remaining Copilot concerns on github#2370:

- Catch UnicodeDecodeError and OSError in both manifest loaders and
  re-raise as ValidationError / PresetValidationError so callers see a
  consistent error type, not a bare decode/IO traceback.
- Validate that PresetManifest YAML root is a mapping (extensions.py
  already had this; presets.py was missing it). Treat None as {} for
  empty-file compatibility.
- Tighten the negative regression tests to assert the specific message,
  and add a non-mapping-root test for PresetManifest matching the
  existing one for ExtensionManifest.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Windows-specific crashes when loading UTF-8 extension.yml / preset.yml by ensuring manifests are read with an explicit UTF-8 encoding, and adds regression coverage around non-ASCII content and invalid UTF-8 bytes.

Changes:

  • Read extension/preset YAML manifests with encoding="utf-8" to avoid Windows locale/codepage decoding issues.
  • Convert UnicodeDecodeError (and other read failures) into manifest validation errors with clearer messaging.
  • Add tests covering UTF-8 non-ASCII descriptions and invalid UTF-8 byte sequences for both extensions and presets.
Show a summary per file
File Description
src/specify_cli/extensions.py Reads extension.yml using explicit UTF-8 and wraps decode/read errors as ValidationError.
src/specify_cli/presets.py Reads preset.yml using explicit UTF-8 and wraps decode/read errors as PresetValidationError.
tests/test_extensions.py Adds regression tests for UTF-8 non-ASCII manifest content and invalid UTF-8 bytes.
tests/test_presets.py Adds regression tests for UTF-8 non-ASCII manifest content, invalid UTF-8 bytes, and non-mapping YAML roots.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 0

@mnriem mnriem self-requested a review April 28, 2026 13:46
@mnriem mnriem merged commit 3a7f64c into github:main Apr 28, 2026
15 checks passed
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented Apr 28, 2026

Thank you!

kanfil added a commit to tikalk/agentic-sdlc-spec-kit that referenced this pull request Apr 29, 2026
Upstream changes (22 commits):
- fix: include --from git+... in upgrade hint to avoid PyPI squat package (github#2411)
- fix: dispatch opencode commands via run (github#2410)
- feat: add catalog discovery CLI commands (github#2360)
- fix(extensions): use explicit UTF-8 encoding when reading manifest YAML (github#2370)
- feat: Speckit preset fiction book v1.7 - Support for RAG (Chroma DB) (github#2367)
- chore: release 0.8.2, begin 0.8.3.dev0 development (github#2397)
- Catalog updates: security review v1.3.0, v-model v0.6.0, threatmodel,
  isaqb-architecture-governance, m365, MarkItDown

Fork customizations preserved:
- Fork package name and version (agentic-sdlc-specify-cli)
- skill_app integration from cli_customization
- Bundled extensions and presets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: extension.yml fails to read UTF-8 encoded Chinese content on Windows

3 participants