fix(extensions): use explicit UTF-8 encoding when reading manifest YAML#2370
Conversation
On Windows, Python's open() defaults to the system locale encoding (e.g., GBK on Chinese Windows), which causes UnicodeDecodeError when extension.yml or preset.yml contains non-ASCII content such as Chinese characters in description fields. Add encoding='utf-8' to ExtensionManifest._load_yaml and PresetManifest._load_yaml so manifests are read consistently across platforms. Fixes github#2325
There was a problem hiding this comment.
Pull request overview
Fixes Windows UnicodeDecodeError failures when loading UTF-8 encoded extension.yml / preset.yml by explicitly reading those YAML manifests using UTF-8.
Changes:
- Read
preset.ymlusingopen(..., encoding="utf-8")inPresetManifest._load_yaml. - Read
extension.ymlusingopen(..., encoding="utf-8")inExtensionManifest._load_yaml.
Show a summary per file
| File | Description |
|---|---|
| src/specify_cli/presets.py | Forces UTF-8 decoding when loading preset manifests to avoid Windows locale-dependent decoding. |
| src/specify_cli/extensions.py | Forces UTF-8 decoding when loading extension manifests to avoid Windows locale-dependent decoding. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (5)
src/specify_cli/presets.py:140
- Test coverage: consider adding a unit test that writes a UTF-8
preset.ymlcontaining non-ASCII text (e.g., Chinesedescription) and verifiesPresetManifestloads successfully. This would lock in the intended Windows behavior and guard against future regressions.
try:
with open(path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f) or {}
src/specify_cli/extensions.py:147
_load_yaml()only wrapsyaml.YAMLError/FileNotFoundError. If the manifest is unreadable (e.g.,UnicodeDecodeErrorfor non-UTF-8 content,PermissionError, etc.), this will bubble up as an unhandled exception and can crashspecify extension add. Consider catching(OSError, UnicodeError)here and re-raising asValidationErrorwith a clear message (similar toIntegrationDescriptor._load).
This issue also appears on line 141 of the same file.
def _load_yaml(self, path: Path) -> dict:
"""Load YAML file safely."""
try:
with open(path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
except yaml.YAMLError as e:
raise ValidationError(f"Invalid YAML in {path}: {e}")
except FileNotFoundError:
raise ValidationError(f"Manifest not found: {path}")
src/specify_cli/presets.py:144
PresetManifest._load_yaml()currently returnsyaml.safe_load(...)without validating the root type. Ifpreset.ymlparses to a non-mapping (e.g., a list/string),_validate()will raiseTypeError/KeyErrorinstead of aPresetValidationError. Consider validatingdatais adict(and treatingNoneas{}) before returning, matchingExtensionManifest/IntegrationDescriptorbehavior.
This issue also appears in the following locations of the same file:
- line 138
- line 138
def _load_yaml(self, path: Path) -> dict:
"""Load YAML file safely."""
try:
with open(path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f) or {}
except yaml.YAMLError as e:
raise PresetValidationError(f"Invalid YAML in {path}: {e}")
except FileNotFoundError:
raise PresetValidationError(f"Manifest not found: {path}")
src/specify_cli/presets.py:145
PresetManifest._load_yaml()doesn’t wrap file I/O / decode failures (e.g.,UnicodeDecodeError,PermissionError). With the new explicit UTF-8 decoding, non-UTF-8 manifests will now raise an unhandled exception. Consider catching(OSError, UnicodeError)and re-raising asPresetValidationErrorwith a readable error message.
try:
with open(path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f) or {}
except yaml.YAMLError as e:
raise PresetValidationError(f"Invalid YAML in {path}: {e}")
except FileNotFoundError:
raise PresetValidationError(f"Manifest not found: {path}")
src/specify_cli/extensions.py:143
- Test coverage: this change fixes a Windows-specific decoding failure when manifests contain non-ASCII characters. There don’t appear to be tests exercising non-ASCII UTF-8
extension.ymlparsing; adding one (write bytes encoded as UTF-8 with e.g. Chinesedescription, then loadExtensionManifest) would prevent regressions.
try:
with open(path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
- Files reviewed: 2/2 changed files
- Comments generated: 0
|
Can you add positive / negative tests so we do not regress on this? |
…hub#2325 Positive: extension.yml/preset.yml with non-ASCII (Chinese + emoji) descriptions load correctly when written as UTF-8 bytes — fails on Windows without explicit encoding='utf-8'. Negative: files containing invalid UTF-8 bytes raise a clean error (ValidationError or UnicodeDecodeError), not a silent crash.
There was a problem hiding this comment.
Pull request overview
This PR addresses a Windows-specific crash when reading extension.yml / preset.yml containing non‑ASCII UTF‑8 content by ensuring the manifest loaders read YAML with an explicit UTF‑8 encoding.
Changes:
- Use
encoding="utf-8"when opening extension and preset manifest YAML files. - Add regression tests that write explicit UTF‑8 bytes and validate non‑ASCII descriptions load correctly.
- Add negative tests around invalid UTF‑8 bytes.
Show a summary per file
| File | Description |
|---|---|
src/specify_cli/extensions.py |
Opens extension.yml using explicit UTF‑8 to avoid Windows locale decoding issues. |
src/specify_cli/presets.py |
Opens preset.yml using explicit UTF‑8 to avoid Windows locale decoding issues. |
tests/test_extensions.py |
Adds regression test for UTF‑8 non‑ASCII descriptions and a negative invalid-bytes test. |
tests/test_presets.py |
Adds regression test for UTF‑8 non‑ASCII descriptions and a negative invalid-bytes test. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (2)
src/specify_cli/presets.py:145
PresetManifest._load_yaml()will currently propagateUnicodeDecodeErrorfor invalid UTF-8, and it also returns whateveryaml.safe_load()yields (which may be a scalar), which can later trigger aTypeErrorin_validate()when checking required fields. Consider loading into a localdata, catchingUnicodeDecodeErrorto raisePresetValidationError, and validatingdatais a mapping (similar toExtensionManifest) so invalid/garbled manifests consistently produce aPresetValidationErrorinstead of leaking low-level exceptions.
def _load_yaml(self, path: Path) -> dict:
"""Load YAML file safely."""
try:
with open(path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f) or {}
except yaml.YAMLError as e:
raise PresetValidationError(f"Invalid YAML in {path}: {e}")
except FileNotFoundError:
raise PresetValidationError(f"Manifest not found: {path}")
src/specify_cli/extensions.py:147
ExtensionManifest._load_yaml()can still raise a rawUnicodeDecodeError(e.g., ifextension.ymlcontains invalid UTF-8). Since this code already normalizes YAML parse/file-not-found errors intoValidationError, consider also catchingUnicodeDecodeErrorand raisingValidationErrorwith a clear message (e.g., that the manifest must be valid UTF-8) sospecify extension addfails gracefully and tests can assert a single error type.
def _load_yaml(self, path: Path) -> dict:
"""Load YAML file safely."""
try:
with open(path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
except yaml.YAMLError as e:
raise ValidationError(f"Invalid YAML in {path}: {e}")
except FileNotFoundError:
raise ValidationError(f"Manifest not found: {path}")
- Files reviewed: 4/4 changed files
- Comments generated: 2
|
Please address Copilot feedback |
Address remaining Copilot concerns on github#2370: - Catch UnicodeDecodeError and OSError in both manifest loaders and re-raise as ValidationError / PresetValidationError so callers see a consistent error type, not a bare decode/IO traceback. - Validate that PresetManifest YAML root is a mapping (extensions.py already had this; presets.py was missing it). Treat None as {} for empty-file compatibility. - Tighten the negative regression tests to assert the specific message, and add a non-mapping-root test for PresetManifest matching the existing one for ExtensionManifest.
There was a problem hiding this comment.
Pull request overview
Fixes Windows-specific crashes when loading UTF-8 extension.yml / preset.yml by ensuring manifests are read with an explicit UTF-8 encoding, and adds regression coverage around non-ASCII content and invalid UTF-8 bytes.
Changes:
- Read extension/preset YAML manifests with
encoding="utf-8"to avoid Windows locale/codepage decoding issues. - Convert
UnicodeDecodeError(and other read failures) into manifest validation errors with clearer messaging. - Add tests covering UTF-8 non-ASCII descriptions and invalid UTF-8 byte sequences for both extensions and presets.
Show a summary per file
| File | Description |
|---|---|
src/specify_cli/extensions.py |
Reads extension.yml using explicit UTF-8 and wraps decode/read errors as ValidationError. |
src/specify_cli/presets.py |
Reads preset.yml using explicit UTF-8 and wraps decode/read errors as PresetValidationError. |
tests/test_extensions.py |
Adds regression tests for UTF-8 non-ASCII manifest content and invalid UTF-8 bytes. |
tests/test_presets.py |
Adds regression tests for UTF-8 non-ASCII manifest content, invalid UTF-8 bytes, and non-mapping YAML roots. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 0
|
Thank you! |
Upstream changes (22 commits): - fix: include --from git+... in upgrade hint to avoid PyPI squat package (github#2411) - fix: dispatch opencode commands via run (github#2410) - feat: add catalog discovery CLI commands (github#2360) - fix(extensions): use explicit UTF-8 encoding when reading manifest YAML (github#2370) - feat: Speckit preset fiction book v1.7 - Support for RAG (Chroma DB) (github#2367) - chore: release 0.8.2, begin 0.8.3.dev0 development (github#2397) - Catalog updates: security review v1.3.0, v-model v0.6.0, threatmodel, isaqb-architecture-governance, m365, MarkItDown Fork customizations preserved: - Fork package name and version (agentic-sdlc-specify-cli) - skill_app integration from cli_customization - Bundled extensions and presets
Summary
Fixes #2325 —
specify extension addcrashes on Windows withUnicodeDecodeError: 'gbk' codec can't decode byte ...when
extension.yml/preset.ymlcontains non-ASCII content (e.g., Chinese indescription).Root cause
ExtensionManifest._load_yaml(src/specify_cli/extensions.py:142) andPresetManifest._load_yaml(
src/specify_cli/presets.py:139) callopen(path, 'r')without explicit encoding. On Windows, Python uses the systemlocale (GBK on Chinese Windows), which fails on UTF-8 manifests.
Fix
Add
encoding='utf-8'to bothopen()calls — matching convention already used inintegrations/catalog.py:449,workflows/engine.py:63, and everyread_text(encoding="utf-8")callsite.Test plan