Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .auxiliary/notes/detextive-bugs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Detextive Issues

## Binary Data Decoded as UTF-16-LE

**Issue**: Detextive incorrectly decodes certain binary data as UTF-16-LE text.

**Example**: A file containing alternating bytes `0xFF 0x00` repeated (i.e., `bytes([0xFF, 0x00] * 52)`) is successfully detected as having charset `utf-16-le` and decoded as text, producing a string of repeated `ÿ` characters.

**Impact**: This causes binary files that should be rejected to be accepted as valid text files. While this is not a security risk for most cases (since the "decoded" content is gibberish), it means that mimeogram may accept files that are not genuinely textual.

**Workaround**: Tests have been updated to use binary files with more recognizable headers (like PE executables with `MZ` magic bytes) that Detextive properly rejects. These files cause decode failures even when a charset is detected.

**Status**: This is a limitation of charset detection algorithms in general - alternating binary patterns can appear to match certain multi-byte encodings like UTF-16. The issue should be reported to the Detextive project for potential improvement in validation heuristics.

**Related Tests**:
- `test_410_application_x_security`: Updated to check for truly dangerous files only
- `test_520_nontextual_mime`: Updated to use PE executable header instead of simple binary pattern
78 changes: 78 additions & 0 deletions .auxiliary/notes/issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Known Issues

## CLI Parser Failure with tyro

**Discovered**: 2025-11-09 during Detextive 2.0 port verification

**Status**: Pre-existing issue (exists before Detextive 2.0 port)

**Severity**: Critical - CLI is completely non-functional

### Description

The mimeogram CLI fails to start with a tyro parser error:

```
AssertionError: UnsupportedStructTypeMessage(message="Empty hints for <slot wrapper '__init__' of '_io.TextIOWrapper' objects>!")
```

### Reproduction

```bash
hatch run mimeogram --help
# or any other command: version, create, apply, provide-prompt
```

### Analysis

The error originates from `tyro` attempting to parse the CLI structure and encountering a type that lacks proper type hints. The error occurs in:

```
File "/root/.local/share/hatch/env/.../tyro/_parsers.py", line 113, in from_callable_or_type
assert not isinstance(out, UnsupportedStructTypeMessage), out
```

The error mentions `_io.TextIOWrapper`, suggesting that somewhere in the command classes or their dependencies, there's a reference to stdin/stdout/stderr or file handles that tyro cannot introspect.

### Timeline

- **Commit 556db71** (Merge PR #9 - appcore cutover): Error present
- **Commit fac1d9f** (Integrate detextive package): Error present
- **Commit c1401a1** (Port to Detextive 2.0): Error present
- **Commit 32a777f** (Fix linter errors): Error present

This indicates the issue was introduced during the appcore refactor (PR #9), not by the Detextive 2.0 port.

### Investigation Points

1. **appcore type annotations**: The issue likely stems from how `appcore` types are exposed to tyro
2. **CLI command definitions**: Check `cli.py`, `create.py`, `apply.py`, `prompt.py` for problematic type hints
3. **TextIOWrapper references**: Search for uses of `sys.stdin`, `sys.stdout`, `sys.stderr` that may need explicit typing

Confirmed uses in codebase:
- `sources/mimeogram/apply.py:134`: `__.sys.stdin.isatty()`
- `sources/mimeogram/apply.py:144`: `__.sys.stdin.read()`
- `sources/mimeogram/interactions.py:76`: `__.sys.stdout.flush()`
- `sources/mimeogram/display.py:60`: `__.sys.stdin.isatty()`

### Suggested Fix

Based on the user's suggestion: Switch to `emcd-appcore[cli]` which likely includes additional dependencies or type stubs that help tyro properly parse the CLI structure.

### Impact

- **Tests**: All 173 tests pass (tests don't exercise CLI parsing, they import modules directly)
- **Linters**: Pass cleanly (ruff and pyright)
- **Detextive integration**: Working correctly
- **CLI functionality**: Completely broken - cannot run any commands

### Workaround

None currently available. The application can be used programmatically by importing modules directly, but the CLI is unusable.

### Next Steps

1. Try switching dependency from `emcd-appcore~=1.4` to `emcd-appcore[cli]~=1.4`
2. If that doesn't resolve it, investigate the specific type annotation that tyro cannot parse
3. Consider adding explicit type annotations to any stdin/stdout/stderr usage
4. May need to report issue to `tyro` if it's a limitation in their type introspection
4 changes: 1 addition & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@ dependencies = [
'absence~=1.1',
'accretive~=4.1',
'aiofiles',
'chardet',
'detextive~=1.0',
'detextive~=2.0',
'dynadoc~=1.4',
'emcd-appcore~=1.4',
'exceptiongroup',
Expand All @@ -29,7 +28,6 @@ dependencies = [
'httpx',
'icecream-truck~=1.5',
'patiencediff',
'puremagic',
'pyperclip',
'python-dotenv', # TODO: Remove after cutover to appcore.
'readchar',
Expand Down
146 changes: 19 additions & 127 deletions sources/mimeogram/acquirers.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,16 @@ async def _acquire_from_file( location: __.Path ) -> _parts.Part:
async with _aiofiles.open( location, 'rb' ) as f: # pyright: ignore
content_bytes = await f.read( )
except Exception as exc: raise ContentAcquireFailure( location ) from exc
mimetype, charset = _detect_mimetype_and_charset( content_bytes, location )
mimetype, charset = __.detextive.infer_mimetype_charset(
content_bytes, location = str( location ) )
if charset is None: raise ContentDecodeFailure( location, '???' )
linesep = _parts.LineSeparators.detect_bytes( content_bytes )
linesep = __.detextive.LineSeparators.detect_bytes( content_bytes )
if linesep is None:
_scribe.warning( f"No line separator detected in '{location}'." )
linesep = _parts.LineSeparators( __.os.linesep )
try: content = content_bytes.decode( charset )
linesep = __.detextive.LineSeparators( __.os.linesep )
try:
content = __.detextive.decode(
content_bytes, location = str( location ) )
except Exception as exc:
raise ContentDecodeFailure( location, charset ) from exc
_scribe.debug( f"Read file: {location}" )
Expand All @@ -105,21 +108,22 @@ async def _acquire_via_http(
response = await client.get( url )
response.raise_for_status( )
except Exception as exc: raise ContentAcquireFailure( url ) from exc
mimetype = (
response.headers.get( 'content-type', 'application/octet-stream' )
.split( ';' )[ 0 ].strip( ) )
http_content_type = response.headers.get( 'content-type' )
content_bytes = response.content
charset = response.encoding or _detect_charset( content_bytes )
mimetype, charset = __.detextive.infer_mimetype_charset(
content_bytes,
location = url,
http_content_type = http_content_type or __.absent )
if charset is None: raise ContentDecodeFailure( url, '???' )
if not _is_textual_mimetype( mimetype ):
mimetype, _ = (
_detect_mimetype_and_charset(
content_bytes, url, charset = charset ) )
linesep = _parts.LineSeparators.detect_bytes( content_bytes )
linesep = __.detextive.LineSeparators.detect_bytes( content_bytes )
if linesep is None:
_scribe.warning( f"No line separator detected in '{url}'." )
linesep = _parts.LineSeparators( __.os.linesep )
try: content = content_bytes.decode( charset )
linesep = __.detextive.LineSeparators( __.os.linesep )
try:
content = __.detextive.decode(
content_bytes,
location = url,
http_content_type = http_content_type or __.absent )
except Exception as exc:
raise ContentDecodeFailure( url, charset ) from exc
_scribe.debug( f"Fetched URL: {url}" )
Expand Down Expand Up @@ -157,102 +161,6 @@ def _collect_directory_files(
return paths


def _detect_charset( content: bytes ) -> str | None:
from chardet import detect
charset = detect( content )[ 'encoding' ]
if charset is None: return charset
if charset.startswith( 'utf' ): return charset
match charset:
case 'ascii': return 'utf-8' # Assume superset.
case _: pass
# Shake out false positives, like 'MacRoman'.
try: content.decode( 'utf-8' )
except UnicodeDecodeError: return charset
return 'utf-8'


def _detect_mimetype( content: bytes, location: str | __.Path ) -> str | None:
from mimetypes import guess_type
from puremagic import PureError, from_string # pyright: ignore
try: return from_string( content, mime = True )
except ( PureError, ValueError ):
return guess_type( str( location ) )[ 0 ]


def _detect_mimetype_and_charset(
content: bytes,
location: str | __.Path, *,
mimetype: __.Absential[ str ] = __.absent,
charset: __.Absential[ str ] = __.absent,
) -> tuple[ str, str | None ]:
from .exceptions import TextualMimetypeInvalidity
if __.is_absent( mimetype ):
mimetype_ = _detect_mimetype( content, location )
else: mimetype_ = mimetype
if __.is_absent( charset ): # noqa: SIM108
charset_ = _detect_charset( content )
else: charset_ = charset
if not mimetype_:
if charset_:
mimetype_ = 'text/plain'
_validate_mimetype_with_trial_decode(
content, location, mimetype_, charset_ )
return mimetype_, charset_
mimetype_ = 'application/octet-stream'
if _is_textual_mimetype( mimetype_ ):
return mimetype_, charset_
if charset_ is None:
raise TextualMimetypeInvalidity( location, mimetype_ )
_validate_mimetype_with_trial_decode(
content, location, mimetype_, charset_ )
return mimetype_, charset_


def _is_reasonable_text_content( content: str ) -> bool:
''' Checks if decoded content appears to be meaningful text. '''
if not content: return False
# Check for excessive repetition of single characters (likely binary)
if len( set( content ) ) == 1: return False
# Check for excessive control characters (excluding common whitespace)
common_whitespace = '\t\n\r'
ascii_control_limit = 32
control_chars = sum(
1 for c in content
if ord( c ) < ascii_control_limit and c not in common_whitespace )
if control_chars > len( content ) * 0.1: return False # >10% control chars
# Check for reasonable printable character ratio
printable_chars = sum(
1 for c in content if c.isprintable( ) or c in common_whitespace )
return printable_chars >= len( content ) * 0.8 # >=80% printable


# MIME types that are considered textual beyond those starting with 'text/'.
_TEXTUAL_MIME_TYPES = frozenset( (
'application/json',
'application/xml',
'application/xhtml+xml',
'application/x-perl',
'application/x-python',
'application/x-php',
'application/x-ruby',
'application/x-shell',
'application/javascript',
'image/svg+xml',
) )
# MIME type suffixes that indicate textual content.
_TEXTUAL_SUFFIXES = ( '+xml', '+json', '+yaml', '+toml' )
def _is_textual_mimetype( mimetype: str ) -> bool:
''' Checks if MIME type represents textual content. '''
_scribe.debug( f"MIME type: {mimetype}" )
if mimetype.startswith( ( 'text/', 'text/x-' ) ): return True
if mimetype in _TEXTUAL_MIME_TYPES: return True
if mimetype.endswith( _TEXTUAL_SUFFIXES ):
_scribe.debug(
f"MIME type '{mimetype}' accepted due to textual suffix." )
return True
return False


def _produce_fs_tasks(
location: str | __.Path, recursive: bool = False
) -> tuple[ __.cabc.Coroutine[ None, None, _parts.Part ], ...]:
Expand All @@ -277,19 +185,3 @@ async def _execute_session( ) -> _parts.Part:
) as client: return await _acquire_via_http( client, url )

return _execute_session( )


def _validate_mimetype_with_trial_decode(
content: bytes, location: str | __.Path, mimetype: str, charset: str
) -> None:
''' Validates charset fallback and returns appropriate MIME type. '''
from .exceptions import TextualMimetypeInvalidity
try: text = content.decode( charset )
except ( UnicodeDecodeError, LookupError ) as exc:
raise TextualMimetypeInvalidity( location, mimetype ) from exc
if _is_reasonable_text_content( text ):
_scribe.debug(
f"MIME type '{mimetype}' accepted after successful "
f"decode test with charset '{charset}' for '{location}'." )
return
raise TextualMimetypeInvalidity( location, mimetype )
2 changes: 1 addition & 1 deletion sources/mimeogram/formatters.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def format_mimeogram(
location = 'mimeogram://message',
mimetype = 'text/plain', # TODO? Markdown
charset = 'utf-8',
linesep = _parts.LineSeparators.LF,
linesep = __.detextive.LineSeparators.LF,
content = message )
lines.append( format_part( message_part, boundary ) )
for part in parts:
Expand Down
8 changes: 5 additions & 3 deletions sources/mimeogram/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,17 +109,19 @@ def _parse_descriptor_and_content(


_QUOTES = '"\''
def _parse_mimetype( header: str ) -> tuple[ str, str, _parts.LineSeparators ]:
def _parse_mimetype(
header: str
) -> tuple[ str, str, __.detextive.LineSeparators ]:
''' Extracts MIME type and charset from Content-Type header. '''
parts = [ p.strip( ) for p in header.split( ';' ) ]
mimetype = parts[ 0 ]
charset = 'utf-8'
linesep = _parts.LineSeparators.LF
linesep = __.detextive.LineSeparators.LF
for part in parts[ 1: ]:
if part.startswith( 'charset=' ):
charset = part[ 8: ].strip( _QUOTES )
if part.startswith( 'linesep=' ):
linesep = _parts.LineSeparators[
linesep = __.detextive.LineSeparators[
part[ 8: ].strip( _QUOTES ).upper( ) ]
return mimetype, charset, linesep

Expand Down
44 changes: 1 addition & 43 deletions sources/mimeogram/parts.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,48 +25,6 @@
from . import fsprotect as _fsprotect


class LineSeparators( __.enum.Enum ):
''' Line separators for various platforms. '''

CR = '\r' # Classic MacOS
CRLF = '\r\n' # DOS/Windows
LF = '\n' # Unix/Linux

@classmethod
def detect_bytes(
selfclass, content: bytes, limit = 1024
) -> "LineSeparators | None":
''' Detects newline characters in bytes array. '''
sample = content[ : limit ]
found_cr = False
for byte in sample:
match byte:
case 0xd:
if found_cr: return selfclass.CR
found_cr = True
case 0xa: # linefeed
if found_cr: return selfclass.CRLF
return selfclass.LF
case _:
if found_cr: return selfclass.CR
return None

@classmethod
def normalize_universal( selfclass, content: str ) -> str:
''' Normalizes all varieties of newline characters in text. '''
return content.replace( '\r\n', '\r' ).replace( '\r', '\n' )

def nativize( self, content: str ) -> str:
''' Nativizes specific variety newline characters in text. '''
if LineSeparators.LF is self: return content
return content.replace( '\n', self.value )

def normalize( self, content: str ) -> str:
''' Normalizes specific variety newline characters in text. '''
if LineSeparators.LF is self: return content
return content.replace( self.value, '\n' )


class Resolutions( __.enum.Enum ):
''' Available resolutions for each part. '''

Expand All @@ -79,7 +37,7 @@ class Part( __.immut.DataclassObject ):
location: str # TODO? 'Url' class
mimetype: str
charset: str
linesep: "LineSeparators"
linesep: __.detextive.LineSeparators
content: str

# TODO? 'format' method
Expand Down
2 changes: 1 addition & 1 deletion sources/mimeogram/updaters.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ async def _update_content_atomic(
location: __.Path,
content: str,
charset: str = 'utf-8',
linesep: _parts.LineSeparators = _parts.LineSeparators.LF
linesep: __.detextive.LineSeparators = __.detextive.LineSeparators.LF
) -> None:
''' Updates file content atomically, if possible. '''
import aiofiles.os as os # noqa: PLR0402
Expand Down
Loading