Bug Description
When converting HTML to PDF using convert_html_to_pdf tool, the conversion fails with a UTF-8 decoding error if the HTML file is not UTF-8 encoded.
Error Message
❌ Conversion failed: 'utf-8' codec can't decode byte 0x9a in position 0: invalid start byte
Current Behavior
- Tool assumes all HTML files are UTF-8 encoded
- No encoding detection or fallback
- Fails with cryptic error message
- File may be deleted after failed conversion (unclear if intentional)
Expected Behavior
- Auto-detect encoding: Try UTF-8, then fallback to other common encodings (GBK, Latin-1, etc.)
- Better error message: Tell user what went wrong and suggest solutions
- Graceful handling: Don't delete the source file on conversion failure
- Optional encoding parameter: Allow user to specify encoding if auto-detect fails
Reproduction Steps
- Create an HTML file with non-UTF-8 encoding (e.g., GBK, Windows-1252)
- Use
convert_html_to_pdf tool to convert it
- Observe: Conversion fails with UTF-8 decoding error
Suggested Fix
# Pseudo-code for encoding detection
def read_html_with_encoding_detection(file_path):
encodings_to_try = ['utf-8', 'gbk', 'latin-1', 'cp1252', 'big5']
for encoding in encodings_to_try:
try:
with open(file_path, 'r', encoding=encoding) as f:
return f.read()
except UnicodeDecodeError:
continue
raise ValueError(f"Unable to decode file with common encodings: {encodings_to_try}")
Priority
🔴 High - affects core file conversion functionality, especially for Chinese users (GBK encoding)
Related
- May affect other file reading tools that assume UTF-8
- Should be consistent with
read_document tool's encoding handling
Bug Description
When converting HTML to PDF using
convert_html_to_pdftool, the conversion fails with a UTF-8 decoding error if the HTML file is not UTF-8 encoded.Error Message
Current Behavior
Expected Behavior
Reproduction Steps
convert_html_to_pdftool to convert itSuggested Fix
Priority
🔴 High - affects core file conversion functionality, especially for Chinese users (GBK encoding)
Related
read_documenttool's encoding handling