New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retype doesn't check source encodings before parsing #2

Closed
movermeyer opened this Issue Jun 7, 2017 · 1 comment

Comments

Projects
None yet
1 participant
@movermeyer
Contributor

movermeyer commented Jun 7, 2017

I was trying to retype a file, but the file had a comment that contained Unicode characters.

Example file (core.py):

#This is a comment with unicode characters: "Афон"
foo = "bar"

Example stub (types/core.pyi):

foo = ... # type: str
$>retype --traceback core.py
error: core.py: 'charmap' codec can't decode byte 0x90 in position 72: character maps to <undefined>
Traceback (most recent call last):
  File "retype.py", line 110, in retype_path
    retype_file(src, pyi_dir, targets, quiet=quiet, hg=hg)
  File "retype.py", line 131, in retype_file
    src_txt = src_file.read()
  File "Python36-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 72: character maps to <undefined>

I would have expected it to not crash on files with Unicode characters, especially since the characters are within comments.

Tested using:

  • Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32
@movermeyer

This comment has been minimized.

Contributor

movermeyer commented Jun 8, 2017

This line doesn't set an encoding when reading the file, which means that it defaults to a platform specific encoding.

On Windows (10), this is cp1252, which cannot handle the Unicode characters.

  • According to PEP 3120, UTF-8 is the default source encoding for Python 3 source
  • According to PEP 263, ASCII is the default source encoding for Python 2 source

However, it is also possible for users to define their own source encoding using a comment syntax as defined in PEP 263.

So it seems that in an ideal world, retype would determine the encoding from these PEP 263 comment lines, defaulting to UTF-8.

movermeyer added a commit to movermeyer/retype that referenced this issue Jun 9, 2017

@movermeyer movermeyer changed the title from retype fails on files that contain Unicode characters to retype doesn't check source encodings before parsing Jun 9, 2017

@ambv ambv closed this in #4 Jun 9, 2017

ambv added a commit that referenced this issue Jun 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment