-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode imports #3119
unicode imports #3119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, a fairly big chunk of changes. Since the C/C++ Unicode identifiers are worth some debate, I would like to ask you to exclude them from this PR.
Cython/Compiler/Main.py
Outdated
@@ -433,6 +446,16 @@ def create_default_resultobj(compilation_source, options): | |||
def run_pipeline(source, options, full_module_name=None, context=None): | |||
from . import Pipeline | |||
|
|||
# ensure that the inputs are unicode (for Python 2) | |||
try: | |||
source = source.decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the source always be Unicode already? (If not, then that's a bug, and this is not the right place to fix it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source
here is the source filename, not the source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Then there's the file system encoding for that. Not all file systems use utf-8
. See Utils.decode_filename()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - should "module_name" be the same? I assume it shouldn't, but it is derived from the filename usually
Also add a test-case. (Also add a test case for passing keyword arguments, which already worked, but is good to have as a test case).
When they are to be used internally the name is mangled with punycode as normal. When they are to be used externally (e.g. "cdef public" or "cdef from extern") the name is taken exactly as-is and simply slash-escaped ("\uNNNN"). The vast majority of C compilers are capable to dealing with \uNNNN characters in literal names.
(Only valid under Python 3)
Added tests and a small fix
+ a few other small errors in generation of module code
This reverts commit 463c7f9. Removes the more controversal unicode C identifiers, but leaves unicode modules
Also, handle generated .h files properly
045a853
to
98df21b
Compare
I've moved the C stuff out of this PR - I'll make a new PR with it in at some point. I think I've addressed most of the issues raised |
Cython/Compiler/Main.py
Outdated
@@ -433,6 +446,16 @@ def create_default_resultobj(compilation_source, options): | |||
def run_pipeline(source, options, full_module_name=None, context=None): | |||
from . import Pipeline | |||
|
|||
# ensure that the inputs are unicode (for Python 2) | |||
try: | |||
source = source.decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Then there's the file system encoding for that. Not all file systems use utf-8
. See Utils.decode_filename()
.
Cython/Compiler/ModuleNode.py
Outdated
from .Pythran import has_np_pythran | ||
|
||
|
||
def replace_suffix(path, newsuf): | ||
x = utils_replace_suffix(path, newsuf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is x
here?
I think it's generally not a good idea to reuse the name of an imported function and then let it do something else. This function apparently does something specific to file names. That should be reflected in its name.
Cython/Compiler/ModuleNode.py
Outdated
|
||
def replace_suffix(path, newsuf): | ||
x = utils_replace_suffix(path, newsuf) | ||
return encoded_string_or_bytes_literal(x, sys.getfilesystemencoding()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line actually seems worth a helper function, e.g. as_encoded_filename()
. Although it generally seems quite late to handle file name encodings at this point, on the way out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The problem is that Utils.replace_suffix doesn't seem to know about EncodedString/BytesLiteral (I could import them but it seems like a design decision that it doesn't use anything from "Compiler"), so it's quite hard to deal with the file name encoding earlier
Cython/Compiler/ModuleNode.py
Outdated
@@ -208,7 +216,14 @@ def h_entries(entries, api=0, pxd=0): | |||
h_code.putln("/* It now returns a PyModuleDef instance instead of a PyModule instance. */") | |||
h_code.putln("") | |||
h_code.putln("#if PY_MAJOR_VERSION < 3") | |||
h_code.putln("PyMODINIT_FUNC init%s(void);" % env.module_name) | |||
try: | |||
env.module_name.encode("ascii") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a name_is_ascii()
helper function to Utils.py
that try-encodes in Py2 and just calls isascii()
in Py3? That would avoid all these repeated try-except
usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an isascii
method to EncodedString
and BytesLiteral
instead (for Py2 only). I haven't rewritten the punycoding code since this is taken from PEP489 and it seems best to match that.
Cython/Compiler/ModuleNode.py
Outdated
@@ -230,7 +245,9 @@ def generate_public_declaration(self, entry, h_code, i_code): | |||
entry.type.declaration_code(entry.cname, pyrex=1))) | |||
|
|||
def api_name(self, env): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would allow a prefix
argument here and pass it through. That makes the concatenation more obvious.
Cython/Compiler/ModuleNode.py
Outdated
try: | ||
env.module_name.encode('ascii') | ||
except UnicodeEncodeError: | ||
py2_mod_name = env.module_name.encode("ascii", errors="ignore").decode("utf8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are no non-ascii characters at all? I think this deserves at least a comment right here that the compilation is intended to fail completely, further down. And/or, rename the no_py2
flag to fail_compilation_in_py2
, and set it before this line instead of after it.
Cython/Compiler/ModuleNode.py
Outdated
@@ -2665,13 +2699,14 @@ def generate_module_import_setup(self, env, code): | |||
fq_module_name = self.full_module_name | |||
if fq_module_name.endswith('.__init__'): | |||
fq_module_name = fq_module_name[:-len('.__init__')] | |||
fq_module_name_cstring = EncodedString(fq_module_name).as_c_string_literal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, self.full_module_name
should be an EncodedString
already (for consistency reasons), and then it only needs to be rewrapped above if it actually gets modified here.
@@ -39,6 +39,7 @@ | |||
from cStringIO import StringIO | |||
except ImportError: | |||
from io import StringIO | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left-over?
|
||
# For Python 2 and Python <= 3.4 just run pyx->c; | ||
# don't compile the C file | ||
modules = cythonize(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Updated again to address your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for keeping up the good work!
I'm still getting failures on Windows, even after resolving several encoding issues and what not: @da-woods, could you try to find out what's going wrong there? |
I'll have a look... |
@scoder I think it's a distutils bug (or maybe setuptools) rather than a Cython bug. When I run it on Windows I get:
The key bit is Relevant code is in... CPython bug report: https://bugs.python.org/issue39432. However, in the short-term this looks like something we could patch in |
@scoder A really hacky patch for the Cython side of it:
This goes immediately before this line: cython/Cython/Build/Dependencies.py Line 975 in 074362b
|
This hopefully "finishes" the support for unicode identifiers. It builds off #3096, so I don't think there's much point in looking at it before that's finalized.
It adds
2 features1 feature:Support for unicode identifiers in C/C++ features such as structs and cppclasses. For structs used purely in Python I've mangled the names with punycode. For features that are exported/imported to C with "public" or "extern", I've translated the names to be \uXXXX escaped (without any mangling or normalization). Pretty much every modern C/C++ compiler supports unicode in identifiers in this form (only Clang supports it in raw form I think), so this this seems like the most compatible thing to do. I've trusted that the user knows what names they want and not performed any normalization for these (I don't think normalization is yet defined in C/C++ standards, so it's hard to do anything else).Support for import of modules with unicode in their names. As in PEP 489 this is only supported from Python 3.5 onwards (but the Cython translation step should work fine in earlier versions...). The Python 2 filename handling in Cython.Build seems a bit fragile with unicode filenames (since most os.path functions require bytes) - I haven't made any serious attempt to fix that beyond getting my test-cases to work.