unicode imports #3119

da-woods · 2019-09-01T18:09:59Z

This hopefully "finishes" the support for unicode identifiers. It builds off #3096, so I don't think there's much point in looking at it before that's finalized.

It adds ~~2 features~~ 1 feature:

Support for unicode identifiers in C/C++ features such as structs and cppclasses. For structs used purely in Python I've mangled the names with punycode. For features that are exported/imported to C with "public" or "extern", I've translated the names to be \uXXXX escaped (without any mangling or normalization). Pretty much every modern C/C++ compiler supports unicode in identifiers in this form (only Clang supports it in raw form I think), so this this seems like the most compatible thing to do. I've trusted that the user knows what names they want and not performed any normalization for these (I don't think normalization is yet defined in C/C++ standards, so it's hard to do anything else).
Support for import of modules with unicode in their names. As in PEP 489 this is only supported from Python 3.5 onwards (but the Cython translation step should work fine in earlier versions...). The Python 2 filename handling in Cython.Build seems a bit fragile with unicode filenames (since most os.path functions require bytes) - I haven't made any serious attempt to fix that beyond getting my test-cases to work.

scoder

Phew, a fairly big chunk of changes. Since the C/C++ Unicode identifiers are worth some debate, I would like to ask you to exclude them from this PR.

Cython/Build/Dependencies.py

Cython/Compiler/Main.py

scoder · 2019-09-24T19:52:44Z

Cython/Compiler/Main.py

@@ -433,6 +446,16 @@ def create_default_resultobj(compilation_source, options):
 def run_pipeline(source, options, full_module_name=None, context=None):
    from . import Pipeline

+    # ensure that the inputs are unicode (for Python 2)
+    try:
+        source = source.decode("utf-8")


Shouldn't the source always be Unicode already? (If not, then that's a bug, and this is not the right place to fix it.)

source here is the source filename, not the source code.

Ah, ok. Then there's the file system encoding for that. Not all file systems use utf-8. See Utils.decode_filename().

Done - should "module_name" be the same? I assume it shouldn't, but it is derived from the filename usually

Cython/StringIOTree.py

Cython/Utility/CConvert.pyx

tests/run/unicode_identifiers.pyx

tests/run/unicode_imports.srctree

Also add a test-case. (Also add a test case for passing keyword arguments, which already worked, but is good to have as a test case).

When they are to be used internally the name is mangled with punycode as normal. When they are to be used externally (e.g. "cdef public" or "cdef from extern") the name is taken exactly as-is and simply slash-escaped ("\uNNNN"). The vast majority of C compilers are capable to dealing with \uNNNN characters in literal names.

(Only valid under Python 3)

Added tests and a small fix

+ a few other small errors in generation of module code

This reverts commit 463c7f9. Removes the more controversal unicode C identifiers, but leaves unicode modules

Also, handle generated .h files properly

da-woods · 2019-09-25T12:25:02Z

I've moved the C stuff out of this PR - I'll make a new PR with it in at some point.

I think I've addressed most of the issues raised

scoder · 2019-09-26T17:01:18Z

Cython/Compiler/Main.py

@@ -433,6 +446,16 @@ def create_default_resultobj(compilation_source, options):
 def run_pipeline(source, options, full_module_name=None, context=None):
    from . import Pipeline

+    # ensure that the inputs are unicode (for Python 2)
+    try:
+        source = source.decode("utf-8")


Ah, ok. Then there's the file system encoding for that. Not all file systems use utf-8. See Utils.decode_filename().

scoder · 2019-09-26T17:02:45Z

Cython/Compiler/ModuleNode.py

 from .Pythran import has_np_pythran

+
+def replace_suffix(path, newsuf):
+    x = utils_replace_suffix(path, newsuf)


What is x here?
I think it's generally not a good idea to reuse the name of an imported function and then let it do something else. This function apparently does something specific to file names. That should be reflected in its name.

scoder · 2019-09-26T17:09:34Z

Cython/Compiler/ModuleNode.py

+
+def replace_suffix(path, newsuf):
+    x = utils_replace_suffix(path, newsuf)
+    return encoded_string_or_bytes_literal(x, sys.getfilesystemencoding())


This line actually seems worth a helper function, e.g. as_encoded_filename(). Although it generally seems quite late to handle file name encodings at this point, on the way out.

Done. The problem is that Utils.replace_suffix doesn't seem to know about EncodedString/BytesLiteral (I could import them but it seems like a design decision that it doesn't use anything from "Compiler"), so it's quite hard to deal with the file name encoding earlier

scoder · 2019-09-28T13:43:14Z

Cython/Compiler/ModuleNode.py

@@ -208,7 +216,14 @@ def h_entries(entries, api=0, pxd=0):
            h_code.putln("/* It now returns a PyModuleDef instance instead of a PyModule instance. */")
            h_code.putln("")
            h_code.putln("#if PY_MAJOR_VERSION < 3")
-            h_code.putln("PyMODINIT_FUNC init%s(void);" % env.module_name)
+            try:
+                env.module_name.encode("ascii")


What about adding a name_is_ascii() helper function to Utils.py that try-encodes in Py2 and just calls isascii() in Py3? That would avoid all these repeated try-except usages.

I've added an isascii method to EncodedString and BytesLiteral instead (for Py2 only). I haven't rewritten the punycoding code since this is taken from PEP489 and it seems best to match that.

scoder · 2019-09-28T13:47:49Z

Cython/Compiler/ModuleNode.py

@@ -230,7 +245,9 @@ def generate_public_declaration(self, entry, h_code, i_code):
                entry.type.declaration_code(entry.cname, pyrex=1)))

    def api_name(self, env):


I would allow a prefix argument here and pass it through. That makes the concatenation more obvious.

scoder · 2019-09-28T13:58:53Z

Cython/Compiler/ModuleNode.py

+        try:
+            env.module_name.encode('ascii')
+        except UnicodeEncodeError:
+            py2_mod_name = env.module_name.encode("ascii", errors="ignore").decode("utf8")


What if there are no non-ascii characters at all? I think this deserves at least a comment right here that the compilation is intended to fail completely, further down. And/or, rename the no_py2 flag to fail_compilation_in_py2, and set it before this line instead of after it.

scoder · 2019-09-28T14:03:48Z

Cython/Compiler/ModuleNode.py

@@ -2665,13 +2699,14 @@ def generate_module_import_setup(self, env, code):
        fq_module_name = self.full_module_name
        if fq_module_name.endswith('.__init__'):
            fq_module_name = fq_module_name[:-len('.__init__')]
+        fq_module_name_cstring = EncodedString(fq_module_name).as_c_string_literal()


As mentioned above, self.full_module_name should be an EncodedString already (for consistency reasons), and then it only needs to be rewrapped above if it actually gets modified here.

scoder · 2019-09-28T14:09:00Z

Cython/StringIOTree.py

@@ -39,6 +39,7 @@
    from cStringIO import StringIO
 except ImportError:
    from io import StringIO
+import sys


Cython/Utility/CConvert.pyx

scoder · 2019-09-28T14:12:25Z

tests/run/unicode_imports.srctree

+
+# For Python 2 and Python <= 3.4 just run pyx->c; 
+# don't compile the C file
+modules = cythonize(files)


da-woods · 2019-09-29T10:30:06Z

Updated again to address your comments.

scoder

Very nice, thanks for keeping up the good work!

Cython/Compiler/StringEncoding.py

scoder · 2020-01-22T20:14:11Z

I'm still getting failures on Windows, even after resolving several encoding issues and what not:
https://ci.appveyor.com/project/cython/cython/builds/29947270
Here's the branch that I've been working with:
#3194

@da-woods, could you try to find out what's going wrong there?

da-woods · 2020-01-22T20:31:52Z

I'll have a look...

da-woods · 2020-01-23T13:10:24Z

@scoder I think it's a distutils bug (or maybe setuptools) rather than a Cython bug. When I run it on Windows I get:

C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\XXXXX\Miniconda3\envs\py37\libs /LIBPATH:C:\Users\XXXXX\Miniconda3\envs\py37\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.24.28314\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\um\x64" /EXPORT:PyInit_mymoð build\temp.win-amd64-3.7\Release\mymoð.obj /OUT:C:\Users\XXXXX\Documents\cython-fix_srctree_tests_on_windows\TEST_TMP\run\unicode_imports\mymoð.cp37-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.7\Release\mymoð.cp37-win_amd64.lib
LINK : error LNK2001: unresolved external symbol PyInit_mymoð
build\temp.win-amd64-3.7\Release\mymoð.cp37-win_amd64.lib : fatal error LNK1120: 1 unresolved externals

The key bit is /EXPORT:PyInit_mymoð - distutils/setuptools adds this to ensure that the PyInit_ symbol has the correct linkage (presumably incase PyMODINIT_FUNC has been forgotten). However, for a unicode filename the symbol is PyInitU_<punicodified_something>, so you get a link error.

Relevant code is in...
https://github.com/python/cpython/blob/0d30ae1a03102de07758650af9243fd31211325a/Lib/distutils/command/build_ext.py#L686

CPython bug report: https://bugs.python.org/issue39432. However, in the short-term this looks like something we could patch in cythonize so I'll have a go at doing that.

da-woods · 2020-01-23T13:38:29Z

@scoder A really hacky patch for the Cython side of it:

    class DummyExportSymbols(list):
        # export_symbols for unicode filename cause link errors on Windows
        # Cython doesn't need them (it already defines PyInit with the correct linkage)
        # so use this class as a temporary fix to stop them from being generated
        def __init__(self):
            super(DummyExportSymbols, self).__init__() # empty list
        def __contains__(self, val):
            # so distutils doesn't "helpfully" add PyInit_<name>
            return True
    for m in module_list:
        m.export_symbols = DummyExportSymbols()

This goes immediately before this line:

cython/Cython/Build/Dependencies.py

Line 975 in 074362b

deps = create_dependency_tree(ctx, quiet=quiet)

Posting it here because I don't know how to submit a PR against your branch.

scoder requested changes Sep 24, 2019

View reviewed changes

da-woods and others added 8 commits September 25, 2019 11:45

Handle normalization of unicode identifiers

7e43627

Also add a test-case. (Also add a test case for passing keyword arguments, which already worked, but is good to have as a test case).

Support unicode characters in module names

483a779

(Only valid under Python 3)

Unicode structs -> dict

5dbd068

Added tests and a small fix

Fixed breakage of single-phase init

f888e8a

+ a few other small errors in generation of module code

Fixed broken building using Extension on Python2

27f9ab8

Revert "Allow unicode named structs/other C items"

165c03e

This reverts commit 463c7f9. Removes the more controversal unicode C identifiers, but leaves unicode modules

Made requested changes from code review

98df21b

Also, handle generated .h files properly

da-woods force-pushed the unicode_identifiers_cstruct branch from 045a853 to 98df21b Compare September 25, 2019 12:23

da-woods changed the title ~~(more) unicode identifiers and imports~~ unicode imports Sep 25, 2019

da-woods mentioned this pull request Sep 25, 2019

Unicode identifiers (C structures) #3156

Open

scoder reviewed Sep 28, 2019

View reviewed changes

Changes from code review

1f8db88

Fixed isascii on Python 3.(<7)

da989d5

scoder approved these changes Sep 30, 2019

View reviewed changes

Cython/Compiler/StringEncoding.py Show resolved Hide resolved

scoder merged commit 074362b into cython:master Sep 30, 2019

da-woods deleted the unicode_identifiers_cstruct branch September 30, 2019 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode imports #3119

unicode imports #3119

da-woods commented Sep 1, 2019 •

edited

Loading

scoder left a comment

scoder Sep 24, 2019

da-woods Sep 25, 2019

scoder Sep 26, 2019

da-woods Sep 29, 2019

da-woods commented Sep 25, 2019

scoder Sep 26, 2019

scoder Sep 26, 2019

scoder Sep 26, 2019

da-woods Sep 29, 2019

scoder Sep 28, 2019

da-woods Sep 29, 2019

scoder Sep 28, 2019

scoder Sep 28, 2019

scoder Sep 28, 2019

scoder Sep 28, 2019

scoder Sep 28, 2019

da-woods commented Sep 29, 2019

scoder left a comment

scoder commented Jan 22, 2020

da-woods commented Jan 22, 2020

da-woods commented Jan 23, 2020 •

edited

Loading

da-woods commented Jan 23, 2020 •

edited

Loading

		@@ -230,7 +245,9 @@ def generate_public_declaration(self, entry, h_code, i_code):
		entry.type.declaration_code(entry.cname, pyrex=1)))

		def api_name(self, env):

unicode imports #3119

unicode imports #3119

Conversation

da-woods commented Sep 1, 2019 • edited Loading

scoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

da-woods commented Sep 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

da-woods commented Sep 29, 2019

scoder left a comment

Choose a reason for hiding this comment

scoder commented Jan 22, 2020

da-woods commented Jan 22, 2020

da-woods commented Jan 23, 2020 • edited Loading

da-woods commented Jan 23, 2020 • edited Loading

da-woods commented Sep 1, 2019 •

edited

Loading

da-woods commented Jan 23, 2020 •

edited

Loading

da-woods commented Jan 23, 2020 •

edited

Loading