Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode imports #3119

Merged
merged 10 commits into from
Sep 30, 2019
Merged

unicode imports #3119

merged 10 commits into from
Sep 30, 2019

Conversation

da-woods
Copy link
Contributor

@da-woods da-woods commented Sep 1, 2019

This hopefully "finishes" the support for unicode identifiers. It builds off #3096, so I don't think there's much point in looking at it before that's finalized.

It adds 2 features 1 feature:

  1. Support for unicode identifiers in C/C++ features such as structs and cppclasses. For structs used purely in Python I've mangled the names with punycode. For features that are exported/imported to C with "public" or "extern", I've translated the names to be \uXXXX escaped (without any mangling or normalization). Pretty much every modern C/C++ compiler supports unicode in identifiers in this form (only Clang supports it in raw form I think), so this this seems like the most compatible thing to do. I've trusted that the user knows what names they want and not performed any normalization for these (I don't think normalization is yet defined in C/C++ standards, so it's hard to do anything else).

  2. Support for import of modules with unicode in their names. As in PEP 489 this is only supported from Python 3.5 onwards (but the Cython translation step should work fine in earlier versions...). The Python 2 filename handling in Cython.Build seems a bit fragile with unicode filenames (since most os.path functions require bytes) - I haven't made any serious attempt to fix that beyond getting my test-cases to work.

Copy link
Contributor

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, a fairly big chunk of changes. Since the C/C++ Unicode identifiers are worth some debate, I would like to ask you to exclude them from this PR.

Cython/Build/Dependencies.py Show resolved Hide resolved
Cython/Compiler/Main.py Show resolved Hide resolved
Cython/Compiler/Main.py Show resolved Hide resolved
Cython/Compiler/Main.py Outdated Show resolved Hide resolved
@@ -433,6 +446,16 @@ def create_default_resultobj(compilation_source, options):
def run_pipeline(source, options, full_module_name=None, context=None):
from . import Pipeline

# ensure that the inputs are unicode (for Python 2)
try:
source = source.decode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the source always be Unicode already? (If not, then that's a bug, and this is not the right place to fix it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source here is the source filename, not the source code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Then there's the file system encoding for that. Not all file systems use utf-8. See Utils.decode_filename().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - should "module_name" be the same? I assume it shouldn't, but it is derived from the filename usually

Cython/StringIOTree.py Outdated Show resolved Hide resolved
Cython/Utility/CConvert.pyx Show resolved Hide resolved
tests/run/unicode_identifiers.pyx Outdated Show resolved Hide resolved
tests/run/unicode_imports.srctree Outdated Show resolved Hide resolved
tests/run/unicode_imports.srctree Outdated Show resolved Hide resolved
da-woods and others added 8 commits September 25, 2019 11:45
Also add a test-case. (Also add a test case for passing keyword
arguments, which already worked, but is good to have as a test
case).
When they are to be used internally the name is mangled with
punycode as normal. When they are to be used externally
(e.g. "cdef public" or "cdef from extern") the name is taken
exactly as-is and simply slash-escaped ("\uNNNN"). The
vast majority of C compilers are capable to dealing with
\uNNNN characters in literal names.
(Only valid under Python 3)
Added tests and a small fix
+ a few other small errors in generation of module code
This reverts commit 463c7f9.
Removes the more controversal unicode C identifiers, but leaves
unicode modules
Also, handle generated .h files properly
@da-woods
Copy link
Contributor Author

I've moved the C stuff out of this PR - I'll make a new PR with it in at some point.

I think I've addressed most of the issues raised

@da-woods da-woods changed the title (more) unicode identifiers and imports unicode imports Sep 25, 2019
@@ -433,6 +446,16 @@ def create_default_resultobj(compilation_source, options):
def run_pipeline(source, options, full_module_name=None, context=None):
from . import Pipeline

# ensure that the inputs are unicode (for Python 2)
try:
source = source.decode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Then there's the file system encoding for that. Not all file systems use utf-8. See Utils.decode_filename().

from .Pythran import has_np_pythran


def replace_suffix(path, newsuf):
x = utils_replace_suffix(path, newsuf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is x here?
I think it's generally not a good idea to reuse the name of an imported function and then let it do something else. This function apparently does something specific to file names. That should be reflected in its name.


def replace_suffix(path, newsuf):
x = utils_replace_suffix(path, newsuf)
return encoded_string_or_bytes_literal(x, sys.getfilesystemencoding())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line actually seems worth a helper function, e.g. as_encoded_filename(). Although it generally seems quite late to handle file name encodings at this point, on the way out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The problem is that Utils.replace_suffix doesn't seem to know about EncodedString/BytesLiteral (I could import them but it seems like a design decision that it doesn't use anything from "Compiler"), so it's quite hard to deal with the file name encoding earlier

@@ -208,7 +216,14 @@ def h_entries(entries, api=0, pxd=0):
h_code.putln("/* It now returns a PyModuleDef instance instead of a PyModule instance. */")
h_code.putln("")
h_code.putln("#if PY_MAJOR_VERSION < 3")
h_code.putln("PyMODINIT_FUNC init%s(void);" % env.module_name)
try:
env.module_name.encode("ascii")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a name_is_ascii() helper function to Utils.py that try-encodes in Py2 and just calls isascii() in Py3? That would avoid all these repeated try-except usages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an isascii method to EncodedString and BytesLiteral instead (for Py2 only). I haven't rewritten the punycoding code since this is taken from PEP489 and it seems best to match that.

@@ -230,7 +245,9 @@ def generate_public_declaration(self, entry, h_code, i_code):
entry.type.declaration_code(entry.cname, pyrex=1)))

def api_name(self, env):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would allow a prefix argument here and pass it through. That makes the concatenation more obvious.

try:
env.module_name.encode('ascii')
except UnicodeEncodeError:
py2_mod_name = env.module_name.encode("ascii", errors="ignore").decode("utf8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are no non-ascii characters at all? I think this deserves at least a comment right here that the compilation is intended to fail completely, further down. And/or, rename the no_py2 flag to fail_compilation_in_py2, and set it before this line instead of after it.

@@ -2665,13 +2699,14 @@ def generate_module_import_setup(self, env, code):
fq_module_name = self.full_module_name
if fq_module_name.endswith('.__init__'):
fq_module_name = fq_module_name[:-len('.__init__')]
fq_module_name_cstring = EncodedString(fq_module_name).as_c_string_literal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, self.full_module_name should be an EncodedString already (for consistency reasons), and then it only needs to be rewrapped above if it actually gets modified here.

@@ -39,6 +39,7 @@
from cStringIO import StringIO
except ImportError:
from io import StringIO
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left-over?

Cython/Utility/CConvert.pyx Show resolved Hide resolved

# For Python 2 and Python <= 3.4 just run pyx->c;
# don't compile the C file
modules = cythonize(files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@da-woods
Copy link
Contributor Author

Updated again to address your comments.

Copy link
Contributor

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for keeping up the good work!

Cython/Compiler/StringEncoding.py Show resolved Hide resolved
@scoder scoder merged commit 074362b into cython:master Sep 30, 2019
@da-woods da-woods deleted the unicode_identifiers_cstruct branch September 30, 2019 19:47
@scoder
Copy link
Contributor

scoder commented Jan 22, 2020

I'm still getting failures on Windows, even after resolving several encoding issues and what not:
https://ci.appveyor.com/project/cython/cython/builds/29947270
Here's the branch that I've been working with:
#3194

@da-woods, could you try to find out what's going wrong there?

@da-woods
Copy link
Contributor Author

I'll have a look...

@da-woods
Copy link
Contributor Author

da-woods commented Jan 23, 2020

@scoder I think it's a distutils bug (or maybe setuptools) rather than a Cython bug. When I run it on Windows I get:

C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\XXXXX\Miniconda3\envs\py37\libs /LIBPATH:C:\Users\XXXXX\Miniconda3\envs\py37\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.24.28314\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\um\x64" /EXPORT:PyInit_mymoð build\temp.win-amd64-3.7\Release\mymoð.obj /OUT:C:\Users\XXXXX\Documents\cython-fix_srctree_tests_on_windows\TEST_TMP\run\unicode_imports\mymoð.cp37-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.7\Release\mymoð.cp37-win_amd64.lib
LINK : error LNK2001: unresolved external symbol PyInit_mymoð
build\temp.win-amd64-3.7\Release\mymoð.cp37-win_amd64.lib : fatal error LNK1120: 1 unresolved externals

The key bit is /EXPORT:PyInit_mymoð - distutils/setuptools adds this to ensure that the PyInit_ symbol has the correct linkage (presumably incase PyMODINIT_FUNC has been forgotten). However, for a unicode filename the symbol is PyInitU_<punicodified_something>, so you get a link error.

Relevant code is in...
https://github.com/python/cpython/blob/0d30ae1a03102de07758650af9243fd31211325a/Lib/distutils/command/build_ext.py#L686

CPython bug report: https://bugs.python.org/issue39432. However, in the short-term this looks like something we could patch in cythonize so I'll have a go at doing that.

@da-woods
Copy link
Contributor Author

da-woods commented Jan 23, 2020

@scoder A really hacky patch for the Cython side of it:

    class DummyExportSymbols(list):
        # export_symbols for unicode filename cause link errors on Windows
        # Cython doesn't need them (it already defines PyInit with the correct linkage)
        # so use this class as a temporary fix to stop them from being generated
        def __init__(self):
            super(DummyExportSymbols, self).__init__() # empty list
        def __contains__(self, val):
            # so distutils doesn't "helpfully" add PyInit_<name>
            return True
    for m in module_list:
        m.export_symbols = DummyExportSymbols()

This goes immediately before this line:

deps = create_dependency_tree(ctx, quiet=quiet)
Posting it here because I don't know how to submit a PR against your branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants