Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] unicode.split does not allow to pass None for sep #4737

Closed
navytux opened this issue Apr 15, 2022 · 16 comments · Fixed by #4740
Closed

[BUG] unicode.split does not allow to pass None for sep #4737

navytux opened this issue Apr 15, 2022 · 16 comments · Fixed by #4740

Comments

@navytux
Copy link
Contributor

navytux commented Apr 15, 2022

Describe the bug
I'm hitting the difference in behaviour in between CPython and Cython for unicode.split - with Cython passing sep=None explicitly raises TypeError. Please find details below:

To Reproduce
Code to reproduce the behaviour:

---- 8< ---- usplit.pyx

# cython: language_level=3

def mysplit(q):
    return unicode.split(q, None)

print(mysplit("hello world"))

Expected behavior

I expect it to behave the same as in Python - i.e. print ['hello', 'world']:

---- 8< ---- usplit_py.py

def mysplit(q):
    return str.split(q, None)

print(mysplit("hello world"))
$ python usplit_py.py 
['hello', 'world']

However what I get instead is the following exception that None could not be used for sep:

$ cythonize -i usplit.pyx 
Compiling /home/kirr/usplit.pyx because it changed.
[1/1] Cythonizing /home/kirr/usplit.pyx
running build_ext
building 'usplit' extension
creating /home/kirr/tmp3kckc5wa/home
creating /home/kirr/tmp3kckc5wa/home/kirr
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-RNBry6/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-RNBry6/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/kirr/src/wendelin/venv/py3.venv/include -I/usr/include/python3.9 -c /home/kirr/usplit.c -o /home/kirr/tmp3kckc5wa/home/kirr/usplit.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -Wl,-z,relro -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-RNBry6/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 /home/kirr/tmp3kckc5wa/home/kirr/usplit.o -o /home/kirr/usplit.cpython-39-x86_64-linux-gnu.so
$ python -c 'import usplit'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "usplit.pyx", line 6, in init usplit
    print(mysplit("hello world"))
  File "usplit.pyx", line 4, in usplit.mysplit
    return unicode.split(q, None)
TypeError: must be str, not NoneType

Environment (please complete the following information):

  • OS: [Debian GNU/Linux 11]
  • Python version [e.g. 3.9.2]
  • Cython version [e.g. 0.29.27]

Thanks beforehand,
Kirill

@da-woods
Copy link
Contributor

So it looks like we translate unicode.split to PyUnicode_split

def _handle_simple_method_unicode_split(self, node, function, args, is_unbound_method):
"""Replace unicode.split(...) by a direct call to the
corresponding C-API function.
"""
if len(args) not in (1,2,3):
self._error_wrong_arg_count('unicode.split', node, args, "1-3")
return node
if len(args) < 2:
args.append(ExprNodes.NullNode(node.pos))
self._inject_int_default_argument(
node, args, 2, PyrexTypes.c_py_ssize_t_type, "-1")
return self._substitute_method_call(
node, function,
"PyUnicode_Split", self.PyUnicode_Split_func_type,
'split', is_unbound_method, args)
PyUnicode_Join_func_type = PyrexTypes.CFuncType(
Builtin.unicode_type, [
PyrexTypes.CFuncTypeArg("str", Builtin.unicode_type, None),
PyrexTypes.CFuncTypeArg("seq", PyrexTypes.py_object_type, None),
])

The documentation for that (https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_Split) implies that it treats sep==NULL as "split on whitespace". So we probably just want a tiny wrapper function to translate None to NULL and maybe also to special-case it when a None constant is passed.

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

Yes, @da-woods, thanks. I'm already doing my workarounds for other unicode methods - count, endswith, find etc for start and stop, e.g. as follows:

    def count(self, sub, start=None, end=None):                                                            
        # cython optimizes unicode.count to directly call PyUnicode_Count - cannot use None for start/stop        
        if start is None: start = 0                                                                        
        if end   is None: end   = PY_SSIZE_T_MAX                                                           
        return unicode.count(self, pyu(sub), start, end)                                                   

but for unicode.split I could not find a clean way to pass sep as NULL myself.

@da-woods
Copy link
Contributor

If you have a list of the unicode methods that you know of that have similar bugs then that'd be very useful?

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

Sure, here is my current list:

  • count
  • endswith
  • find
  • index
  • rfind
  • rindex
  • split (this one)
  • startswith

@da-woods
Copy link
Contributor

Your best workaround in the short-term would be to stop Cython from knowing that the object is unicode. So

def call_split(u: unicode):
   return (<object>u).split(None)

should at least prevent Cython from doing its optimization

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

@da-woods, thanks for the hint. For the reference I already worked it around as follows:

    def split(self, sep=None, maxsplit=-1):
        # cython optimizes unicode.split to directly call PyUnicode_Split - cannot use None for sep
        # and cannot also use object=NULL https://github.com/cython/cython/issues/4737
        if sep is None:
            v = unicode.split(self, maxsplit=maxsplit)
        else:
            v = unicode.split(self, pyu(sep), maxsplit)
        return list([pyu(_) for _ in v])

@scoder
Copy link
Contributor

scoder commented Apr 15, 2022

Agreed that Cython should handle this internally. Thanks for bringing it up.

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

You are welcome, @scoder. Thanks for feedback.

@scoder
Copy link
Contributor

scoder commented Apr 15, 2022

I looked into this. @navytux, apart from split(), the other methods you mentioned do not seem to accept None as argument except for the index range. That's what you meant, right? Supporting that is a bit tricky, because it involves two different coercions: 1) None -> 0 / PY_SSIZE_T_MAX and 2) object -> Py_ssize_t. I'll see what I can do.

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

@scoder, thanks for looking into it. Yes, for everything besides split I was talking about start/end parameters that specify interested index range.

scoder added a commit to scoder/cython that referenced this issue Apr 15, 2022
This is used by some optimisations for builtins that call C-API functions directly but need to convert None arguments to NULL or special integer values in order to mimic the original Python interface.

Also add and backport the CPython macros for None checks (and True/False, while we're at it):
https://docs.python.org/3/c-api/structures.html#c.Py_Is

Closes cython#4737
See cython#4706
@scoder
Copy link
Contributor

scoder commented Apr 15, 2022

I submitted a PR in #4740

@navytux, since there are a bunch of methods that are affected, maybe you could help us by writing up the tests for them? I added a few and it looks good, but most methods are still missing, and I think it'd be good to also test the None arguments separately and not just together as two Nones.

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

@scoder, thanks. Sure, I could try to help. Since I'm kind of newbie when it comes to work on Cython compiler itself, I've started with something small first:

and I think it'd be good to also test the None arguments separately and not just together as two Nones.

Does navytux@74ba995 look reasonable doing so for startswith ? I'm also unsure why in that file for every case things are tested twice - e.g. by inspecting output of text.startswith(...) and output of startswith_start_end(text, <same...>). Is there a particular reason to do so? In other words what are the downsides of e.g.

--- a/tests/run/unicodemethods.pyx
+++ b/tests/run/unicodemethods.pyx
@@ -357,37 +357,21 @@ def startswith(unicode s, sub):
     "//PythonCapiCallNode")
 def startswith_start_end(unicode s, sub, start, end):
     """
-    >>> text.startswith('b ', 1, 5)
-    True
     >>> startswith_start_end(text, 'b ', 1, 5)
     'MATCH'
-    >>> text.startswith('ab ', -1000, 5000)
-    True
     >>> startswith_start_end(text, 'ab ', -1000, 5000)
     'MATCH'
-    >>> text.startswith('b X', 1, 5)
-    False
     >>> startswith_start_end(text, 'b X', 1, 5)
     'NO MATCH'
 
-    >>> text.startswith('ab ', None, None)
-    True
     >>> startswith_start_end(text, 'ab ', None, None)
     'MATCH'
-    >>> text.startswith('ab ', 1, None)
-    False
     >>> startswith_start_end(text, 'ab ', 1, None)
     'NO MATCH'
-    >>> text.startswith('b ', 1, None)
-    True
     >>> startswith_start_end(text, 'b ', 1, None)
     'MATCH'
-    >>> text.startswith('ab ', None, 3)
-    True
     >>> startswith_start_end(text, 'ab ', None, 3)
     'MATCH'
-    >>> text.startswith('ab ', None, 2)
-    False
     >>> startswith_start_end(text, 'ab ', None, 2)
     'NO MATCH'
     """

?

Once it is clear for me regarding testing, and a small code delta of mine looks good for you, I can proceed with adding tests for all other methods.

@scoder
Copy link
Contributor

scoder commented Apr 15, 2022

You are looking at the doctest part, which is executed by Python (not compiled by Cython). We use this to a) run our compiled test function and validate the result, and sometimes also to b) do the same operation in Python to make sure it produces the same (expected) result, and it stays that way across Python versions. What you remove above is the Python comparison test part.

@navytux
Copy link
Contributor Author

navytux commented Apr 15, 2022

Now I see, thanks for clarifying. Then since we are adding many cases, would the following be ok with you:

--- a/tests/run/unicodemethods.pyx
+++ b/tests/run/unicodemethods.pyx
@@ -357,44 +357,33 @@ def startswith(unicode s, sub):
     "//PythonCapiCallNode")
 def startswith_start_end(unicode s, sub, start, end):
     """
-    >>> text.startswith('b ', 1, 5)
+    >>> def _(sub, start, end):
+    ...     py = text.startswith(sub, start, end)
+    ...     cy = startswith_start_end(text, sub, start, end)
+    ...     assert py == cy, (py, cy)
+    ...     return cy
+    >>> _('b ', 1, 5)
     True
-    >>> startswith_start_end(text, 'b ', 1, 5)
-    'MATCH'
-    >>> text.startswith('ab ', -1000, 5000)
+    >>> _('ab ', -1000, 5000)
     True
-    >>> startswith_start_end(text, 'ab ', -1000, 5000)
-    'MATCH'
-    >>> text.startswith('b X', 1, 5)
+    >>> _('b X', 1, 5)
     False
-    >>> startswith_start_end(text, 'b X', 1, 5)
-    'NO MATCH'
 
-    >>> text.startswith('ab ', None, None)
+    >>> _('ab ', None, None)
     True
-    >>> startswith_start_end(text, 'ab ', None, None)
-    'MATCH'
-    >>> text.startswith('ab ', 1, None)
+    >>> _('ab ', 1, None)
     False
-    >>> startswith_start_end(text, 'ab ', 1, None)
-    'NO MATCH'
-    >>> text.startswith('b ', 1, None)
+    >>> _('b ',  1, None)
     True
-    >>> startswith_start_end(text, 'b ', 1, None)
-    'MATCH'
-    >>> text.startswith('ab ', None, 3)
+    >>> _('ab ', None, 3)
     True
-    >>> startswith_start_end(text, 'ab ', None, 3)
-    'MATCH'
-    >>> text.startswith('ab ', None, 2)
+    >>> _('ab ', None, 2)
     False
-    >>> startswith_start_end(text, 'ab ', None, 2)
-    'NO MATCH'
     """
     if s.startswith(sub, start, end):
-        return 'MATCH'
+        return True
     else:
-        return 'NO MATCH'
+        return False

i.e. do the check of what Python produced wrt what Cython produced automatically, and have only one case for each test entry. So reworked test looks e.g. like the following:

@cython.test_fail_if_path_exists(
    "//CoerceToPyTypeNode",
    "//CastNode", "//TypecastNode")
@cython.test_assert_path_exists(
    "//CoerceFromPyTypeNode",
    "//PythonCapiCallNode")
def startswith_start_end(unicode s, sub, start, end):
    """
    >>> def _(sub, start, end):
    ...     py = text.startswith(sub, start, end)
    ...     cy = startswith_start_end(text, sub, start, end)
    ...     assert py == cy, (py, cy)
    ...     return cy
    >>> _('b ', 1, 5)
    True
    >>> _('ab ', -1000, 5000)
    True
    >>> _('b X', 1, 5)
    False

    >>> _('ab ', None, None)
    True
    >>> _('ab ', 1, None)
    False
    >>> _('b ',  1, None)
    True
    >>> _('ab ', None, 3)
    True
    >>> _('ab ', None, 2)
    False
    """
    if s.startswith(sub, start, end):
        return True
    else:
        return False

and personally I find it easier to understand because the signal/noise ratio becomes higher. (I would rework it even further to just do return s.startswith(...) but then it starts to fail detecting CoerceToPyTypeNode.

Would such style be ok with you?

@scoder
Copy link
Contributor

scoder commented Apr 15, 2022 via email

scoder added a commit to scoder/cython that referenced this issue Apr 16, 2022
This is used by some optimisations for builtins that call C-API functions directly but need to convert None arguments to NULL or special integer values in order to mimic the original Python interface.

Also add and backport the CPython macros for None checks (and True/False, while we're at it):
https://docs.python.org/3/c-api/structures.html#c.Py_Is

Closes cython#4737
See cython#4706
navytux added a commit to navytux/cython that referenced this issue Apr 18, 2022
…s compared to via-Python output

In every test, that is doing such comparison, replace the duplication
with a function that calls both py-function and cy-function, asserts the
result is the same and returns the result.

With this helper we can retain only one doctest entry instead of two for
every tested case. Tests also become a bit more robust because the
possibility of slight unintended error in between py and cy inputs is now
gone.

This refactoring is done in preparation to add more tests for unicode
methods that accept start and stop parameters.

Discussion: cython#4737 (comment)
@navytux
Copy link
Contributor Author

navytux commented Apr 18, 2022

Thanks, @scoder. So I've tried to do those tests for unicode methods the way we discussed. Please find the patches at #4743. More specifically the patches are:

Hope it is ok,
Kirill

scoder added a commit that referenced this issue Apr 18, 2022
…H-4740)

This is used by some optimisations for builtins that call C-API functions directly but need to convert None arguments to NULL or special integer values in order to mimic the original Python interface.

Also add and backport the CPython macros for None checks (and True/False, while we're at it):
https://docs.python.org/3/c-api/structures.html#c.Py_Is

Closes #4737
See #4706
scoder pushed a commit that referenced this issue Apr 21, 2022
…4743)

Also refactor the tests to (always) use a direct comparison between the Python generated result and the result from the optimised Cython code.

Adds tests for #4740
See #4737
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants