Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verbatim C code using docstring syntax. #1915

Merged
merged 3 commits into from
Nov 22, 2017
Merged

Conversation

jdemeyer
Copy link
Contributor

@jdemeyer jdemeyer commented Oct 6, 2017

Allow including snippets of C code in a cdef extern from * block using docstring syntax:

cdef extern from *:
    """
    static long square(long x)
    {
        return x * x;
    }
    """
    long square(long)

This is an alternative to adding an external .h file containing the C code. This would be useful especially for simple things, like a one-line macro.

I still have to write documentation, but I'll wait for feedback first.

Edit: This is currently a syntax error and thus cannot conflict with existing code.

@robertwb
Copy link
Contributor

robertwb commented Oct 6, 2017

While I can see this would be handy when you need just a tiny snippet of C, in practice I think this would lead to polyglot files with hunks of C code embedded in string literals that I'd like to avoid. If you need/want to write raw C, creating a C file is I think a fair price to pay.

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Oct 6, 2017

While I can see this would be handy when you need just a tiny snippet of C, in practice I think this would lead to polyglot files with hunks of C code embedded in string literals that I'd like to avoid. If you need/want to write raw C, creating a C file is I think a fair price to pay.

You may or may not be right, but what you are saying is really a style issue. The fact that some people might write ugly code using a feature is not a reason to not include that feature. Like any feature, it can be used properly or abused. If the C code is pretty simple (say, a one-liner), this feature will actually make it easier to understand code compared to the case of an additional helper.h file.

I see it as analogous to the cname mechanism cdef foo "some_funny_c_code_here"() which has also been abused. But sometimes it's convenient that this is possible. I also think that this feature is useful for the Jupyter %%cython mode. There, you are essentially restricted to a single file and you cannot easily include arbitrary C code otherwise.

@robertwb
Copy link
Contributor

robertwb commented Oct 6, 2017 via email

@scoder
Copy link
Contributor

scoder commented Oct 7, 2017 via email

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Oct 7, 2017

Every feature introduced needs to have its cost balanced against its benefits, and readability of code matters here.

Why does it matter to you how other projects use Cython features? With the feature that I'm proposing, it would be possible to do something like

cdef extern from *:
    """#define my_special_macro(x, y) some_C_code_here"""
    void my_special_macro(long, long)

That's my use case. I don't plan to add 100 lines of C this way (1). I think this is easier to understand than having a separate file with the macro definition. Sage has a lot of these small .h files and it would improve readability if the small ones could be inlined in the Cython sources.

Let me also say that this is feature that I have wanted for a long time in Cython. I only made this PR now because my work on late includes made me realize that it would be very simple to implement it this way.

It has happened many times to me that Cython isn't able to generate the exact C code that I want for some minor reason (2). Sometimes, it is possible to use cname hacks to work around that. But, with cname hacks you are restricted to a specific syntax (3) which doesn't always work. There is a big gap in functionality and usability between cname hacks and cdef extern from "helper.h". This PR would nicely fill this gap.

(1) Although it's good to know that I could. I don't like it when a language artificially restrict functionality for style reasons (Python's restricted decorator syntax comes to mind here).
(2) The most common problem is probably that something is an lvalue but Cython doesn't know it.
(3) Athough, one time I actually managed to write multi-line C code in a cname string. Do you think it's a mistake in Cython to allow that?

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Oct 7, 2017

You can write files to disk from notebooks with the %%file magic.

The question is not "is this functionality possible currently?"

The question is "would this functionality make things easier for our users?" I'm certain that the answer to this question is "yes".

This feature is easy to use for users, easy to implement for Cython, it doesn't break any existing code, it enables some use cases which weren't possible before (%%cython), it simplifies some use cases which were already possible. But just for some hypothetical style issues (people might use this feature to write ugly code), you refuse it?

@scoder
Copy link
Contributor

scoder commented Oct 7, 2017 via email

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Oct 7, 2017

The question is, should we encourage this kind of code mix by promoting it to a language feature

Well, both cname strings and cdef extern from "helper.h" are language features which encourage this kind of code mix. So the possibility is already there, I only want to make it easier in some cases.

we should try to improve and simplify "good" usages, fill holes in the language and the compiler, and improve the coverage of C/C++ features to further reduce the need for writing native code in the first place.

Of course you should. But in the mean time, the fact remains that Cython does not support all C or C++ code. So such hand-written C snippets are needed in practice for some use cases.

@dalcinl
Copy link
Member

dalcinl commented Oct 20, 2017

@scoder @robertwb I've been using Cython for years, after abandoning SWIG for all of my projects. SWIG have support for embedding verbatim code, and I've been missing that feature for years, I would love to have it in Cython. I would actually refactor my own projects to take advantage of it.

Please note that my use case (and the one of many other users) is not to use Cython as faster superset of Python, but rather as a tool with superb capabilities for wrapping third-party C/C++/Fortran code. @jdemeyer is right that we already have some features like cdef foo "some_funny_c_code_here"(). Why do we have these quirks in the language? Well, maybe just because practicality beats purity. The additional quirk proposed in this PR is not essentially different, and yet more powerful.

I understand some users will abuse of these features to maybe write horrible code. People manage to write awful code in any language, even in Python! However, right now the experienced developer using Cython has no other option than write a helper C file on the side, keeping implementation and use separate, which is not always good, as it makes Cython code actually harder to read.

@jdemeyer
Copy link
Contributor Author

@dalcinl Thanks for the support! Let me also repeat that this is a feature that I wanted since a long time. It's not just a crazy idea that I had once.

@@ -54,6 +54,15 @@ def generate_c_code_config(env, options):
c_line_in_traceback=options.c_line_in_traceback)


class VerbatimString(_unicode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably aware that this is a hack, right?


if byte_decoded_filenname[0] == '<' and byte_decoded_filenname[-1] == '>':
code.putln('#include %s' % byte_decoded_filenname)
f = str(filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can raise a UnicodeEncodeError for non-ASCII content in Py2.

@jdemeyer
Copy link
Contributor Author

You're probably aware that this is a hack, right?

I think it's not so bad. I am saying that a verbatim include string is essentially a special kind of string.

Now, in order to things properly object oriented, we should probably introduce a new base class for includes in general. This would actually make sense as it might help to deal with <...> includes vs "..." includes, early vs late includes, file include vs verbatim includes. I thought about that, but didn't do it because I didn't want to make this PR more complicated than it needs be. I also think that one should separate "refactoring" PRs from "new features" PRs.

@dalcinl
Copy link
Member

dalcinl commented Oct 28, 2017

@jdemeyer While I do support this PR, I think we have to do it the best we can.

Would you changes support the following use case? I'm working on a branch to update the NumPy pxd files, and the thing below would allow me to provide backward compatibility with old NumPy releases.

cdef extern from "numpy/arrayobject.h":
    """
    /* Legacy NumPy C API */
    #ifndef NPY_ARRAY_CARRAY
    #define NPY_ARRAY_CARRAY NPY_CARRAY
    #endif
    """
    enum: NPY_ARRAY_CARRAY

@jdemeyer
Copy link
Contributor Author

Would you changes support the following use case?

"support" in which sense? I cannot comment on the numpy specifics, I never use numpy.

@dalcinl
Copy link
Member

dalcinl commented Oct 29, 2017

@jdemeyer "Support" in the sense that the snippet above will be valid Cython syntax and that the generated C code corresponding to such Cython code looks like:

#include "numpy/arrayobject.h"
/* Legacy NumPy C API */
#ifndef NPY_ARRAY_CARRAY
#define NPY_ARRAY_CARRAY NPY_CARRAY
#endif

@jdemeyer
Copy link
Contributor Author

I see what you did now. You combined cdef extern from "include.h with verbatim docstring syntax. In my current PR, this is not supported because that is not my use case. But I agree that it makes a lot of sense.

@jdemeyer
Copy link
Contributor Author

I updated my branch with larger changes to include files in general. I added a new class IncludeCode to factor out some of the logic to deal with includes.

Copy link
Contributor

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature looks good to me now, but see my code comments regarding the implementation.

def __new__(cls, *args, **kwds):
self = object.__new__(cls)
self.order = cls.number
cls.number += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be done in __init__(). No need to implement __new__().

code.putln('#include "%s"' % filename)
for inc in sorted(env.c_includes.values(), key=IncludeCode.sortkey):
if inc.location == inc.INITIAL:
code.putln(str(inc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually the other way round: inc would have a method to write its code contribution, receiving code as an argument. I specifically dislike calling str() on something non-trivial for code generation.

# include. We'll deal with this at code generation time.
if late:
incs = self.include_files_late
from .ModuleNode import IncludeCode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sprinkled imports suggest that ModuleNode might not be the perfect place to put the include files class. Why not put it in Code.py?

>>> test_set_size((1, 2, 3), 2)
(1, 2)
"""
Py_SET_SIZE(x, size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this leak a reference? I'd rather do without such a test.

def append(self, text):
self.text += text

def dict_update(self, d, key):
Copy link
Contributor

@scoder scoder Oct 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it difficult and confusing to read the code where this is used. Renaming it to update_from_dict() might help, but in fact, this doesn't care about the dict at all, only one of its entries. I don't think it would hurt much to pass key and (potentially None) value instead, but it might make it clearer what's happening.

if self.verbatim_include:
if inc is None:
inc = env.add_include_file("", late)
inc.append(self.verbatim_include)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be merged into add_include_file() as an optional argument. That would avoid the need to return inc from it.

text = self.text
if not isinstance(text, str):
# Python 2: encode unicode -> str
text = text.encode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't happen here. The code writer takes Unicode strings and does the encoding on the way out.

if not include:
self.text = u""
elif include[0] == '<' and include[-1] == '>':
self.text = u'#include {0}\n'.format(include)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get rid of the trailing backslash. code.putln() does that for us.

self.location = self.EARLY

def append(self, text):
self.text += text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather store and serialise the parts separately.

@jdemeyer
Copy link
Contributor Author

The code writer takes Unicode strings and does the encoding on the way out.

I only did that because the old code to write includes also did conversion unicode -> str before calling putln(). So you are saying that it's fine to pass unicode to putln()?

@jdemeyer
Copy link
Contributor Author

Wouldn't this leak a reference? I'd rather do without such a test.

The real test here is that the code compiles. So I can just leave the function but never call it.

@jdemeyer
Copy link
Contributor Author

This should get rid of the trailing backslash. code.putln() does that for us.

Sorry, I have no idea what you mean here...

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Oct 31, 2017

I'd rather store and serialise the parts separately.

Do you mean to store both parts in the same IncludeCode object or as two separate IncludeCode objects?

EDIT: I will assume in the same IncludeCode object because if somebody writes

cdef extern from "foo.h":
    "/* C code here */"

it makes sense to keep #include "foo.h" and /* C code here */ together.

EDIT2: maybe not. Having one piece of code per IncludeCode object is conceptually simpler.

@jdemeyer
Copy link
Contributor Author

in fact, this doesn't care about the dict at all, only one of its entries.

That is not quite true. For a given key, it needs to access d[key] if that key exists in the dict but it also needs to add d[key] to the dict if it was not already there.

@jdemeyer
Copy link
Contributor Author

I found it difficult and confusing to read the code where this is used.

I think that the code makes sense but I'll try to add more comments.

@scoder
Copy link
Contributor

scoder commented Oct 31, 2017 via email

@scoder
Copy link
Contributor

scoder commented Oct 31, 2017 via email

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Nov 3, 2017

Then probably the best solution is to put the verbatim text in the same IncludeCode object after all...

I did that now by introducing a new attribute IncludeCode.extratext. This needs to be a separate attribute because the #include "header.h" text should be unique but the verbatim text not.

If you want a separation of concerns, then reverse the responsibility and let the module scope do the content management of its "c_includes" dict internally by itself

I tried to do that in this version with a method ModuleScope.process_include(). It takes an IncludeCode instance and does with it what it needs to do. All processing of includes now goes through this method.

@jdemeyer
Copy link
Contributor Author

I believe that this addresses all your concerns and it implements the use case of @dalcinl

Do you have any further comments?

@scoder scoder added this to the 0.28 milestone Nov 10, 2017
@dalcinl
Copy link
Member

dalcinl commented Nov 10, 2017

@jdemeyer Bravo! Many thanks for the contribution...

@scoder
Copy link
Contributor

scoder commented Nov 10, 2017

Since this is a major new language feature, it needs documentation. @jdemeyer, could you please write something up?

@jdemeyer
Copy link
Contributor Author

I already promised that in the PR description, so yes. Can I assume that this will be merged once that doc has been written?

@scoder
Copy link
Contributor

scoder commented Nov 10, 2017

@robertwb has the final word on this.

My opinion: Cython is a pragmatic tool, and this change solves a real problem in a pragmatic way. I've seen some truly ugly hacks that try to achieve something similar from within Cython code, and this is a much cleaner way to do it. Cython is, by its very design, a tool that people can use to avoid writing C, and that suggests to me that people will keep avoiding to write C, and not start dumping large blobs of C into docstrings. And even if some of them do that, we'll just have to assume that they are adult enough to know what they are doing.

I can also see Robert's point, though, that adding this introduces a feature penalty. Now we have a docstring which is not a docstring, and people using it for documentation will get weird C compiler errors. I'd personally be ok with restricting this feature some more to mitigate that drawback, e.g. enable it only for cdef extern from * and not for actual include files, something like that. But I would also accept it as it is. As I said, I'll leave the final decision to Robert.

@dalcinl
Copy link
Member

dalcinl commented Nov 10, 2017

@scoder about restricting syntax, please look at my comment above, I'm planning to use the feature as currently implemented to modernize Cython's NumPy support, so we can get rid of these annoying C compiler warnings about NumPy API version.

@robertwb
Copy link
Contributor

Though I haven't been very active on this thread of late, I have been giving it a lot of thought. Despite my concerns, I've come to the conclusion that this is useful enough to include in Cython.

The cdef extern from "header.h" variant still seems a bit odd to me--on reading it it's not obvious whether the verbatim code comes before or after the #include statement. There's also the issue that a single file may be #included multiple times, possibly via cimports, and if each had its own verbatim block attached we'd have to establish rules about where and in what order they all got attached. Given that one can accomplish the same thing with a cdef extern from * block that immediately follows or proceeds the header-including one, it might pay to be more conservative here.

The other concern I have is placing these code bocks in .pxd files. If declarations are not declared to be static, one will have ODR violations, and if they are declared to be static, the fact that one has multiple copies (rather than a shared copy that is cimported) would likely be surprising (especially for any declared values, but could be for functions too if they have static locals or similar).

@scoder
Copy link
Contributor

scoder commented Nov 11, 2017

the fact that one has multiple copies (rather than a shared copy that is cimported) would likely be surprising

Hmm, right. Deduplication (towards the first occurrence) would be desirable, but that could mean that re-ordering your cimports might lead to surprising situations.

OTOH, this is an expert feature that is clearly designed to bypass Cython's view (and its safety checks). Users will have to take care themselves that interdependencies are properly resolved. In fact, it might not be much of a problem in practice, since you should always cimport anything you depend on anyway, so it should be relatively easy (for users) to make the first occurrence a safe place to drop the code.

Thinking about the limitation to cdef extern from *, I can now see that ordering is actually problematic already. Whether we write the verbatim block before or after the header include, it can always happen that we reorder other extern blocks due to early/late include decisions. That makes it difficult to enforce a certain order. Putting a * block before an include block does not necessarily mean that the verbatim code in both gets written out in that order. It now depends on what kind of declarations both blocks contain. Not ... great...

@jdemeyer
Copy link
Contributor Author

Hmm, right. Deduplication would be desirable.

I agree. If the same verbatim block is found through different cimports, it should be emitted only once.

@jdemeyer
Copy link
Contributor Author

Updated version which avoids duplication. There is also a new testcase checking for this.

@jdemeyer
Copy link
Contributor Author

The current solution might be overcomplicated though. Let me see if I can simplify it.

I would like to know whether there are any objections to the functionality though.

@jdemeyer
Copy link
Contributor Author

On second thought, there is not much simplification possible. The code is not so simple, but it seems that all complexity is really needed for the various features (clever deduplication, support for initial/early/late includes).

@jdemeyer
Copy link
Contributor Author

Since this is a major new language feature, it needs documentation. @jdemeyer, could you please write something up?

Done.

@scoder
Copy link
Contributor

scoder commented Nov 22, 2017

Let's get this in. It's a long requested feature and the implementation is solid, too.
Thank you for your contribution, Jeroen!

@scoder scoder merged commit 8ccdda6 into cython:master Nov 22, 2017
@jdemeyer
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants