New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Parser: Use the Clang API #51

Open
Arcnor opened this Issue Dec 3, 2015 · 41 comments

Comments

Projects
None yet
2 participants
@Arcnor
Copy link

Arcnor commented Dec 3, 2015

I'm in the process of doing this right now. Currently, the following issues exists with the approach I'm taking:

  • Preprocessor directives are not supported: This is fixable, because Clang can be made aware of those, but it doesn't by default. I'll leave this for later.
  • Normal comments cannot be read: Unless I've missed some Clang flag, non-doc comments are completely ignored. I'm not sure this is a big deal, but nonetheless, it's different from the current behavior.

@saudet saudet added the enhancement label Dec 5, 2015

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 29, 2015

BTW, we probably want to use the C++ API of Clang for this. It is not currently mapped by the presets, so as initial work, we would either have to:

  1. Code the new parser temporarily in C++, or
  2. Create the presets for the C++ API of Clang, using the current Parser.

Either way is fine with me. Thanks for your interest in this project and let me know how I can help!

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 29, 2015

Sorry, I haven't had the time to work on this lately.

My final changes allowed me to parse a lot, but some missing things from the C++ API prevented me finishing it IIRC, so I think option 2 should help us get there (or, as you said, code it in C++, but it's non-trivial :D)

@saudet saudet added the help wanted label Jan 3, 2016

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Jan 6, 2016

I've continued a bit on this, and for now, I've decided to write the Clang bindings manually, as I think the surface we need from Clang is not that big (I might be wrong, though).

Once we hace a working parser, we can generate proper bindings for Clang itself and use them in the generator, closing the circle ;)

I have some doubts about how to implement some of the bindings, though, so I'll hit the forums in a few days with my questions :)

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 6, 2017

@Arcnor Any progress with this?

libclang seems to be getting pretty good for that sort of thing, for example:
https://github.com/rust-lang-nursery/rust-bindgen
https://rust-lang-nursery.github.io/rust-bindgen/
So maybe we won't need to use the C++ API after all...

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 6, 2017

Hi Samuel,

No, unfortunately I haven't had the time to continue, not enough incentive for me to do so right now (the project(s) that were using JavaCPP all stopped for one reason or another, mostly priorities).

So libclang is getting better, eh? That's great news! I've had a quick look at the API again, and it seems to contain some goodies I don't remember from 2 years ago, so yeah, maybe now it's enough for our purposes.

If they kept the same names on the AST I might even be able to reuse some of the code I made years ago that parsed the unstable AST output of CLang (https://github.com/Arcnor/objc2robovm/blob/master/src/main/java/com/arcnor/objcclang/parser/CLangHandler.java for example).

Anyway, unless somebody else is working on this, I'll try to give it another look if it doesn't look too complex to interact with it, as time is limited :).

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 6, 2017

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

I'm going to need some help to generate the bindings it seems. I'll put my question(s) here as they are related, but if you need me to use the forum I'll go there instead:

The bindings have the following code:

typedef struct CXVirtualFileOverlayImpl *CXVirtualFileOverlay;

CXVirtualFileOverlay clang_VirtualFileOverlay_create(unsigned options);
...

How can I rename CXVirtualFileOverlayImpl to CXVirtualFileOverlay? I've tried with javaNames but that's obviously not for that.

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

So besides that small problem, the whole API seems to work (well, at least compile) with very few manual mappings, which is cool.

I'll take more time tomorrow to actually figure out if the stuff I couldn't do ~2 years ago is now possible :).

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 7, 2017

Sounds good! BTW, the bindings for libclang are already available here:
https://github.com/bytedeco/javacpp-presets/blob/master/llvm/src/main/java/org/bytedeco/javacpp/clang.java
AFAIK, we just need to use those.

If there's anything to fix about those though, please send pull requests against the presets config:
https://github.com/bytedeco/javacpp-presets/blob/master/llvm/src/main/java/org/bytedeco/javacpp/presets/clang.java
Thanks!!

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 7, 2017

They already work with LLVM 5.0.0 yes:
https://github.com/bytedeco/javacpp-presets/tree/master/llvm

All const char * should already get mapped to String as well as BytePointer, but if there are char * that should also be mapped to String, yes please, do let me know! Thanks

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 7, 2017

Right, the problem with return values is that when we need a Pointer, we can't get one from the String, but we can get a String from a BytePointer with getString()...

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 7, 2017

If not, we can add to CXString a helper String getString() { return getCString().getString(); } method :)

saudet added a commit to bytedeco/javacpp-presets that referenced this issue Dec 7, 2017

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 7, 2017

Actually, after calling clang_getCString() we need to call clang_disposeString(), so simply returning a String isn't that convenient. I've added the helper function I talked about in the commit above.

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 7, 2017

I've finally checked this properly, and it seems this method was the only one that made sense to have as BytePointer, as it has the dispose. As far as I can see, the others (except clang_EvalResult_getAsStr which also has special disposing) only make sense as String (like CXUnsavedFile.Filename() or clang_getTUResourceUsageName())

Anyway, for now I'll continue as it is, we can always improve things later without many changes.

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 8, 2017

I'm now getting crashes (randomly, like 1 for every 5 executions or so) like this one:

Stack: [0x000070000836e000,0x000070000846e000],  sp=0x000070000846cf30,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
j  org.bytedeco.javacpp.Loader.offsetof(Ljava/lang/Class;Ljava/lang/String;)I+11
j  org.bytedeco.javacpp.Pointer.offsetof(Ljava/lang/String;)I+16
j  org.bytedeco.javacpp.Pointer.sizeof()I+24
j  org.bytedeco.javacpp.Pointer$DeallocatorReference.<init>(Lorg/bytedeco/javacpp/Pointer;Lorg/bytedeco/javacpp/Pointer$Deallocator;)V+29
j  org.bytedeco.javacpp.Pointer$NativeDeallocator.<init>(Lorg/bytedeco/javacpp/Pointer;JJ)V+3
j  org.bytedeco.javacpp.Pointer.init(JJJJ)V+44
v  ~StubRoutines::call_stub
V  [libjvm.dylib+0x2ee9aa]
V  [libjvm.dylib+0x325b59]
V  [libjvm.dylib+0x31b166]
C  [libjniclang.dylib+0x1804]  _ZL19JavaCPP_initPointerP7JNIEnv_P8_jobjectPKvxPvPFvS5_E+0x74
C  [libjniclang.dylib+0x13ef2]  Java_org_bytedeco_javacpp_clang_00024CXCursorVisitor_allocate+0x62
j  org.bytedeco.javacpp.clang$CXCursorVisitor.allocate()V+0
j  org.bytedeco.javacpp.clang$CXCursorVisitor.<init>()V+5
j  com.arcnor.javacpp.Main$Visitor.<init>()V+1
j  com.arcnor.javacpp.Main$Visitor.<init>(Lcom/arcnor/javacpp/Main$1;)V+1
j  com.arcnor.javacpp.Main.visit(Lorg/bytedeco/javacpp/clang$CXTranslationUnit;)V+11
j  com.arcnor.javacpp.Main.main([Ljava/lang/String;)V+22
v  ~StubRoutines::call_stub
V  [libjvm.dylib+0x2ee9aa]
V  [libjvm.dylib+0x3257c2]
V  [libjvm.dylib+0x31e539]
C  [java+0x3931]  JavaMain+0x9c4
C  [libsystem_pthread.dylib+0x393b]  _pthread_body+0xb4
C  [libsystem_pthread.dylib+0x3887]  _pthread_body+0x0
C  [libsystem_pthread.dylib+0x308d]  thread_start+0xd
C  0x0000000000000000

Visitor is a class I created that looks exactly like this:

private static class Visitor extends CXCursorVisitor {
  @Override
  public int call(CXCursor cursor, CXCursor parent, CXClientData client_data) {
    return CXChildVisit_Continue;
  }
}

...and I'm just instantiating by calling new Visitor(). I'm not sure if there are any extra considerations to take when instantiating FunctionPointer classes like this one?

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 8, 2017

Have you disabled "crash recovery"?

∗ In the case of Clang, we might need to disable crash recovery with the LIBCLANG_DISABLE_CRASH_RECOVERY=1 environment variable to prevent clashes with the JVM's own signal handlers.

https://github.com/bytedeco/javacpp-presets/tree/master/llvm

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Dec 8, 2017

Ahh, nice, will try that. I've had a good run of ~10 without crashes though, so it will be difficult to prove if it worked (unless I get it again :D)

saudet added a commit to bytedeco/javacpp-presets that referenced this issue Dec 8, 2017

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 8, 2017

I've added getString() helper methods for CXTUResourceUsageKind and CXEvalResult as well, as per the commit above, but for things like contents and filenames, the encoding can change at runtime. AFAIK there's no pretty way to make String work under those conditions so we might as well just make sure users call BytePointer.getString(). If you have any good ideas though, let me know. Thanks!

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Dec 24, 2017

Let me know if there's anything else missing from the API that would prevent you from making progress. Thanks!!

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Feb 8, 2018

Sorry for the delay, I read this over Christmas but forgot about it later.

I haven't worked on this much lately, and what I have so far is a separate program that takes a header and tries to parse it using libclang. So far, everything seems to work, and the last thing I did was start working on structure generation.

I'll try to upload it to GitHub at some point, so even if I stop somebody else can continue or use it as reference, although it's Kotlin right now instead of Java.

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Feb 8, 2018

Sounds great! Thanks

Does it look like we can do all that we need to do with the C API?
Or do you already see things that can not be done through the C++ API?

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Feb 8, 2018

For now, it seems we can do everything with this one, although I think I've had to hack a few things but it might just be my memory failing me. I'll let you know as soon as I finish my "demo" (parsing LibTCOD headers is the first thing I do when writing a generator like this. It's C only but it's a good non-trivial example)

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Feb 13, 2018

I need some help with some examples I'm trying to stress the API, I hope you don't mind me posting stuff here directly:

This is a contrived example, and it's probably something that is not a good way of doing things, but as it compiles & works correctly when tested, I'm guessing it should be supported on JavaCPP as well.

  struct StructTwo;

  struct StructOne {
     int *ptr2;
     int val1;

     struct StructTwo *ptr3;
  } StructOneVar;

  struct StructTwo {
     int *ptr2;
     int val1;

     struct StructOne *ptr3;
  } StructTwoVar;

  void passStruct(struct StructOne *s1);

What JavaCPP is generating for that right now is:

@Namespace("StructOne") @Name("StructOneVar.ptr2") public static native IntPointer StructOneVar_ptr2(); public static native void StructOneVar_ptr2(IntPointer StructOneVar_ptr2);
@Namespace("StructOne") @Name("StructOneVar.val1") public static native int StructOneVar_val1(); public static native void StructOneVar_val1(int StructOneVar_val1);

@Namespace("StructOne") @Name("StructOneVar.ptr3") public static native StructTwo StructOneVar_ptr3(); public static native void StructOneVar_ptr3(StructTwo StructOneVar_ptr3);

@Namespace("StructTwo") @Name("StructTwoVar.ptr2") public static native IntPointer StructTwoVar_ptr2(); public static native void StructTwoVar_ptr2(IntPointer StructTwoVar_ptr2);
@Namespace("StructTwo") @Name("StructTwoVar.val1") public static native int StructTwoVar_val1(); public static native void StructTwoVar_val1(int StructTwoVar_val1);

@Namespace("StructTwo") @Name("StructTwoVar.ptr3") public static native StructOne StructTwoVar_ptr3(); public static native void StructTwoVar_ptr3(StructOne StructTwoVar_ptr3);

public static native void passStruct(StructOne s1);

Obviously this won't work, because there is no StructOne or StructTwo defined (because they're not types but anonymous structures), but in C you can still do passStruct(&StructOneVar) and it will compile and work correctly.

Is this something that cannot be modelled in JavaCPP and a workaround must be found, or something we can still support?

Thanks!

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Feb 16, 2018

No, that should work. It should produce a class definition + a variable declaration for each. That it doesn't means there's a bug somewhere...

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Mar 6, 2018

@Arcnor How is it looking? Have you encountered any other issues?

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Mar 6, 2018

It's been mainly a general lack of time as always. IIRC I can fully parse a C header now except for a few things:

  1. I still need to solve that question about how anonymous structs with defined instances should be generated.
  2. I can't seem to find a way of getting enough information about C function pointers from the Clang C API (not sure if the C++ API is better, or I'm simply looking in the wrong place). For example, the names of the arguments seem to be lost, and I was fighting to get the real types instead of the resolved ones, but this is all from memory, I need to check again to see what the status was.
@saudet

This comment has been minimized.

Copy link
Member

saudet commented Mar 6, 2018

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Mar 6, 2018

My code is just a very big hack right now, so I don't want to put that publicly yet :).

I can upload & give you access on Gitlab, are you saudet there?

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Mar 6, 2018

Don't worry about making it public. JavaCPP is a pretty big hack, and everything else out there is too. :)

In any case, yes, I'm also "saudet" on GitLab. Thanks!

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Mar 6, 2018

Well, you'll see what I mean when you check out the code :)

I've given you access to the repo. Let me know what you think, and if you can help with the issues I'm having, please do! :)

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Mar 6, 2018

Thanks! As for anonymous structs, there are some instances of that in FFmpeg and OpenCV, for example:
https://github.com/bytedeco/javacpp-presets/blob/master/ffmpeg/src/main/java/org/bytedeco/javacpp/avutil.java#L5655

        @Name("data.ptr") public native @Cast("uchar*") BytePointer data_ptr(); public native CvMat data_ptr(BytePointer data_ptr);
        @Name("data.s") public native ShortPointer data_s(); public native CvMat data_s(ShortPointer data_s);
        @Name("data.i") public native IntPointer data_i(); public native CvMat data_i(IntPointer data_i);
        @Name("data.fl") public native FloatPointer data_fl(); public native CvMat data_fl(FloatPointer data_fl);
        @Name("data.db") public native DoublePointer data_db(); public native CvMat data_db(DoublePointer data_db);

https://raw.githubusercontent.com/bytedeco/javacpp-presets/master/opencv/src/main/java/org/bytedeco/javacpp/opencv_core.java

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Mar 6, 2018

Yeah, that works, but just because they don't reference the anonymous struct inside itself, and that's the issue I presented above. If data is of type MyStruct, if you add another field inside its struct of type MyStruct* then the generation is wrong.

So the question is, should we (always?) generate a class for the anonymous struct?

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Mar 6, 2018

In the example that you gave above, the types are not anonymous, so we can allocate them and they should have a wrapper class, yes. Anonymous structs cannot be allocated independently because they have no name.

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Mar 6, 2018

I'm probably using the wrong terminology, yeah. So this requires a "wrapper" to be created, fine.

Then the next question is, what should this wrapper be named? I'm just asking because the type of struct M {} is not M, but struct M, and there can exist another type called M that is a completely different thing. Should we prefix all structs with Struct for example? (I'd normally check JavaCPP for current behavior, but as I said, right now no classes are being generated in such cases)

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Mar 6, 2018

@saudet

This comment has been minimized.

Copy link
Member

saudet commented Jun 8, 2018

Hey, I'm finally able to start working on this a bit. Sorry it took so long. I looked at your repository, it's a great start!! Thanks

I've tried to figure out how to get information about the argument names for function pointers, but I haven't found much of anything either. I'm sure there are going to be a lot of cases like that, so we need something general enough to work around those issues. The API gives us a way to get the extents of the declarations from inside the source files themselves. It looks like this:

 val extent = clang_getCursorExtent(cursor)
 val start = clang_getRangeStart(extent)
 val end = clang_getRangeEnd(extent)

 val file1 = CXFile(); val line1 = IntArray(1); val column1 = IntArray(1); val offset1 = IntArray(1)
 val file2 = CXFile(); val line2 = IntArray(1); val column2 = IntArray(1); val offset2 = IntArray(1)
 clang_getSpellingLocation(start, file1, line1, column1, offset1);
 clang_getSpellingLocation(end, file2, line2, column2, offset2);
 println("${clang_getFileName(file1).string} ${line1[0]} ${column1[0]} ${offset1[0]}")
 println("${clang_getFileName(file2).string} ${line2[0]} ${column2[0]} ${offset2[0]}")
 val bytes = Files.readAllBytes(Paths.get(clang_getFileName(file1).string))
 println(String(bytes, offset1[0], offset2[0] - offset1[0]))

And the output looks like this for that function pointer in a typedef:

src/main/resources/complex.h 10 1 97
src/main/resources/complex.h 10 90 186
typedef value_t (*parser_custom_t)(value_t *lex, void *listener, int str, char *propname)

Pretty clean, with all the names and everything. We can then go on parsing that manually possibly with a regex or in some other ad hoc manner, such as striping out the "typedef" part and passing that back to Clang as a function prototype. In any case, it shouldn't be too hard in general given that Clang has usually already done most of the hard work at this point. What do you think of this approach, not just for function pointers but for any other corner case that we encounter?

@Arcnor

This comment has been minimized.

Copy link

Arcnor commented Jun 10, 2018

Hey, sorry for the delay.

Good that you can finally work on this! I've been more and more busy lately, so I haven't been able to continue on it (also I'm not using it for the time being due to different reasons).

The parsing sounds good, although it's obviously not as good as having it as part of the API. Maybe we should open some ticket on their tracker to see if it can be done, or to ask for the capability to be added?

In any case, parsing by regex should be fine as long as the output you get from that code is what CLang has processed and not directly from the file (which I guess it is), although even in that case I think it's more foolproof to have a real parser (maybe Clang itself after removing the first part of the output so it thinks it's a normal function declaration?). I'm just worried about edge cases and whatnot, like having some sort of default parameter value or similar (I know this is not possible, but just so you get the idea of what I'm worried about).

I also hope this is the only place where we cannot really use the API, but I can't remember how far my code was from generating usable bindings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment