Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android support for TensorFlow presets #297

Merged
merged 18 commits into from
Oct 15, 2016

Conversation

andreas-eberle
Copy link
Contributor

This PR is not complete yet. However, as I need some support I open this for the discussion. My problems are the following:

  • The android patch file contains the hard coded android ndk path for my machine. How can I make this dynamic? I cannot patch in an environment variable because bazel does not understand that.
  • The cppbuild.sh works when I run it for android-arm. However, when I run the mvn build for android-arm, the generated tensorflow.java file is empty and the library is not packaged into the installed dependency. Nevertheless, the maven build is still shown as a success.

How can I fix these problems?

@saudet
Copy link
Member

saudet commented Oct 1, 2016

  • We could use sed for dynamically patching files.
  • We'll need to add "android" to the @Platform list in presets/tenserflow.java

Conflicts:
	pom.xml
	tensorflow/cppbuild.sh
	tensorflow/pom.xml
	tensorflow/src/main/java/org/bytedeco/javacpp/presets/tensorflow.java
@andreas-eberle
Copy link
Contributor Author

@saudet: I updated to the new tensorflow version by merging in the master and added the android in the @platform list. However, now the android build doesn't work any more. I get these errors:
https://gist.github.com/andreas-eberle/54aef6450b578f25566f5e14663fd6ba

Do you have an idea what I need to fix?

I will try using sed for patching the file.

@saudet
Copy link
Member

saudet commented Oct 2, 2016

A lot of things have changed between 0.9.0 and 0.10.0. Maybe the build for Android broke. Please confirm with someone upstream that build for Android isn't broken. Thanks!

@andreas-eberle
Copy link
Contributor Author

The cppbuild.sh build for android worked fine. The compilation of the jni classes is failing, not the tensorflow build. But I will test there Android example to be sure.

@saudet
Copy link
Member

saudet commented Oct 2, 2016

The errors you are getting mean that libtensorflow_cc.so is not properly built. That has nothing to do with JavaCPP. Please ask them how to get a properly working version of libtensorflow_cc.so on Android, regardless of whether the other Android examples work or not. We need libtensorflow_cc.so.

@andreas-eberle
Copy link
Contributor Author

Because the TensorFlow guys prefer questions like this on stackoverflow, I created a question there. https://stackoverflow.com/questions/39855672/tensorflow-how-to-compile-libtensorflow-cc-so-for-android

@andreas-eberle
Copy link
Contributor Author

From what I read in some tensorflow issues, it seams that they don't support the ops for Android. I tried to include them, but there are several dependencies that aren't working for Android or at least not without significant work.

@saudet: Is there a way to exclude the ops when building for Android?

@saudet
Copy link
Member

saudet commented Oct 6, 2016

Sure, we can provide a different list of header files to use with @Platform(value="android", include=...

@andreas-eberle
Copy link
Contributor Author

@saudet: I had to remove some headers and also some Infos from the map. For easier comparison which headers have been removed, I temporarily removed the macos and linux Platform.

At the moment there are two problems:

  1. It still doesn't compile due to the following errors https://gist.github.com/andreas-eberle/51330bf6c8f35f8741a0c593abcf7b2b. Do you have an idea how I can fix them? I ran bazel with the -s option and saw that e.g. RegisterOps is included in the build.
  2. As you can see in my commit, I had to remove several Infos put into the infoMap to get rid of tons of errors. How can I define that these Infos must only be used for mac and linux?

Many thanks in advance!

@saudet
Copy link
Member

saudet commented Oct 7, 2016

DotGraph and others are obviously not included in the build. You will need to get those in the build, or remove their interfaces from the presets.

You shouldn't need to remove any Info to get this working. If you could point me to a simple example of one Info that is causing problems, it will be easier to start explaining what you need to do.

@andreas-eberle
Copy link
Contributor Author

I removed the dot.h from the presets, as it was indeed not included in the binary. However, I still get these errors: https://gist.github.com/andreas-eberle/90e987abcf2cdd3f91838339e675cec1

I checked and the op.cc file is compiled and included into the binary. I cannot figure out what could be missing. Furthermore, if I remove that header, I have to remove many others (I din't find a valid configuration yet).

What Infos could help you to identify the issue? What would you do?

@saudet
Copy link
Member

saudet commented Oct 7, 2016

Well, seeing the errors that you get would be a good start.

@andreas-eberle
Copy link
Contributor Author

Sorry that I didn't give you the full log output... https://gist.github.com/andreas-eberle/0ae0185f3dfab6ad26c78348621e6cba is the output of a complete clean install.

I also did the bazel build with the -s option to see all files that are included. You can find the result here: https://gist.github.com/andreas-eberle/ce3ccda61c54f9a8292b9de484b402d0

@saudet
Copy link
Member

saudet commented Oct 7, 2016

Thanks!

You should either move the file to this package or depend on an appropriate rule there. <- You should probably try to fix those build warnings from Bazel because I don't get those for the normal build.

error: undefined reference to 'RegisterOps' <- If you didn't build the ops, that's pretty normal, so let's remove those header files and see what errors we get then.

@andreas-eberle
Copy link
Contributor Author

You should either move the file to this package or depend on an appropriate rule there.

This is caused by the built-in android build rules. The same thing happens when you build the TensorFlow Android example. Therefore I don't think that's making us problems. Furthermore, from what I read and understand, this only says it's not good style to include sources from another package and you'd better only include libraries from other packages.

error: undefined reference to 'RegisterOps'

I found out that most of the ops are actually included in the Android build (see tensorflow/tensorflow#4795).

What I don't understand about the RegisterOps is that is declared in core/framework/op.h as ./tensorflow/core/framework/op.h:extern "C" void RegisterOps(void* registry_ptr); but never implemented anywhere. (I grepped for it in the tensorflow folder. I also grepped for it after I did a complete build for MacOSX to check if it was generated, but couldn't find it.) Anyways, the op.cc file is included in the build. So I really don't get how this normally works and why it isn't working the same way here. Do you have an idea? Or should I ask the TensorFlow guys how this is working?

I will try to remove those headers in the meantime to see if I can get something building.

@saudet
Copy link
Member

saudet commented Oct 8, 2016

Ok, so we can skip those with new Info("tensorflow::RegisterOps(void*)", ...).skip()

@andreas-eberle
Copy link
Contributor Author

andreas-eberle commented Oct 8, 2016

I skipped the RegisterOps. As the other errors were still there, I also skipped those definitions (see commit). Furthermore, I had to remove the definitions of the OpDefBuilderWrapper. However, now I get these errors: https://gist.github.com/andreas-eberle/4cfcbd1b26b4b65dcd5f2db709d21225

In the generated tensorflow.java, I saw that in line 10312 the class OpDefBuilderReceiver is generated. How can I skip that whole class?

EDIT: Never mind, found it.

@andreas-eberle
Copy link
Contributor Author

The current version compiles and sometimes works fine. However, sometimes it crashes after loading the model with one of the following errors (Android does not give me more than this):

A/libc: Fatal signal 11 (SIGSEGV), code 1, fault addr 0x5030481 in tid 25840 (JavaCPP Dealloc)

10-08 21:35:16.214 27914-27958/arconsis.com.tensorflowtestapplication A/libc: Invalid address 0xdea55164 passed to free: value not allocated 10-08 21:35:16.214 27914-27958/arconsis.com.tensorflowtestapplication A/libc: Fatal signal 6 (SIGABRT), code -6 in tid 27958 (JavaCPP Dealloc)

I debugged into my dummy application and found that this does not happen during any call to the tensorflow/JavaCPP library. It seems to happen in the background. @saudet: Do you have an idea what could be causing this?

@andreas-eberle andreas-eberle force-pushed the tensorflow-android branch 2 times, most recently from 169f0e6 to 768f0e4 Compare October 8, 2016 22:29
@saudet
Copy link
Member

saudet commented Oct 10, 2016

Ah, I see. Fixed in the commit above. Thanks for reporting!

@andreas-eberle
Copy link
Contributor Author

Thx, that worked. I now wanted to replace the last skip in these lines with the @Platform(not = "android") annotation.

However, in the generated sources for OpDefBuilderReceiver, the constructors have arguments of type TrueOpDefBuilderWrapper. As this type is also annotated with the @Platform(not = "android") annotation, the annotation is added to the parameter of the constructor. See this generated code:

@Platform(not = "android") @Namespace("tensorflow::register_op") public static class OpDefBuilderReceiver extends Pointer {
    static { Loader.load(); }
    /** Pointer cast constructor. Invokes {@link Pointer#Pointer(Pointer)}. */
    public OpDefBuilderReceiver(Pointer p) { super(p); }

  // To call OpRegistry::Global()->Register(...), used by the
  // REGISTER_OP macro below.
  // Note: These are implicitly converting constructors.
  public OpDefBuilderReceiver(
        @Const @Platform(not = "android") @ByRef TrueOpDefBuilderWrapper wrapper) { super((Pointer)null); allocate(wrapper); }
  private native @Platform(not = "android") void allocate(
        @Const @Platform(not = "android") @ByRef TrueOpDefBuilderWrapper wrapper);  // NOLINT(runtime/explicit)
  public OpDefBuilderReceiver(@Const @Platform(not = "android") @ByRef FalseOpDefBuilderWrapper arg0) { super((Pointer)null); allocate(arg0); }
  private native @Platform(not = "android") void allocate(@Const @Platform(not = "android") @ByRef FalseOpDefBuilderWrapper arg0);  // NOLINT(runtime/explicit)
}

Then, the Java compiler complains that the @platform annotation is used for a parameter. I guess that this annotation must never be added to parameters. @saudet: What do you think?

@saudet
Copy link
Member

saudet commented Oct 10, 2016

Maybe these don't even get built for Linux or Mac OS X either you know. Have you confirmed that they do? Android is sensitive to undefined symbols, so errors show up, but there is no reason to have those interfaces around if they are not defined anywhere anyway.

@andreas-eberle
Copy link
Contributor Author

The classes OpDefBuilderReceiver and OpDefBuilderWrapper actually exist and are defined in tensorflow/core/framework/op.h and tensorflow/core/framework/op.cc. From my understanding, they are used to register new operations. So I think they can be used on macos/linux but aren't in the Android build.

But regardless of that, isn't the problem that the code generator adds the @Platform annotation to constructor/method parameters, where they aren't allowed, a problem similar to the one you fixed with 076cd97 in the javacpp project?

Anyways, what is your preferred way? Skipping these classes on all platforms or only on Android?

@saudet
Copy link
Member

saudet commented Oct 10, 2016

So, why not simply exclude the op.h file entirely in the case of Android?

@andreas-eberle
Copy link
Contributor Author

andreas-eberle commented Oct 10, 2016

@saudet: Thx for your question, it helped me to find out that I got some things mixed up.

  1. op.cc is part of the Android library.
  2. When removing all the changes I did to the infoMap, one problem was that the method declaration extern "C" void RegisterOps(void *) is an orphan, that has not been removed when it's implementation was removed. I created a PR for tensorflow (Removed orphaned declaration of extern "C" void RegisterOps(void *) tensorflow/tensorflow#4872) and fixed it in this PR via the patch file.
  3. Furthermore, the OpDefBuilder::Doc method was always reported as missing, although the op_def_builder.cc was included in the build. When I took a look at the code, I found out that there was an #ifndef TF_LEAN_BINARY which was only applied to this single method (I grepped the repository). In the tensorflow master, this code is already completely changed. Therefore I also fixed this via the patch file.

Therefore, there is only one problem left: The sporadic, asynchronous segfault when I run my test application:

10-10 17:31:31.461 13685-13816/arconsis.com.tensorflowtestapplication A/libc: Fatal signal 11 (SIGSEGV), code 1, fault addr 0x63202c2e in tid 13816 (JavaCPP Dealloc)

Is there a special JavaCPP Dealloc or something which is referred in that message? The exception is occurring at random times without my code doing anything special. Is there a chance the GC kills the library?

EDIT: I found the deallocator thread of JavaCPP, which is actually causing the problem. I verified this by pausing that thread. Now the application does not Segfault any more. Do you have an idea why that thread is causing this?

@saudet
Copy link
Member

saudet commented Oct 11, 2016

Well, like I said, you're probably doing something wrong in your code. Try to figure it out, you can set the "org.bytedeco.javacpp.logger.debug" system property to "true" to get more info in the log about what JavaCPP itself is doing.

@andreas-eberle
Copy link
Contributor Author

Thanks. The debugging output helped me to find the pointer that was cleaned up to early. The solution is that you have to keep a reference to the SessionOptions object. This is because tensorflow's session internally keeps a reference to the SessionOptions. But if you do not keep a reference in Java, JavaCPP does deallocate the SessionOptions object which then leads to a SegFault when running the Session the next time.

Would it be an option for you to keep the SessionOption object as a field in the AbstractSession class in the helper/tensorflow.java file?

Conflicts:
	tensorflow/src/main/java/org/bytedeco/javacpp/presets/tensorflow.java
	tensorflow/src/main/java/org/bytedeco/javacpp/tensorflow.java
saudet added a commit that referenced this pull request Oct 12, 2016
@saudet
Copy link
Member

saudet commented Oct 12, 2016

Yes, that would be the way to go. I made this change in the commit above. Thanks for looking into this!

@andreas-eberle
Copy link
Contributor Author

andreas-eberle commented Oct 12, 2016

Thx, I merged in your commit. Is there anything left from your side to merge this PR? And if you merge this PR, when will you release the 1.2.5 version?

@saudet
Copy link
Member

saudet commented Oct 12, 2016

Let's see, could you remove the tabs and put back the spaces instead?

@andreas-eberle
Copy link
Contributor Author

Done.

@saudet
Copy link
Member

saudet commented Oct 13, 2016

Thanks! It doesn't build with Bazel 0.3.0. I'm guessing you're using 0.3.1 or 0.3.2? These don't work with TensorFlow 0.10.0 and CUDA... Anyway, looks like TensorFlow 0.11.0 will get released before we all get this working :) I'm fine to merge this, but it won't get released.

@andreas-eberle
Copy link
Contributor Author

andreas-eberle commented Oct 14, 2016

I just switched back to Bazel 0.3.0 (I had 0.3.1) and tested the build. What build doesn't work for you? For me Android and MacOsX worked with Bazel 0.3.0.

@saudet
Copy link
Member

saudet commented Oct 14, 2016 via email

@andreas-eberle
Copy link
Contributor Author

andreas-eberle commented Oct 14, 2016

So are you building on a Linux or building for Linux? What is your used system and the target platform? Are you compiling with CUDA or without?

@saudet
Copy link
Member

saudet commented Oct 15, 2016

It just looks like we need to set the following environment variables under Fedora, even when building for Android:

export CC="/usr/bin/gcc"
export CXX="/usr/bin/g++"

Then it builds fine, for some reason, you know Bazel, whatever.

@saudet saudet merged commit 3b27084 into bytedeco:master Oct 15, 2016
saudet added a commit that referenced this pull request Oct 16, 2016
@andreas-eberle andreas-eberle deleted the tensorflow-android branch October 16, 2016 19:44
@andreas-eberle
Copy link
Contributor Author

Cool, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants