Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bare Java JNI Bindings optimized for Apache Spark #1798

Merged
merged 34 commits into from Jun 4, 2019

Conversation

Projects
None yet
4 participants
@eisber
Copy link
Collaborator

commented Mar 20, 2019

I moved the existing Java bindings into classic/
The additional bare bindings are under bare/

Based on @jackgerrits comment I added java/bare/README.md to provide some justification why these are separate.

The actual Apache Spark bindings using MLlib interfaces will be found in MMLSpark.

vagrant and others added some commits Mar 9, 2019

@JohnLangford

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

@deaktator @jon-morra-zefr comments on this?

@eisber Is there an argument that we need two sets of bindings?

@jackgerrits jackgerrits changed the title Bara Java JNI Bindings optimized for Apache Spark Bare Java JNI Bindings optimized for Apache Spark Mar 20, 2019

@jon-morra-zefr

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

The methodology seems weird here. Why have two totally different interfaces? It seems like it would be confusing for people as opposed to integrating it all into one. It also duplicates a heck of a lot of surrounding infrastructure (pom, cmake, etc.).

@eisber

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 21, 2019

Hi,

I outlined the difference in https://github.com/eisber/vowpal_wabbit/blob/marcozo/spark/java/bare/README.md

There are a couple of reasons this is duplicated:

  • the 2nd set of bindings is not intended for users of VW, but rather for further integration. e.g. the hashing must happen before calling here.
  • there is no need for the locking that happens in the classic layer for the bare one and will negatively impact performance
  • as for pom: the bare packages the boost and zlib dependencies into the jar. As this is a fundamental change I don't want to break the original approach.

@jon-morra-zefr

  • what's your proposal for integration?
  • modify the original Java object? I'm worried about breaking any existing dependencies. Also as stated above I don't want to bear the cost for the locking.
  • have the classes in the same src/ folder?
  • create a single jar file?

Markus

@JohnLangford

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

Reading through the README:

  1. Being able to construct examples directly in Java is a big computational win.
  2. Driving passes is possible with the current interface so not using/allowing a cache file is a limitation of bare?
  3. Exposing allreduce/spanning tree is obviously important for parallel learning which can become a big computational win.
  4. I'm a little bit unclear on locking. Having locking on by default seems reasonable since the VW reduction stack is not reentrant, but if you know you are using it in a thread-safe way it does add some overhead. I expect this is modest however. Do @eisber do you have any measurements of this?

So (1) and (3) seem like obviously desirable additional features of an interface.

The biggest concern I have is around having two interfaces---a single interface combining 'classic' with 1, 3, and optionally 4 would be more powerful, less confusing, and more maintainable.

@eisber

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 22, 2019

  1. One could expose RunPasses, but the primary use-case is for Spark and there we have all the examples hashed in memory anyway, so perform disk i/o will just hurt perf.
  2. Not yet, but it feels weird to artificially introduce additional computation when trying to strip as much as possible (I can relate to adding some computation for sake of simplifying the API).

Please also consider the different packaging as the bare interface includes all the binaries inside the jar. It's definitely desirable to do it this way for ease of use on Azure Databricks/Spark, not so clear for other existing use-cases as it requires write access to a temp-directory.

As for maintanability: the overlap between the old and the new interface is minimal (basically the vw::initialize() and finish() call. Also the pom files differ significantly (all binaries are included in the jar, unit tests cannot be used, but rather integration tests as the classpath needs to be set on the jar to be able to extract the binaries).

The bare api is built with the Spark wrapper code in-mind which can control the state properly: when looking at: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/java/src/main/java/vowpalWabbit/learner/VWIntLearner.java#L38
there is overhead for the lock, check if it's open

Maybe naming can solve the confusion and a bit more cleanup in the pom.

@eisber

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 22, 2019

Potentially build the classic on-top of bare would be an option... though the packaing and library loading needs some careful thought. 2nd iteration?

@JohnLangford

This comment has been minimized.

Copy link
Member

commented Mar 25, 2019

Is there a reason to avoid a union interface?

@jon-morra-zefr

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2019

I haven't looked at this in a while, so I'm not going to comment on the details here. Suffice it to say that it feels like the goal here is to expose a different interface in addition to the one already exposed. Ideally these 2 interfaces would be right beside each other in the same code base. The build tools would hopefully be modified to handle producing multiple artifacts instead of having 2 build chains. But since I'm not in the weeds, I'm not going to put my foot down here, just food for thought.

@eisber

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 1, 2019

I'm happy to rename classic -> simple, bare -> spark.

I can refactor the pom.xml to avoid duplication.
'same can be done for C++ code' refers to CMake files.

I'm still iterating on implementation, so it might take a bit.

@JohnLangford

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

Ok, sounds like we have a plan.

eisber and others added some commits Apr 29, 2019

Show resolved Hide resolved java/CMakeLists.txt Outdated
Show resolved Hide resolved java/CMakeLists.txt Outdated
Show resolved Hide resolved java/README.md Outdated
Show resolved Hide resolved java/src/main/c++/util.h Outdated
@JohnLangford

This comment has been minimized.

Copy link
Member

commented Jun 3, 2019

@eisber Can you respond to @jackgerrits ? This looks very close to a merge so I'd like to get it in for the next release.

eisber added some commits Jun 4, 2019

clean up of CMakeFile.txt
formatted code
improved Spanning Tree error message
re-added README.md
moved JNI util to the right place

@eisber eisber merged commit fa5c3df into VowpalWabbit:master Jun 4, 2019

7 of 8 checks passed

LGTM analysis: Java Analysis failed (could not build the merge commit and the base commit (7b1fd30))
Details
LGTM analysis: C# No code changes detected
Details
LGTM analysis: C/C++ No new or fixed alerts
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No code changes detected
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.007%) to 73.423%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.