Use Apache Arrow for interprocess communication #58

dfdx · 2018-03-21T09:23:20Z

According Apache Arrow description, it provides "zero-copy streaming messaging", which may be helpful to eliminate the cost of transferring data between JVM and Julia. Julia already has bindings for Arrow, so it shouldn't be too much work (although may be worth to wait until Julia 0.7).

cc: @xmehaut

ExpandingMan · 2019-02-25T22:06:08Z

I'm toying with the idea of expanding my Arrow package so that it completes the implementation so that we can use it here. I haven't completely made up my mind on whether I'll take this on yet, but if people here are interested, I'd like to hear about it.

dfdx · 2019-02-27T20:52:11Z

Yes! From my observations interprocess communication is the main performance killer for RDD API, so switching to Arrow should be the most important improvement in a while. Although, I did't look into API yet, so it will take time to do the change.

ValdarT · 2021-02-12T09:06:05Z

Now with Arrow.jl in a good state might be worth revisiting this?

dfdx · 2021-02-15T23:15:34Z

Although integrating Arrow into existing API may be easy, I believe we need to drop RDD API and fully migrate to Dataset API first - otherwise we will need to implement serialization layer twice, one of them for an interface which is little used nowadays.

To fully support dataset API we must implement Julia UDFs similar to PythonUDF. PythonUDF extends Catalyst's Expression and honestly I don't yet understand all the underlying machinery. So it's quite a huge change. I'll try to gather more information and create a preliminary plan, but can't commit to any changes in near future.

exyi · 2021-07-17T17:04:27Z

Using Arrow seems a like a good idea to me, I'd be willing to help implementing this, but I'd probably need some help. I think I know how to use Arrow in Julia and create Arrow data in Spark, but I'm really not sure how to send the data from JVM to Julia without copying :/

I find it limiting that only few primitive field types are currently, no support for arrays, structs and maps. Using Arrow instead of jcall based conversion mechanism (on Dataset) or the custom format (in RDDs) should also help this problem (apart from being faster), right?

dfdx · 2021-07-17T20:25:07Z

Currently Julia and JVM communicate in 2 ways:

Julia starts a JVM and calls Java functions via JavaCall. In particular, Julia driver creates Spark application and delegates computations to JVM, including JuliaRDD
JVM, or more specifically, JuliaRDD starts a Julia worker per partition and streams RDD's data to them via normal OS sockets

The most important parts on JVM side are JuliaRDD.writeValueToStream() and JuliaRDD.readValueFromStream(). Julia worker is started via worker_runner.jl.

At the moment I don't see a way to implement Julia UDFs, so apparently we are left with RDD API. I didn't use Arrow or Arrow.jl myself, but I guess to migrate to it in Spark.jl we need to create an Arrow data structure in the JVM and then reference it from the Julia worker. Since I don't do much Spark work lately, I don't have a specific plan for this, but I'll be happy to support it as much as I can!

dfdx · 2021-07-17T21:19:32Z

More broadly, there are several ways to efficiently bring custom Julia functions to Spark clusters including things like compiling Julia to Java and creating a new distributed computation framework. But the demand for such features is unclear for me. @exyi do you already have a use case for custom Julia functions on Spark?

tk3369 mentioned this issue May 2, 2018

Supports Spark DataFrame? #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Apache Arrow for interprocess communication #58

Use Apache Arrow for interprocess communication #58

dfdx commented Mar 21, 2018

ExpandingMan commented Feb 25, 2019

dfdx commented Feb 27, 2019

ValdarT commented Feb 12, 2021

dfdx commented Feb 15, 2021

exyi commented Jul 17, 2021

dfdx commented Jul 17, 2021

dfdx commented Jul 17, 2021

Use Apache Arrow for interprocess communication #58

Use Apache Arrow for interprocess communication #58

Comments

dfdx commented Mar 21, 2018

ExpandingMan commented Feb 25, 2019

dfdx commented Feb 27, 2019

ValdarT commented Feb 12, 2021

dfdx commented Feb 15, 2021

exyi commented Jul 17, 2021

dfdx commented Jul 17, 2021

dfdx commented Jul 17, 2021