Spark.jl for Julia 1.1.0 #70

msobiesk23 · 2019-03-08T05:21:54Z

Hello! I was wondering -- is there any timeline for the Spark.jl package being updated for Julia 1.1.0? Thank you for any info you can give!

dfdx · 2019-03-08T22:20:41Z

Hi! Spark.jl should work fine on Julia 1.x, including Julia 1.1. Do you experience any issues with it?

msobiesk23 · 2019-03-10T21:46:44Z

I have been having a few issues with running Spark.jl on Julia 1.1.0. First, whenever I tried to run Spark.init() in Julia 1.1.0 on a Hadoop cluster I was getting the error

ERROR: BoundsError: attempt to access 1-element Array{SubString{String},1} at index [2]
Stacktrace:
[1] getindex at ./array.jl:729 [inlined]
[2] load_spark_defaults(::Dict{Any,Any}) at /home/s32cqh/.julia/packages/Spark/kFCaM/src/init.jl:61
[3] init() at /home/s32cqh/.julia/packages/Spark/kFCaM/src/init.jl:5
[4] top-level scope at none:0

I believe there was also another error (I can’t quite remember what caused this one) that resulted in Julia closing entirely.

Do you have any recommendation for how to resolve these issues? Thank you very much for any advice you can give!

dfdx · 2019-03-10T22:52:40Z

Looks like an unprocessed error while reading Spark configuration, can you show your spark-defaults.conf file (should be in $SPARK_HOME directory by default)?

msobiesk23 · 2019-03-11T20:32:11Z

Sorry about this -- I went to check, and I realized that I was actually looking at an old issue we were able to address. The actual problem now is that every time we run Spark.init() we get the error

Segmentation fault

and then Julia closes without any further context. I've also attached some of the output we got adding the package to Julia, in case that helps at all.

dfdx · 2019-03-11T21:31:27Z

It turns to be an issue with JavaCall, see JuliaInterop/JavaCall.jl#96. Unfortunately, the only workaround for now is to stick to Julia 1.0.

msobiesk23 · 2019-03-18T15:45:11Z

I'm currently trying Julia 1.0.3, and JavaCall appears to still be having some issues. When I try to run the code

using Spark
Spark.init()
sc = SparkContext(master="local")
text = parallelize(sc, ["hello world", "the world is one", "we are the world"])
words = flat_map(text, split)
words_tuple = cartesian(words, parallelize(sc, [1]))
counts = reduce_by_key(words_tuple, +)
result = collect(counts)

everything runs fine up until the last line

result = collect(counts)

which causes the below error. Do you have any suggestions for how to deal with/fix it? Thank you so much for any guidance you can provide!

ERROR: JavaCall.JavaCallError("Error calling Java: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.Exception: MethodError(transcode, (UInt8, "hello"), 0x000000000000620d)\nStacktrace:\n [1] writeobj(::Sockets.TCPSocket, ::SubString{String}) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/worker.jl:65\n [2] dump_stream(::Sockets.TCPSocket, ::Base.Iterators.Flatten{Array{Array{SubString{String},1},1}}) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/worker.jl:92\n [3] main() at /home/s 2k6av/.julia/packages/Spark/kFCaM/src/worker_runner.jl:25\n [4] top-level scope at none:0\n [5] include at ./boot.jl:317 [inlined]\n [6] include_relative(::Module, ::String) at ./loading.jl:1044\n [7] include(::Module, ::String) at ./sysimg.jl:29\n [8] exec_options(::Base.JLOptions) at ./client.jl:266\n [9] _start() at ./client.jl:425\n\tat org.apache.spark.api.julia.JuliaRDD$.readValueFromStream (JuliaRDD.scala:181)\n\tat org.apache.spark.api.julia.InputIterator.read(InputIterator.scala:33)\n\tat org.apache.spark.api.julia.InputIterator.(InputIterator.scala:54)\n\tat org.apache.spark.api.julia.AbstractJuliaRDD.compute(JuliaRDD.scala:42)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:287)\n\tat org.apache.spark. rdd.CartesianRDD.compute(CartesianRDD.scala:75)\n\tat org.apache.spark.rdd.RDD.c omputeOrReadCheckpoint(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.iterator(RD D.scala:287)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapT ask.scala:96)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap Task.scala:53)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)\n\tat java. util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\nDriver stacktrace:")
Stacktrace:
[1] geterror(::Bool) at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:294
[2] geterror at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:274 [inlined]
[3] _jcall(::JavaCall.JavaMetaClass{Symbol("org.apache.spark.api.julia.JuliaPairRDD")}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Type, ::Tuple{DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaPairRDD")}) at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:247
[4] jcall(::Type{JavaCall.JavaObject{Symbol("org.apache.spark.api.julia.JuliaPairRDD")}}, ::String, ::Type, ::Tuple{DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaPairRDD")}) at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:143
[5] collect_internal(::Spark.PipelinedPairRDD, ::Type, ::Type) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/rdd.jl:233
[6] collect(::Spark.PipelinedPairRDD) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/rdd.jl:281
[7] top-level scope at none:0

dfdx · 2019-03-21T21:58:57Z

We use transcode to convert strings to byte arrays before moving them between JVM and Julia. It turns out there's no transcode method for substrings which you get with split. Minimal example for this issue is:

julia> ss = split("hello world")[1]
"hello"

julia> transcode(UInt8, ss)
ERROR: MethodError: no method matching transcode(::Type{UInt8}, ::SubString{String})
Closest candidates are:
  transcode(::Type{UInt8}, ::Array{#s57,1} where #s57<:Union{Int32, UInt32}) at c.jl:284
  transcode(::Type{UInt8}, ::AbstractArray{UInt16,1}) at c.jl:346
  transcode(::Type{T<:Union{Int32, UInt16, UInt32, UInt8}}, ::AbstractArray{T<:Union{Int32, UInt16, UInt32, UInt8},1}) where T<:Union{Int32, UInt16, UInt32, UInt8} at c.jl:276
  ...
Stacktrace:
 [1] top-level scope at none:0

To fix it right now you can simply convert substrings to strings:

using Spark

Spark.init()
sc = SparkContext(master="local")
text = parallelize(sc, ["hello world", "the world is one", "we are the world"])
words = flat_map(text, s -> [string(word) for word in split(s)])  # <-- this line changed
words_tuple = cartesian(words, parallelize(sc, [1]))
counts = reduce_by_key(words_tuple, +)
result = collect(counts)

I'll add a fix for this to master in the nearest time.

msobiesk23 · 2019-04-07T06:21:51Z

Thank you for this! The code appears to be running now. However, we have a website that tracks whether the code is accessing the cluster or not, and it says that it is somehow not running on the cluster at all. Some technical support people I talked with said that it appeared to be running on "some local version of Spark" as opposed to accessing the cluster. Have you ever had experience with this type of issue before? Thanks again for all of your help!

dfdx · 2019-04-07T08:25:15Z

Spark supports several cluster types. In my example, I used local which only requires Spark.jl itself to be installed on a machine and doesn't connect to any external resources. Most typical external cluster is YARN which you can connect using:

using Spark

Spark.init()
sc = SparkContext(master="yarn-client")
...

Note that it requires your machine to be configured appropriately, in particular, Spark.jl should be able to find yarn-site.xml in standard paths. If your machine is not pre-configured, please ask technical support for how to do it.

Other types of clusters are Standalone and Mesos.

Moelf · 2019-06-13T16:40:41Z

Sorry for making a noise, but is Spark.jl now considered stable/mature for common uses and for Julia 1.x in general?

dfdx · 2019-06-13T22:22:02Z

Spark.jl is still missing a lot of API available in Scala and Python, mostly because this API is quite huge and demand from Julia community is quite low, but adding missing pieces is usually quite trivial.

The bigger problem is compatibility with Julia > 1.0 - JuliaInterop/JavaCall.jl#96 is still open which means any library based on JavaCall.jl will not run on Julia 1.1. (It may work on upcoming Julia 1.2 though).

If you are ok with these 2 points, Spark.jl should work pretty reliably.

NDari · 2020-03-07T19:39:05Z

This seems to be fixed in JavaCall 0.7.3 when using julia with JULIA_COPY_STACKS set to 1, when using julia 1.3+

aviks · 2020-06-29T15:27:47Z

Closing this. As per the JavaCall README

On Non-Windows operating system: JavaCall and its derivatives do not work correctly on Julia 1.1 and Julia 1.2. On Julia 1.3, please set the environment variable JULIA_COPY_STACKS. On 1.1 and 1.2, and on 1.3 without JULIA_COPY_STACKS set, you may see segfaults or incorrect results. This is typically due to stack corruption. The Julia long-term-support version of 1.0.x continues to work correctly as before.

aviks closed this as completed Jun 29, 2020

igitur mentioned this issue Aug 14, 2020

I'm lost for a title! Will update when I get clarity :-( #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark.jl for Julia 1.1.0 #70

Spark.jl for Julia 1.1.0 #70

msobiesk23 commented Mar 8, 2019

dfdx commented Mar 8, 2019

msobiesk23 commented Mar 10, 2019

dfdx commented Mar 10, 2019

msobiesk23 commented Mar 11, 2019

dfdx commented Mar 11, 2019

msobiesk23 commented Mar 18, 2019

dfdx commented Mar 21, 2019

msobiesk23 commented Apr 7, 2019

dfdx commented Apr 7, 2019

Moelf commented Jun 13, 2019

dfdx commented Jun 13, 2019

NDari commented Mar 7, 2020 •

edited

Loading

aviks commented Jun 29, 2020

Spark.jl for Julia 1.1.0 #70

Spark.jl for Julia 1.1.0 #70

Comments

msobiesk23 commented Mar 8, 2019

dfdx commented Mar 8, 2019

msobiesk23 commented Mar 10, 2019

dfdx commented Mar 10, 2019

msobiesk23 commented Mar 11, 2019

dfdx commented Mar 11, 2019

msobiesk23 commented Mar 18, 2019

dfdx commented Mar 21, 2019

msobiesk23 commented Apr 7, 2019

dfdx commented Apr 7, 2019

Moelf commented Jun 13, 2019

dfdx commented Jun 13, 2019

NDari commented Mar 7, 2020 • edited Loading

aviks commented Jun 29, 2020

NDari commented Mar 7, 2020 •

edited

Loading