Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark.jl for Julia 1.1.0 #70

Closed
msobiesk23 opened this issue Mar 8, 2019 · 13 comments
Closed

Spark.jl for Julia 1.1.0 #70

msobiesk23 opened this issue Mar 8, 2019 · 13 comments

Comments

@msobiesk23
Copy link

Hello! I was wondering -- is there any timeline for the Spark.jl package being updated for Julia 1.1.0? Thank you for any info you can give!

@dfdx
Copy link
Owner

dfdx commented Mar 8, 2019

Hi! Spark.jl should work fine on Julia 1.x, including Julia 1.1. Do you experience any issues with it?

@msobiesk23
Copy link
Author

I have been having a few issues with running Spark.jl on Julia 1.1.0. First, whenever I tried to run Spark.init() in Julia 1.1.0 on a Hadoop cluster I was getting the error

ERROR: BoundsError: attempt to access 1-element Array{SubString{String},1} at index [2]
Stacktrace:
[1] getindex at ./array.jl:729 [inlined]
[2] load_spark_defaults(::Dict{Any,Any}) at /home/s32cqh/.julia/packages/Spark/kFCaM/src/init.jl:61
[3] init() at /home/s32cqh/.julia/packages/Spark/kFCaM/src/init.jl:5
[4] top-level scope at none:0

I believe there was also another error (I can’t quite remember what caused this one) that resulted in Julia closing entirely.

Do you have any recommendation for how to resolve these issues? Thank you very much for any advice you can give!

@dfdx
Copy link
Owner

dfdx commented Mar 10, 2019

Looks like an unprocessed error while reading Spark configuration, can you show your spark-defaults.conf file (should be in $SPARK_HOME directory by default)?

@msobiesk23
Copy link
Author

Sorry about this -- I went to check, and I realized that I was actually looking at an old issue we were able to address. The actual problem now is that every time we run Spark.init() we get the error

Segmentation fault

and then Julia closes without any further context. I've also attached some of the output we got adding the package to Julia, in case that helps at all.

image001

@dfdx
Copy link
Owner

dfdx commented Mar 11, 2019

It turns to be an issue with JavaCall, see JuliaInterop/JavaCall.jl#96. Unfortunately, the only workaround for now is to stick to Julia 1.0.

@msobiesk23
Copy link
Author

I'm currently trying Julia 1.0.3, and JavaCall appears to still be having some issues. When I try to run the code

using Spark
Spark.init()
sc = SparkContext(master="local")
text = parallelize(sc, ["hello world", "the world is one", "we are the world"])
words = flat_map(text, split)
words_tuple = cartesian(words, parallelize(sc, [1]))
counts = reduce_by_key(words_tuple, +)
result = collect(counts)

everything runs fine up until the last line

result = collect(counts)

which causes the below error. Do you have any suggestions for how to deal with/fix it? Thank you so much for any guidance you can provide!

ERROR: JavaCall.JavaCallError("Error calling Java: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.Exception: MethodError(transcode, (UInt8, "hello"), 0x000000000000620d)\nStacktrace:\n [1] writeobj(::Sockets.TCPSocket, ::SubString{String}) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/worker.jl:65\n [2] dump_stream(::Sockets.TCPSocket, ::Base.Iterators.Flatten{Array{Array{SubString{String},1},1}}) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/worker.jl:92\n [3] main() at /home/s 2k6av/.julia/packages/Spark/kFCaM/src/worker_runner.jl:25\n [4] top-level scope at none:0\n [5] include at ./boot.jl:317 [inlined]\n [6] include_relative(::Module, ::String) at ./loading.jl:1044\n [7] include(::Module, ::String) at ./sysimg.jl:29\n [8] exec_options(::Base.JLOptions) at ./client.jl:266\n [9] _start() at ./client.jl:425\n\tat org.apache.spark.api.julia.JuliaRDD$.readValueFromStream (JuliaRDD.scala:181)\n\tat org.apache.spark.api.julia.InputIterator.read(InputIterator.scala:33)\n\tat org.apache.spark.api.julia.InputIterator.(InputIterator.scala:54)\n\tat org.apache.spark.api.julia.AbstractJuliaRDD.compute(JuliaRDD.scala:42)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:287)\n\tat org.apache.spark. rdd.CartesianRDD.compute(CartesianRDD.scala:75)\n\tat org.apache.spark.rdd.RDD.c omputeOrReadCheckpoint(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.iterator(RD D.scala:287)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapT ask.scala:96)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap Task.scala:53)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)\n\tat java. util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\nDriver stacktrace:")
Stacktrace:
[1] geterror(::Bool) at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:294
[2] geterror at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:274 [inlined]
[3] _jcall(::JavaCall.JavaMetaClass{Symbol("org.apache.spark.api.julia.JuliaPairRDD")}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Type, ::Tuple{DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaPairRDD")}) at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:247
[4] jcall(::Type{JavaCall.JavaObject{Symbol("org.apache.spark.api.julia.JuliaPairRDD")}}, ::String, ::Type, ::Tuple{DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaPairRDD")}) at /home/s2k6av/.julia/packages/JavaCall/toamy/src/core.jl:143
[5] collect_internal(::Spark.PipelinedPairRDD, ::Type, ::Type) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/rdd.jl:233
[6] collect(::Spark.PipelinedPairRDD) at /home/s2k6av/.julia/packages/Spark/kFCaM/src/rdd.jl:281
[7] top-level scope at none:0

@dfdx
Copy link
Owner

dfdx commented Mar 21, 2019

We use transcode to convert strings to byte arrays before moving them between JVM and Julia. It turns out there's no transcode method for substrings which you get with split. Minimal example for this issue is:

julia> ss = split("hello world")[1]
"hello"

julia> transcode(UInt8, ss)
ERROR: MethodError: no method matching transcode(::Type{UInt8}, ::SubString{String})
Closest candidates are:
  transcode(::Type{UInt8}, ::Array{#s57,1} where #s57<:Union{Int32, UInt32}) at c.jl:284
  transcode(::Type{UInt8}, ::AbstractArray{UInt16,1}) at c.jl:346
  transcode(::Type{T<:Union{Int32, UInt16, UInt32, UInt8}}, ::AbstractArray{T<:Union{Int32, UInt16, UInt32, UInt8},1}) where T<:Union{Int32, UInt16, UInt32, UInt8} at c.jl:276
  ...
Stacktrace:
 [1] top-level scope at none:0

To fix it right now you can simply convert substrings to strings:

using Spark

Spark.init()
sc = SparkContext(master="local")
text = parallelize(sc, ["hello world", "the world is one", "we are the world"])
words = flat_map(text, s -> [string(word) for word in split(s)])  # <-- this line changed
words_tuple = cartesian(words, parallelize(sc, [1]))
counts = reduce_by_key(words_tuple, +)
result = collect(counts)

I'll add a fix for this to master in the nearest time.

@msobiesk23
Copy link
Author

Thank you for this! The code appears to be running now. However, we have a website that tracks whether the code is accessing the cluster or not, and it says that it is somehow not running on the cluster at all. Some technical support people I talked with said that it appeared to be running on "some local version of Spark" as opposed to accessing the cluster. Have you ever had experience with this type of issue before? Thanks again for all of your help!

@dfdx
Copy link
Owner

dfdx commented Apr 7, 2019

Spark supports several cluster types. In my example, I used local which only requires Spark.jl itself to be installed on a machine and doesn't connect to any external resources. Most typical external cluster is YARN which you can connect using:

using Spark

Spark.init()
sc = SparkContext(master="yarn-client")
...

Note that it requires your machine to be configured appropriately, in particular, Spark.jl should be able to find yarn-site.xml in standard paths. If your machine is not pre-configured, please ask technical support for how to do it.

Other types of clusters are Standalone and Mesos.

@Moelf
Copy link

Moelf commented Jun 13, 2019

Sorry for making a noise, but is Spark.jl now considered stable/mature for common uses and for Julia 1.x in general?

@dfdx
Copy link
Owner

dfdx commented Jun 13, 2019

Spark.jl is still missing a lot of API available in Scala and Python, mostly because this API is quite huge and demand from Julia community is quite low, but adding missing pieces is usually quite trivial.

The bigger problem is compatibility with Julia > 1.0 - JuliaInterop/JavaCall.jl#96 is still open which means any library based on JavaCall.jl will not run on Julia 1.1. (It may work on upcoming Julia 1.2 though).

If you are ok with these 2 points, Spark.jl should work pretty reliably.

@NDari
Copy link

NDari commented Mar 7, 2020

This seems to be fixed in JavaCall 0.7.3 when using julia with JULIA_COPY_STACKS set to 1, when using julia 1.3+

@aviks
Copy link
Collaborator

aviks commented Jun 29, 2020

Closing this. As per the JavaCall README

On Non-Windows operating system: JavaCall and its derivatives do not work correctly on Julia 1.1 and Julia 1.2. On Julia 1.3, please set the environment variable JULIA_COPY_STACKS. On 1.1 and 1.2, and on 1.3 without JULIA_COPY_STACKS set, you may see segfaults or incorrect results. This is typically due to stack corruption. The Julia long-term-support version of 1.0.x continues to work correctly as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants