no method for `reduce_by_key(::Spark.PipelinedRDD, ::Function)` #51

ExpandingMan · 2017-11-15T20:22:54Z

I am going through the Spark documentation examples here and trying to reproduce all of them. For the very first one

sc = SparkContext(master="local")
txt = text_file(sc, samplefile)
rdd = flat_map(txt, line -> split(line))
rdd = map(rdd, word -> (word, 1))
rdd = reduce_by_key(rdd, +)
o = collect(rdd)         
close(sc)

one gets

ERROR: LoadError: MethodError: no method matching reduce_by_key(::Spark.PipelinedRDD, ::Base.#+)
Closest candidates are:
  reduce_by_key(::Spark.PairRDD, ::Function) at /home/savastio/.julia/v0.6/Spark/src/rdd.jl:359
Stacktrace:
 [1] test1() at /home/savastio/src/test_sparkjl.jl:11
 [2] include_from_node1(::String) at ./loading.jl:576
 [3] include(::String) at ./sysimg.jl:14
while loading /home/expandingman/src/test_sparkjl.jl, in expression starting on line 17

My guess is that either these method signatures are unnecessarily restrictive (in which case some associated methods would have to be changed as well) or one of these is returning a PipelinedRDD when it should be returning a PipelinedPairRDD.

The text was updated successfully, but these errors were encountered:

dfdx · 2017-11-16T00:06:56Z

Don't expect Scala or Python examples to be reproducible line-by-line. Scala, for example, uses implicit conversion from RDD[Pair] to PairRDD, which are 2 different types, while Python use duck typing to make these 2 types look identical. In Julia you need to convert between them explicitly, which in my opinion is the right behavior. Not sure we have a method for it right now, so I'll check it tomorrow.

dfdx · 2017-11-16T14:53:03Z

Try this:

using Spark
Spark.init()
samplefile = ...

sc = SparkContext(master="local")
txt = text_file(sc, samplefile)
rdd = flat_map(txt, line -> split(line))
rdd = map_pair(rdd, word -> (word, 1))    # <-- this line has changed
rdd = reduce_by_key(rdd, +)
o = collect(rdd)         
close(sc)

map_pair creates PipelinedPairRDD (instead of PipelinedRDD) suitable for functions like reduce_by_key(). Note that there are also map_partitions_pair and flat_map_pair.

Let me know if there are any other issues with Spark examples.

dfdx · 2017-11-16T15:00:49Z

By the way, thanks for raising the issue! I think many people come across the project, try it out without success and silently go away thinking it's not working. I'll add a couple of example from that page to Spark.jl's docs.

ExpandingMan · 2017-11-16T15:12:11Z

That seems to work fine, thanks. Of course I think it would be extremely useful to have some documentation that explains the differences from the Scala and Java API's in such a way that it would be a bit clearer what to do in cases like this. When I finish creating Julia versions of all the examples here it might be helpful to post them in the documentation, that way people can see Julia code equivalent to well-known examples in all 3 other languages.

I'm not sure if the remaining examples can even be done yet, as they require some dataframes functionality. I've started looking through the source code and made a fork in the hopes of adding some that are missing, but going through the Spark API docs is pretty painful (as I'm sure you know) and I'm not a Spark expert. We'll see how far I get with this.

dfdx · 2017-11-16T19:20:25Z

Of course I think it would be extremely useful to have some documentation that explains the differences from the Scala and Java API's in such a way that it would be a bit clearer what to do in cases like this.

I don't really think there are many "cases like this". Spark API uses different tricks in different languages. I always look at Java API since this is what we actually call under the hood and it almost doesn't contain hidden operations (like implicit conversions in Scala, for example). But for end users it's not much useful, so we need to provide independent and easily discoverable examples.

going through the Spark API docs is pretty painful (as I'm sure you know) and I'm not a Spark expert.

Don't hesitate to ask questions! I spent quite a lot of time both - writing Spark programs in Scala and Python, and wrapping it in Julia, so I'll be glad to help whenever possible. On my side, I'm going to implement a couple of important functions from that page today or tomorrow to get things easier.

ExpandingMan · 2017-11-16T19:31:35Z

That sounds great, thanks. From my limited experience, one of the best uses of Spark is as a sane replacement for SQL, so I would consider dataframe operations such as groupby and join to be quite valuable (though I realize that under the hood it's probably just reduce_by_key and what-not).

dfdx · 2017-11-16T22:49:43Z

group_by and join are easy. In general, anything involving only fixed operations on Java types is as simple as calling a few API methods. For example, here's the whole implementation of join over a single column:

function join(left::Dataset, right::Dataset, col_name)
    jdf = jcall(left.jdf, "join", JDataset, (JDataset, JString), right.jdf, col_name)
    return Dataset(jdf)
end

group_by introduces one more of Spark's data types, so it requires a bit more of code, but conceptually is also easy. I've put simple implementations of both in #52. Unfortunately, I don't have much time to spend on Spark.jl right now, but if you need something else and have problems figuring out how to add it, feel free to ping me.

dfdx closed this as completed Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no method for `reduce_by_key(::Spark.PipelinedRDD, ::Function)` #51

no method for `reduce_by_key(::Spark.PipelinedRDD, ::Function)` #51

ExpandingMan commented Nov 15, 2017

dfdx commented Nov 16, 2017

dfdx commented Nov 16, 2017

dfdx commented Nov 16, 2017

ExpandingMan commented Nov 16, 2017

dfdx commented Nov 16, 2017

ExpandingMan commented Nov 16, 2017

dfdx commented Nov 16, 2017

no method for reduce_by_key(::Spark.PipelinedRDD, ::Function) #51

no method for reduce_by_key(::Spark.PipelinedRDD, ::Function) #51

Comments

ExpandingMan commented Nov 15, 2017

dfdx commented Nov 16, 2017

dfdx commented Nov 16, 2017

dfdx commented Nov 16, 2017

ExpandingMan commented Nov 16, 2017

dfdx commented Nov 16, 2017

ExpandingMan commented Nov 16, 2017

dfdx commented Nov 16, 2017

no method for `reduce_by_key(::Spark.PipelinedRDD, ::Function)` #51

no method for `reduce_by_key(::Spark.PipelinedRDD, ::Function)` #51