Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no method for reduce_by_key(::Spark.PipelinedRDD, ::Function) #51

Closed
ExpandingMan opened this issue Nov 15, 2017 · 7 comments
Closed

no method for reduce_by_key(::Spark.PipelinedRDD, ::Function) #51

ExpandingMan opened this issue Nov 15, 2017 · 7 comments

Comments

@ExpandingMan
Copy link

I am going through the Spark documentation examples here and trying to reproduce all of them. For the very first one

sc = SparkContext(master="local")
txt = text_file(sc, samplefile)
rdd = flat_map(txt, line -> split(line))
rdd = map(rdd, word -> (word, 1))
rdd = reduce_by_key(rdd, +)
o = collect(rdd)         
close(sc)           

one gets

ERROR: LoadError: MethodError: no method matching reduce_by_key(::Spark.PipelinedRDD, ::Base.#+)
Closest candidates are:
  reduce_by_key(::Spark.PairRDD, ::Function) at /home/savastio/.julia/v0.6/Spark/src/rdd.jl:359
Stacktrace:
 [1] test1() at /home/savastio/src/test_sparkjl.jl:11
 [2] include_from_node1(::String) at ./loading.jl:576
 [3] include(::String) at ./sysimg.jl:14
while loading /home/expandingman/src/test_sparkjl.jl, in expression starting on line 17

My guess is that either these method signatures are unnecessarily restrictive (in which case some associated methods would have to be changed as well) or one of these is returning a PipelinedRDD when it should be returning a PipelinedPairRDD.

@dfdx
Copy link
Owner

dfdx commented Nov 16, 2017

Don't expect Scala or Python examples to be reproducible line-by-line. Scala, for example, uses implicit conversion from RDD[Pair] to PairRDD, which are 2 different types, while Python use duck typing to make these 2 types look identical. In Julia you need to convert between them explicitly, which in my opinion is the right behavior. Not sure we have a method for it right now, so I'll check it tomorrow.

@dfdx
Copy link
Owner

dfdx commented Nov 16, 2017

Try this:

using Spark
Spark.init()
samplefile = ...

sc = SparkContext(master="local")
txt = text_file(sc, samplefile)
rdd = flat_map(txt, line -> split(line))
rdd = map_pair(rdd, word -> (word, 1))    # <-- this line has changed
rdd = reduce_by_key(rdd, +)
o = collect(rdd)         
close(sc)           

map_pair creates PipelinedPairRDD (instead of PipelinedRDD) suitable for functions like reduce_by_key(). Note that there are also map_partitions_pair and flat_map_pair.

Let me know if there are any other issues with Spark examples.

@dfdx dfdx closed this as completed Nov 16, 2017
@dfdx
Copy link
Owner

dfdx commented Nov 16, 2017

By the way, thanks for raising the issue! I think many people come across the project, try it out without success and silently go away thinking it's not working. I'll add a couple of example from that page to Spark.jl's docs.

@ExpandingMan
Copy link
Author

That seems to work fine, thanks. Of course I think it would be extremely useful to have some documentation that explains the differences from the Scala and Java API's in such a way that it would be a bit clearer what to do in cases like this. When I finish creating Julia versions of all the examples here it might be helpful to post them in the documentation, that way people can see Julia code equivalent to well-known examples in all 3 other languages.

I'm not sure if the remaining examples can even be done yet, as they require some dataframes functionality. I've started looking through the source code and made a fork in the hopes of adding some that are missing, but going through the Spark API docs is pretty painful (as I'm sure you know) and I'm not a Spark expert. We'll see how far I get with this.

@dfdx
Copy link
Owner

dfdx commented Nov 16, 2017

Of course I think it would be extremely useful to have some documentation that explains the differences from the Scala and Java API's in such a way that it would be a bit clearer what to do in cases like this.

I don't really think there are many "cases like this". Spark API uses different tricks in different languages. I always look at Java API since this is what we actually call under the hood and it almost doesn't contain hidden operations (like implicit conversions in Scala, for example). But for end users it's not much useful, so we need to provide independent and easily discoverable examples.

going through the Spark API docs is pretty painful (as I'm sure you know) and I'm not a Spark expert.

Don't hesitate to ask questions! I spent quite a lot of time both - writing Spark programs in Scala and Python, and wrapping it in Julia, so I'll be glad to help whenever possible. On my side, I'm going to implement a couple of important functions from that page today or tomorrow to get things easier.

@ExpandingMan
Copy link
Author

That sounds great, thanks. From my limited experience, one of the best uses of Spark is as a sane replacement for SQL, so I would consider dataframe operations such as groupby and join to be quite valuable (though I realize that under the hood it's probably just reduce_by_key and what-not).

@dfdx
Copy link
Owner

dfdx commented Nov 16, 2017

group_by and join are easy. In general, anything involving only fixed operations on Java types is as simple as calling a few API methods. For example, here's the whole implementation of join over a single column:

function join(left::Dataset, right::Dataset, col_name)
    jdf = jcall(left.jdf, "join", JDataset, (JDataset, JString), right.jdf, col_name)
    return Dataset(jdf)
end

group_by introduces one more of Spark's data types, so it requires a bit more of code, but conceptually is also easy. I've put simple implementations of both in #52. Unfortunately, I don't have much time to spend on Spark.jl right now, but if you need something else and have problems figuring out how to add it, feel free to ping me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants