Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12604] [CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java #10554

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Jan 2, 2016

Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change

@SparkQA
Copy link

SparkQA commented Jan 2, 2016

Test build #2297 has finished for PR 10554 at commit 1a27421.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 4, 2016

Does this actually change anything w.r.t. bytecode singature?

@srowen
Copy link
Member Author

srowen commented Jan 4, 2016

I don't think it does. It may cause source incompatibility since the generic type changes. You can see an example of a fix/change that can happen in the caller in the example that changed.

@@ -848,6 +848,6 @@ object JavaPairDStream {

def scalaToJavaLong[K: ClassTag](dstream: JavaPairDStream[K, Long])
: JavaPairDStream[K, JLong] = {
DStream.toPairDStreamFunctions(dstream.dstream).mapValues(new JLong(_))
DStream.toPairDStreamFunctions(dstream.dstream).mapValues(JLong.valueOf)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed for consistency; valueOf is slightly more efficient than the constructor since the JDK actually caches Long objects for small values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@rxin
Copy link
Contributor

rxin commented Jan 4, 2016

@srowen If this doesn't change any signature, I think it actually makes things slower by adding another layer of iterators. Maybe it'd make more sense to just change the signature rather than doing actual "conversions".

@srowen
Copy link
Member Author

srowen commented Jan 4, 2016

It doesn't compile that way though since the values are Scala Longs and the signature says Java Longs. One way or the other, such a conversion has to happen somewhere. If it works for anyone it's because they're already doing this in the calling code.

@rxin
Copy link
Contributor

rxin commented Jan 4, 2016

One thing I don't understand is how can these methods return scala types at the bytecode level? Scala Long is just a primitive long. All the places you find are generics, which cannot be primitive types, and as a result I assume they are all java boxed types?

@srowen
Copy link
Member Author

srowen commented Jan 4, 2016

Yeah that's a good point -- let me see if I can understand that better. At some level, it has to be an Object and not a long in order to be part of a java.util.Map, so something is transforming something. It may be that implicits convert it to a java.lang.Long in the end, maybe via RichLong, which would be pretty good news -- this is just then fixing the signature and making explicit some implicit conversion and maybe even making it more efficient. Let me experiment a bit

@srowen
Copy link
Member Author

srowen commented Jan 4, 2016

@rxin so I tried unpacking this code in IntelliJ:

val m = new java.util.HashMap[String, java.lang.Long]()
val l = 1L
m.put("foo", l)

... and it tells me that the implicit conversion that is applied here is Predef.long2Long:

implicit def long2Long(x: Long)           = java.lang.Long.valueOf(x)

It looks like this is the same underlying transformation that happens in SerializableMapWrapper and mapAsSerializableJavaMap already. So it looks to me like this is already the implicit transformation being applied anyway.

Right now the returned type of these methods in Java is Map<K, Object> and the wrapper does the conversion on the fly in get(). I believe. I think to make it a real Map<K, Long> requires making the transformation upfront -- well, short of making some other new special purpose wrapper.

WDYT?

@srowen
Copy link
Member Author

srowen commented Jan 4, 2016

... or I suppose the implementation can just cast to java.util.Map[K, java.lang.Long] since that's safe. It's less change, and mimics what callers are doing anyway in Java.

@rxin
Copy link
Contributor

rxin commented Jan 4, 2016

Yea your latest suggestion sounds great (a very tiny change).

@srowen
Copy link
Member Author

srowen commented Jan 5, 2016

@rxin OK I almost did that. I realized that JavaRDD.countByValue already does a mapValues. I left countByKey to act the same way, doing the mapping. Other methods that return JavaPairRDD went back to doing a cast. (Although similar methods in the streaming API actually also do a mapValues.) I also tried to use JLong vs jl.Long consistently within a file without going overboard; both of these files use about every different form. I think the result is more consistent internally, but WDYT? I'm neutral, and would further change things while keeping them consistent if anyone had a preference.

@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48759 has finished for PR 10554 at commit 39fa6e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -288,17 +288,18 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
* immediately to the master as a Map. This will also perform the merging locally on each mapper
* before sending results to a reducer, similarly to a "combiner" in MapReduce.
*/
def reduceByKeyLocally(func: JFunction2[V, V, V]): java.util.Map[K, V] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, i actually think spelling out java.util.Map is more clear ...

…Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change
@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48847 has finished for PR 10554 at commit 293b5e4.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #2335 has finished for PR 10554 at commit 293b5e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 7, 2016

I'm going to merge this. Thanks Sean.

@asfgit asfgit closed this in ac56cf6 Jan 7, 2016
@@ -448,7 +448,7 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
* combine step happens locally on the master, equivalent to running a single reduce task.
*/
def countByValue(): java.util.Map[T, jl.Long] =
mapAsSerializableJavaMap(rdd.countByValue().map((x => (x._1, new jl.Long(x._2)))))
mapAsSerializableJavaMap(rdd.countByValue().mapValues(jl.Long.valueOf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen how come there is still a mapValues here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply at #10554 (comment) -- it was because I then saw this was how countByKey had been implemented. I picked consistency, but it's as reasonable to change countByKey I suppose. Do you feel strongly about it one way or the other?

@srowen srowen deleted the SPARK-12604 branch January 7, 2016 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants