[SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long] #11880

rxin · 2016-03-22T07:26:04Z

What changes were proposed in this pull request?

This patch changed the return type for SQLContext.range from Dataset[Long] (Scala primitive) to Dataset[java.lang.Long] (Java boxed long).

Previously, SPARK-13894 changed the return type of range from Dataset[Row] to Dataset[Long]. The problem is that due to https://issues.scala-lang.org/browse/SI-4388, Scala compiles primitive types in generics into just Object, i.e. range at bytecode level now just returns Dataset[Object]. This is really bad for Java users because they are losing type safety and also need to add a type cast every time they use range.

Talked to Jason Zaugg from Lightbend (Typesafe) who suggested the best approach is to return Dataset[java.lang.Long]. The downside is that when Scala users want to explicitly type a closure used on the dataset returned by range, they would need to use java.lang.Long instead of the Scala Long.

How was this patch tested?

The signature change should be covered by existing unit tests and API tests. I also added a new test case in DatasetSuite for range.

…ong]

cloud-fan · 2016-03-22T07:41:36Z

LGTM

cloud-fan · 2016-03-22T07:45:40Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java

@@ -354,7 +354,7 @@ public void testCountMinSketch() {

  @Test
  public void testBloomFilter() {
-    Dataset df = context.range(1000);


Why does this code pass compile? Is it because we only turn scalac warnings to errors, but not javac warning?

SparkQA · 2016-03-22T09:08:13Z

Test build #53754 has finished for PR 11880 at commit 8272a1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-22T14:18:21Z

Good catch of course. I grepped around for similar issues and noticed this in JavaDoubleRDD:

  def histogram(bucketCount: Int): (Array[scala.Double], Array[Long]) = {
    val result = srdd.histogram(bucketCount)
    (result._1, result._2)
  }

  def histogram(buckets: Array[scala.Double]): Array[Long] = {
    srdd.histogram(buckets, false)
  }

  def histogram(buckets: Array[JDouble], evenBuckets: Boolean): Array[Long] = {
    srdd.histogram(buckets.map(_.toDouble), evenBuckets)
  }

For a second I thought it was the same problem, but I suppose an Array[scala.Long] is actually long[] and so is already fine in Java. Different from the example you're fixing.

rxin · 2016-03-22T18:37:37Z

@srowen yea I think arrays are fine since those are not generics.

Merging in master.

[SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.L…

8272a1f

…ong]

rxin mentioned this pull request Mar 22, 2016

[SPARK-13401][SQL][TESTS] Fix SQL test warnings. #11857

Closed

cloud-fan reviewed Mar 22, 2016
View reviewed changes

asfgit closed this in 297c202 Mar 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long] #11880

[SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long] #11880

rxin commented Mar 22, 2016

cloud-fan commented Mar 22, 2016

cloud-fan Mar 22, 2016

rxin Mar 22, 2016

SparkQA commented Mar 22, 2016

srowen commented Mar 22, 2016

rxin commented Mar 22, 2016

[SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long] #11880

[SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long] #11880

Conversation

rxin commented Mar 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Mar 22, 2016

cloud-fan Mar 22, 2016

Choose a reason for hiding this comment

rxin Mar 22, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2016

srowen commented Mar 22, 2016

rxin commented Mar 22, 2016