SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

yanboliang · 2014-12-29T09:43:24Z

https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.

override def next(): T = {
val r = data.next()
advance
r
}

GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.

To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.

AmplabJenkins · 2014-12-29T09:47:10Z

Can one of the admins verify this patch?

mengxr · 2014-12-29T19:40:08Z

add to whitelist

mengxr · 2014-12-29T19:40:13Z

ok to test

SparkQA · 2014-12-29T19:42:32Z

Test build #24866 has started for PR 3827 at commit 6eaee5e.

This patch merges cleanly.

SparkQA · 2014-12-29T20:52:37Z

Test build #24866 has finished for PR 3827 at commit 6eaee5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-29T20:52:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24866/
Test PASSed.

liancheng · 2014-12-30T04:34:21Z

Hey @yanbohappy, as I've commented in the JIRA, would you mind to do a micro benchmark using code in #758 to see whether this fix introduces noticeable performance regression?

liancheng · 2014-12-30T04:43:16Z

@yanbohappy Actually, we can move the copy call to execution.Sample.execute. In this way, queries without sampling are not negatively affected.

SparkQA · 2014-12-30T08:12:36Z

Test build #24888 has started for PR 3827 at commit cea7e2e.

This patch merges cleanly.

yanboliang · 2014-12-30T08:37:39Z

@liancheng I agree to move the copy call to execution.Sample.execute and added new commits.
It will take no effect on HiveTableScan.

SparkQA · 2014-12-30T09:25:09Z

Test build #24888 has finished for PR 3827 at commit cea7e2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-30T09:25:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24888/
Test PASSed.

marmbrus · 2014-12-30T19:21:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala

+    TestHive.sql("SELECT * FROM src WHERE key % 2 = 0")
+      .sample(withReplacement = false, fraction = 0.3)
+      .registerTempTable("sampled")
+    assert((1 to 10)


Use checkAnswer instead here as it give better output when there is an exception or the answer is wrong.

(1 to 10).foreach { i => checkAnswer( sql("SELECT * FROM sampled WHERE key % 2 = 1"), Seq.empty) }

SparkQA · 2014-12-31T06:02:30Z

Test build #24942 has started for PR 3827 at commit 55c7c56.

This patch merges cleanly.

yanboliang · 2014-12-31T06:08:08Z

Change for better test output and move it to another test file which is more reasonable.

SparkQA · 2014-12-31T06:15:36Z

Test build #24942 has finished for PR 3827 at commit 55c7c56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-31T06:15:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24942/
Test FAILed.

SparkQA · 2014-12-31T06:52:33Z

Test build #24947 has started for PR 3827 at commit 65c4e7c.

This patch merges cleanly.

SparkQA · 2014-12-31T08:05:24Z

Test build #24947 has finished for PR 3827 at commit 65c4e7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-31T08:05:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24947/
Test PASSed.

yanboliang · 2015-01-08T10:17:03Z

Can anyone verify and merge this patch? It's a bug appeared frequently and fix it asap will be better. @marmbrus

liancheng · 2015-01-08T12:21:34Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+    sql("SELECT * FROM src WHERE key % 2 = 0")
+      .sample(withReplacement = false, fraction = 0.3)
+      .registerTempTable("sampled")
+    (1 to 10).foreach{ i =>


Nit: space before {.

liancheng · 2015-01-08T12:22:14Z

Sorry for the late response. This LGTM except a minor styling issue. Thanks!

SparkQA · 2015-01-09T03:22:40Z

Test build #25293 has started for PR 3827 at commit 0912ca0.

This patch merges cleanly.

SparkQA · 2015-01-09T04:31:12Z

Test build #25293 has finished for PR 3827 at commit 0912ca0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-09T04:31:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25293/
Test PASSed.

marmbrus · 2015-01-10T22:13:09Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+    sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
+    val location =
+      Utils.getSparkClassLoader.getResource("data/files/kv1.txt").getFile()
+    sql(s"LOAD DATA LOCAL INPATH '$location' INTO TABLE src")


You don't need to create the src table. Our test harness does that automatically whenever the test tables are referenced. I can remove this when merging.

marmbrus · 2015-01-10T22:20:17Z

Thanks! I've merged this to master.

yanboliang changed the title ~~HiveTableScan return mutable row with copy~~ SPARK-4963 [SQL] HiveTableScan return mutable row with copy Dec 29, 2014

Yanbo Liang added 2 commits December 30, 2014 16:03

HiveTableScan return mutable row with copy

e840829

SchemaRDD add copy operation before Sample operator

cea7e2e

yanboliang force-pushed the spark-4963 branch from 6eaee5e to cea7e2e Compare December 30, 2014 08:10

yanboliang changed the title ~~SPARK-4963 [SQL] HiveTableScan return mutable row with copy~~ SPARK-4963 [SQL] Add copy to SQL's Sample operator Dec 30, 2014

marmbrus reviewed Dec 30, 2014
View reviewed changes

better output of test case

55c7c56

import file and clear annotation

65c4e7c

liancheng reviewed Jan 8, 2015
View reviewed changes

code format keep

0912ca0

marmbrus reviewed Jan 10, 2015
View reviewed changes

asfgit closed this in 77106df Jan 10, 2015

yanboliang deleted the spark-4963 branch February 19, 2015 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

yanboliang commented Dec 29, 2014

AmplabJenkins commented Dec 29, 2014

mengxr commented Dec 29, 2014

mengxr commented Dec 29, 2014

SparkQA commented Dec 29, 2014

SparkQA commented Dec 29, 2014

AmplabJenkins commented Dec 29, 2014

liancheng commented Dec 30, 2014

liancheng commented Dec 30, 2014

SparkQA commented Dec 30, 2014

yanboliang commented Dec 30, 2014

SparkQA commented Dec 30, 2014

AmplabJenkins commented Dec 30, 2014

marmbrus Dec 30, 2014

SparkQA commented Dec 31, 2014

yanboliang commented Dec 31, 2014

SparkQA commented Dec 31, 2014

AmplabJenkins commented Dec 31, 2014

SparkQA commented Dec 31, 2014

SparkQA commented Dec 31, 2014

AmplabJenkins commented Dec 31, 2014

yanboliang commented Jan 8, 2015

liancheng Jan 8, 2015

liancheng commented Jan 8, 2015

SparkQA commented Jan 9, 2015

SparkQA commented Jan 9, 2015

AmplabJenkins commented Jan 9, 2015

marmbrus Jan 10, 2015

marmbrus commented Jan 10, 2015

SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

Conversation

yanboliang commented Dec 29, 2014

AmplabJenkins commented Dec 29, 2014

mengxr commented Dec 29, 2014

mengxr commented Dec 29, 2014

SparkQA commented Dec 29, 2014

SparkQA commented Dec 29, 2014

AmplabJenkins commented Dec 29, 2014

liancheng commented Dec 30, 2014

liancheng commented Dec 30, 2014

SparkQA commented Dec 30, 2014

yanboliang commented Dec 30, 2014

SparkQA commented Dec 30, 2014

AmplabJenkins commented Dec 30, 2014

marmbrus Dec 30, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 31, 2014

yanboliang commented Dec 31, 2014

SparkQA commented Dec 31, 2014

AmplabJenkins commented Dec 31, 2014

SparkQA commented Dec 31, 2014

SparkQA commented Dec 31, 2014

AmplabJenkins commented Dec 31, 2014

yanboliang commented Jan 8, 2015

liancheng Jan 8, 2015

Choose a reason for hiding this comment

liancheng commented Jan 8, 2015

SparkQA commented Jan 9, 2015

SparkQA commented Jan 9, 2015

AmplabJenkins commented Jan 9, 2015

marmbrus Jan 10, 2015

Choose a reason for hiding this comment

marmbrus commented Jan 10, 2015