Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

Closed
wants to merge 5 commits into from

Conversation

yanboliang
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.

override def next(): T = {
val r = data.next()
advance
r
}

GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.

To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.

@yanboliang yanboliang changed the title HiveTableScan return mutable row with copy SPARK-4963 [SQL] HiveTableScan return mutable row with copy Dec 29, 2014
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented Dec 29, 2014

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Dec 29, 2014

ok to test

@SparkQA
Copy link

SparkQA commented Dec 29, 2014

Test build #24866 has started for PR 3827 at commit 6eaee5e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 29, 2014

Test build #24866 has finished for PR 3827 at commit 6eaee5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24866/
Test PASSed.

@liancheng
Copy link
Contributor

Hey @yanbohappy, as I've commented in the JIRA, would you mind to do a micro benchmark using code in #758 to see whether this fix introduces noticeable performance regression?

@liancheng
Copy link
Contributor

@yanbohappy Actually, we can move the copy call to execution.Sample.execute. In this way, queries without sampling are not negatively affected.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24888 has started for PR 3827 at commit cea7e2e.

  • This patch merges cleanly.

@yanboliang
Copy link
Contributor Author

@liancheng I agree to move the copy call to execution.Sample.execute and added new commits.
It will take no effect on HiveTableScan.

@yanboliang yanboliang changed the title SPARK-4963 [SQL] HiveTableScan return mutable row with copy SPARK-4963 [SQL] Add copy to SQL's Sample operator Dec 30, 2014
@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24888 has finished for PR 3827 at commit cea7e2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24888/
Test PASSed.

TestHive.sql("SELECT * FROM src WHERE key % 2 = 0")
.sample(withReplacement = false, fraction = 0.3)
.registerTempTable("sampled")
assert((1 to 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use checkAnswer instead here as it give better output when there is an exception or the answer is wrong.

(1 to 10).foreach { i =>
  checkAnswer(
    sql("SELECT * FROM sampled WHERE key % 2 = 1"),
    Seq.empty)
}

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24942 has started for PR 3827 at commit 55c7c56.

  • This patch merges cleanly.

@yanboliang
Copy link
Contributor Author

Change for better test output and move it to another test file which is more reasonable.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24942 has finished for PR 3827 at commit 55c7c56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24942/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24947 has started for PR 3827 at commit 65c4e7c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24947 has finished for PR 3827 at commit 65c4e7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24947/
Test PASSed.

@yanboliang
Copy link
Contributor Author

Can anyone verify and merge this patch? It's a bug appeared frequently and fix it asap will be better. @marmbrus

sql("SELECT * FROM src WHERE key % 2 = 0")
.sample(withReplacement = false, fraction = 0.3)
.registerTempTable("sampled")
(1 to 10).foreach{ i =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space before {.

@liancheng
Copy link
Contributor

Sorry for the late response. This LGTM except a minor styling issue. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25293 has started for PR 3827 at commit 0912ca0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25293 has finished for PR 3827 at commit 0912ca0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25293/
Test PASSed.

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
val location =
Utils.getSparkClassLoader.getResource("data/files/kv1.txt").getFile()
sql(s"LOAD DATA LOCAL INPATH '$location' INTO TABLE src")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to create the src table. Our test harness does that automatically whenever the test tables are referenced. I can remove this when merging.

@marmbrus
Copy link
Contributor

Thanks! I've merged this to master.

@asfgit asfgit closed this in 77106df Jan 10, 2015
@yanboliang yanboliang deleted the spark-4963 branch February 19, 2015 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants