-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827
Conversation
Can one of the admins verify this patch? |
add to whitelist |
ok to test |
Test build #24866 has started for PR 3827 at commit
|
Test build #24866 has finished for PR 3827 at commit
|
Test PASSed. |
Hey @yanbohappy, as I've commented in the JIRA, would you mind to do a micro benchmark using code in #758 to see whether this fix introduces noticeable performance regression? |
@yanbohappy Actually, we can move the |
6eaee5e
to
cea7e2e
Compare
Test build #24888 has started for PR 3827 at commit
|
@liancheng I agree to move the copy call to execution.Sample.execute and added new commits. |
Test build #24888 has finished for PR 3827 at commit
|
Test PASSed. |
TestHive.sql("SELECT * FROM src WHERE key % 2 = 0") | ||
.sample(withReplacement = false, fraction = 0.3) | ||
.registerTempTable("sampled") | ||
assert((1 to 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use checkAnswer
instead here as it give better output when there is an exception or the answer is wrong.
(1 to 10).foreach { i =>
checkAnswer(
sql("SELECT * FROM sampled WHERE key % 2 = 1"),
Seq.empty)
}
Test build #24942 has started for PR 3827 at commit
|
Change for better test output and move it to another test file which is more reasonable. |
Test build #24942 has finished for PR 3827 at commit
|
Test FAILed. |
Test build #24947 has started for PR 3827 at commit
|
Test build #24947 has finished for PR 3827 at commit
|
Test PASSed. |
Can anyone verify and merge this patch? It's a bug appeared frequently and fix it asap will be better. @marmbrus |
sql("SELECT * FROM src WHERE key % 2 = 0") | ||
.sample(withReplacement = false, fraction = 0.3) | ||
.registerTempTable("sampled") | ||
(1 to 10).foreach{ i => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: space before {
.
Sorry for the late response. This LGTM except a minor styling issue. Thanks! |
Test build #25293 has started for PR 3827 at commit
|
Test build #25293 has finished for PR 3827 at commit
|
Test PASSed. |
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") | ||
val location = | ||
Utils.getSparkClassLoader.getResource("data/files/kv1.txt").getFile() | ||
sql(s"LOAD DATA LOCAL INPATH '$location' INTO TABLE src") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to create the src
table. Our test harness does that automatically whenever the test tables are referenced. I can remove this when merging.
Thanks! I've merged this to master. |
https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.
override def next(): T = {
val r = data.next()
advance
r
}
GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.
To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.