-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9615][SPARK-9616][SQL][MLLIB] Bugs related to FrequentItems when merging and with Tungsten #7945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #39762 has finished for PR 7945 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we still need to recompute minCount after each update, I think the code is simpler without tracking minCount in all cases. We can compute minCount directly in this branch.
if (baseMap.contains(key)) {
baseMap(key) += count
} else {
baseMap(key) = count
if (baseMap.size > size) {
val minCount = baseMap.values.min
baseMap.retain((_, v) => v > minCount)
baseMap.transform((_, v) => v - minCount)
}
}|
Test build #39884 has finished for PR 7945 at commit
|
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, this could happen after transforming exiting counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I had this up here, was so that this gets deleted if count = minCount, and that we don't add key -> 0 to the map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, this is minor
|
Test build #39910 has finished for PR 7945 at commit
|
|
Test build #39956 has finished for PR 7945 at commit
|
… when merging and with Tungsten In short: 1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings. 2- Merging two partitions had a bug: **Existing behavior with size 3** Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4) Partition B -> Map(4 -> 25) Result -> Map() **Correct Behavior:** Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4) Partition B -> Map(4 -> 25) Result -> Map(3 -> 1, 4 -> 22) cc mengxr rxin JoshRosen Author: Burak Yavuz <brkyvz@gmail.com> Closes #7945 from brkyvz/freq-fix and squashes the following commits: 07fa001 [Burak Yavuz] address 2 1dc61a8 [Burak Yavuz] address 1 506753e [Burak Yavuz] fixed and added reg test 47bfd50 [Burak Yavuz] pushing (cherry picked from commit 98e6946) Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
LGTM. Merged into master and branch-1.5. Thanks! |
In short:
1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings.
2- Merging two partitions had a bug:
Existing behavior with size 3
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map()
Correct Behavior:
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map(3 -> 1, 4 -> 22)
cc @mengxr @rxin @JoshRosen