Skip to content

Conversation

@brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Aug 4, 2015

In short:
1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings.

2- Merging two partitions had a bug:

Existing behavior with size 3
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map()

Correct Behavior:
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map(3 -> 1, 4 -> 22)

cc @mengxr @rxin @JoshRosen

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39762 has finished for PR 7945 at commit 506753e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we still need to recompute minCount after each update, I think the code is simpler without tracking minCount in all cases. We can compute minCount directly in this branch.

if (baseMap.contains(key)) {
  baseMap(key) += count
} else {
  baseMap(key) = count
  if (baseMap.size > size) {
    val minCount = baseMap.values.min
    baseMap.retain((_, v) => v > minCount)
    baseMap.transform((_, v) => v - minCount)
  }
}

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39884 has finished for PR 7945 at commit 1dc61a8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 5, 2015

test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, this could happen after transforming exiting counts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I had this up here, was so that this gets deleted if count = minCount, and that we don't add key -> 0 to the map

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, this is minor

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39910 has finished for PR 7945 at commit 1dc61a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #39956 has finished for PR 7945 at commit 07fa001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 98e6946 Aug 6, 2015
asfgit pushed a commit that referenced this pull request Aug 6, 2015
… when merging and with Tungsten

In short:
1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings.

2- Merging two partitions had a bug:

**Existing behavior with size 3**
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map()

**Correct Behavior:**
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map(3 -> 1, 4 -> 22)

cc mengxr rxin JoshRosen

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7945 from brkyvz/freq-fix and squashes the following commits:

07fa001 [Burak Yavuz] address 2
1dc61a8 [Burak Yavuz] address 1
506753e [Burak Yavuz] fixed and added reg test
47bfd50 [Burak Yavuz] pushing

(cherry picked from commit 98e6946)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2015

LGTM. Merged into master and branch-1.5. Thanks!

@brkyvz brkyvz deleted the freq-fix branch February 3, 2019 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants