Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26878] QueryTest.compare() does not handle maps with array keys correctly #23789

Closed
wants to merge 5 commits into from

Conversation

ala
Copy link
Contributor

@ala ala commented Feb 14, 2019

What changes were proposed in this pull request?

The previous strategy for comparing Maps leveraged sorting (key, value) tuples by their _.toString. However, the _.toString representation of an arrays has nothing to do with it's content. If a map has array keys, it's (key, value) pairs would be compared with other maps essentially at random. This could results in false negatives in tests.

This changes first compares keys together to find the matching ones, and then compares associated values.

How was this patch tested?

New unit test added.

@ala
Copy link
Contributor Author

ala commented Feb 14, 2019

@gatorsmile fyi

@SparkQA
Copy link

SparkQA commented Feb 14, 2019

Test build #102345 has finished for PR 23789 at commit 7d6daa1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Confirmed the test failed on old code and passed on new code.

I guess we might be able to also try to leverage the fact when the type of key can be sorted, but that's only for tests and not sure how much it can reduce test time, so that's no big deal and completely optional.

@HyukjinKwon
Copy link
Member

retest this please

}.reduce(_ && _)
} else {
a.size == b.size
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

      a.size == b.size && a.keys.forall { aKey =>
        val maybeBKey = b.keys.find(bKey => compare(aKey, bKey))
        maybeBKey.isDefined && compare(a(aKey), b(maybeBKey.get))
      }

? I think it's similar with other iterable or array comparison.

cc @cloud-fan who touched this code lately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this looks cleaner

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102384 has finished for PR 23789 at commit 7d6daa1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

a.size == b.size && a.keys.forall { aKey =>
b.keys.find(bKey => compare(aKey, bKey))
.map(bKey => compare(a(aKey), b(bKey)))
.getOrElse(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine but to be clear technically chaining isn't necessarily always preferred (see https://github.com/databricks/scala-style-guide#monadic-chaining). I made a PR to your branch during the review. I wonder why you picked this over the suggestion though.

+Ah, simply it was missed. That's Okie:).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, missed it, sorry. Thanks for the PR, though 👍

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102397 has finished for PR 23789 at commit 33c5bcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val entries2 = b.iterator.toSeq.sortBy(_.toString())
compare(entries1, entries2)
a.size == b.size && a.keys.forall { aKey =>
b.keys.find(bKey => compare(aKey, bKey))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 space indentation.

compare(entries1, entries2)
a.size == b.size && a.keys.forall { aKey =>
b.keys.find(bKey => compare(aKey, bKey))
.map(bKey => compare(a(aKey), b(bKey)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we want to use chaining

b.keys.find(compare(aKey, _)).exists(bKey => compare(a(aKey), b(bKey)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@SparkQA
Copy link

SparkQA commented Feb 17, 2019

Test build #102433 has finished for PR 23789 at commit 2713011.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2019

Test build #102434 has finished for PR 23789 at commit 5933bbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 36902e1 Feb 18, 2019
@HyukjinKwon
Copy link
Member

LGTM too

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…s correctly

## What changes were proposed in this pull request?

The previous strategy for comparing Maps leveraged sorting (key, value) tuples by their _.toString. However, the _.toString representation of an arrays has nothing to do with it's content. If a map has array keys, it's (key, value) pairs would be compared with other maps essentially at random. This could results in false negatives in tests.

This changes first compares keys together to find the matching ones, and then compares associated values.

## How was this patch tested?

New unit test added.

Closes apache#23789 from ala/compare-map.

Authored-by: Ala Luszczak <ala@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants