Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020

Closed
JozsefKutas opened this issue Apr 16, 2023 · 7 comments
Milestone

Comments

@JozsefKutas
Copy link

Comparing two unordered collections recursively is very slow if the two collections are not small and their elements are not in a similar order.

I've had a look through the implementation in RecursiveComparisonDifferenceCalculator#compareUnorderedIterables and it looks like the complexity in the worst case is quadratic.

I think this could be improved by using a hash map to avoid unnecessary comparisons, but this would require a deep hash implementation that is consistent with the deep equals configuration. (There was a deepHashCode method in an earlier version of RecursiveComparisonDifferenceCalculator, but it was removed in commit 7406133.)

I've tested this with assertj version 3.24.2.

Some example timed comparisons:

@Test
public void test_usingRecursiveComparison() {
    List<Integer> range100 = IntStream.range(0, 100).boxed().toList();
    List<Integer> range1000 = IntStream.range(0, 1000).boxed().toList();
    List<Integer> range10000 = IntStream.range(0, 10000).boxed().toList();

    // everything works fine for comparisons of ordered collections
    Stopwatch stopwatch = Stopwatch.createStarted();
    assertThat(range100)
            .usingRecursiveComparison()
            .isEqualTo(range100);
    System.out.println(stopwatch);  // 176.6 ms
    stopwatch = Stopwatch.createStarted();
    assertThat(range1000)
            .usingRecursiveComparison()
            .isEqualTo(range1000);
    System.out.println(stopwatch);  // 53.10 μs
    stopwatch = Stopwatch.createStarted();
    assertThat(range10000)
            .usingRecursiveComparison()
            .isEqualTo(range10000);
    System.out.println(stopwatch);  // 32.40 μs

    List<Integer> reversed100 = IntStream.range(0, 100).boxed()
            .sorted(comparing(x -> -x)).toList();
    List<Integer> reversed1000 = IntStream.range(0, 1000).boxed()
            .sorted(comparing(x -> -x)).toList();
    List<Integer> reversed10000 = IntStream.range(0, 10000).boxed()
            .sorted(comparing(x -> -x)).toList();

    // but comparisons of unordered collections become slow very quickly
    stopwatch = Stopwatch.createStarted();
    assertThat(range100)
            .usingRecursiveComparison()
            .ignoringCollectionOrder()
            .isEqualTo(reversed100);
    System.out.println(stopwatch);  // 153.8 ms
    stopwatch = Stopwatch.createStarted();
    assertThat(range1000)
            .usingRecursiveComparison()
            .ignoringCollectionOrder()
            .isEqualTo(reversed1000);
    System.out.println(stopwatch);  // 1.471 s
    stopwatch = Stopwatch.createStarted();
    assertThat(range10000)
            .usingRecursiveComparison()
            .ignoringCollectionOrder()
            .isEqualTo(reversed10000);
    System.out.println(stopwatch);  // 1.639 min
}
@joel-costigliola
Copy link
Member

We know this is slow because unordered collection comparison is O(n2) (see #2979).

To use any kind of hashcode, we would need to be consistent with the recursive comparison configuration, let's say you are comparing Person with an Address field and an id of type string field, if the hashcode is computed with both fields but the recursive comparison ignores the id then using the hashcode return lead to false positives.

So I believe this could work if the hashcode is consistent with the recursive comparison configuration.

@JozsefKutas
Copy link
Author

Thanks for the link - I did a quick search for existing issues before posting this one, but somehow missed that.

I just had a go at fixing the problem and it's pretty easy to get something that covers the simple cases, but my solution doesn't take into account ignored fields or custom comparators. For those I think I'd have to modify the deep hash code to track the field location.

If I come up with something that seems to work I'll start a PR.

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 16, 2023
JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 16, 2023
JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 17, 2023
JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 17, 2023
… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020
JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 18, 2023
… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020
JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 18, 2023
… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020
JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 18, 2023
@JozsefKutas
Copy link
Author

I've looked a bit more into this, and I don't think it's possible to write a general hash code function that matches the recursive comparison, as with certain configurations the recursive comparison may not define an equivalence relation. This is most obvious with configuration options like ignoreAllActualNullFields which isn't symmetric or transitive. In this case, whether expected.x should be compared (and so included in the hash calculation) depends on the nullness of actual.x. There are other configurations that violate the equivalence relation conditions in less obvious ways.

I think it's still possible to improve on the current behaviour though. The hash code could just be used as a hint for which elements in the unordered iterable to compare to first - if expected doesn't match any actual element in the corresponding hash bucket, we check the other elements.

@joel-costigliola
Copy link
Member

Thanks for the investigation, I agree that comparing first elements with matching hashcode could improve the overall number of comparison, I don't think we need a deep hashcode (as there was in the implementation at the beginning), the object hashcode should do the job. I would curious how much we gain.

@JozsefKutas
Copy link
Author

OK, I just created a PR using the object hashcode.

@joel-costigliola
Copy link
Member

@JozsefKutas out of curiosity, what is the reason for having to compare huge collection with the recursive comparison? I'm kinda of expecting unit test to operate on small data set so I'm curious what your use case is

@joel-costigliola joel-costigliola added this to the 3.25.0 milestone Apr 18, 2023
@JozsefKutas
Copy link
Author

I'm running some tests where I import some medium sized test data that's used for some financial modelling calculations and I try exporting/reimporting the data, or persisting/fetching the data, and use a recursive comparison to test that it hasn't been changed in any meaningful way.

I can work around the problem by only using a subsample of the test data for tests requiring recursive comparisons (this is my current workaround), but it would be better if this wasn't necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants