.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020

JozsefKutas · 2023-04-16T14:37:53Z

Comparing two unordered collections recursively is very slow if the two collections are not small and their elements are not in a similar order.

I've had a look through the implementation in RecursiveComparisonDifferenceCalculator#compareUnorderedIterables and it looks like the complexity in the worst case is quadratic.

I think this could be improved by using a hash map to avoid unnecessary comparisons, but this would require a deep hash implementation that is consistent with the deep equals configuration. (There was a deepHashCode method in an earlier version of RecursiveComparisonDifferenceCalculator, but it was removed in commit 7406133.)

I've tested this with assertj version 3.24.2.

Some example timed comparisons:

@Test
public void test_usingRecursiveComparison() {
    List<Integer> range100 = IntStream.range(0, 100).boxed().toList();
    List<Integer> range1000 = IntStream.range(0, 1000).boxed().toList();
    List<Integer> range10000 = IntStream.range(0, 10000).boxed().toList();

    // everything works fine for comparisons of ordered collections
    Stopwatch stopwatch = Stopwatch.createStarted();
    assertThat(range100)
            .usingRecursiveComparison()
            .isEqualTo(range100);
    System.out.println(stopwatch);  // 176.6 ms
    stopwatch = Stopwatch.createStarted();
    assertThat(range1000)
            .usingRecursiveComparison()
            .isEqualTo(range1000);
    System.out.println(stopwatch);  // 53.10 μs
    stopwatch = Stopwatch.createStarted();
    assertThat(range10000)
            .usingRecursiveComparison()
            .isEqualTo(range10000);
    System.out.println(stopwatch);  // 32.40 μs

    List<Integer> reversed100 = IntStream.range(0, 100).boxed()
            .sorted(comparing(x -> -x)).toList();
    List<Integer> reversed1000 = IntStream.range(0, 1000).boxed()
            .sorted(comparing(x -> -x)).toList();
    List<Integer> reversed10000 = IntStream.range(0, 10000).boxed()
            .sorted(comparing(x -> -x)).toList();

    // but comparisons of unordered collections become slow very quickly
    stopwatch = Stopwatch.createStarted();
    assertThat(range100)
            .usingRecursiveComparison()
            .ignoringCollectionOrder()
            .isEqualTo(reversed100);
    System.out.println(stopwatch);  // 153.8 ms
    stopwatch = Stopwatch.createStarted();
    assertThat(range1000)
            .usingRecursiveComparison()
            .ignoringCollectionOrder()
            .isEqualTo(reversed1000);
    System.out.println(stopwatch);  // 1.471 s
    stopwatch = Stopwatch.createStarted();
    assertThat(range10000)
            .usingRecursiveComparison()
            .ignoringCollectionOrder()
            .isEqualTo(reversed10000);
    System.out.println(stopwatch);  // 1.639 min
}

The text was updated successfully, but these errors were encountered:

joel-costigliola · 2023-04-16T15:05:44Z

We know this is slow because unordered collection comparison is O(n2) (see #2979).

To use any kind of hashcode, we would need to be consistent with the recursive comparison configuration, let's say you are comparing Person with an Address field and an id of type string field, if the hashcode is computed with both fields but the recursive comparison ignores the id then using the hashcode return lead to false positives.

So I believe this could work if the hashcode is consistent with the recursive comparison configuration.

JozsefKutas · 2023-04-16T15:37:56Z

Thanks for the link - I did a quick search for existing issues before posting this one, but somehow missed that.

I just had a go at fixing the problem and it's pretty easy to get something that covers the simple cases, but my solution doesn't take into account ignored fields or custom comparators. For those I think I'd have to modify the deep hash code to track the field location.

If I come up with something that seems to work I'll start a PR.

… some common cases. See assertj#3020

… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020

… some common cases. See assertj#3020

JozsefKutas · 2023-04-18T00:58:32Z

I've looked a bit more into this, and I don't think it's possible to write a general hash code function that matches the recursive comparison, as with certain configurations the recursive comparison may not define an equivalence relation. This is most obvious with configuration options like ignoreAllActualNullFields which isn't symmetric or transitive. In this case, whether expected.x should be compared (and so included in the hash calculation) depends on the nullness of actual.x. There are other configurations that violate the equivalence relation conditions in less obvious ways.

I think it's still possible to improve on the current behaviour though. The hash code could just be used as a hint for which elements in the unordered iterable to compare to first - if expected doesn't match any actual element in the corresponding hash bucket, we check the other elements.

joel-costigliola · 2023-04-18T09:40:10Z

Thanks for the investigation, I agree that comparing first elements with matching hashcode could improve the overall number of comparison, I don't think we need a deep hashcode (as there was in the implementation at the beginning), the object hashcode should do the job. I would curious how much we gain.

JozsefKutas · 2023-04-18T11:04:42Z

OK, I just created a PR using the object hashcode.

joel-costigliola · 2023-04-18T17:27:26Z

@JozsefKutas out of curiosity, what is the reason for having to compare huge collection with the recursive comparison? I'm kinda of expecting unit test to operate on small data set so I'm curious what your use case is

JozsefKutas · 2023-04-18T17:57:01Z

I'm running some tests where I import some medium sized test data that's used for some financial modelling calculations and I try exporting/reimporting the data, or persisting/fetching the data, and use a recursive comparison to test that it hasn't been changed in any meaningful way.

I can work around the problem by only using a subsample of the test data for tests requiring recursive comparisons (this is my current workaround), but it would be better if this wasn't necessary.

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 16, 2023

Use hash code for compareUnorderedIterables to improve performance in…

4827bb0

… some common cases. See assertj#3020

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 16, 2023

Use hash code for compareUnorderedIterables to improve performance in…

104469c

… some common cases. See assertj#3020

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 17, 2023

Use hash code for compareUnorderedIterables to improve performance in…

2532434

… some common cases. See assertj#3020

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 17, 2023

Use hash code for compareUnorderedIterables to improve performance in…

a11715b

… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 18, 2023

Use hash code for compareUnorderedIterables to improve performance in…

3014123

… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 18, 2023

Use hash code for compareUnorderedIterables to improve performance in…

c8c2386

… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020

JozsefKutas pushed a commit to JozsefKutas/assertj that referenced this issue Apr 18, 2023

Use hash code for compareUnorderedIterables to improve performance in…

0aae64d

… some common cases. See assertj#3020

JozsefKutas mentioned this issue Apr 18, 2023

Use hash code for compareUnorderedIterables to improve performance in… #3021

Closed

joel-costigliola added this to the 3.25.0 milestone Apr 18, 2023

joel-costigliola closed this as completed in c2e981d Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020

.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020

JozsefKutas commented Apr 16, 2023

joel-costigliola commented Apr 16, 2023

JozsefKutas commented Apr 16, 2023

JozsefKutas commented Apr 18, 2023

joel-costigliola commented Apr 18, 2023

JozsefKutas commented Apr 18, 2023

joel-costigliola commented Apr 18, 2023

JozsefKutas commented Apr 18, 2023

.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020

.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020

Comments

JozsefKutas commented Apr 16, 2023

joel-costigliola commented Apr 16, 2023

JozsefKutas commented Apr 16, 2023

JozsefKutas commented Apr 18, 2023

joel-costigliola commented Apr 18, 2023

JozsefKutas commented Apr 18, 2023

joel-costigliola commented Apr 18, 2023

JozsefKutas commented Apr 18, 2023