New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.usingRecursiveComparison().ignoringCollectionOrder() is very slow for medium to large unordered collections #3020
Comments
We know this is slow because unordered collection comparison is O(n2) (see #2979). To use any kind of hashcode, we would need to be consistent with the recursive comparison configuration, let's say you are comparing So I believe this could work if the hashcode is consistent with the recursive comparison configuration. |
Thanks for the link - I did a quick search for existing issues before posting this one, but somehow missed that. I just had a go at fixing the problem and it's pretty easy to get something that covers the simple cases, but my solution doesn't take into account ignored fields or custom comparators. For those I think I'd have to modify the deep hash code to track the field location. If I come up with something that seems to work I'll start a PR. |
… some common cases. See assertj#3020
… some common cases. See assertj#3020
… some common cases. See assertj#3020
… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020
… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020
… some common cases. Does not account for all recursive comparison configuration options. See assertj#3020
… some common cases. See assertj#3020
I've looked a bit more into this, and I don't think it's possible to write a general hash code function that matches the recursive comparison, as with certain configurations the recursive comparison may not define an equivalence relation. This is most obvious with configuration options like I think it's still possible to improve on the current behaviour though. The hash code could just be used as a hint for which elements in the unordered iterable to compare to first - if |
Thanks for the investigation, I agree that comparing first elements with matching hashcode could improve the overall number of comparison, I don't think we need a deep hashcode (as there was in the implementation at the beginning), the object hashcode should do the job. I would curious how much we gain. |
OK, I just created a PR using the object hashcode. |
@JozsefKutas out of curiosity, what is the reason for having to compare huge collection with the recursive comparison? I'm kinda of expecting unit test to operate on small data set so I'm curious what your use case is |
I'm running some tests where I import some medium sized test data that's used for some financial modelling calculations and I try exporting/reimporting the data, or persisting/fetching the data, and use a recursive comparison to test that it hasn't been changed in any meaningful way. I can work around the problem by only using a subsample of the test data for tests requiring recursive comparisons (this is my current workaround), but it would be better if this wasn't necessary. |
Comparing two unordered collections recursively is very slow if the two collections are not small and their elements are not in a similar order.
I've had a look through the implementation in
RecursiveComparisonDifferenceCalculator#compareUnorderedIterables
and it looks like the complexity in the worst case is quadratic.I think this could be improved by using a hash map to avoid unnecessary comparisons, but this would require a deep hash implementation that is consistent with the deep equals configuration. (There was a
deepHashCode
method in an earlier version ofRecursiveComparisonDifferenceCalculator
, but it was removed in commit 7406133.)I've tested this with assertj version 3.24.2.
Some example timed comparisons:
The text was updated successfully, but these errors were encountered: