Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find the longest match in the suffix-sorted array #6

Merged
merged 1 commit into from
Feb 23, 2018

Conversation

phmccarty
Copy link
Contributor

@phmccarty phmccarty commented Dec 19, 2017

Fixes #2

The current implementation performs a binary search through the suffix-sorted array to find byte sequence matches between old and new files. However, the algorithm is not optimal when it repeatedly matches very short strings, leading to performance issues as reported in issue #2 .

This commit changes the algorithm to consider all matches encountered during the binary search and choose the longest of these matches.

The overall performance impact is yet to be determined, but it appears to yield a small percentage increase in diff creation time (expected) and a large percentage decrease for the case of diffing the files from issue #2. In my testing, the creation time there decreases from approx 64 minutes to 4.5 seconds.

Fixes clearlinux#2

The current implementation performs a binary search through the
suffix-sorted array to find byte sequence matches between old and new
files. However, the algorithm is not optimal when it repeatedly matches
very short strings, leading to performance issues as reported in issue
 clearlinux#2.

This commit changes the algorithm to consider *all* matches encountered
during the binary search and choose the longest of these matches.

The overall performance impact is yet to be determined, but it appears
to yield a small percentage increase in diff creation time (expected)
and a large percentage *decrease* for the case of diffing the files from
issue clearlinux#2. In my testing, the creation time there decreases from approx
64 minutes to 4.5 seconds.

Signed-off-by: Patrick McCarty <patrick.mccarty@intel.com>
@matthewrsj
Copy link

+1, I think the large decrease in diff creation times for files from issue #2 is a good tradeoff for the small increase seen for most files. It should also be mentioned that patch times are faster as a result of the greater match lengths (paraphrasing @phmccarty here).

@phmccarty
Copy link
Contributor Author

(revisiting this now)

I was concerned about the small performance increase for the general case, but the tradeoff for vastly improved performance for certain special cases (e.g. #2) is acceptable for now. I can iterate from this point.

@phmccarty phmccarty merged commit 85b1c53 into clearlinux:master Feb 23, 2018
@tpepper
Copy link

tpepper commented Feb 23, 2018

Is there a notable difference in memory consumption during runs before/after this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bsdiff runs very long
3 participants