Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't run SequenceMatcher on identical input. #889

Closed
wants to merge 1 commit into from

Conversation

Carreau
Copy link
Contributor

@Carreau Carreau commented Mar 14, 2017

Seem like obvious optimisation but SequenceMatcher does not do it.

Seem like obvious optimisation but SequenceMatcher does not do it.
@davidhalter
Copy link
Owner

This is probably going to change a bit anyway (the future would be comparing lines instead of a string).

The SequenceMatcher is really fast. Do you really think that this is an issue?

@Carreau
Copy link
Contributor Author

Carreau commented Mar 18, 2017

The SequenceMatcher is really fast. Do you really think that this is an issue?

Yes. Because DiffLib is sometimes wrong, using it is shooting yourself in the foot in some rare case. I wrote the above after scratching my head for a week on an incomprehensible problem.

As to "fast", I've benchmarks somewhere that show that in same case it gets O(n**2) if not worse.
And that sometime it's "fast" because it does not give you minimal diff, but some crazy results, that are correct diff but far from being the expected ones.

This is a "won't fix" upstream as SequenceMatcher() is made for "human friendly" diff, and they won't change the behavior. The O(n**2) behavior is also what makes it by default set autojunk=True which drop the most common character before performing the diff if len(sequence) > 200:

    # from starting any matching block at a junk element ...
    # b2j also does not contain entries for "popular" elements, meaning
    # elements that account for more than 1 + 1% of the total elements, and
    # when the sequence is reasonably large (>= 200 elements); this can
    # be viewed as an adaptive notion of semi-junk, and yields an enormous
    # speedup when, e.g., comparing program files with hundreds of
    # instances of "return NULL;"

So yeah, i've seen a lot of surprising behavior of Sequence Matcher, and have started to write my own a couple of years ago (I should revive that). But I don't trust it to even be fast on2 similars inputs.

@davidhalter
Copy link
Owner

Thanks for the hint. I have integrated similar logic (comparing lines) in a different abstraction where it makes more sense (this abstraction didn't even exist before).

With all the different improvements, the whole test suite runs in 15s now on my notebook, which is pretty good for the amount of work it has to do. That's like 9% faster than the master branch.

@davidhalter
Copy link
Owner

Oh and BTW it's ~ 30% faster than both Jedi 0.8.1 and 0.9 while having 400 more tests than 0.9 and 550 (+ 50%) more tests 0.8.1, which makes it even more impressive (and fun to work with :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants