Don't run SequenceMatcher on identical input. #889

Carreau · 2017-03-14T02:31:40Z

Seem like obvious optimisation but SequenceMatcher does not do it.

davidhalter · 2017-03-15T18:37:48Z

This is probably going to change a bit anyway (the future would be comparing lines instead of a string).

The SequenceMatcher is really fast. Do you really think that this is an issue?

Carreau · 2017-03-18T21:40:21Z

The SequenceMatcher is really fast. Do you really think that this is an issue?

Yes. Because DiffLib is sometimes wrong, using it is shooting yourself in the foot in some rare case. I wrote the above after scratching my head for a week on an incomprehensible problem.

As to "fast", I've benchmarks somewhere that show that in same case it gets O(n**2) if not worse.
And that sometime it's "fast" because it does not give you minimal diff, but some crazy results, that are correct diff but far from being the expected ones.

This is a "won't fix" upstream as SequenceMatcher() is made for "human friendly" diff, and they won't change the behavior. The O(n**2) behavior is also what makes it by default set autojunk=True which drop the most common character before performing the diff if len(sequence) > 200:

    # from starting any matching block at a junk element ...
    # b2j also does not contain entries for "popular" elements, meaning
    # elements that account for more than 1 + 1% of the total elements, and
    # when the sequence is reasonably large (>= 200 elements); this can
    # be viewed as an adaptive notion of semi-junk, and yields an enormous
    # speedup when, e.g., comparing program files with hundreds of
    # instances of "return NULL;"

So yeah, i've seen a lot of surprising behavior of Sequence Matcher, and have started to write my own a couple of years ago (I should revive that). But I don't trust it to even be fast on2 similars inputs.

davidhalter · 2017-03-31T21:14:21Z

Thanks for the hint. I have integrated similar logic (comparing lines) in a different abstraction where it makes more sense (this abstraction didn't even exist before).

With all the different improvements, the whole test suite runs in 15s now on my notebook, which is pretty good for the amount of work it has to do. That's like 9% faster than the master branch.

davidhalter · 2017-03-31T21:25:34Z

Oh and BTW it's ~ 30% faster than both Jedi 0.8.1 and 0.9 while having 400 more tests than 0.9 and 550 (+ 50%) more tests 0.8.1, which makes it even more impressive (and fun to work with :))

Don't run SequenceMatcher on identical input.

7926f58

Seem like obvious optimisation but SequenceMatcher does not do it.

davidhalter closed this Mar 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't run SequenceMatcher on identical input. #889

Don't run SequenceMatcher on identical input. #889

Carreau commented Mar 14, 2017

davidhalter commented Mar 15, 2017

Carreau commented Mar 18, 2017

davidhalter commented Mar 31, 2017

davidhalter commented Mar 31, 2017

Don't run SequenceMatcher on identical input. #889

Don't run SequenceMatcher on identical input. #889

Conversation

Carreau commented Mar 14, 2017

davidhalter commented Mar 15, 2017

Carreau commented Mar 18, 2017

davidhalter commented Mar 31, 2017

davidhalter commented Mar 31, 2017