SequenceMatcher can chunkify suboptimally #206

rogalski · 2021-09-23T22:34:57Z

This bug is based on my experiments with darker on closed-source repo.
I am not at liberty to disclose any code. We'll have to work on recreating failure criteria on our own.

This one is hardest one to reproduce without code which triggered this exact behavior...

Particular baseline combined with particular patch triggered a case where black chunk was extremely large (~150 lines) and rather non-sensical. This caused bisect to go on and on. In result, a lot of lines not belonging to diff was reformatted as well. By introducing small heuristic via isjunk I was able to trigger smaller chunks which then got reformatted into something sensible.

We may want diving into adjusting junk heuristics in SequenceMatcher or using different matching algorithm (something that forces minimal diffs instead of human-readable diffs). Also, possibly use something like hypothesmith to try to reproduce this behavior in synthetic manner.

The text was updated successfully, but these errors were encountered:

akaihola · 2021-09-24T07:30:23Z

bisect to go on and on

Like how many rounds?

Originally I had a naïve algorithm which expanded the number of extra context lines one by one. This of course caused laughable performance with large files which didn't diff well.

I thought bisection would result in a reasonable number of required rounds on any imaginable source code files, but apparently this isn't the case?

different matching algorithm (something that forces minimal diffs instead of human-readable diffs)

This is what I've been thinking might be the solution to many cases where bisection is triggered. Are you aware of any good Python implementations of such algorithms?

rogalski · 2021-09-24T11:20:57Z

Last time I read about that, Levenstein was supposedly best.
There is C-extension: https://pypi.org/project/Levenshtein/
I am unsure how well maintained that is or if public API is fully interchangeable.

rogalski · 2021-09-24T13:32:37Z

Btw. I still like my heuristic to treat whitespace-only lines as junk. A lot of code in the world already use empty lines to split code into "somewhat" independent sections. Forcing chunk mechanism to break on those changes seems very intuitive to me.

akaihola · 2021-09-25T13:48:29Z

Sounds like your blank line heuristic would be a smaller and simpler change. Do you have a Darker patch for it?

rogalski · 2021-09-28T11:16:25Z

darker/src/darker/diff.py

Line 92 in ac4f2b0

matcher = SequenceMatcher(None, src.lines, dst.lines, autojunk=False)

Something along the lines below of should do the trick:

   matcher = SequenceMatcher(lambda line: all(char in string.whitespace for char in line), src.lines, dst.lines, autojunk=False)

Please note that this is not taking into consideration any performance concerns. O(1) lookup via set may be desirable.

rogalski · 2021-09-28T20:16:35Z

Silly idea, maybe we should delegate computing diffs to git binary itself? (git diff --no-index f1.py f2.py)
We have to create some temporary files, but algorithm itself should be already quite well-tuned for code.
Also, compiled code may be faster than difflib, even with imposed overhead.

akaihola · 2021-09-30T19:06:36Z

I think it would be a fair approach to implement support for git diff --no-index f1.py f2.py in Darker and both compare the results on "difficult" files as well as benchmark the performance against the current implementation.

akaihola · 2021-10-10T17:02:00Z

@rogalski I experimented with

matcher = SequenceMatcher(
    lambda line: all(char in string.whitespace for char in line),
    src.lines,
    dst.lines,
    autojunk=False
)

but it actually makes my results worse, not better. If two reformatted lines are separated by a blank line, and only one of them was modified by the user, both still get reformatted by Darker.

Minimal case:

import string
from difflib import SequenceMatcher

print(
    SequenceMatcher(
        lambda line: all(char in string.whitespace for char in line),
        ["old 1", "", "old 2"],
        ["new 1", "", "new 2"],
        autojunk=False
    ).get_opcodes()
)

[('replace', 0, 3, 0, 3)]

whereas

print(
    SequenceMatcher(
        None,
        ["old 1", "", "old 2"],
        ["new 1", "", "new 2"],
        autojunk=False
    ).get_opcodes()
)

[('replace', 0, 1, 0, 1), ('equal', 1, 2, 1, 2), ('replace', 2, 3, 2, 3)]

akaihola · 2021-10-10T19:28:02Z

delegate computing diffs to git binary itself? (git diff --no-index f1.py f2.py)

@rogalski FYI, src/darker/git_diff.py (link to commit just before removal) used to contain code for invoking git diff and parsing the result. I had to discard that approach when improving isort integration for Darker 1.0.0 (in 8a4e13e).

With --no-index (which I was unaware of until you pointed it out) we could very well use git diff again, now not only for user edits and isort results, but also for Black reformatting.

The best option would of course be to be able to get the distinct reformatted chunks from Black itself without having to diff and analyze its output. But I haven't yet looked into that. And Black hasn't yet defined a stable public API (although that hasn't stopped Darker from calling format_str(), assert_equivalent(), and find_project_root()).

akaihola · 2021-10-11T06:47:06Z

@rogalski, I started to experiment and see what it would take to get more information from Black in order to apply reformattings with more granularity. See #221.

Update: It does quickly get really hairy – obtaining reformattings in chunks from Black is not a silver bullet.

akaihola · 2022-02-18T06:53:28Z

Moving this to milestone 1.5.0, there are a plenty of bugfixes and documentation improvements coming up for a 1.4.2 bugfix release already.

akaihola · 2022-02-24T21:48:38Z

Postponing to 1.6.0 – this one doesn't seem like a simple change.

akaihola · 2022-12-25T16:26:59Z

Postponing to 1.7.1.

rogalski mentioned this issue Sep 23, 2021

Missing --jobs support #178

Closed

akaihola added the question Further information is requested label Sep 24, 2021

akaihola added this to the 1.3.2 milestone Oct 5, 2021

akaihola modified the milestones: 1.3.2, 1.4.0, 1.4.1 Oct 28, 2021

akaihola modified the milestones: 1.4.1, 1.4.2 Feb 10, 2022

akaihola modified the milestones: 1.4.2, 1.5.0 Feb 18, 2022

akaihola modified the milestones: 1.5.0, 1.6.0 Feb 24, 2022

akaihola modified the milestones: 1.6.0, 1.6.1 Dec 19, 2022

akaihola modified the milestones: 1.6.1, 1.7.1 Dec 25, 2022

akaihola modified the milestones: 1.7.1, 1.7.2 Feb 19, 2023

akaihola modified the milestones: Darker 1.7.2, Darker 1.8.1 Mar 26, 2023

akaihola added enhancement New feature or request help wanted Extra attention is needed labels Mar 28, 2024

akaihola modified the milestones: Darker 2.1.1, Darker 2.2.0 - New feature release Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequenceMatcher can chunkify suboptimally #206

SequenceMatcher can chunkify suboptimally #206

rogalski commented Sep 23, 2021 •

edited

Loading

akaihola commented Sep 24, 2021

rogalski commented Sep 24, 2021

rogalski commented Sep 24, 2021

akaihola commented Sep 25, 2021 •

edited

Loading

rogalski commented Sep 28, 2021

rogalski commented Sep 28, 2021

akaihola commented Sep 30, 2021

akaihola commented Oct 10, 2021

akaihola commented Oct 10, 2021 •

edited

Loading

akaihola commented Oct 11, 2021 •

edited

Loading

akaihola commented Feb 18, 2022

akaihola commented Feb 24, 2022

akaihola commented Dec 25, 2022

SequenceMatcher can chunkify suboptimally #206

SequenceMatcher can chunkify suboptimally #206

Comments

rogalski commented Sep 23, 2021 • edited Loading

akaihola commented Sep 24, 2021

rogalski commented Sep 24, 2021

rogalski commented Sep 24, 2021

akaihola commented Sep 25, 2021 • edited Loading

rogalski commented Sep 28, 2021

rogalski commented Sep 28, 2021

akaihola commented Sep 30, 2021

akaihola commented Oct 10, 2021

akaihola commented Oct 10, 2021 • edited Loading

akaihola commented Oct 11, 2021 • edited Loading

akaihola commented Feb 18, 2022

akaihola commented Feb 24, 2022

akaihola commented Dec 25, 2022

rogalski commented Sep 23, 2021 •

edited

Loading

akaihola commented Sep 25, 2021 •

edited

Loading

akaihola commented Oct 10, 2021 •

edited

Loading

akaihola commented Oct 11, 2021 •

edited

Loading