Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Highlighting of --word-diff output, and custom "squashed" diff output #152

Open
ryanberckmans opened this issue Apr 28, 2020 · 22 comments
Projects

Comments

@ryanberckmans
Copy link

Does delta support diff --word-diff?

@AvnerCohen
Copy link

In a quick check - no. Would be awesome to add though.

@dandavison dandavison changed the title ➕Does delta support diff --word-diff? 🚀 Support --word-diff May 5, 2020
@dandavison
Copy link
Owner

Hi @ryanberckmans (and @AvnerCohen), thanks for this! Could you expand on exactly what you'd like to see here?

In a sense Delta does provide something like --word-diff, in that it infers and highlights within-line edits (using a Levenshtein-type DP algorithm). However, I'm not saying that there wouldn't be benefit in working with git's native output, and also extending and improving the algorithms implemented by Delta (see #140).

@dandavison
Copy link
Owner

Are you thinking of replacing git's default (and ugly) {+ and [- markers with background colors? (Presumably in this mode, Delta's native within-line edit inference would be disabled).

_SUPPORTED_PYTHONS = [-['3.7',-]{+['3.8', '3.7',+} '3.6', '3.5', '3.4',[-'3.3',-] '2.7']

@AvnerCohen
Copy link

How about something like:

_SUPPORTED_PYTHONS = [['3.7',]{['3.8', '3.7',} '3.6', '3.5', '3.4',[-'3.3',-] '2.7']

With relevant color scheme applied as well ?

What would be nice is if the color is applied not across all lin, but just on the +- parts.

@dandavison
Copy link
Owner

dandavison commented May 5, 2020

@AvnerCohen could you comment on how you see the role of delta's default within-line edit highlighting versus --word-diff? For example, when would you use one versus the other? Do you have examples of cases where Delta's within-line edit inference is not doing a good job? (Delta's within-line edits (like GitHub, GitLab) are displayed on two different lines, unlike --word-diff which combines the added and removed on a single line.) Note that you may want to play with --max-line-distance to adjust Delta's within-line edit inference algorithm.

Again, I'm not suggesting we shouldn't support Git's --word-diff output, I just want to understand how people view those two different, but somewhat similar, possibilities.

@AvnerCohen
Copy link

@dandavison Thanks for the time and effort.
First, I have checked back and seems like I got a bit confused, I actually meant "--color-words" and not "word-diff".
That being said, I have done couple of tests, and while I remember specific cases where color-words was pretty helpful, I couldn't really find a specific case where this would be better than the correct default you suggest, with the specific part of the word highlighted, for me, it feels like it's just a matter of getting used to it.

@ryanberckmans
Copy link
Author

@dandavison to answer your question on what I'd like to see here-

Simply put, right now delta doesn't seem to support --word-diff. The output has no or incorrect highlighting.

Here's an example
image

In this screenshot, I'd expect the removals and additions to be styled according to the configured delta theme. Right now they are unstyled.

Thanks!

@dandavison
Copy link
Owner

Thanks @ryanberckmans. I agree that that would make much more sense and that, whatever features delta offers, people will still send it --word-diff output and it would be good for it to display it nicely.

Out of interest, do you have any comments on when you use --word-diff versus relying on delta's within-line edit highlighting (below)?

git diff | delta --max-line-distance=0.6
image

@ryanberckmans
Copy link
Author

@dandavison delta's within-line edit highlighting displays on two lines, whereas --word-diff shows "just the words that changed" on a single diff line. For example,

image
^ delta, uses two lines

image
^ --word-diff, uses one line

@lilyball
Copy link

lilyball commented May 7, 2020

At the very least I'd love to see --color-words and --word-diff not confuse delta quite so much.

@pilif-pilif
Copy link

@dandavison I would like to have this feature and help coding it. Could you give me some guidelines where to start?

@dandavison
Copy link
Owner

Hi @pilif-pilif, thanks for the offer! Let's start thinking about this then.

I think that we need to start off with some design thinking before getting to code changes. Basically, what exactly are we aiming for? One question is:

  • Do we give users the option for the lines with --word-diff markers in them to be syntax highlighted by delta? Or do we forget about syntax highlighting and just aim to style the output appropriately?

Syntax highlighting might be quite hard: I think we would have to parse the input to reconstruct the original minus/plus lines of valid syntax in the target language, send those to the syntax highlighter, and then recombine them.

Honestly, after writing the notes below, I am starting to wonder about a different approach here: maybe we should just emit the diff --word-diff output in raw mode, so that the colors that git applies come through unchanged?

What are your initial thoughts about how you want this to behave when you are finished with it?


Here are some more notes. I'm just writing down my initial thoughts, so I am probably overlooking possibilities.

Here is some raw output from git --word-diff, not using delta:

image

We see that

  1. Git has added some special markup to the code: [-removed-] and {+added+}
  2. Git is also emitting ANSI color escape sequences causing these to be displayed as red and green

Now, here is the same output, but using delta:

image

We see that

  1. Delta has removed the red and green from git
  2. Delta has syntax-highlighted the code as Rust

The [- and {+ markers mean that this is not valid Rust syntax, and so the syntax-highlighting could have gone wrong. But in fact, in this case, it doesn't look too bad. In contrast, here's an example where the syntax-highlighting isn't quite what we'd want:

image

@pilif-pilif
Copy link

pilif-pilif commented Jun 16, 2021

Hi @dandavison,

sorry for the delay.

I think that raw for only the [- and {+ part is the only real usable solution since it is a squashed diff. As a user I would still want syntax highlighting for the rest so i can better understand/see the Code where the change is happening.

When I use a squashed diff I want to see fast what changed where and in what environment/segment. So what would be raw green, where - raw red and environment delta syntax highlighting.

so maybe delta can:

  1. stop removing the red/green ASCI color when there are [- and {+ markers
  2. syntax highlighting treats [- and {+ markers as not valid syntax per default and changes nothing on the provided ASCI color

what do you think?

@dandavison
Copy link
Owner

dandavison commented Jun 23, 2021

Hi @pilif-pilif, Two questions

  1. Do you feel that you want syntax highlighting on the line containing [- and {+ itself, or would it be sufficient to have syntax -highlighting on surrounding lines but the line containing the word diff markers in raw mode?

  2. Are you envisaging activating the new behavior via a new delta option, or are you hoping that delta will automatically detect the --word-diff markers? I'm not sure the latter is possible as (a) it may be valid syntax in some language, and (b) we'd have to parse the code to avoid being triggered by that pattern in strings/comments etc., (c) I don't think we have a nice way to interrogate the parent git process for whether --word-diff is in effect (do we?)

syntax highlighting treats [- and {+ markers as not valid syntax per default and changes nothing on the provided ASCI color

The behavior here will depend on the sublime syntax definition for the language in question so I think all we can say is that the syntax highlighting behavior is undefined/unpredictable: it may change it, and as the example above showed a comment can confuse it.

So I'm not sure that I have a better suggestion than adding a new option to delta that causes it to emit the affected lines in raw mode throughout the entire line. Perhaps that new option is not specifically related to word-diff but instead an optional regex that delta will test against each line and emit the line as raw if it matches (just thinking aloud, that might be a bad idea :) ).

@pilif-pilif
Copy link

pilif-pilif commented Jun 25, 2021

Hi @dandavison,

1, I think it would be sufficient to have s.-highlighting on the surrounding lines only. That way the raw line would be easy to spot.

  1. a new delta option would be ok

I want to check my logic - please correct me if I am wrong: (I know the ".." part is very abstract formulated)

  1. git generates ASCII "colors"

  2. delta gets "text" with ASCII "colors"

  3. delta removes the ASCII part - with the new option delta would skip this step.
    or
    3a. delta removes the ASCII part but before that saves the raw lines internally as per the new option

  4. delta sends only the "text part" to the syntax highlighter
    or
    4a. delta sends normally the "text" to the syntax highlighter (nothing changes)

  5. syntax highlighter "puts new ASCII" but leaves the ASCII from git per line
    or
    5.a. if 3a then delta restores saved lines - that way syntax highlighting will not break.

  6. Less displays everything.

can delta influence directly the syntax highlighter prosses?

@dandavison
Copy link
Owner

I want to check my logic - please correct me if I am wrong:

Yes, that's basically accurate. As you say, delta saves a copy of the input line with ANSI color codes stripped and also saves the raw line with ANSI codes intact.

Then, in the main state machine parser loop delta calls handle_hunk_line when its handling a "minus" (removed) or "plus" (added) or "zero" (unchanged) line.

I think that you are going to want to write a new version of handle_hunk_line specially for this git-word-diff feature. Here's the current handle_hunk_line:

delta/src/delta.rs

Lines 458 to 525 in a4f5a4f

fn handle_hunk_line(&mut self) -> std::io::Result<bool> {
// Don't let the line buffers become arbitrarily large -- if we
// were to allow that, then for a large deleted/added file we
// would process the entire file before painting anything.
if self.painter.minus_lines.len() > self.config.line_buffer_size
|| self.painter.plus_lines.len() > self.config.line_buffer_size
{
self.painter.paint_buffered_minus_and_plus_lines();
}
self.state = match self.line.chars().next() {
Some('-') => {
if let State::HunkPlus(_) = self.state {
self.painter.paint_buffered_minus_and_plus_lines();
}
let state = match self.config.inspect_raw_lines {
cli::InspectRawLines::True
if style::line_has_style_other_than(
&self.raw_line,
[*style::GIT_DEFAULT_MINUS_STYLE, self.config.git_minus_style].iter(),
) =>
{
State::HunkMinus(Some(self.painter.prepare_raw_line(&self.raw_line)))
}
_ => State::HunkMinus(None),
};
self.painter
.minus_lines
.push((self.painter.prepare(&self.line), state.clone()));
state
}
Some('+') => {
let state = match self.config.inspect_raw_lines {
cli::InspectRawLines::True
if style::line_has_style_other_than(
&self.raw_line,
[*style::GIT_DEFAULT_PLUS_STYLE, self.config.git_plus_style].iter(),
) =>
{
State::HunkPlus(Some(self.painter.prepare_raw_line(&self.raw_line)))
}
_ => State::HunkPlus(None),
};
self.painter
.plus_lines
.push((self.painter.prepare(&self.line), state.clone()));
state
}
Some(' ') => {
self.painter.paint_buffered_minus_and_plus_lines();
self.painter.paint_zero_line(&self.line);
State::HunkZero
}
_ => {
// The first character here could be e.g. '\' from '\ No newline at end of file'. This
// is not a hunk line, but the parser does not have a more accurate state corresponding
// to this.
self.painter.paint_buffered_minus_and_plus_lines();
self.painter
.output_buffer
.push_str(&self.painter.expand_tabs(self.raw_line.graphemes(true)));
self.painter.output_buffer.push('\n');
State::HunkZero
}
};
self.painter.emit()?;
Ok(true)
}
}

As you can see it's basically a match statement handling the 3 cases: minus, zero, and plus. These are recognized by a '-', ' ' , and '+' character at the start of the line respectively. But with --word-diff, that's no longer true: instead of these 3 types of lines, you have two types of lines:

  1. Squashed-changed lines. This is a new sort of line that delta doesn't previously know about. They are recognized by having substrings like [- ... -] and/or {+ ... +} . We want to emit these raw (look at handle_hunk_line -- it shows how to emit a line as raw).
  2. "Zero" (unchanged) lines. But with git-word-diff these don't start with a leading ' ' character; they are recognized simply by not being squashed-changed lines. These you want to syntax-highlight (so look at handle_hunk_line for that)

Here's the code location downstream where we see delta not syntax-highlighting line that should be emitted "raw" (the fact that is should be emitted "raw" is represented by the enum instance containing a non-empty value. In fact the data contained by the enum is the raw line characters.)

State::HunkMinus(Some(_)) | State::HunkPlus(Some(_)) => false,

So hopefully that helps a bit. I'm going to add some more thoughts in a separate comment.

@dandavison
Copy link
Owner

I think that we should carry on thinking about how we can make this feature as good as possible. Maybe this question helps: what would your absolutely ideal way be for delta to display --word-diff output, ignoring any implementation challenges?

Here are some more higher-level notes:

Delta already has two ways of displaying diffs: traditional, and side-by-side.

But git has a third style: what you are calling "squashed" (which seems like a good name). These are produced by git --word-diff.

The aim of your work could be said to be two different things (which one do you think of it as?):

  1. To handle git diff --word-diff
  2. To add the "squashed diff" style to delta.

If the aim is (2), there is a totally different approach we could take: we could forget about git diff --word-diff entirely, adding an entry like this to the README:

delta does not support git --word-diff output. If you want this style of diff, use delta's squashed option (or whatever it would be called).

Then you would add to delta the ability to squash the minus and plus lines into a single squashed line. That way the lines would already be syntax highlighted, you wouldn't have to worry about parsing/recognizing any weird git output containing [- etc, and you could focus on how to style the removed and added parts of the squashed line.

In contrast, what we're talking about above is (1): i.e. having git do the squashing and delta learning how to parse and style that style of output.

If we are doing (1) then one thing that worries me is how will you handle someone using git diff --word-diff=color? There the git output does not contain the [- and {+ markers; instead removed and added material is indicated by color alone. Here's an example of that:

image

and for comparison here's normal git diff --word-diff:

image

Perhaps we ignore the [- and {+ markers and do the whole thing by recognizing the red and green colors in the raw git output? (And strip the [- and {+ markers if they are present). And perhaps you could actually syntax highlight everything: red and green tells you what is removed and what is added, so you could reconstruct the original code, and send it to the syntax highlighter, and then reassemble it into a single squashed line, with the desired (background color) styling.

I think this gets back to my question above: in an ideal world, if we could just ask for any implementation of this and get it with no work, what would we ask for?

@dandavison
Copy link
Owner

dandavison commented Jun 25, 2021

cc @th1000s just in case you have any thoughts: the topic here is git's --word-diff output format, which merges sections of added and removed content together on a single line.

It seems to me that there are two related objectives:

(a) Handle git diff --word-diff output from git
(b) Design how delta should visually present added and removed content together in a single "squashed diff" line

So far we have been talking as if the new feature would do (a) and (b) so that a user who enters git diff --word-diff or git diff --word-diff will get the single-line squashed output from delta, and this would be the only way to get single-line squashed diff output from delta.

But another way of approaching this would be to

  1. Implement parsing of git diff --word-diff output to delta's standard internal data structures ("minus" and "plus" lines, and "zero" lines)
  2. Implement conversion of delta's standard internal data structures to a single-line squashed-diff output, somewhat like what git produces with --word-diff.

This way a delta user would be able to select single-line squashed diff output whether or not they had passed --word-diff to git.

@th1000s
Copy link
Contributor

th1000s commented Jul 14, 2021

On a), handling the input: Detecting unified diff vs. word diff format is probably not that difficult, when a hunk contains no +/- characters in column zero it (only probably) is word diff output. Or (platform dependent?) look at the arguments of the parent process. Then, parsing [-removed-]{+added+} unambiguously can't be guaranteed either, which would leave grabbing the ANSI. That would mean users have to be instructed to use --word-diff=color or --color-words (or even =porcelain). But the colors are not always red/green, e.g. git -c color.diff.old=yellow -c color.diff.new=cyan diff --color-words.

And b), unless I misunderstand the issue, should just be the standard red/green background.

2.) Sounds much better, only the --word-diff-regex setting from git would be lost. However, then configuring delta as the pager for git would always use the word diff view. There should probably be a quick way to switch that on or off.

@unphased
Copy link

unphased commented Nov 26, 2021

Thanks for linking me here @dandavison and apologies for my delay. As is often the case various things take priority over tinkering with tools but every time I get back to tools I am reminded that it's the foundation for my success.

I'm grateful for the ground that has already been covered on these topics here, and I would like to contribute from a few directions:

My personal experience with an independent diff input format

I fell down the rabbit hole of enhanced line diffs on the command line ever since I caught a glimpse of intra line highlighting via things like diff-highlight (a perl script that comes with git? I've had it in my scripts repo in $PATH forever), and github. In mid 2013 I began maintaining a python script that used diff_match_patch as a dependency to obtain character-wise diffs and quickly converted that (because it was slow) to a particular C++ STL implementation of diff-match-patch ("dmp") which this py script would exec. As such, I never went deep with git --word-diff. I think that git diff may well provide many of the same capabilities, I think there is even regex tokenization configuration available. git word diff can be used outside of git using --no-index. It is merely happenstance that I didn't choose to use it and reinvented this wheel myself.

You may see in the above linked discussions #776 #785 where I have posted screenshots, my tool is fundamentally pretty similar to git word-diff. All my python script sift does is massage the raw protocol output from C++ dmp into suitably-placed ANSI background/reverse codes.

One of the many deficiencies that I wanted to address with stuff like diff-highlight (and which I perceived on cursory evaluation of other options, and I might add this applies to github as well is that the results are sort of suboptimal in the sense that no effort is put toward bringing out a lot of the details in intra-line changes. For example please compare the quality of highlighted regions between

image

(assume the usual where + line is green and - line is red)

and

image

As you might imagine I have become rather reliant on these maximally concise diffs where there are several main optimal properties:

  • no unchanged context content is ever duplicated, so context uses up the least amount of space possible
  • to my knowledge this can produce intelligible output highlighting under many stressful scenarios, when alternative less intensive diff algorithms cannot succeed. Examples include:
    • text dumps of SQL tables for double-checking transactions,
    • automatic code formatting applied to code: confirming no functional code changes. You would think that something like diff-highlight can take care of this when only changing whitespace, but as already shown when more than one change exists on a given line, it comes up short, and as a general rule, line based tools including Github often completely come apart at the seams even when only whitespace changes are introduced, in particular when newlines get introduced or removed,
    • something I did the other day which was the comparison between a Jenkins consoleText output and a tmux capture-pane -eJ export, which inserts a large number of trailing spaces in each line and many small insignificant changes

So far I've never had a need for changing the behavior of this from doing a kind of brute force character-wise diff. I have not had a chance to more deeply explore git --word-diff, i would want to adjust its output somewhat and wrap it still if I were to use it, for example added or removed newlines don't show well in the color configuration.

Now computing a character-wise diff does increase computational load compared to regular diff implementations and I use it for almost, but not all, git diff viewing scenarios: I keep aliases around to use lesser, much faster diffs, and use them often when engaging in deep archaeology sessions because the runtime of dmp appears in practice to be a bit worse than n*log(n) complexity (an area to investigate another time).

All of this to say, I have a unique perspective on this issue of "squashed-diff" output, I come bearing a third possible input to Delta, and am strongly in support of having flexibility of input for Delta so that I may apply it to my own needs. I want to be able to use all diff tools at my disposal in git, and I've described (unfortunately not very concisely) the delightful extent of the benefits inside and outside of git that can be had from dedicating a little more compute to diff calculation.

I feel that Delta has independently gone down the same path in this regard, to great effect and with a much more efficient implementation. When I consider choosing between using Delta or sift with git diff, there is usually no contest because Delta is a more pleasant experience in most cases, whether unified or side by side (quite the accomplishment I'd say). It is a great deal faster. But in many examples when only small syntactical changes are present, a large amount of actual output is produced, making quickly scanning and understanding the change more difficult than it has to be! A small amount of this is down to differences in comparison token boundary definitions, which I think are configurable and can be optimized further to reduce the amount of raw change characters highlighted (note, brute char-wise diff does go overboard finding useless matching patterns and causing garbled output, so it is not always better), a larger part of the extra content is due to duplication found in the minus- and plus- context lines. They're clearly helpful when changes become more complex, but they are also visually distracting when changes are simple.

So, the point I want to make (and finally coming back to the topic at hand):

word-diff is often more concise and readable

It's clear to me now that I have been driving Delta for a few weeks that its only weaknesses compared to my beloved sift is verbosity. In the vast majority of cases, a change involves small portions of a few lines. The problem with Delta/github is that it will always use up two lines of context to show a change, no matter how small. Since I'm very used to it, I would be one of these squashed-diff users, so my two cents would be in support of enabling all the possibilities (threw in a --color-moved related bullet but i think we should ignore that for this discussion as it is out of scope, but mentioning in case it is relevant in implementation):

  • to be capable of identifying and interpreting different types of input diffs:
    • vanilla ansi colored unified diff (already works, also works as raw stdin without ansi colors e.g. diff -u ... | delta)
    • word-diff in whatever flavors is most suitable for parsing, perhaps porcelain. Then I can edit sift to emit porcelain word-diff format and run e.g. sift | delta for my non-git shenanigans for example as a way to perhaps input a diff suggestion (this feature might be ill-formed, sorry)
    • color-moved when enabled as default or zebra (they are the same) as I expect parsing dimmed-zebra may be nontrivial
    • raw pair of input files (i'm not familiar with how it computes diff internally but i know it works)
  • to internally rebuild minus- and plus-lines in hunks and syntax-highlight each version of each hunk
  • to have configurable output modes
    • unified
    • side-by-side
    • squashed

On syntax highlighting of concise/squashed diff output

I am not sure if there was a consensus regarding whether/how to highlight the squashed diff lines.

I think the minus and plus sections themselves never have ambiguity in their syntax. They should just be highlighted per their computed values based on the minus- and plus- reconstructed hunks.

It is the context lines that have ambiguity. This is, after all, why I believe we need to consider even in the squashed-diff mode we're proposing, that we may still want to have a configurable mode where if the syntax in the unchanged context would be different that we should show the minus- and plus- unchanged context content. A simple example of this is if some lines become commented. They turn from code syntax to comment syntax. I know what I would like to do to view these scenarios is I would like to see both versions of diff, squashed to make it easier to see all of the chars changed only (i'd do it today with sift!), and I'd like to also see it expanded so I can confirm the green sides all have commented syntax (today Delta crushes this task). So I argue this should be an option.

Misc. (somewhat off topic but might help someone who has gone off the deep end like me)

I've found it difficult ever since I discovered Delta recently, to toggle between 3 ways to view git diff,

  1. delta
  2. sift (its output getting sent to delta works fine it turns out -- delta knows not how to interpret it so it spits it back out verbatim, but I get whatever delta's been configured for for the header styles/decoration)
  3. vanilla (git --no-pager diff --color=always | less as a way to disable delta as pager) as a backup which always runs fast, which I will probably never use going forward now that I have delta

The issue I was running into was that if I configure an external diff (diff.external), --color-moved never gets applied. This makes a little bit of sense, but I think it is a bug, because I believe when --no-ext-diff is specified it should behave as though no external diff is configured, but it doesn't do that, it seems to disable --color-moved.

In conjunction with this, since this issue means that I cannot use a diff.external because I do want to benefit from --color-moved, I had a great deal of difficulty enabling sift inside git log -p which I do use a great deal.

I just did some testing and was able to find the workaround!
It is possible to use GIT_EXTERNAL_DIFF=sift git log -p --ext-diff to enable an ext diff for log -p. I think I have all the puzzle pieces for this again now. This makes me feel much better being able to have all the choices for the most capable tools at hand. But most ideal would be if one tool can be the best in all scenarios, that is the dream, one diff to rule them all!

@dandavison
Copy link
Owner

The current release (0.10.3) refuses to handle --word-diff or --color-words at all, instead allowing the raw git output through. I think this is at least a step in the right direction since what delta was doing previously to that input was entirely invalid.

@dandavison dandavison changed the title 🚀 Support --word-diff 🚀 Highlighting of --word-diff output, and custom "squashed" diff output Dec 4, 2021
vorburger added a commit to vorburger/vorburger-dotfiles-bin-etc that referenced this issue Nov 21, 2022
Note dandavison/delta#152
is causing a (small) change in that WS intendation is
no longer ignored, but it's still possible to use git diff -w.

PS: I also considered https://github.com/so-fancy/diff-so-fancy
but opted for delta after reading https://dandavison.github.io/delta/features.html
and seeing https://dandavison.github.io/delta/comparisons-with-other-tools.html
@vigo
Copy link

vigo commented Feb 23, 2023

$ git diff --word-diff
fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[-+0-9.eE]+i?|0[xX]?[0-9a-fA-F]+i?|[-+*/<>%&^|=!:]=|--|\+\+|<<=?|>>=?|&\^=?|&&|\|\||<-|\.{3}|[^[:space:]]|[�-�][�-�]+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Triage
High priority
Development

No branches or pull requests

8 participants