Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax highlighting sometimes incorrect with multiline constructs #117

Open
dandavison opened this issue Mar 23, 2020 · 11 comments
Open

Syntax highlighting sometimes incorrect with multiline constructs #117

dandavison opened this issue Mar 23, 2020 · 11 comments

Comments

@dandavison
Copy link
Owner

dandavison commented Mar 23, 2020

If delta receives a hunk that looks something like the example below, then it will syntax highlight the code as a string. (This is a python example, but the problem exists in any language with multiline string literals I think):

    ...and this is the end of my docstring.
    """
    x = some_function("some_string")  # This line should be syntax-highlighted as code.
image

That seems like an inevitable ambiguity. However, are there perhaps some heuristics that could be introduced to allow delta to guess which side of the """ is code and which side string literal?

@dandavison dandavison changed the title Heuristics for multiline string parse ambiguity Syntax highlighting incorrect with multiline strings Sep 4, 2020
@dandavison
Copy link
Owner Author

From @kofls in #316

I think a simple solution would be to get the syntax style per file and use that for each diff section

Do you mean that delta should try to locate the entire file on disk, and use that to determine exactly what is going on in a diff that shows just a fragment of the file? That's a good point, delta could try to do that. It won't always work of course, because the diff might just be piped to delta and the files mentioned in the diff might not even exist, but delta does already use libgit2 to see if it is in a git repo. As you say, it could be an option.

@kofls
Copy link

kofls commented Sep 4, 2020

Yup, that was what I meant. I hadn't considered that the diff source might not be files.

@YodaEmbedding
Copy link

YodaEmbedding commented Apr 5, 2021

Quick n' dirty: I think that syntax highlighting both portions (even if incorrect) is less disastrous than not highlighting the code at all. Perhaps a quick and simple workaround would be to ignore """ when highlighting, so that everything gets highlighted?

Robust and accurate: delta knows the file location, line numbers, and the line number of function definition. So read the file, skip to the function definition, then highlight from there until the end of the function. Now extract the relevant lines from the diff and display those. There are probably some edge cases to handle but that seems to be a good way to do this properly.

@Alphare
Copy link

Alphare commented May 5, 2021

Robust and accurate: delta knows the file location, line numbers, and the line number of function definition

FYI this would only work with Git which is content-addressable, and not (as simply) with Mercurial. I'm not even sure that every operation in Git can be recovered by looking up a file given the information available in a diff, but I don't know enough about Git.

@Alphare
Copy link

Alphare commented May 7, 2021

There exists the git diff -U and hg diff -U flags that give more context. delta could use this flag and then strip down the added context. Note that the theoretical "right" approach in Mercurial would be to use some kind of extdiff mechanism, but that's probably more work. :)

@TheKevJames
Copy link

TheKevJames commented Oct 26, 2022

Potentially a middle-ground between @YodaEmbedding's options:

Try to highlight both sides, capture how successful we were (does bat expose details on what the guessed language was? This could even be a simple "bat guessed the wrong language for this file, probably a comment!"), only show the highlights if error_count < n.

@dandavison
Copy link
Owner Author

dandavison commented Oct 26, 2022

Hi @TheKevJames, delta does not use bat for syntax highlighting or guessing languages. I'm curious -- could you tell me where you got the impression that it does? You're not alone, it's a common misconception!

@TheKevJames
Copy link

@dandavison oh, interesting! I took a quick glance at the README before writing my above message and came across the following lines:

Language syntax highlighting with the same syntax-highlighting themes as bat
All the syntax-highlighting color themes that are available with bat are available with delta:

They definitely don't state that "bat is used for syntax highlighting" directly, but my read from them implied it. IMO "why mention bat if it's not relevant here?"

Guesswork start here: perhaps the library you use for syntax highlighting is just the same one as bat's, or something like that? Maybe mentioning the library could help? eg.

"All the syntax-highlighting color themes that are available with $library -- the same library used by bat -- are available with delta"

@dandavison
Copy link
Owner Author

Actually, re-reading what you wrote, you're just talking about guessing languages, not syntax highlighting. OK, so the way this works is:

  1. Delta uses some code from bat to map a file name to a language.
  2. Delta uses syntect to syntax highlight that language.

@dandavison
Copy link
Owner Author

Yes, exactly, you're right -- delta and bat use the same syntax highlighting library. Thanks, I'll change the README as you suggest -- this confuses lots of people. I thought it might help to mention that they're the same themes to reassure people that they will be able to create a visually consistent experience using both programs.

@TheKevJames
Copy link

Definitely agree on mentioning the similarity to bat -- the two having the same themes was certainly relevant to me when I started looking into using delta! But yeah, mentioning/linking to syntect will certainly help folks looking for a more specific understanding of what lib does what thing.

So in that case, it looks to me like the functionality I described would best be implemented via changes to syntect code; eg. they would need to offer someway to check if a given SyntaxSet "seems to be correct".

My initial thought is that "seems to be correct" might be approximate by checking on the number of modifications: my naive assumption is that in cases where we're highlighting the code we would get many tokens highlighted with their syntax but in cases where we're highlighting the comment we would get few highlights (or perhaps "all the same"? Not sure if there's a highlight for "this token doesn't make sense here").

So in that case, an option which doesn't require syntect changes might be to modify this chunk of delta.

Pardon my incredible lack of rust fluency here, but in attempted psuedo-code / bad code, we might be able to do something like:

let mut line_styles = Vec::new();
for (line, _) in lines.iter() {
    let sections = highlighter
        .highlight_line(line, &config.syntax_set)
        .unwrap();
    # MIN_STYLED_TOKENS might be "at least n tokens should be highlighted"
    # alternatively, it could be "at least n% of the line is hightlighted"
    let is_styled = sections
        .reduce(|acc, (style, _)| {
            if (style != NULL && style != INVALID_TOKEN) { acc += 1 }
        }) <= MIN_STYLED_TOKENS;
    line_styles.push((sections, is_styled));
}

# ie. "split our vector into lines before and lines after a block comment"
let index = line_styles.iter().position(|(line, _)| line == "\"\"\"").unwrap();
let styled_tokens_before = line_styles[..index]
    .filter(|(_, is_styled)| { is_styled })
    .count();
let styled_tokens_after = line_styles[index..]
    .filter(|(_, is_styled)| { is_styled })
    .count();

# This could also be thresholded: eg. "if they're close enough, style both"
if (styled_tokens_before > styled_tokens_after) {
    line_sections.push(line_styles[..index]);
    for (line, _) in lines[index..].iter() {
        line_sections.push(vec![(config.null_syntect_style, line.as_str())]);
    }
} else {
    for (line, _) in lines[..index].iter() {
        line_sections.push(vec![(config.null_syntect_style, line.as_str())]);
    }
    line_sections.push(line_styles[index..]);
}

@dandavison dandavison changed the title Syntax highlighting incorrect with multiline strings Syntax highlighting sometimes incorrect with multiline constructs Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants