Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax highlighted diffs #3101

Merged
merged 135 commits into from
Oct 31, 2017
Merged

Syntax highlighted diffs #3101

merged 135 commits into from
Oct 31, 2017

Conversation

niik
Copy link
Member

@niik niik commented Oct 19, 2017

Wouldn't it be nice with some syntax highlighting in your diffs?

A screenshot of GitHub Desktop showing a diff with syntax highlighting

The Problem

Syntax highlighting is a well-understood problem with tons of options. Atom uses TextMate grammars to do theirs but since we're already using CodeMirror I took a stab at implementing ours using that.

Syntax highlighted diffs have been a much appreciated feature of GitHub.com for a long time now and one that I have missed in GitHub Desktop for a long time. Highlighting in diffs presents some added complexity over that of highlighting in a normal source file though. Pretty much all languages are contextual, in that what happened on some line "higher up" affects what's going on further down. As such you can't just pull out a line from a diff and expect it to be highlighted properly. Here's a good example

A screenshot of GitHub Desktop showing a diff with a multi-line comment which is missing the opening statement

Had we just tried to highlight individual lines here we wouldn't have been able to infer that the first line was part of a multi-line comment.

Instead, we have to take the contents of the file before the change, and the contents of it after and run highlighting on both versions. Once that's done we can stitch these together to form one syntax highlighted diff.

The Approach

When we are about to perform highlighting on a diff we start out by scanning through the diff to figure out which lines we need from which file. Context lines can be pulled from either version while added/removed lines obviously need to come from a particular version. If we find that a file consists entirely of additions or entirely of deletions we can optimize further by adding a preference for one of the versions and thus getting away with loading just one file.

Once we've got that settled we load the first 256kb from both versions (256kb picked arbitrarily because I figured it should cover the majority of source files while adding a very manageable memory overhead for the feature). We then pass this content, along with which lines we want to get tokens for to one or two web workers which then run the modes. CodeMirror modes are synchronous but running them in a web worker means we can get on with other things while we're tokenizing up to half a megabyte of content in up to two different threads (threads in javascript, what have we come to). It also means that we have a real nice containment of the highlighting process and that we can terminate it should it for some reason end up taking a very long time to complete.

When we get the results from the workers we apply our own custom CodeMirror mode which takes the tokens from the language modes and applies them inside of our diff. That means that there's a small window in between when users see the diff and when highlighting gets applied. In my testing it's barely noticeable and it means we can deliver what really matters (the diff) as quickly as we've done before.

The Performance.

So far I've been blown away by how well this performs, even on my 2015 12" macbook it's real snappy. We do a couple of optimizations behind the scenes and we keep web workers alive once we've fired them up to avoid the penalty of parsing the javascript of all the modes.

Languages currently supported

We can obviously support every mode that CodeMirror supports but for now I've opted to include these until we get a sense of how well this is working.

JavaScript, JSON, TypeScript, HTML, Markdown, Yaml, XML, Objective-C, Scala, Java, C, C++, sh/bash, Go, Perl, PHP, Python, Ruby.

Feel free to clone https://github.com/niik/highlighter-tests to test some of the modes

screen shot 2017-10-19 at 18 47 39

Future improvements

  • TextMate grammars seem to have a ton of traction. We might want to look at using them instead of CodeMirror modes in the future. The problem with them is that they use RegExes that aren't 100% compatible with JavaScript's RegEx engine so we'd have to bring in another tool for that. Atom uses the oniguruma library which is a C++ library. With our current implementation using web workers that's not an option for us since we can't use native modules inside of a web worker safely. There's a neat package that uses the JavaScript implementation of oniguruma but the author does call out that there are incompatibilities between the two. The most likely thing for us to do is run highlighting inside of a renderer instead.
  • If we decide to stick with the CodeMirror modes and we add a bunch of other languages to the point where spinning up a worker becomes an issue we could consider loading modes on-demand in the worker.
  • With the performance having improved so much since I initially went down the web worker route it's quite possible that we could move highlighting back into the renderer and manage concurrency by chunking up our tokenization work into animation frames.

Caveats

  • I've spotted a fairly common occurrence bug in the typescript mode which can cause the entire file after a particular combination of tokens to be classifies as a string. This has been fixed in CodeMirror/master but not yet released.
  • There's a very small race condition window in between when we get the diff from Git and when we read the file contents of a modified file in the working directory where the contents of the file could have changed. The results would be pretty benign with syntax being offset or wrong. We could add some logic to ensure that lines match up after we've tokenized them. I did that in the beginning but removed it due to the overhead it added in the token model. We could also generate diffs with infinite context and manually collapse them in code. That's a much riskier approach from a memory conservation standpoint but it would mean that we could support expanding context dynamically like dotcom does (though there are other ways to achieve that as well).

Fixes #1312

niik and others added 2 commits October 26, 2017 13:53
Also, move the disable comment in lib/globals.d.ts to the top of the file

�
Degeneralize the ESLint-disabling comment in highlighter/globals.d.ts
@niik
Copy link
Member Author

niik commented Oct 30, 2017

Is there anything I can do to help out here @shiftkey?

* information contained within the ILineTokens interface.
*/
export interface IToken {
length: number

This comment was marked as spam.

@@ -39,3 +40,24 @@ export async function getBlobContents(

return Buffer.from(blobContents.stdout, 'binary')
}

export async function getPartialBlobContents(

This comment was marked as spam.

// exists.
const worker =
highlightWorkers.shift() ||
new Worker(`file:///${__dirname}/highlighter.js`)

This comment was marked as spam.

This comment was marked as spam.

@niik
Copy link
Member Author

niik commented Oct 30, 2017

📼


### I want to add my favorite language

Cool! As long as it's a language that [CodeMirror supports out of the box](https://codemirror.net/mode/index.html) we should be able to make it work. Open an issue and we'll take it from there. It would be swell if you could also submit a PR with a sample file for the language to [niik/highlighter-tests](https://github.com/niik/highlighter-tests) (we'll find a better spot for this in the future).

This comment was marked as spam.


When we are about to perform highlighting on a diff we start out by scanning through the diff to figure out which lines we need from which file. Context lines can be pulled from either version while added/removed lines obviously need to come from a particular version. If we find that a file consists entirely of additions or entirely of deletions we can optimize further by adding a preference for one of the versions and thus getting away with loading just one file.

Once we've got that settled we load the first 256kb from both versions (256kb picked arbitrarily because I figured it should cover the majority of source files while adding a very manageable memory overhead for the feature). We then pass this content, along with which lines we want to get tokens for to one or two web workers which then run the modes. CodeMirror modes are synchronous but running them in a web worker means we can get on with other things while we're tokenizing up to half a megabyte of content in up to two different threads (threads in javascript, what have we come to). It also means that we have a real nice containment of the highlighting process and that we can terminate it should it for some reason end up taking a very long time to complete.

This comment was marked as spam.

timeout = window.setTimeout(() => {
worker.terminate()
log.error('Highlighting worker timed out')
reject(resolve({}))

This comment was marked as spam.

This comment was marked as spam.

* stream to count columns. See CodeMirror's StringStream
* class for more details.
*/
tabSize: number

This comment was marked as spam.

}
})
})
if (exitCodes.has(code) || signal) {

This comment was marked as spam.

0,
MaxHighlightContentLength - 1
)
} else if (file instanceof CommittedFileChange) {

This comment was marked as spam.

@niik
Copy link
Member Author

niik commented Oct 31, 2017

🍸

@shiftkey
Copy link
Member

shiftkey commented Oct 31, 2017

@xt0rted
Copy link
Contributor

xt0rted commented Nov 7, 2017

The release notes have #1312 as the issue/pr for the feature instead of this or #2550

@shiftkey
Copy link
Member

shiftkey commented Nov 7, 2017

@xt0rted yeah, this was generated by what-the-changelog - I think what happened here was that it chose the only linked issue (which was #1312) as we didn't link to the original issue.

@nyssance
Copy link

nyssance commented Nov 8, 2017

Need a preference to open/close the syntax, it's in a daze.

@j-f1
Copy link
Contributor

j-f1 commented Nov 8, 2017

@nypisces Can you open a new issue and give us a few more details about your feature request there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants