Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new injection point setting to support parsing Rust macros via language injection #19556

Merged
merged 6 commits into from Jun 18, 2019

Conversation

@maxbrunsfeld
Copy link
Contributor

commented Jun 17, 2019

Fixes #19435

Summary

This PR upgrades tree-sitter-rust to the latest version. Because of some changes that I've made to that parser, we need to change the way that we highlight macro definitions and macro invocations in Rust code. This change required adding a new option to the existing injection point API. It should improve the experience of editing Rust code that contains macros.

Background - Rust Macros

Macros come up a lot in Rust, especially macro invocations, like this:

assert_eq!(a.b(), C::D { e: f });

The Rust parser does not actually parse the arguments to macros (everything after the ! in the example above) as written using normal Rust syntax. Instead, it parses them into simple token trees - loose sequences of tokens with bracket-matching as the only structure. The contents of the Token trees get parsed later, after macro expansion.

Background - Syntax Highlighting Macros

Even though macro arguments are technically just token trees, in practice, they usually contain valid Rust syntax. So it's desirable to highlight macro arguments as if they were Rust expressions.

In the old version of tree-sitter-rust (0.13.7 and below), the parser itself would try to parse token trees as Rust code, and fall back to treating them as token trees if this parsing failed. It did this using Tree-sitter's GLR system for handling ambiguity. So in most cases, macro calls would appear in the syntax tree with their arguments fully parsed.

Changing How Macros are Highlighted

There were a couple of problems with this ☝️ approach:

First of all, if there was anything invalid inside of a macro call, the entire macro call would switch back to being parsed as a simple token tree. During the course of typing, this sometimes caused unexpectedly large changes to the highlighting. Also, the ambiguity of parsing token trees as Rust code caused the parser to have a larger binary size than it would otherwise need to.

I've since realized that Rust macros should be highlighted using language injection, similarly to what we do for Ruby code within ERB, and C code within C macro definitions. This way, we won't totally give up on parsing the macros due to small parse errors, and we don't need to complicate the Rust parser itself to handle the behavior.

The includeChildren API

Unfortunately, the current language injection API isn't quite suitable for parsing Rust token trees as Rust code.

Currently, when you add an injection point for a given node N, we assume that if N has internal structure, we should preserve that internal structure. Specifically:

  • We continue to render highlighting for the tokens within N
  • When computing which ranges of text we should pass to the injected language parser, we subtract those tokens' ranges from N's range, so that we can pass only N's own text, not its children's text.

This is not the behavior we want for handling Rust macros. All of the tokens within a token tree are already parsed out as separate tokens by the outer Rust parser. We do want to reinterpret them via a second pass of the Rust parser.

I've added a new boolean option to the GrammarRegistry.addInjectionPoint method called includeChildren that implements this new behavior. If you call addInjectionPoint with includeChildren set to true, then the matching nodes' entire text will be passed to the injected parser, including the nodes' children's text.

QA Process

In addition to writing one unit test for this new behavior, I've been testing this manually by opening large Rust source files containing lots of macro invocations and macro definitions, like this one from serde, editing, and verifying that syntax highlighting looks good.

@maxbrunsfeld maxbrunsfeld force-pushed the mb-rust-macro-injection branch from dc91d69 to 7bfd33c Jun 17, 2019

content(node) {
return node.lastChild;
},
includeChildren: true

This comment has been minimized.

Copy link
@maxbrunsfeld

maxbrunsfeld Jun 18, 2019

Author Contributor

Here's where we use the new API. We includeChildren so that the tokens within a token tree get included in the text that we re-parse as Rust.

this.languageMode = languageMode;
this.grammar = grammar;
this.tree = null;
this.currentParsePromise = null;
this.patchSinceCurrentParseStarted = null;
this.contentChildTypes = contentChildTypes;
this.depth = depth;
this.isOpaque = isOpaque;

This comment has been minimized.

Copy link
@maxbrunsfeld

maxbrunsfeld Jun 18, 2019

Author Contributor

I added a boolean flag on LanguageLayer that I'm calling isOpaque, which indicates whether or not the parent layer's token should show through this layer or not. It basically matches the value of includeChildren. So by default, when includeChildren is false, the parent layer's highlighting does show. But for injections with includeChildren: true, the parent layer is covered up.

this.iterators.push(iterator);
}
}
this.iterators.sort((a, b) => b.compare(a));

This comment has been minimized.

Copy link
@maxbrunsfeld

maxbrunsfeld Jun 18, 2019

Author Contributor

I've also done a refactor which made it easier to implement the new functionality. I've changed the invariant of the HighlightIterator a bit. Previously, the the iterators array always contained one HighlightLayerIterator for every HighlightLayer. Layer iterators that were no longer active had a done flag set to true. Now, once a HighlightLayerIterator advances to the end of its syntax tree, I remove it from the iterators array. This removes the need for the done flag.

The reason for this diff is that LayerIterator.seek can now return false, which means that the iterator has already fallen off the end of its tree. If seek returns false, we avoid adding the iterator to the array at all.

This comment has been minimized.

Copy link
@as-cii

as-cii Jun 18, 2019

Contributor

Sounds great! ⚡️ Not sure how big these arrays usually get, but seems nice not to have to pay for sorting those elements as well.

Show resolved Hide resolved src/tree-sitter-language-mode.js Outdated
Show resolved Hide resolved src/tree-sitter-language-mode.js
});
}
position = child.endPosition;
index = child.endIndex;

This comment has been minimized.

Copy link
@maxbrunsfeld

maxbrunsfeld Jun 18, 2019

Author Contributor

This is where we would subtract out the ranges of child nodes when computing injection ranges. The includeChildren flag suppresses that behavior.

@maxbrunsfeld

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2019

Here's an example of the improvement

Previous behavior - when typing inside of a macro call, temporarily invalid code would cause the macro to lose its Rust syntax highlighting. Notice it happening at different levels of nesting:

rust-macro-bad

New behavior - we get normal error recovery in the injected parse tree, so the highlighting doesn't change due to minor errors:

rust-macro-good

@maxbrunsfeld maxbrunsfeld force-pushed the mb-rust-macro-injection branch from 3480e31 to ea6d061 Jun 18, 2019

@maxbrunsfeld

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2019

Hey @as-cii or @nathansobo any chance you could give this a 👀 at some point? It contains @as-cii's tree-sitter-rust fix and some other improvements. I'm trying to stay more on top of the task of keeping all the Tree-sitter stuff up-to-date in Atom.

Unfortunately, the code changes here are kinda complex. Some of the naming might need some work too. I could talk about it synchronously at some point if it'd be helpful.

@maxbrunsfeld

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2019

Looks like a spurious CI failure in find-and-replace:

ResultsView
  copying items with core:copy
    it ResultsView copying items with core:copy copies the selected line onto the clipboard
      timeout: timed out after 60000 msec waiting for spec promise to resolve
  copying path with find-and-replace:copy-path
    it ResultsView copying path with find-and-replace:copy-path copies the selected file path to clipboard
      timeout: timed out after 60000 msec waiting for spec promise to resolve
    it ResultsView copying path with find-and-replace:copy-path copies the selected file path to the clipboard when there are multiple project folders
      timeout: timed out after 60000 msec waiting for spec promise to resolve
@as-cii
Copy link
Contributor

left a comment

I left a couple of comments and started reviewing this alongside @nathansobo as well, who may have additional feedback.

Overall, I think this approach works well and implementation-wise this seems pretty good already. As for isOpaque, I think it conveys the idea and fits well in the layer analogy, so I'd be okay with keeping it! 👍

this.iterators.push(iterator);
}
}
this.iterators.sort((a, b) => b.compare(a));

This comment has been minimized.

Copy link
@as-cii

as-cii Jun 18, 2019

Contributor

Sounds great! ⚡️ Not sure how big these arrays usually get, but seems nice not to have to pay for sorting those elements as well.

Show resolved Hide resolved src/tree-sitter-language-mode.js
Show resolved Hide resolved src/tree-sitter-language-mode.js
Show resolved Hide resolved src/tree-sitter-language-mode.js Outdated
return;
}
}
this.currentScopeIsCovered = false;

This comment has been minimized.

Copy link
@maxbrunsfeld

maxbrunsfeld Jun 18, 2019

Author Contributor

Ok, I changed the way that we suppress the scope boundaries from the outer layers. I got rid of the concept of "opaque" language layers, and instead I just detect duplicate scope boundaries individually. After each call to seek or moveToSuccessor, we check if the leading layer iterator is on top of a more deeply-nested iterator. If so, we hide its scope boundaries.

This comment has been minimized.

Copy link
@nathansobo

nathansobo Jun 18, 2019

Contributor

Are you concerned about boundaries from a shallower language mode that don't line up to the same offsets as boundaries in the deeper mode? I guess in the Rust macro scenario, the boundaries in the token tree produced by the outer grammar align with the boundaries in the inner grammar, because the tree produced by the inner grammar is going to align with those tokens. Will there ever be cases where scope boundaries in the outer grammar don't align?

If so, I'm wondering if it makes sense to somehow assign a depth on the HighlightIterator as a whole that can be incremented when we enter an opaque zone and decremented when it ends. Then you could just ignore boundaries from a language mode that's shallower than the current depth regardless of the offset at which they occur. You could sort the deeper layers to be later in the array and assign the depth whenever you start emitting token boundaries on a deeper layer. As soon as that layer gets popped from the array, you could decrement the depth to the previous layer.

I suppose if we're not worried about scope boundaries from shallower layers that don't align with boundaries at the deeper layer, this doesn't matter. Just thought I'd share the thought process in case it's valuable.

This comment has been minimized.

Copy link
@maxbrunsfeld

maxbrunsfeld Jun 18, 2019

Author Contributor

Are you concerned about boundaries from a shallower language mode that don't line up to the same offsets as boundaries in the deeper mode?

I'm not totally sure, because I don't have a use case in mind, but I actually think it's better that we allow non-aligned scopes like that from the parent layer to still be rendered.

Even though my initial implementation had this concept of "opaque" layers, which would cover up other layers, I don't really think that concept is needed. And if we ever do need it, we might not want to conflate it with the includeChildren option. I had conflated the two at first, but then I realized that all I was really trying to solve was the problem of duplicate scope boundaries from inner and outer layers.

It's also kind of tricky to implement the opaque language layer thing, because technically, even an opaque language layer could have "gaps" in it, in which outer layers' scopes show through, because there might be gaps in the parent layer 😬. For example, if there were some kind of ERB-like templating language that was based on Rust instead of Ruby, there might be "gaps" in the Rust code, so any nested Rust trees would need to respect those initial gaps.

My earlier implementation wouldn't have handled that situation correctly. I'm sure there's a way to do it, but it seems good to just drop the "opaque" concept for now.

@maxbrunsfeld maxbrunsfeld force-pushed the mb-rust-macro-injection branch from 714f7a3 to 5e6770b Jun 18, 2019

@maxbrunsfeld maxbrunsfeld merged commit 5e418c3 into master Jun 18, 2019

1 check passed

Atom Pull Requests #20190618.8 succeeded
Details

@maxbrunsfeld maxbrunsfeld deleted the mb-rust-macro-injection branch Jun 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.