Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different nodes for multi-clause directives? #1

Closed
the-mikedavis opened this issue Jan 7, 2022 · 5 comments · Fixed by #4
Closed

different nodes for multi-clause directives? #1

the-mikedavis opened this issue Jan 7, 2022 · 5 comments · Fixed by #4

Comments

@the-mikedavis
Copy link
Contributor

the-mikedavis commented Jan 7, 2022

👋 happy new year @connorlay! Sorry for this massive wall of text I'm dropping in your issue tracker 😅

Helix is right on the cusp of having combined injections (helix-editor/helix#1378) which make injections for multi-directive expressions possible (e.g. <%= if true do %>..<% end %> having its elixir expressions joined and handled as one combined tree-sitter node).

I noticed that there's some odd behavior with heex and eex with combined injections because the $.code nodes don't contain a trailing newline, which ends up making non-multi-directive expressions (expressions not split across multiple directives) get parsed as errors by the elixir grammar (in which newlines are syntactically important). For example, I notice some errors popping up when you have two function calls right next to each other:

combined-no-semicolon

So the directives are being combined and the resulting elixir code comes out to:

csrf_meta_tag() live_title_tag assigns[:page_title] || "Foo", suffix: " · Foo"

Which the Elixir parser recognizes as having errors. Put in semicolons to separate the expressions, though, and it all looks great:

combined-semicolon

Why doesn't this happen with tree-sitter-embedded-template for the injected ruby (erb) and javascript (ejs)? I looked into those grammars and they're super permissive about newlines, so even take an ejs template like this one from the expressjs repo:

  <% posts.forEach(function(post) { %>
    <dt><%= post.title %></dt>
    <dd><%= post.body %></dd>
  <% }) %>

Which comes out to this javascript:

posts.forEach(function(post) { post.title post.body })

And tree-sitter-javascript says that this is valid, even though it's in fact nonsense javascript afaik. The same happens for erb/ruby. So really one approach for fixing this problem is for me to open up an issue/PR in tree-sitter-elixir that makes newline less syntactically important. (I really like how close-to-the-parser tree-sitter-elixir currently is, though, so I'd prefer to avoid doing that.)

Initially I was thinking that it might make sense for there to be a new directive in tree-sitter's query language for specifying a character used to join combined expressions (I think \n would work in this case), but that got me curious how the HEEx and ultimately EEx compiler do this. Are those compilers inserting newlines? That led me to EEx.Tokenizer (see this clause of tokenize/6 and token_key/2). The tokenizer separates out directives into the records :start_expr, :middle_expr and :end_expr based on the contained elixir tokens when the expression spans multiple directives, and :expr otherwise. So my current thinking is that it'd be best for tree-sitter-eex and tree-sitter-heex (and maybe tree-sitter-surface) to do this same separation so that you would have a node type for complete expressions ($.directive?) and a separate one for partially valid expressions that need to be combined ($.multi_clause_directive?). That way you could write a query for the injections.scm like

([
   (expression (code) @injection.content) ; for heex
   (directive (code) @injection.content)
 ]
 (#set! injection.language "elixir"))

((multi_clause_directive (code) @injection.content)
 (#set! injection.combined)
 (#set! injection.language "elixir"))

I'm a little worried about how difficult this could be implementation-wise. Would eex and heex (and surface?) grammars need to take a dependency on the elixir grammar to correctly determine which directives are partial expressions? Can we accomplish something reasonably close without pulling in the grammar? I'll hack around with this grammar and see if this is easier than I'm thinking 🕵️

@the-mikedavis
Copy link
Contributor Author

the-mikedavis commented Jan 7, 2022

If I'm reading EEx.Tokenizer rules correctly from token_keys/2:

if the expression... then it's a
starts with "end" and ends with "do" middle_expr
ends with a "do" start_expr
ends with a block-identifier (?) middle_expr
starts with an "end" and ends with a stab operator ("->") middle_expr
ends with a stab operator and (other more complicated rules) start_expr/middle_expr
"end" preceeded by /[\[\(\{]*/ end_expr
otherwise expr

Despite my github-code-search-fu I can't seem to figure out what a block_identifier token is 🤔

And that "end preceeded by..." rule code looks a little weird to me: why is closing_bracket?/1 checking against ~w"( [ {"a? I can't really imagine any code that would look like that.

I think we can map that out to simpler rules without pulling in the elixir grammar given that we may not care what's a start/middle/end and just that it is/is-not an expr.

It's a multi-clause directive if the code:

  • ends with "do", block-identifier, or "->"
  • starts with seq(/[\[\(\{]*/, "end")

@the-mikedavis
Copy link
Contributor Author

the closing_bracket?/1 thing might be a bug, I'll investigate that separately elixir-lang/elixir@94fa4e5#diff-37ace2010a0277f8d6f8c940c500ceeecb5adf03ca2609fef69aa8b3b0f5a391L14

@connorlay
Copy link
Owner

Hi @the-mikedavis, happy 2022 and thank you for the investigation into this issue! I agree we should avoid embedding the entire tree-sitter-elixir grammar if possible. I'll review your PRs and see if I spot any issues 😄

@the-mikedavis
Copy link
Contributor Author

Ok I think I did not think this all the way through 😅

There's this case that comes up here and there on large templates:

eex-two

Which would look like this to the elixir grammar because of the combined injections

ex-two

I'm not sure what the best place to fix this is. On the one hand, like the javascript+jsx example above, it could make sense to fix this by having the elixir grammar allow space to be a $._terminator, making the elixir grammar more permissive. On the other hand it might be fixable in this grammar by splitting up the new multi-clause directives into start/middle/end-expr like the EEx.Tokenizer does. That way you could write the injections like

((start_expression) @injection.content
 (middle_expression)* @injection.content
 (end_expression) @injection.content
 (#set! injection.combined)
 (#set! injection.include-children)
 (#set! injection.language "elixir"))

for multi-clause directives, and you would only be combining each expression, so you wouldn't need any change in the elixir grammar.

I'll give this a try here, but if that doesn't work I'll open an issue in the elixir grammar 👍

@the-mikedavis
Copy link
Contributor Author

I tried out reorganizing the expressions so you have (directive_group)s that start with a start/middle directive, contain pretty much anything, and end with an ending directive, (branch is here) but unfortunately this query:

((directive_group (directive (partial_expression) @injection.content))
 (#set! injection.combined)
 (#set! injection.include-children)
 (#set! injection.language "elixir"))

doesn't work the way I hoped. It combines the (partial_expression)s from all (directive_group)s despite them being different matches, so the problem persists.

I think this means that it either has to be solved in the elixir grammar or injections need a new #set! property to change the behavior. I'll noodle on it some more but I don't think it's possible to fix this only with changes to the eex grammar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants