Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Update draft tree-sitter grammar in #303 #438

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

claytonrcarter
Copy link

@claytonrcarter claytonrcarter commented Aug 2, 2021

Requirements

  • ✅ Filling out the template is required.
  • ✅ All new code requires tests

Description of the Change

⚠️ This PR builds on the mb-tree-sitter-parser branch from #303. If this is merged, #303 can be closed. ⚠️

  1. updates Add tree-sitter PHP grammar #303 to master, resolving conflicts
  2. updates tree-sitter-php to v0.16.2
  3. ports most existing specs from the current TextMate grammar to run against the new tree-sitter grammar
  4. updates the changes started in Add tree-sitter PHP grammar #303 to "finish" the tree-sitter PHP grammar
  5. changes Add tree-sitter PHP grammar #303 to use source.php instead of text.html.php. This matches the TM grammar and fixes things like snippets.

I say that this "finishes" the tree-sitter grammar because there are still many current lanuage features that aren't available in tree-sitter-php 0.16.2. Updating to 0.19 (which is itself currently blocked by atom/atom#22130) won't even help w/ these, as many of them have only been added to TS-php recently. Until this is resolved, the following language features are being parsed as errors:

Current Status

This is still very much ⚠️ draft ⚠️ , but appears to be usable enough for day to day work (at least for me). Most of the TM grammar specs were ported to TS w/o issue, though many were left out b/c of the current status of tree-sitter-php. I am using this for my day to day work, but otherwise I would suggest that it's blocking on atom/atom#22130 and on a new release of tree-sitter-php that includes all of the modern language goodies.

TODO

  • Port as many tests as possible from TextMate grammar to tree-sitter
  • Clean up the tests to reduce duplication. (There are lots of places where I ported expectations verbatim, but they're really just testing parsing of the language. These aren't as relevant anymore since parsing is handled by tree-sitter, not this grammar.)
  • once all specs are passing, clean up, reorganize and simplify the grammar by removing duplicate selectors, etc
  • folds seem to be working OK, but I don't use them very much, so it would be great if someone else could confirm that they are working as expected
  • Figure out how to handle scopeing for builtin classes and functions
    • the TM grammar relies are large regexs for these, but these don't work w/ Atom's TS grammars
  • Still having trouble scoping some basic punctuation:
    • comment delimiters: //, /*, */ and /**
    • namespace separator: \ (as in Foo\Bar)
    • opening and closing strings: ie " and '
    • decimal/float separator: ie . in 1.23

Breaking Changes:

scopes for opening & closing punctuation: tree-sitter grammar includes opening & closing punctuation in the meta scopes, while the original TM grammar does not. For example, in function foo($arg) { echo 'bar'; }, the TM grammar applies scope meta.function.parameters.php to $arg and meta.function.body.php to echo 'bar';, while this TS grammar applies scope meta.function.parameters.php to ($arg) and meta.function.body.php to { echo 'bar'; }. This is due to how TS parses and not b/c of any decision of mine.

In order to maintain BC on this, we would need to select "everything but the opening and closing punctuation (brace, paren, bracket, etc)" and I don't believe that this is possible w/ Atom's current TS selector implementation. This applies to lots of things: class definitions, use declarations, arrays, function/method parameters, etc.

scopes for semicolons: same as above. The TM grammar did not include the semicolon in it's meta scopes, but TS does include them in statements. This means that, for example, use A; will now have meta.use.php applied to the semicolon, whereas the TM grammar would stop that scope at A.

scopes for typehinted parameters: the TM grammar supports 3 scopes for function/method parameter defns: meta.function.parameter.no-default.php when no default value is provided, meta.function.parameter.default.php when a
default value is provided, and meta.function.parameter.typehinted.php when a typehint is provided and which overrides the previous 2 regardless of whether a default value has been provided. Because of how TS parses, I don't know how to duplicate this exact behavior. Instead, this grammar leaves the default/no-default scopes in place for these parameters and then applies an additional meta.function.parameter.typehint.php scope just to the typehint.

For example, in function foo($arg1, $arg2 = 2, int $arg3, int $arg4 = 4) {}, the TM grammar would scope the first param as no-default, the second as default, and the 3rd and 4th would both be scoped as typehinted. This TS grammar instead scopes the 1st and 3rd params both as no-default, the 2nd and 4th as default, and then applies the typehint scope only to the int typehints.

In order to maintain BC on this, we would need to select "parameter nodes that contain a typehint node" and I don't believe that this is possible w/ Atom's current TS selector implemenation.

heredoc and nowdoc indetifiers: as far as I can tell, TS does not expose the heredoc/nowdoc identifiers as a named node, which makes it impossible to scope them as the TM grammar does. For example, in <<<HEREDOC, TM would scope HEREDOC as keyword.operator.heredoc.php, but this grammar is unable to do so. The overall heredoc/nowdoc string is scoped correctly, but the identifiers are no longer scoped.

string interpolation: along w/ newer language features, string interpolation is currently unsupported by TSPHP, so is not supported here either.

PHPDoc & language-todo: PHPDoc and TODOs are no longer highlighted. Regarding PHPDoc, see issue @ TSPHP regarding PHPDoc. Basically, the recommendation is to use a separate tree-sitter parser just for phpdocs. As for language-todo, it looks like it just doesn't work w/ any tree-sitter grammars, according to this long standing issue. There are initial drafts of TS parsers for phpdoc and TODO, but I haven't explored what it would take to get them updated ad usable for this.

Alternate Designs

The main alternate I thought about was leaving the new specs in Coffeescript to match the existing specs. They would probably be alot easier to read in Coffeescript, as JS+prettier is turning 1 line of CS into 5-8+ lines of JS.

Also, this PR is attempting to maintain the legacy/textmate scopes as closely as possible. At the risk of introducing some breaking changes, we could instead follow some of the other tree-sitter grammars and implement a much simpler, stripped down set of scopes. For example, the JS and C TS grammars scope all operators as keyword.operator (+/- a language scope like .js) while this grammar is applying add'l scopes like .arithmetic, .assignment, etc based on the type of operator. Another example: the JS grammar seems to scope all parens/brackets/braces the same, while this grammar applies add'l scopes based on if the punctuation is used in a function, class, method, use, namescape, etc. Not bad, per se, but it does set this grammar apart from the others a little bit, as far as I can tell. See this draft RFC: atom/atom#19623 and this issue for more background: atom/language-javascript#649

Benefits

Finishes up a long-standing PR; speeds up PHP highlighting, folding, indenting, etc in Atom.

Possible Drawbacks

Until atom/atom#22130 is resolved, and tree-sitter-php is released w/ support for newer language features, the tree-sitter grammar will lack support for newer language features that are already supported by the existing TM grammar. This could be seen as a regression.

Applicable Issues

Previous PR this is based on: #303
tree-sitter-php issue to add full language support: tree-sitter/tree-sitter-php#58
PR to update Atom to tree-sitter 0.19.0: atom/atom#22130

@claytonrcarter
Copy link
Author

claytonrcarter commented Aug 2, 2021

Oh, hmm. Should I not be including the merge commit from master in this PR? What's the best practice for that? Technically, the only changes in this PR are to files

grammars/tree-sitter-php.cson
specs/tree-sitter-spec.js
package.json
package-lock.json

All of the other changes are from master...

@claytonrcarter
Copy link
Author

Also, any assistance on debugging the window's builds would be much appreciated. Specs are passing on linux and macOS, but it looks like something is causing the tree-sitter grammar to not load on Windows because the specs appear to be using the existing TM grammar instead of the TS grammar.

Possibly related: atom/atom#22675

@KapitanOczywisty
Copy link
Contributor

Best would be to rebase changes after current master's head. Just make sure you don't change commits from master. Easiest way which comes to mind: You can make patch from your commits (you can squash them) apply onto original mb-tree-sitter-parser and then make rebase.

And make PR to master.

About windows build: cleanup commits and I'll try this on my end, but you should not be concerned about that if this is atom's problem.

@claytonrcarter claytonrcarter changed the base branch from mb-tree-sitter-parser to master August 2, 2021 12:42
@claytonrcarter
Copy link
Author

OK thanks! I've rebased this branch (including original commits from #303) onto master, and updated this PR to target master. PLMK if I did anything wrong in that process.

B/c this is still draft, I would prefer to keep my commits unsquashed as I work, but I agree that they should be squashed before (if?) this gets merged.

@KapitanOczywisty
Copy link
Contributor

Looks good now.

@KapitanOczywisty
Copy link
Contributor

Seems like windows problems are caused by different node versions used for tree-sitter-php compilation. It will probably solve itself when that package will be bundled with atom itself, what will happen after update of package.json on master.

@claytonrcarter
Copy link
Author

I'm ported over as many of the existing tests as possible, and things are largely working OK, aside from the BCs mentioned in the description, many of which are out of our control at this point. Two things that are in our control, though are:

  • figuring out how to scope the punctuation that I mentioned in the descriptions. I don't know why it's so tricky to scope separators in classnames and comments, etc, but it's really evading me.
  • figuring a way to scope all of the standard library builtins. It would be great if the TS grammar could share the regexs used by the TM grammar, but Atom's TS support appears to limit how the regexs are used. eg it appears that multiline regexs w/ (?xi) are not allowed, and these are used extensively by the TM grammar.

@KapitanOczywisty
Copy link
Contributor

KapitanOczywisty commented Aug 9, 2021

About punctuation problems: get a few most popular themes, get some php source codes like symfony, and check if it's looking ok. TM has more scopes that is ever used, so if not needed - move on.

  • figuring a way to scope all of the standard library builtins. It would be great if the TS grammar could share the regexs used by the TM grammar, but Atom's TS support appears to limit how the regexs are used. eg it appears that multiline regexs w/ (?xi) are not allowed, and these are used extensively by the TM grammar.

Function and constant names are out of date by now. I'm thinking about regenerating these from php source. For the first release could be probably skipped.

I'll try to get tree-sitter running on my machine to help with some grammar, but maybe you could get some help in tree-sitter or tree-sitter-php repositories.
Edit: I've got windows running with node-gyp rebuild -m --target=9.4.4 --dist-url=https://electronjs.org/headers

@claytonrcarter
Copy link
Author

Hi, quick update: this seems to be working just fine for day-to-day use for non-template PHP files.

I'm playing w/ how to support HTML+PHP in the same document. More on this later, but the main scope of this grammar appears to need to be text.html.php as I've encountered at least 1 package (php-ide-serenata) that is expecting that scope instead of source.php. (This makes sense for BC, since the current grammar uses text.html.php)

I'm also playing w/ a tree-sitter-todo package to enable TODO markers in this grammar, and perhaps other tree-sitter grammars. (This is a long standing issue not limited to this grammar, see atom/language-todo#82. Also, this is definitely a nice-to-have, and not a priority.)

I'm working on developing a tree-sitter phpdoc parser so that we can get phpdoc scopes working. There was already some code out there (https://github.com/john-nguyen09/tree-sitter-phpdoc) that has gotten me started. So far this is also working fine with the exception that the parser is looking for "correct" syntax, which means that certain things the current PHPdoc tests were allowing won't work b/c they were actually invalid phpdoc. (Ie @var, @param, etc technically require a variable name, but these were missing from the tests.)

Now I have an important question I need some input on: unless we get a BC release of tree-sitter-php, this PR will be stuck w/o support for lots of modern PHP features until Atom updates it's tree-sitter support. As an alternative, we could maintain a lightweight fork of tree-sitter-php that simple downgrades the version of tree-sitter that is used to generate the grammar, thereby allowing the modern syntax from tree-sitter-php@master to work to work w/I current Atom. I don't love the idea of maintaining fork like that, but I also doubt that it would be too much work, and I like that it would remove one (of many) blocking issues related to this work. (I'm using this approach locally to add scopes for arrow functions, etc and it's working just fine.) What do you think?

Function and constant names are out of date by now. I'm thinking about regenerating these from php source. For the first release could be probably skipped.

OK, maybe what I'll do is just manually add the built-ins that I encounter as I work just to get started, when we can work on a more robust a/o long term solution for later releases.

@claytonrcarter
Copy link
Author

Another quick update: this is not finished, but it has been my daily PHP driver for ~4 months now and -- as far as I can tell -- seems to work just fine. There are still a number of regressions between the TS grammar and the TM grammar, and most of these are noted in the PR description.

There are 2 main things that have changed since I last posted an update:

  • I got impatient waiting for some compatibility updates/downgrades and have started using a custom "fork" of tree-sitter-php that is really just master but w/ the parser generated by tree-sitter 0.17 instead of 0.20. This means it's using an ABI that's compatible w/ Atom's version of tree-sitter. I don't see a simpler way around this until Atom updates it's version of tree-sitter
  • Facing the difficulties of the text.html.php/source.php question, I changed the parser structure to match how the TM grammar does it: there are separate tree-sitter parsers for the "HTML" part of every PHP file (aka tree-sitter-embedded-php and another for the actual PHP parts (aka tree-sitter-php). The HTML parser is just a lightweight wrapper that injects the PHP parser. This allows us to properly (I think!, but ??) apply the text.html.php scope separately from the source.php scope, and closely matches how the TM grammar does it.
  • Not a big change, but this also includes phpdoc support.

I think that these are mostly working correctly, but there are a number of differences from the TM grammar that are flagged in the tree-sitter-spec.js file.

NOTE/WARNING/WATCH OUT
In order to get this working, I created 3 separate dependencies in package.js:

I know that this is not maintainable in the long run, but it felt like a decent way to get the ball rolling until something more sustainable can be figured out.

A few months have passed since I worked on this code, and I see now that there are a few updates to bring in from tree-sitter-php (not many) and from language-php@0.48.1. (As far as I can recall, this is up-to-date w/ language-php@0.48.)

@sadick254 @darangi Would greatly appreciate any input or guidance on this, if you can spare the time.

@claytonrcarter
Copy link
Author

It looks like CI is failing, but the tests are passing for me locally ... of course. 🙄 I'll try to look into it, but I won't have time to do so today.

@claytonrcarter
Copy link
Author

I've been chipping away at some issues w/ the embedding grammar the past few days. This isn't something that I've looked at much before, but I'm tinkering w/ having some add'l tests to confirm that the embedding tree-sitter grammar is up to snuff w/ the embedding TextMate grammar.

In particular, shebangs (eg #!/usr/bin/env php) currently aren't recognized or highlighted by the tree-sitter grammar. Locally, I have them recognized, but not yet highlighted. This is sort of interesting (at least, to me 😄) b/c the shebang needs to come before the opening php tag. W/o a special case for this, though, anything outside of php tags is considered HTML, and parsed/highlighted as such. In order for this to work w/i Atom's tree-sitter system, I think I'll need to work on an exception where the first line (and only the first line) can be considered "PHP" even if it's not w/i the start tag, but only if it's a shebang. We'll see...

@mikehaertl
Copy link

@claytonrcarter I came here just by coincidence from the linked issue in the ts php parser. But I stumbled over hash bang handling in the js parser yesterday: https://github.com/tree-sitter/tree-sitter-javascript/blob/master/grammar.js#L92

Maybe it helps.

Ben3eeE and others added 11 commits April 27, 2022 20:17
Also remove scope for variable and for syntax-error
Fixes scopes of anonymous nodes
Scopes constant.language.php for language constants
Scope the class of catch expressions as a class name
Change object_creation_expression to scope as entity.name.type.class
This uses a custom, "off label" package that is just the up-to-date tree-sitter-php rebuilt to use an older tree-sitter ABI that is compatible w/ the older verion that ships w/ Atom.
These are exhaustive to tests ported from the TextMate grammar test for
maximum compatibility. Never-the-less, there are still lots of FIXME
comments to flag area that still need improvement or work.
@claytonrcarter
Copy link
Author

Just rebased this to update it and squashed my ~70 commits down into a more reasonable handful. Though it's not without some issues, this is still my daily driver.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants