Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding to assignment in grammar #25

Merged
merged 1 commit into from
Sep 28, 2021
Merged

Conversation

colleend
Copy link
Contributor

Fixes errors where lines like this.foo = bar don't parse.

To test: use

npx tree-sitter test

The this assignment test should pass.

Pad's Comment:

'This is nice, and it might simplify the original grammar of Kotlin, but I'm a bit worried that we deviate now even
more from the original grammar. Fwcd already deviated from the original grammar a little on how he is handling expressions. I'd rather see us go back to the original spec of Kotlin where there is a directly_assignable_expression,
and a postfix_unary_operator, etc.'

@fwcd
Copy link
Owner

fwcd commented Sep 28, 2021

Lgtm, thanks!

@fwcd fwcd merged commit 7da84e6 into fwcd:main Sep 28, 2021
@fwcd
Copy link
Owner

fwcd commented Oct 5, 2021

I am a bit concerned about the size of the generated parser. The grammar update after this PR (8d8a9c9) increased the size of parser.c from 28.6 MB to 44.5 MB, for example. Not sure if this is directly related to these changes, but this is already magnitudes larger than e.g. tree-sitter-java with 1.55 MB or tree-sitter-typescript with 5.75 MB. This might also be related to #11.

Any ideas on what makes the parser so large or how we could slim it down?

cc @aryx

@aryx
Copy link
Contributor

aryx commented Oct 5, 2021

good point, not sure why indeed. tree-sitter-c# is also pretty good, but still less with 18MB.

@aryx
Copy link
Contributor

aryx commented Oct 5, 2021

I remember the author of tree-sitter told me to look out for the number of states in parser.c, and it seems to be
26000, whereas in C# it's 6700. Not sure how to lower that number though.

@aryx
Copy link
Contributor

aryx commented Oct 5, 2021

Maybe @maxbrunsfeld knows some tricks to reduce the number of states? Max, do you see anything obviously wrong in the grammar.js for kotlin?

@fwcd
Copy link
Owner

fwcd commented Oct 5, 2021

After some investigation, I found that b651419 seems to have bumped the parser size from 2.4 MB to 30.9 MB. Perhaps control structures in Kotlin are simply hard to parse? The corresponding keywords may indicate very different constructs in different contexts, since they could be either statements or (possibly nested) expressions.

@aryx
Copy link
Contributor

aryx commented Oct 5, 2021

I wonder if using the 'word:' directive in grammar.js would help,
see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction

@aryx
Copy link
Contributor

aryx commented Oct 5, 2021

Ok this reduces parser.c to 8MB (from 48MB):

diff --git a/grammar.js b/grammar.js
index 6acfe27..83b57bc 100644
--- a/grammar.js
+++ b/grammar.js
@@ -54,6 +54,8 @@ const REAL_EXPONENT = token(seq(/[eE]/, optional(/[+-]/), DEC_DIGITS))
 module.exports = grammar({
   name: "kotlin",
 
+  word: $ => $._alpha_identifier,
+    
   conflicts: $ => [
     // Ambiguous when used in an explicit delegation expression,
     // since the '{' could either be interpreted as the class body
@@ -1088,10 +1090,12 @@ module.exports = grammar({
     // Identifiers
     // ==========
 
-    _lexical_identifier: $ => choice(
-      /[a-zA-Z_][a-zA-Z_0-9]*/,
-      /`[^\r\n`]+`/
-    ),
+      _lexical_identifier: $ => choice(
+         $._alpha_identifier,
+         $._backquote_identifier
+      ),
+      _alpha_identifier: $ => /[a-zA-Z_][a-zA-Z_0-9]*/,
+      _backquote_identifier: $ => /`[^\r\n`]+`/,
 
     _uni_character_literal: $ => seq(
       "\\",

@aryx
Copy link
Contributor

aryx commented Oct 5, 2021

and the STATE_COUNT is at 4119 now. Better!

@fwcd
Copy link
Owner

fwcd commented Oct 5, 2021

Added in 456c100, thanks!

@maxbrunsfeld
Copy link

Nice, glad you solved it!

@colleend
Copy link
Contributor Author

colleend commented Oct 5, 2021

this is awesome!

@aryx
Copy link
Contributor

aryx commented Oct 6, 2021

Hmm was too easy. Apparently the word: directive reduce the size of parser.c but we now get more parsing errors on our corpus.
We drop from 90% to 79% according to @colleend . @maxbrunsfeld any idea why the word: directive can change the behavior of the parser?

@aryx
Copy link
Contributor

aryx commented Oct 6, 2021

Also @fwcd would it be possible for you before some commits to run some regressions testing like running tree-sitter stat
on big corpus, just to double check we don't regress.

@aryx
Copy link
Contributor

aryx commented Oct 6, 2021

We were thinking at some point at R2C to add some infrastructure to make it easy for tree-sitter projects to run those parsing regressions stats in CI but we're not there yet.

@fwcd
Copy link
Owner

fwcd commented Oct 6, 2021

Ideally we would add the critical snippets that currently don't parse correctly to the corpus in this repo, to ensure that we don't regress on anything. But I agree, running the bigger corpus, preferably in CI, would be great.

@colleend
Copy link
Contributor Author

colleend commented Oct 8, 2021

I looked into this a bit more, and found two test cases that are erroring now that weren't erroring before:

1):

 bar.foo()
     .show()

Before, this parsed to:

(source_file [0, 0] - [2, 0]
  (call_expression [0, 0] - [0, 9]
    (navigation_expression [0, 0] - [0, 7]
      (simple_identifier [0, 0] - [0, 3])
      (navigation_suffix [0, 3] - [0, 7]
        (simple_identifier [0, 4] - [0, 7])))
    (call_suffix [0, 7] - [0, 9]
      (value_arguments [0, 7] - [0, 9])))
  (call_expression [1, 0] - [1, 11]
    (navigation_expression [1, 0] - [1, 9]
      (this_expression [1, 0] - [1, 0])
      (navigation_suffix [1, 4] - [1, 9]
        (simple_identifier [1, 5] - [1, 9])))
    (call_suffix [1, 9] - [1, 11]                                                                                                                         X
      (value_arguments [1, 9] - [1, 11]))))
test.kt	0 ms	(MISSING "this" [1, 0] - [1, 0])

Now it parses to:

(source_file [0, 0] - [2, 0]
  (call_expression [0, 0] - [0, 9]
    (navigation_expression [0, 0] - [0, 7]
      (simple_identifier [0, 0] - [0, 3])
      (navigation_suffix [0, 3] - [0, 7]
        (simple_identifier [0, 4] - [0, 7])))
    (call_suffix [0, 7] - [0, 9]
      (value_arguments [0, 7] - [0, 9])))
  (ERROR [1, 3] - [1, 4])
  (call_expression [1, 4] - [1, 9]
    (simple_identifier [1, 4] - [1, 7])
    (call_suffix [1, 7] - [1, 9]
      (value_arguments [1, 7] - [1, 9]))))
test.kt	0 ms	(ERROR [1, 3] - [1, 4])
expect(1)

Before, this parsed to:

(source_file [0, 2] - [1, 0]
  (call_expression [0, 2] - [0, 11]
    (simple_identifier [0, 2] - [0, 8])
    (call_suffix [0, 8] - [0, 11]
      (value_arguments [0, 8] - [0, 11]
        (value_argument [0, 9] - [0, 10]
          (integer_literal [0, 9] - [0, 10]))))))

Now it parses to:

(source_file [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 6]
    (platform_modifier [0, 0] - [0, 6]))
  (parenthesized_expression [0, 6] - [0, 9]
    (integer_literal [0, 7] - [0, 8])))
test2.kt	0 ms	(ERROR [0, 0] - [0, 6])

@fwcd
Copy link
Owner

fwcd commented Oct 11, 2021

@colleend The failure in the second example was a regression that I mistakenly introduced in 5f18577. Apparently, tree-sitter assigns literal strings a higher precedence than regexes, causing it to never consider the conflict between expect as a platform modifier and as an identifier. That should be fixed in 88faefe. Thanks!

Regarding your first example, I am not sure whether that is the correct parse in the first place. Shouldn't it be a single call expression?

@colleend
Copy link
Contributor Author

Interesting! Didn't know that tree-sitter did that -- thanks for fixing that 🙏

On the topic of the first example -- good point, that should definitely be a single call expression (probably something like

(source_file [0, 0] - [1, 0]
  (call_expression [0, 0] - [0, 15]
    (navigation_expression [0, 0] - [0, 13]
      (call_expression [0, 0] - [0, 9]
        (navigation_expression [0, 0] - [0, 7]
          (simple_identifier [0, 0] - [0, 3])
          (navigation_suffix [0, 3] - [0, 7]
            (simple_identifier [0, 4] - [0, 7])))
        (call_suffix [0, 7] - [0, 9]
          (value_arguments [0, 7] - [0, 9])))
      (navigation_suffix [0, 9] - [0, 13]
        (simple_identifier [0, 10] - [0, 13])))
    (call_suffix [0, 13] - [0, 15]
      (value_arguments [0, 13] - [0, 15]))))

I'll look into this case and see what we can change to make this parsing correct!

@aryx
Copy link
Contributor

aryx commented Oct 18, 2021

Hmm something weird is this test.kt 0 ms (MISSING "this" [1, 0] - [1, 0])
No idea why when there is no word: section, the parser automatically insert 'this' in the code
to be able to parse.
@maxbrunsfeld where this MISSING error recovery mechanism is documented? And why does it interact weird with the word: directive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants