Adding to assignment in grammar #25

colleend · 2021-09-27T19:42:49Z

Fixes errors where lines like this.foo = bar don't parse.

To test: use

npx tree-sitter test

The this assignment test should pass.

Pad's Comment:

'This is nice, and it might simplify the original grammar of Kotlin, but I'm a bit worried that we deviate now even
more from the original grammar. Fwcd already deviated from the original grammar a little on how he is handling expressions. I'd rather see us go back to the original spec of Kotlin where there is a directly_assignable_expression,
and a postfix_unary_operator, etc.'

fwcd · 2021-09-28T10:43:51Z

Lgtm, thanks!

fwcd · 2021-10-05T08:45:42Z

I am a bit concerned about the size of the generated parser. The grammar update after this PR (8d8a9c9) increased the size of parser.c from 28.6 MB to 44.5 MB, for example. Not sure if this is directly related to these changes, but this is already magnitudes larger than e.g. tree-sitter-java with 1.55 MB or tree-sitter-typescript with 5.75 MB. This might also be related to #11.

Any ideas on what makes the parser so large or how we could slim it down?

cc @aryx

aryx · 2021-10-05T09:15:26Z

good point, not sure why indeed. tree-sitter-c# is also pretty good, but still less with 18MB.

aryx · 2021-10-05T09:17:10Z

I remember the author of tree-sitter told me to look out for the number of states in parser.c, and it seems to be
26000, whereas in C# it's 6700. Not sure how to lower that number though.

aryx · 2021-10-05T09:21:58Z

Maybe @maxbrunsfeld knows some tricks to reduce the number of states? Max, do you see anything obviously wrong in the grammar.js for kotlin?

fwcd · 2021-10-05T09:34:27Z

After some investigation, I found that b651419 seems to have bumped the parser size from 2.4 MB to 30.9 MB. Perhaps control structures in Kotlin are simply hard to parse? The corresponding keywords may indicate very different constructs in different contexts, since they could be either statements or (possibly nested) expressions.

aryx · 2021-10-05T14:10:44Z

I wonder if using the 'word:' directive in grammar.js would help,
see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction

aryx · 2021-10-05T14:18:58Z

Ok this reduces parser.c to 8MB (from 48MB):

diff --git a/grammar.js b/grammar.js
index 6acfe27..83b57bc 100644
--- a/grammar.js
+++ b/grammar.js
@@ -54,6 +54,8 @@ const REAL_EXPONENT = token(seq(/[eE]/, optional(/[+-]/), DEC_DIGITS))
 module.exports = grammar({
   name: "kotlin",
 
+  word: $ => $._alpha_identifier,
+    
   conflicts: $ => [
     // Ambiguous when used in an explicit delegation expression,
     // since the '{' could either be interpreted as the class body
@@ -1088,10 +1090,12 @@ module.exports = grammar({
     // Identifiers
     // ==========
 
-    _lexical_identifier: $ => choice(
-      /[a-zA-Z_][a-zA-Z_0-9]*/,
-      /`[^\r\n`]+`/
-    ),
+      _lexical_identifier: $ => choice(
+         $._alpha_identifier,
+         $._backquote_identifier
+      ),
+      _alpha_identifier: $ => /[a-zA-Z_][a-zA-Z_0-9]*/,
+      _backquote_identifier: $ => /`[^\r\n`]+`/,
 
     _uni_character_literal: $ => seq(
       "\\",

aryx · 2021-10-05T14:19:48Z

and the STATE_COUNT is at 4119 now. Better!

fwcd · 2021-10-05T14:53:40Z

Added in 456c100, thanks!

maxbrunsfeld · 2021-10-05T16:47:46Z

Nice, glad you solved it!

colleend · 2021-10-05T19:09:00Z

this is awesome!

aryx · 2021-10-06T16:04:58Z

Hmm was too easy. Apparently the word: directive reduce the size of parser.c but we now get more parsing errors on our corpus.
We drop from 90% to 79% according to @colleend . @maxbrunsfeld any idea why the word: directive can change the behavior of the parser?

aryx · 2021-10-06T16:05:37Z

Also @fwcd would it be possible for you before some commits to run some regressions testing like running tree-sitter stat
on big corpus, just to double check we don't regress.

aryx · 2021-10-06T16:06:07Z

We were thinking at some point at R2C to add some infrastructure to make it easy for tree-sitter projects to run those parsing regressions stats in CI but we're not there yet.

fwcd · 2021-10-06T20:53:53Z

Ideally we would add the critical snippets that currently don't parse correctly to the corpus in this repo, to ensure that we don't regress on anything. But I agree, running the bigger corpus, preferably in CI, would be great.

colleend · 2021-10-08T22:01:40Z

I looked into this a bit more, and found two test cases that are erroring now that weren't erroring before:

1):

 bar.foo()
     .show()

Before, this parsed to:

(source_file [0, 0] - [2, 0]
  (call_expression [0, 0] - [0, 9]
    (navigation_expression [0, 0] - [0, 7]
      (simple_identifier [0, 0] - [0, 3])
      (navigation_suffix [0, 3] - [0, 7]
        (simple_identifier [0, 4] - [0, 7])))
    (call_suffix [0, 7] - [0, 9]
      (value_arguments [0, 7] - [0, 9])))
  (call_expression [1, 0] - [1, 11]
    (navigation_expression [1, 0] - [1, 9]
      (this_expression [1, 0] - [1, 0])
      (navigation_suffix [1, 4] - [1, 9]
        (simple_identifier [1, 5] - [1, 9])))
    (call_suffix [1, 9] - [1, 11]                                                                                                                         X
      (value_arguments [1, 9] - [1, 11]))))
test.kt	0 ms	(MISSING "this" [1, 0] - [1, 0])

Now it parses to:

(source_file [0, 0] - [2, 0]
  (call_expression [0, 0] - [0, 9]
    (navigation_expression [0, 0] - [0, 7]
      (simple_identifier [0, 0] - [0, 3])
      (navigation_suffix [0, 3] - [0, 7]
        (simple_identifier [0, 4] - [0, 7])))
    (call_suffix [0, 7] - [0, 9]
      (value_arguments [0, 7] - [0, 9])))
  (ERROR [1, 3] - [1, 4])
  (call_expression [1, 4] - [1, 9]
    (simple_identifier [1, 4] - [1, 7])
    (call_suffix [1, 7] - [1, 9]
      (value_arguments [1, 7] - [1, 9]))))
test.kt	0 ms	(ERROR [1, 3] - [1, 4])

expect(1)

Before, this parsed to:

(source_file [0, 2] - [1, 0]
  (call_expression [0, 2] - [0, 11]
    (simple_identifier [0, 2] - [0, 8])
    (call_suffix [0, 8] - [0, 11]
      (value_arguments [0, 8] - [0, 11]
        (value_argument [0, 9] - [0, 10]
          (integer_literal [0, 9] - [0, 10]))))))

Now it parses to:

(source_file [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 6]
    (platform_modifier [0, 0] - [0, 6]))
  (parenthesized_expression [0, 6] - [0, 9]
    (integer_literal [0, 7] - [0, 8])))
test2.kt	0 ms	(ERROR [0, 0] - [0, 6])

fwcd · 2021-10-11T20:59:26Z

@colleend The failure in the second example was a regression that I mistakenly introduced in 5f18577. Apparently, tree-sitter assigns literal strings a higher precedence than regexes, causing it to never consider the conflict between expect as a platform modifier and as an identifier. That should be fixed in 88faefe. Thanks!

Regarding your first example, I am not sure whether that is the correct parse in the first place. Shouldn't it be a single call expression?

colleend · 2021-10-11T22:54:50Z

Interesting! Didn't know that tree-sitter did that -- thanks for fixing that 🙏

On the topic of the first example -- good point, that should definitely be a single call expression (probably something like

(source_file [0, 0] - [1, 0]
  (call_expression [0, 0] - [0, 15]
    (navigation_expression [0, 0] - [0, 13]
      (call_expression [0, 0] - [0, 9]
        (navigation_expression [0, 0] - [0, 7]
          (simple_identifier [0, 0] - [0, 3])
          (navigation_suffix [0, 3] - [0, 7]
            (simple_identifier [0, 4] - [0, 7])))
        (call_suffix [0, 7] - [0, 9]
          (value_arguments [0, 7] - [0, 9])))
      (navigation_suffix [0, 9] - [0, 13]
        (simple_identifier [0, 10] - [0, 13])))
    (call_suffix [0, 13] - [0, 15]
      (value_arguments [0, 13] - [0, 15]))))

I'll look into this case and see what we can change to make this parsing correct!

aryx · 2021-10-18T13:04:25Z

Hmm something weird is this test.kt 0 ms (MISSING "this" [1, 0] - [1, 0])
No idea why when there is no word: section, the parser automatically insert 'this' in the code
to be able to parse.
@maxbrunsfeld where this MISSING error recovery mechanism is documented? And why does it interact weird with the word: directive?

rebase on main

7da84e6

colleend mentioned this pull request Sep 27, 2021

rebase on main semgrep/tree-sitter-kotlin#5

Closed

fwcd merged commit 7da84e6 into fwcd:main Sep 28, 2021

This was referenced Oct 18, 2021

newline in method chain calls leads to parsing errors #32

Closed

Add support for MISSING nodes semgrep/ocaml-tree-sitter-core#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding to assignment in grammar #25

Adding to assignment in grammar #25

colleend commented Sep 27, 2021

fwcd commented Sep 28, 2021

fwcd commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

fwcd commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

fwcd commented Oct 5, 2021

maxbrunsfeld commented Oct 5, 2021

colleend commented Oct 5, 2021

aryx commented Oct 6, 2021

aryx commented Oct 6, 2021

aryx commented Oct 6, 2021

fwcd commented Oct 6, 2021

colleend commented Oct 8, 2021 •

edited

Loading

fwcd commented Oct 11, 2021

colleend commented Oct 11, 2021

aryx commented Oct 18, 2021

Adding to assignment in grammar #25

Adding to assignment in grammar #25

Conversation

colleend commented Sep 27, 2021

fwcd commented Sep 28, 2021

fwcd commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

fwcd commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

aryx commented Oct 5, 2021

fwcd commented Oct 5, 2021

maxbrunsfeld commented Oct 5, 2021

colleend commented Oct 5, 2021

aryx commented Oct 6, 2021

aryx commented Oct 6, 2021

aryx commented Oct 6, 2021

fwcd commented Oct 6, 2021

colleend commented Oct 8, 2021 • edited Loading

fwcd commented Oct 11, 2021

colleend commented Oct 11, 2021

aryx commented Oct 18, 2021

colleend commented Oct 8, 2021 •

edited

Loading