complete the grammar #1

amaanq · 2023-06-15T22:10:51Z

from nvim-treesitter/nvim-treesitter#4944

There's a lot of changes, frankly I was too busy writing code to segment my commits so I hope you'll just review it as it is altogether

Main updates:

remove redundant fielding/aliasing
don't alias literals like delimiters/keywords
shorten the external scanner code, fuzz it for sanity (had an endless loop bug or something before..)
add a test script on real matlab code

Todo:

update all tests (I only did some..there's so many and who knows if this is the "final" grammar)
update emacs queries (not something I care about, no offense)

Thanks!

acristoffers · 2023-06-16T08:27:10Z

Ok, this is a huge mix of things I really want in with things I won't accept at all, so let me try to separate them here first.

Before the list: for me tree-sitter is not only about syntax-highlight. I want this parser to be mostly correct with what it parses, and not to simply match as much of wrong code as possible. I'm not ok with oversimplifications for no reason, as an ERROR token should signal something is in fact wrong with your code. I don't want the tree to be the simplest possible, I want it to be mostly correct. I already changed my README yesterday to reflect that (you didn't git pull, the "fuzzyfication" you did I had already done).

Love

The _eof token.
The separated line_continutation token.
The addition of property_name inside function_call.
Removing redudances of fields/operators.
Renaming parethesis, matrix and cell.
The simplification of things with the commaSep functions.

Hate

Field expression instead of struct. This is bad because it's wrong: (1+1).D is not valid. Both side of the dot have stricter rules than that and it boils down to what matlab calls a struct, so it should be called a struct here too. We can still debate the name, but matching _expression on both sides is wrong.
The String and Comment changes. You removed formatting sequence and I don't like it. It allows you to take comments out of the scanner and bring back escape chars to single-quoted strings, but that's a worse trade-off in my opinion. MATLAB nowadays basically threats strings and char vectors equally (in the sense that you can almost always use them interchangably) so the way I did allows the user to use one if they want the highlight and the other if not. It also allows them to change to double, verify if the formatting expression is ok and then change to single again if double-quoted is not acceped in some function.
The assignment operator. Nope. Big no. [a;b]=a() is invalid. And so is [1 ~; 2 3]. And so is [1 2] = a(). And also 1+1=2. If I type something wrong like that, I want the highlight to break and show me I did something stupid.

Want to discuss

The removal of the operator aliases. Again, it's not about syntax-highlighting, but other uses. I can see the use of tree-sitter to write a beautifier/formatter, and having tokens for the operators would simplify that.
The import statement. Simply because it's not a statement, it's a function. You can do import = 2 if you want, but you cannot do global = 2. Also, you say it can have import_path inside it, but never defines what import_path is. For me import should not have a special threatment, just like no other function is getting special threatment.

So, I think those are all the changes to the code itself (the extra files you added I'm ok with). Point it out if I overlooked something and tell me whether you want to adapt the pull request so you get properly credited in the commit history or if you don't want to have the trouble, in which case I'll commit the Love part myself, making a mention of you in the README and leaving your name in the header you added.

acristoffers · 2023-06-16T08:34:40Z

Oh, I was busy reading the changes and forgot to say thank you for the pull request 🙈 So, thank you for your work, even if I don't agree with everything.

clason · 2023-06-16T10:56:30Z

Before the list: for me tree-sitter is not only about syntax-highlight. I want this parser to be mostly correct with what it parses, and not to simply match as much of wrong code as possible. I'm not ok with oversimplifications for no reason, as an ERROR token should signal something is in fact wrong with your code.

To be perfectly frank, that is not the stance of Neovim (and nvim-treesitter). For us, tree-sitter is a tool for working with code in a text editor, not a general-purpose parser and definitely not a linter. This:

If I type something wrong like that, I want the highlight to break and show me I did something stupid.

is very much a non-goal. So

parses correct code correctly -> necessary
does not parse incorrect code -> unnecessary, and unwanted if this complicates the parser to the point where it negatively affects highlighting performance.

(We had major issues with parsers that have objectively unacceptable parsing speed, which makes them unusable for our purpose.)

Of course, if your goals are different and don't align, that is fine -- there are many valid applications of tree-sitter besides ours -- but it may mean that we just cannot use your parser in Neovim.

(Although I hope that it doesn't come to that and both goals can be met satisfactorily.)

acristoffers · 2023-06-16T11:53:47Z

I understand that we have some different goals, but I don't think that they are necessarily conflicting yet. There is one question: is this parser with all the modifications OBJECTIVELY more performant than the parser withtout the ones I'm rejecting?

If yes, than maybe we should clone this repo into a nvim-tree-sitter-matlab and apply those modification. If not, we're all set.

I don't plan to go overboard with the parser either, don't expect me to all of sudden come back changing things unless it is to fix corner cases or add missing/new features of MATLAB. No matter the result of this disscussion I'll implement the "Love" part and then the parser will be considered stable/release, so I won't be modifying it without need.

For me, looking at it through your perspective of nvim-treesitter's goals, the things in the "Hate" list are premature optmizations. If they don't really make a meaninful difference in performance, then there is no reason for that kind of simplification.

And about the operator tokens, the python grammar returns fields for those, so is this really such a bad thing?

clason · 2023-06-16T12:11:14Z

I don't think we are necessarily in conflict, either; I just wanted to make clear our stance after you made clear yours, so everybody knows where the other person is coming from.

And the question is less about optimal performance and more about unacceptable performance. If (and only if) your goal of a strict parser leads to the latter, we'll simply stick with the current parser until someone else writes and proposes one that is a better fit for us (no need to change the name; you didn't feel the need to rename yours to distinguish it from the previous one).

As regards the specific changes, I have no personal opinion either way (having left Matlab years ago and never looked back; I'm not even recommending it to my students anymore).

acristoffers · 2023-06-16T12:40:45Z

It's good to know where we are standing, I agree.

I just created a file with about 1000 lines using a concatenation of some of my code (lots of functions one after the other) and my parser (without any modification) took about 3ms to run it, the same as this one. Then I copied the file into itself 10 times for a file with about 10000 lines. Both parsers took 31ms. The changes don't see to make any meaninful difference in performance. Way of measuring: clear; for i in (seq 100); tree-sitter parse -tq t.m; end. Fish. So yeah, no gain at all.

Afterwards I though about using only the different parts (multivar assignment and string with formatting) to see how badly it goes. Same thing again, same time to both.

And yeah, soft-power made me stick to MATLAB. "Manda quem pode, obedece quem tem juízo".

acristoffers · 2023-06-16T12:55:38Z

you didn't feel the need to rename yours to distinguish it from the previous one

Oh, I skipped that. Actually, tree-sitter's documentation asks for projects to be named this way, that's way I named it so.

https://tree-sitter.github.io/tree-sitter/creating-parsers#project-setup

clason · 2023-06-16T15:08:33Z

No, that's fine, that's why I'm saying different parsers can (and need to) share the same name.

Anyway, the point is simply that: as long as the parser is objectively more correct and not significantly less performant, we will switch. And the (non)existence of a hypothetical better parser does not factor into this decision. (Of course, if someone later comes with a better parser -- say, just as correct but noticeably faster -- we have no qualms of switching to that from yours...)

amaanq · 2023-06-16T17:13:19Z

Field expression instead of struct. This is bad because it's wrong: (1+1).D is not valid. Both side of the dot have stricter rules than that and it boils down to what matlab calls a struct, so it should be called a struct here too. We can still debate the name, but matching _expression on both sides is wrong.

I haven't touched matlab in a while, but creating complex rules to wiggle out a couple of items to be a strict syntax checker is not really what tree-sitter is about, it makes sense if, say, the state count is massively inflated or it causes conflicts that resolve incorrectly everywhere, but in this case, I would think it's not worth it.

The String and Comment changes. You removed formatting sequence and I don't like it. It allows you to take comments out of the scanner and bring back escape chars to single-quoted strings, but that's a worse trade-off in my opinion. MATLAB nowadays basically threats strings and char vectors equally (in the sense that you can almost always use them interchangably) so the way I did allows the user to use one if they want the highlight and the other if not. It also allows them to change to double, verify if the formatting expression is ok and then change to single again if double-quoted is not acceped in some function.

Well, the format verbs are only valid in the context of a printing function, no? Otherwise, a %d in a string is, literally, % and d. I could be incorrect, but it's what I recall. A solution would be injections using format verbs, but that's on pause atm (C/C++ also want/would benefit from this).

The assignment operator. Nope. Big no. [a;b]=a() is invalid. And so is [1 ~; 2 3]. And so is [1 2] = a(). And also 1+1=2. If I type something wrong like that, I want the highlight to break and show me I did something stupid.

Well, isn't that an LSPs job? This is one change I could add back, but it was causing some annoying conflicts that I thought solving it in 5 seconds by removing it was easier.

The removal of the operator aliases. Again, it's not about syntax-highlighting, but other uses. I can see the use of tree-sitter to write a beautifier/formatter, and having tokens for the operators would simplify that.

I don't think it'd make it simpler, and all it does for the purposes of the cst is pollute the tree, you'd still check the actual operator either way for a formatter I'd think.

2. The import statement. Simply because it's not a statement, it's a function. You can do import = 2 if you want, but you cannot do global = 2. Also, you say it can have import_path inside it, but never defines what import_path is. For me import should not have a special threatment, just like no other function is getting special threatment.

Well, I saw import a.b.c.* where * I'm assuming denotes "import all items" from the module - if it shouldn't be treated as a statement, a command argument should allow for this then

acristoffers · 2023-06-16T19:36:32Z

I haven't touched matlab in a while, but creating complex rules to wiggle out a couple of items to be a strict syntax checker is not really what tree-sitter is about, it makes sense if, say, the state count is massively inflated or it causes conflicts that resolve incorrectly everywhere, but in this case, I would think it's not worth it.

I'm not giving up correctness for no reason. If it is not causing a real problem nor creating a performance drawback then it's not going away.

Well, the format verbs are only valid in the context of a printing function, no? Otherwise, a %d in a string is, literally, % and d. I could be incorrect, but it's what I recall. A solution would be injections using format verbs, but that's on pause atm (C/C++ also want/would benefit from this).

Escapes are also only valid in that context, but there is no way for us to know since the user can just be doing

fmt = "%.2f"
sprintf(fmt, 2)

Well, isn't that an LSPs job? This is one change I could add back, but it was causing some annoying conflicts that I thought solving it in 5 seconds by removing it was easier.

I'm not giving up correctness for no reason. I think that at this point it has already been established that I would rather not be in nvim-treesitter than to accept those 3 specific changes.

I don't think it'd make it simpler, and all it does for the purposes of the cst is pollute the tree, you'd still check the actual operator either way for a formatter I'd think.

Not for a formatter, you would only probably care what kind of operation you have (binary, boolean, post, pre). The actual operator doesn´t matter. If you know where it actually is you just copy the string without having to do any extra check.

Well, I saw import a.b.c.* where * I'm assuming denotes "import all items" from the module - if it shouldn't be treated as a statement, a command argument should allow for this then

At first I thought I had missed it too, but it's not an operator but a function. And there is nothing to be changed, command arguments accepts those arguments just fine.

I'm sorry but you won't change my mind on the 3 rejections, I really dislike those 3 changes. If running the performance tests I had seen a difference I would change my mind, but there is no gain at all. Truly none. So I'm not changing my stance on those. This talk about complexity reminds me of people in my field talking about the conservativeness of equations: really nice on paper with zero improvements on real life, so what's even the point? I'm not giving up on that correctness for an imaginary improvement.

clason · 2023-06-16T19:59:35Z

It's your parser; you can do whatever you like with it. And as I said, as long as it's better in some regards and not worse in every other (by a non-negligible margin), I am happy to switch to it.

(But I can assure you that conservativeness of schemes does matter, practically ;))

acristoffers · 2023-06-17T07:32:25Z

@clason at this point I'm just waiting for @amaanq to remove the three changes from the request or tell me to apply the changes myself, so I can fix the tests and other queries and be done with it.

Going off-topic, but conservativeness of schemes? Are you talking IT or math here? What I'm talking about is this kind of conservativeness:

you have that P-A'*P*A has to be semi-definite negative. But A contains a variable and P is a variable (all matrices) so the multiplication of variables makes it impossible to solve (there is no bilinear solver, whoever does that for sure gets a Nobel). There are multiple solutions, and you want to search the solution space for the optional one. To solve it you turn it into (Schur complement) [-P A'*P; P*A -P] and then you can apply variable substitution so solve the P*A problem. But this adds conservativeness as all solutions for this are solutions for the original problem, but the original problem has solutions that are not valid for this. So you can now solve the problem, but you may not have the true optimal solution. Some people really have a sort of fetish with removing this conservativeness from known problems, which seems awesome until you see that the conservative solutions that already exist still find the same optimal solutions or solutions so close that the gain is useless. And sometimes the less conservative solution adds so much computation time to the solver that it is really impractical even for small problems. I'm talking about the conservative method finding a solution in under 1 minute and the non-conservative taking more than 3 hours to find the same or just slightly better.

clason · 2023-06-17T07:56:33Z

That is not a use of the word I was familiar with ;) (And that sounds like a fairly standard semidefinite optimization problem, for which there exist a whole zoo of solvers?)

acristoffers · 2023-06-17T10:38:50Z

This is quite common meaning of the word in the field of Electrical Engineering, more specifically Control Theory. I believe mathematicians use it too. And yes, there are plenty of LINEAR solvers. But the problems are born, usually, non-linear, and that's where you start to have this kind of problem. You cannot give P-A'*P*A directly to any solver if P and A are variables. In reality A is actually A-B*K or A-L*C or something like that, and K and L are the actual variables. And this, of course, is a very simple and basic example to illustrate the concept, things get much hairier.

amaanq · 2023-06-17T19:52:33Z

I apologize if I came off as trying to strong-arm you into accepting my changes as absolute - I chose what was simpler and passed all tests nicely, while ensuring syntax highlighting & related worked fine. I added back what you wanted back for the most part, and fixed a couple of scanner bugs detected in test suites, though whether some of these bugs were actually bugs is subjective..for example, is a standalone % a format verb? "Hey, % is a %". I'd assume not, but it was being parsed as one. Another was double quotes only surrounded by whitespace, e.g. " "" ", and leading whitespace were being trimmed in string contents. I won't delve into them all, but hopefully now the PR better reflects what we'd both like for the grammar - cleanliness, correctness, and stability 😸

One of the fuzzing failures, interestingly, is a bug with tree-sitter and not this repo (it's mentioned here if you're interested)

Let me know if there's anything else to discuss, and if not I can update all the tests then

acristoffers · 2023-06-18T09:32:33Z

That was exactly the impression I was under. Apologies accepted. I'm now in accordance with the changes, except property_name in expression and field_expression because they are generating wrong trees:

The file

a.b(3) + 2

is parsed by these changes as

(source_file [0, 0] - [2, 0]
  (field_expression [0, 0] - [0, 10]
    object: (identifier [0, 0] - [0, 1])
    field: (binary_operator [0, 2] - [0, 10]
      left: (function_call [0, 2] - [0, 6]
        name: (identifier [0, 2] - [0, 3])
        (arguments [0, 4] - [0, 5]
          argument: (number [0, 4] - [0, 5])))
      right: (number [0, 9] - [0, 10]))))

and it should be something like

(source_file [0, 0] - [2, 0]
  (binary_operator [0, 0] - [0, 10]
    left: (struct [0, 0] - [0, 6]
      (identifier [0, 0] - [0, 1])
      (function_call [0, 2] - [0, 6]
        name: (identifier [0, 2] - [0, 3])
        (func_call_paren [0, 3] - [0, 4])
        arguments: (arguments [0, 4] - [0, 5]
          argument: (number [0, 4] - [0, 5]))
        (func_call_paren [0, 5] - [0, 6])))
    (operator [0, 7] - [0, 8])
    right: (number [0, 9] - [0, 10])))

So I'll just accept the pull request and fix that and the tests myself. Waiting for you to do that would be nitpicking.

Thank you very much for the changes, especially for catching those bugs. On the infinity-loop you found: was it when running tests? I didn't come across any on my machines.

amaanq · 2023-06-19T06:29:00Z

Thank you very much for the changes, especially for catching those bugs. On the infinity-loop you found: was it when running tests? I didn't come across any on my machines.

It's from fuzzing, I can send the script to test it, and usually it does catch bugs with external scanners pretty quickly

Actually its in ts questions here now: https://github.com/sogaiu/ts-questions/blob/master/questions/failed-fuzzing/script/test-fuzzing.sh

A simpler version:

#!/bin/sh

set -eu

ROOT_DIR="fuzzer"

LANG=$1
TIME=$2
CPP=$3
# if scanner = scanner.cc then XFLAG = c++ else XFLAG = c
if [ "$CPP" = "cpp" ]; then
	SCANNER="scanner.cc"
	XFLAG="c++"
else
	SCANNER="scanner.c"
	XFLAG="c"
fi

shift 3

export PATH="/root/.cargo/bin:$PATH"
export CFLAGS="$(pkg-config --cflags --libs tree-sitter) -O0 -g -Wall"

JQ_FILTER='.. | if .type? == "STRING" or (.type? == "ALIAS" and .named? == false) then .value else null end'

build_dict() {
	jq "$JQ_FILTER" <src/grammar.json |
		grep -v "\\\\" | grep -v null >"$ROOT_DIR/dict"
}

build_fuzzer() {
	cat <<END | clang -fsanitize=fuzzer,address $CFLAGS -lstdc++ -g -x $XFLAG - src/$SCANNER src/parser.c $@ -o $ROOT_DIR/fuzzer
#include <stdio.h>
#include <stdlib.h>
#include <tree_sitter/api.h>

#ifdef __cplusplus
extern "C"
#endif
TSLanguage *tree_sitter_$LANG();

#ifdef __cplusplus
extern "C"
#endif
int LLVMFuzzerTestOneInput(const uint8_t * data, const size_t len) {
  // Create a parser.
  TSParser *parser = ts_parser_new();

  // Set the parser's language.
  ts_parser_set_language(parser, tree_sitter_$LANG());

  // Build a syntax tree based on source code stored in a string.
  TSTree *tree = ts_parser_parse_string(
    parser,
    NULL,
    (const char *)data,
    len
  );
  // Free all of the heap-allocated memory.
  ts_tree_delete(tree);
  ts_parser_delete(parser);
  return 0;
}
END
}

generate_fuzzer() {
	tree-sitter generate
}

makedirs() {
	mkdir -p "$ROOT_DIR"
	mkdir -p "$ROOT_DIR/out"
}

makedirs
generate_fuzzer

build_dict
build_fuzzer $@
cd "$ROOT_DIR"
./fuzzer -dict=dict -timeout=2 -max_total_time=$TIME out/

and then just run ./fuzz.sh matlab 5 c

amaanq mentioned this pull request Jun 15, 2023

feat(matlab)!: replace parser nvim-treesitter/nvim-treesitter#4944

Merged

amaanq added 4 commits June 17, 2023 15:44

feat: rework and complete the grammar

b73925c

chore: add eslint formatting

54d390a

feat: add ci test script

4ace974

chore: update some tests

8770881

amaanq force-pushed the main branch from eed4e0c to 4359b43 Compare June 17, 2023 19:44

amaanq added 3 commits June 17, 2023 15:47

chore: update neovim queries

37c27db

chore: update dots

03f3efa

chore: generate

3bbc23c

amaanq force-pushed the main branch from 4359b43 to 3bbc23c Compare June 17, 2023 19:47

acristoffers merged commit a09be54 into acristoffers:main Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complete the grammar #1

complete the grammar #1

amaanq commented Jun 15, 2023

acristoffers commented Jun 16, 2023

acristoffers commented Jun 16, 2023

clason commented Jun 16, 2023 •

edited

Loading

acristoffers commented Jun 16, 2023

clason commented Jun 16, 2023

acristoffers commented Jun 16, 2023 •

edited

Loading

acristoffers commented Jun 16, 2023 •

edited

Loading

clason commented Jun 16, 2023

amaanq commented Jun 16, 2023

acristoffers commented Jun 16, 2023

clason commented Jun 16, 2023 •

edited

Loading

acristoffers commented Jun 17, 2023

clason commented Jun 17, 2023

acristoffers commented Jun 17, 2023

amaanq commented Jun 17, 2023 •

edited

Loading

acristoffers commented Jun 18, 2023

amaanq commented Jun 19, 2023 •

edited

Loading

complete the grammar #1

complete the grammar #1

Conversation

amaanq commented Jun 15, 2023

acristoffers commented Jun 16, 2023

acristoffers commented Jun 16, 2023

clason commented Jun 16, 2023 • edited Loading

acristoffers commented Jun 16, 2023

clason commented Jun 16, 2023

acristoffers commented Jun 16, 2023 • edited Loading

acristoffers commented Jun 16, 2023 • edited Loading

clason commented Jun 16, 2023

amaanq commented Jun 16, 2023

acristoffers commented Jun 16, 2023

clason commented Jun 16, 2023 • edited Loading

acristoffers commented Jun 17, 2023

clason commented Jun 17, 2023

acristoffers commented Jun 17, 2023

amaanq commented Jun 17, 2023 • edited Loading

acristoffers commented Jun 18, 2023

amaanq commented Jun 19, 2023 • edited Loading

clason commented Jun 16, 2023 •

edited

Loading

acristoffers commented Jun 16, 2023 •

edited

Loading

acristoffers commented Jun 16, 2023 •

edited

Loading

clason commented Jun 16, 2023 •

edited

Loading

amaanq commented Jun 17, 2023 •

edited

Loading

amaanq commented Jun 19, 2023 •

edited

Loading