Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctags: Matlab: generate tag with only the function name, not the function name plus arguments #3358

Merged
merged 3 commits into from
Oct 14, 2023

Conversation

cousteaulecommandant
Copy link
Contributor

A line like function y = foo(a, b, c) should yield a tag of foo, not foo(a, b, c).
That way, Ctrl-clicking foo somewhere in the code will take me there.
The function name is foo after all, not foo(a, b c).

Also, fixed issue where a line like function foo() % this = bug would yield a tag of bug instead of foo because the = in the comment was not ignored.

Finally, added a check to ensure that the line starts with the keyword function, and not any word starting with function... which could be a variable name (e.g. functionality).

Here I am considering that function names contain only alphanumeric characters (and underscore) as Matlab's documentation states. I'm not aware of the possibility of declaring functions with . or other special characters directly using function.

Example Matlab file demonstrating the issue:

function y = foo(a, b, c) % the function name should be `foo`, not `foo(a, b, c)`
    y = a+b+c;
end
function baz() % this = bug
    disp('The function name is `baz`, not `bug`');
end
functionality = struct('a', 1, 'b', 2);

This used to yield tags foo(a, b, c) and bug, as well as struct('a', 1, 'b', 2); as a function; now it yields tags foo and baz as expected, and omits the functionality thing.

(If it's any consolation, notice that GitHub also messes up the highlighting of baz() because of the = in the comment.)


PS: Similarly, it might be good to figure out a way to also exclude occurrences of the substring "struct" that aren't the keyword struct itself, such as in strings, or as part of words, e.g.:

restructure = true;
disp('You are on the way to destruction.');

@elextr
Copy link
Member

elextr commented Dec 25, 2022

Please submit the changes to upstream Universal Ctags, we need to be able to import it unchanged, we don't have the resources to keep patches that will have to be re-applied every time ctags is updated.

@cousteaulecommandant
Copy link
Contributor Author

Please submit the changes to upstream Universal Ctags, we need to be able to import it unchanged, we don't have the resources to keep patches that will have to be re-applied every time ctags is updated.

The file I modified is ctags/parsers/geany_matlab.c, which looks like a modified version by Geany (it has "geany" in the name!), is it not?
I just checked in the universal-ctags project and their matlab.c is completely different to the one in Geany, so I guess Geany just uses its own modified version.
I'm a bit confused about all this process (this is my first "real" commit to Geany with actual C code in it).

(There's still the small issue of all the tests failing; I'm working on it)

PS: maybe universal-ctags's matlab.c was removed for historical reasons and the devs forgot about it, and it should just be added back replacing Geany's own modified version? I can test if the upstream version works too.

@cousteaulecommandant
Copy link
Contributor Author

PS: maybe universal-ctags's matlab.c was removed for historical reasons and the devs forgot about it, and it should just be added back replacing Geany's own modified version? I can test if the upstream version works too.

I just had a look at that file and it seems to be more complete (it seems to do the right thing, plus it also parses classes and variables), so maybe it's a better idea to just pull that file instead, and ignore this PR (unless there's a reason to use geany_matlab.c instead of the one universal-ctags provides).

I naïvely tried to integrate that file into Geany myself and failed miserably, so maybe I'll leave this parked until someone with more experience has time to have a look at this. (There's no rush at the moment; it's not like having poor tag detection in Matlab is such a big deal; it's just that I thought this was going to be a quick fix that wouldn't be bothering any dev until after the holidays.)

@elextr
Copy link
Member

elextr commented Dec 26, 2022

Ahh, ok, time for history lesson, stone age ctags became exuberant ctags which went unmaintained for a long time during the dark ages, before the renaissance when universal ctags was forked to provide what it calls "a maintained ctags implementation" which has now become the modern ctags implementation.

During the dark ages when changes were not being accepted in ectags, changes were made to Geanys copy. Since the renaissance some of those changes have been pushed upstream if acceptable and Geany and uctags have been unified, some have simply been replaced by upstreams which are significantly better (C/C++), and some are so different that they are too hard to integrate into changed upstreams so the modified Geany version has been kept.

I guess Matlab is in that category.

Also the upstream matlab parser is a regex parser which Geany historically did not accept (@techee is that still the case?) so upstream couldn't be merged, so the olde version is kept.

Pending what @techee advises I guess the options are to replace Geany's one with upstream if regex parsers are now supported (to my quick look upstream parses classes that Geany's doesn't, so its "better") or to keep maintaining Geany's unique version until regex parsers are supported.

[Edit: see also #3170]

@elextr
Copy link
Member

elextr commented Dec 26, 2022

The test problem looks like you have a gap between the name and the data in the reference tags file, whereas AFAICT all others do not have that. So probably your code is right and the test is wrong ;-P

@cousteaulecommandant
Copy link
Contributor Author

The test problem looks like you have a gap between the name and the data in the reference tags file, whereas AFAICT all others do not have that. So probably your code is right and the test is wrong ;-P

Oooh, that was it :) So yeah, the example matlab_test.m file included a space at the end of each function, like function func3 , which the previous (erroneous) parser was including as part of the function name, since it parses everything from the first non-whitespace after function to the end of the line (or beginning of comment), and the test was tailored to make that work. Anyway, all tests pass now :)

Maybe I should add a few extra lines to that test with some of the new corner cases I'm now capturing, like the case with = in a comment.

As per the level of support for regular expressions, I see no other ctags parser using tagRegexTable, so I'm not sure it's supported. (But even if it is, I'm not sure it's a good idea to use it only for Matlab.)

PS: I have no idea how to write ctags parsers and have just been blindly modifying what geany_matlab.c did. I notice that other tests have a much more complex .tags file which seems to include an argument list, probably for autocompletion hints. My commit only addresses the sidebar and ctrl-click thing; maybe the proper fix would be to use these more advanced features, but I'm not at that level yet...

@elextr
Copy link
Member

elextr commented Dec 26, 2022

Maybe I should add a few extra lines to that test with some of the new corner cases I'm now capturing, like the case with = in a comment.

Certainly more is always better, in case someone else breaks it at some point :-)

As per the level of support for regular expressions, I see no other ctags parser using tagRegexTable, so I'm not sure it's supported. (But even if it is, I'm not sure it's a good idea to use it only for Matlab.)

Yeah, as per the comment in #3170 there are no others, and there is a common apprehension that regex parsers are slow, so having them run between keystrokes is an issue.

I notice that other tests have a much more complex .tags file which seems to include an argument list,

You are running up against the limits of my knowledge of ctags/tagmangler (the expert referred to previously is NOT me) but IIUC the arglist is a separate field on the tag, its not part of the tag name (for the reasons you made this PR). Not sure how that is added, maybe look at a parser that does it?

@cousteaulecommandant
Copy link
Contributor Author

I... think it's fine without arglist for now. matlab.c from universal-ctags doesn't add those either anyway (it's a really simple parser). I'd need to study the documentation of universal-ctags for that.

As for the use of regex, matlab.c's regex are simple enough that they could be replaced with normal C code. Maybe in a future I'll look into adding the missing classdef parser (and possibly remove the struct parser, which seems superfluous and potentially problematic).

@techee
Copy link
Member

techee commented Jan 1, 2023

@cousteaulecommandant Thanks for the patches, they look good to me.

Regarding the regex ctags version of the parser vs Geany's version of the parser: we could use the regex ctags version, I was just thinking that since we have the hand-written version in Geany already, it might be a base for a hand-written parser that could eventually be submitted upstream so I kept the geany_ parser. Hand-written parsers tend to offer better flexibility in parsing and are much faster than regex parsers. But before such a parser could be submitted upstream, it would have to offer all the functionality the current regex parser offers.

Could be a nice early 2023 project, what do you think :-)

PS: Similarly, it might be good to figure out a way to also exclude occurrences of the substring "struct" that aren't the keyword struct itself, such as in strings, or as part of words, e.g.:

Probably could be done by checking if after

p=(const unsigned char*) strstr ((const char*) line, "struct");

p-1 and p+6 are not alnum (plus all the necessary range checks).

PS: I have no idea how to write ctags parsers and have just been blindly modifying what geany_matlab.c did.

This is the correct and official way to write ctags parsers, welcome :-). Plus cannibalising other ctags parsers.

I notice that other tests have a much more complex .tags file which seems to include an argument list, probably for autocompletion hints.

Instead of makeSimpleTag() you create a function similar to the makeTagFull() here

static int makeTagFull (tokenInfo *const token, const goKind kind,

which fills in the things you want to support into the tagEntryInfo struct and then create the tag using makeTagEntry(). For the argument list it's the e.extensionFields.signature member. You should also call initTagEntry() to set the tag name and kind. The rest is optional. For readline-based parsers ctags sets the line number for you, you don't have to care about it by yourself.

@cousteaulecommandant
Copy link
Contributor Author

Regarding the regex ctags version of the parser vs Geany's version of the parser: we could use the regex ctags version, I was just thinking that since we have the hand-written version in Geany already, it might be a base for a hand-written parser that could eventually be submitted upstream so I kept the geany_ parser. Hand-written parsers tend to offer better flexibility in parsing and are much faster than regex parsers. But before such a parser could be submitted upstream, it would have to offer all the functionality the current regex parser offers.

I think the custom parser is currently a bit messy and could use some restructuring, but your point is valid.
Right now, I think the only real advantage of the regex version of the parser is its readability, but it doesn't seem to be really leveraging the full power of regexes -- it is rather simple and can probably be translated to "plain C" easily.
(Also, at first glance, it seems that those regexes aren't too good either; for example, I believe the second one will match functionality = 42 too.)

Re: speed, I see four "levels" in which the parser can be implemented:

  1. Compare characters one by one
  2. strncmp() and strstr()
  3. sscanf() 💡
  4. Regular expressions

I think the regex parser could be easily re-implemented using sscanf() as a faster alternative to regex, so if that's an option I think it'd be an elegant solution -- readable, efficient, and less prone to errors than options 1 and 2.

Something like

if (sscanf(line, " function [%*[^]%]] = %[A-Za-z0-9_]", buffer) == 1) ...
if (sscanf(line, " function%*[ \t]%*[A-Za-z0-9_] = %[A-Za-z0-9_]", buffer) == 1) ...

etc.
(where " " matches zero or more whitespace chars, "%*[ \t]" matches one or more spaces/tabs, "%*[^]%]" matches anything but ] and %, etc -- it's not incredibly readable, but it's fast.)

So, what do you think? Would sscanf() be fast enough, or better to keep matching individual chars and substrings?

Probably could be done by checking if after

p=(const unsigned char*) strstr ((const char*) line, "struct");

p-1 and p+6 are not alnum (plus all the necessary range checks).

That still won't ignore words in strings (and maybe other corner cases).
Honestly I think I'd entirely ditch parsing structs; knowing that a certain variable at a certain point in the program is a struct isn't really that relevant, and universal-ctags doesn't do it anyway. Class parsing would be more useful.
For a similar reason, I'd avoid parsing all variables as universal-ctags does; having a list with EVERY variable assignment in EVERY function in the script seems excessive. (However, it might be a good idea to list global and persistent variables.)

@elextr
Copy link
Member

elextr commented Jan 3, 2023

Warning: a problem with regexes (regexen? regexii? what is the plural ;-) and I think also your sscanf approach is lack of context, so things that look like functions inside strings and comments might be found. Perhaps the sscanf method can find strings and comments first and not look for other stuff inside them?

Don't know how fast sscanf is, its possible it won't be any faster than regexes since its still scanning the input more than once. Thats what is slow for regex parsers, the fact that multiple regexes are applied to the same input, not the fact that a well optimised regex library like PCRE is slow. Repeat scans is one thing well written character by character parsers try not to do.

I'd avoid parsing all variables as universal-ctags does;

Thats the problem with assignment is declaration (see also Python and Julia for egs). It takes a parser well above the pay grade of these ones to decide which assignments are actually declarations, and which are "just" assignments. The writers of the Python parser took the "every assignment is a declaration" approach and the Julia parser writers took the approach "no assignment is a declaration". So Python is a precedent for having all the names available, and I havn't seen too many complaints about it.

Having them available also makes autocomplete more comprehensive and if they are removed the parser won't get back upstream probably since it would not be as "good" as the upstream one and you don't know if other users (emacs, vim, ...) of Matlab ctags can make use of them.

Honestly I think I'd entirely ditch parsing structs;

Don't know matlab enough to comment, but if no Matlabbers object I guess its ok if upstream doesn't do it.

@cousteaulecommandant
Copy link
Contributor Author

Perhaps the sscanf method can find strings and comments first and not look for other stuff inside them?

The regices(?) defined in upstream ctags are meant to match from the beginning of the line (notice the ^ at the beginning), so they're not going to accidentally match things inside an end-of-line comment or a string.

If you want to ignore definitions inside block comments (and I considered modifying my current PR or creating a new one for that) then the good news is that that's relatively easy to do, since a block comment is always going to be delimited by lines containing only %{ and %} (possibly with some whitespace) and nothing else, so there's no risk to accidentally interpret the content of a string or comment as a block comment delimiter. The only difficulty in this regard is that block comments can be nested (think of C's #if 0 ... #endif), but that's easy to solve with a counter.

As for end-of-line comments, those always start with % or ... (excluding those in a string, but you won't find strings in function definitions), so they are easy to avoid. And strings don't span multiple lines. So I think that other than the rare %{...%} case, it should be straightforward.

Don't know how fast sscanf is, its possible it won't be any faster than regexes since its still scanning the input more than once. Thats what is slow for regex parsers, the fact that multiple regexes are applied to the same input, not the fact that a well optimised regex library like PCRE is slow. Repeat scans is one thing well written character by character parsers try not to do.

My understanding is that the scanf family functions only need to parse one character at a time, and never backtrack or look ahead more than one character, so they don't need to perform multiple passes on the input. For example, if you do scanf("%d", &i), that's going to read character by character from stdin, and as soon as it finds a character that doesn't match an integer (it's not a digit or a leading whitespace/sign), it'll put that one character back into stdin (see ungetc()) and return. Think of it as a possessive regex.
Regular expressions, on the other hand, have to try to match different combinations until one of them works (unless they're possessive), hence the need for backtracking and multiple passes and inefficiencies.

For example, when matching the string abcde123 against the regex ([a-z]+)([aeiou]+) it will succeed, because the first capture group will match abcd (even if it could match all of abcde) and the second part will match e. But the scanf pattern %[a-z]%[aeiou] will fail, because the first specifier will just eat all the letters and not leave any for the second.

So in other words, sscanf() is a glorified char-by-char parser that only needs to look ahead one character at most. And it's also probably very well optimized, so it might be even faster than writing the parsing state machine yourself. Maybe we could just write some tests and time them.

The writers of the Python parser took the "every assignment is a declaration" approach and the Julia parser writers took the approach "no assignment is a declaration". So Python is a precedent for having all the names available, and I haven't seen too many complaints about it.

I had never noticed this, and it feels kinda wrong that the same variable can be "declared" in multiple places, but then again, Python programs are usually a bunch of functions/classes with maybe a few "file-scope" variables, and the parser ignores assignments performed inside a function (which are local to the function). So for a typical Python file structure, it makes sense to assume that every assignment done outside of a function is some sort of "global" variable. But one could also make a Python script where most/all the code is outside of a function and is executed directly (this would look a bit ugly in Geany because of all the variables, but that's the price to pay if we want a "normal" module-like Python file to look good).

Similarly, Matlab files can be of two types: either "scripts" where all the code inside is executed or "functions" containing one or more function definitions. Maybe we can just disable variable assignment detection when inside a function (i.e., when a line defining a function has been scanned before), and then we'd have Python's behavior.

However, I'd argue that in the case of Matlab files, there's nothing similar to C's "file-scope variables" since you'll never mix function definitions and variable declarations, so maybe it's a bit pointless to parse variables. But I'm OK with it as long as the ones in functions are excluded.

Honestly I think I'd entirely ditch parsing structs;

Don't know matlab enough to comment, but if no Matlabbers object I guess its ok if upstream doesn't do it.

I have some experience with Matlab and I'd say structs aren't used that often, or at least I don't use them often (and they're far from the only type of data structure). Plus using struct explicitly isn't the only way to declare a struct (and I'd say it's rarely done); just doing a.b.c.d = 42 will declare a as a struct containing a struct containing a struct if a doesn't exist yet. I think in practice I'd only ever use struct explicitly if I wanted to "reset" a struct variable to the empty struct.

@techee
Copy link
Member

techee commented Jan 3, 2023

Re: speed, I see four "levels" in which the parser can be implemented:

  1. Compare characters one by one

Definitely fastest but not very readable and if you wanted to go this way, it would be better to convert the parser into the token-based parser (i.e. the "proper" parser). Such parsers first split an input like

function y = func4(a, b)

into tokens like these

  1. function - keyword
  2. y - identifier
  3. = - = operator
  4. func4 - identifier
  5. (
  6. a - identifier
  7. ,
  8. b - identifer
  9. )

first and then perform analysis on top of these pre-parsed tokens. Also when creating these tokens, these parsers skip things like whitespace or comments so you don't have to worry about these in the rest of the code. When creating these tokens, the parsers read the input character by character and do the necessary comparisons character-wise so they are very fast. In ctags these are all the parsers that don't use readLineFromInputFile() or regular expressions.

This is definitely the way to go if you want the best possible parser - but they require more time to implement and you'd have to rewrite the current implementation of the Matlab parser from scratch.

readLine() based parsers are definitely shittier but often just fine if the language isn't too crazy.

  1. strncmp() and strstr()

This is used in most ctags readLine() based parsers.

  1. sscanf() 💡

Like Lex, I'm not entirely sure by the performance of this - even though you don't have to backtrack, I'm not sure how these rules are evaluated and if it's fast enough. Also, personally, I'd prefer just plain C code that does this stuff - it's more readable and it can be reused - you can remove the whole string behind % in C first and then the rest of the code doesn't have to care about this any more (this is one of the typical simplifications of readLine() based parsers - % could be inside of a string in which case you shouldn't do this).

What's sure is that ctags parsers don't really use this method.

  1. Regular expressions

Regexii (as the ancient Romans commonly called them) are probably slowest and also least flexible but fastest to write and better than no parser at all.

Honestly I think I'd entirely ditch parsing structs; knowing that a certain variable at a certain point in the program is a struct isn't really that relevant, and universal-ctags doesn't do it anyway. Class parsing would be more useful.

I'm not a Matlab user but I guess this is probably fine for Geany.

For a similar reason, I'd avoid parsing all variables as universal-ctags does; having a list with EVERY variable assignment in EVERY function in the script seems excessive. (However, it might be a good idea to list global and persistent variables.)

This is where you might run into a problem in universal ctags - the "kinds" ctags support is kind of an interface and dropping it means backwards-incompatible change. In any case, before you spend more time on this parser, I'd suggest opening an issue in the universal-ctags project describing which way you want to proceed and asking if it's fine to avoid some unnecessary work (the maintainer of the project tends to be very responsive and supportive).

@cousteaulecommandant
Copy link
Contributor Author

cousteaulecommandant commented Sep 25, 2023

Regardless of what direction does Geany go in a future regarding Matlab parsing (upstream ctags, regex, sscanf, strncmp...), can this PR be merged? As far as I understand it is ready to merge, and solves an issue with Matlab file parsing.

I still have some interest in improving this in a future (%{...%} comments, classes, etc), but for the time being this PR solves the issue at hand.

EDIT: Actually it seems that I DO have the commit for %{...%} ready, along with one that allows function definitions to be indented (which is perfectly fine in Matlab, and pretty common in class definitions too). Those have been in my waiting list since April, just waiting for this commit to be pulled since they were built on top of it to avoid conflicts.
Maybe I should submit a PR for that one so that you can merge both at the same time.

@cousteaulecommandant
Copy link
Contributor Author

I have included this PR as part of #3563, in case you want to merge both in one sitting.

A line like `function y = foo(a, b, c)` should yield a tag of `foo`,
not `foo(a, b, c)`.
That way, Ctrl-clicking `foo` somewhere in the code will take me there.
The function name is `foo` after all, not `foo(a, b c)`.

Also, fixed issue where a line like `function foo() % this = bug`
would yield a tag of `bug` instead of `foo` because the `=` in the
comment was not ignored.
tests/ctags/matlab_test.m now captures more corner cases
(comments with `=` and variable names starting with `function`).
This will prevent accidental regressions in future commits.

For now, block comments (text between `%{` and `%}`) are NOT ignored.
@kugel- kugel- requested a review from techee October 4, 2023 07:23
@techee techee merged commit 20c39c8 into geany:master Oct 14, 2023
4 checks passed
@techee
Copy link
Member

techee commented Oct 14, 2023

I had a look at the code and it appears to do the right thing. The changes seem simple enough that they shouldn't cause any problems - so even though we are close to release, I'm merging this.

Thanks!

@cousteaulecommandant
Copy link
Contributor Author

Excellent! Thank you very much :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants