-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctags: Matlab: generate tag with only the function name, not the function name plus arguments #3358
Conversation
Please submit the changes to upstream Universal Ctags, we need to be able to import it unchanged, we don't have the resources to keep patches that will have to be re-applied every time ctags is updated. |
The file I modified is (There's still the small issue of all the tests failing; I'm working on it) PS: maybe universal-ctags's |
I just had a look at that file and it seems to be more complete (it seems to do the right thing, plus it also parses classes and variables), so maybe it's a better idea to just pull that file instead, and ignore this PR (unless there's a reason to use geany_matlab.c instead of the one universal-ctags provides). I naïvely tried to integrate that file into Geany myself and failed miserably, so maybe I'll leave this parked until someone with more experience has time to have a look at this. (There's no rush at the moment; it's not like having poor tag detection in Matlab is such a big deal; it's just that I thought this was going to be a quick fix that wouldn't be bothering any dev until after the holidays.) |
Ahh, ok, time for history lesson, stone age ctags became exuberant ctags which went unmaintained for a long time during the dark ages, before the renaissance when universal ctags was forked to provide what it calls "a maintained ctags implementation" which has now become the modern ctags implementation. During the dark ages when changes were not being accepted in ectags, changes were made to Geanys copy. Since the renaissance some of those changes have been pushed upstream if acceptable and Geany and uctags have been unified, some have simply been replaced by upstreams which are significantly better (C/C++), and some are so different that they are too hard to integrate into changed upstreams so the modified Geany version has been kept. I guess Matlab is in that category. Also the upstream matlab parser is a regex parser which Geany historically did not accept (@techee is that still the case?) so upstream couldn't be merged, so the olde version is kept. Pending what @techee advises I guess the options are to replace Geany's one with upstream if regex parsers are now supported (to my quick look upstream parses classes that Geany's doesn't, so its "better") or to keep maintaining Geany's unique version until regex parsers are supported. [Edit: see also #3170] |
The test problem looks like you have a gap between the name and the data in the reference tags file, whereas AFAICT all others do not have that. So probably your code is right and the test is wrong ;-P |
315cb94
to
8d1c3af
Compare
Oooh, that was it :) So yeah, the example Maybe I should add a few extra lines to that test with some of the new corner cases I'm now capturing, like the case with As per the level of support for regular expressions, I see no other ctags parser using PS: I have no idea how to write ctags parsers and have just been blindly modifying what geany_matlab.c did. I notice that other tests have a much more complex .tags file which seems to include an argument list, probably for autocompletion hints. My commit only addresses the sidebar and ctrl-click thing; maybe the proper fix would be to use these more advanced features, but I'm not at that level yet... |
Certainly more is always better, in case someone else breaks it at some point :-)
Yeah, as per the comment in #3170 there are no others, and there is a common apprehension that regex parsers are slow, so having them run between keystrokes is an issue.
You are running up against the limits of my knowledge of ctags/tagmangler (the expert referred to previously is NOT me) but IIUC the arglist is a separate field on the tag, its not part of the tag name (for the reasons you made this PR). Not sure how that is added, maybe look at a parser that does it? |
I... think it's fine without arglist for now. As for the use of regex, matlab.c's regex are simple enough that they could be replaced with normal C code. Maybe in a future I'll look into adding the missing |
@cousteaulecommandant Thanks for the patches, they look good to me. Regarding the regex ctags version of the parser vs Geany's version of the parser: we could use the regex ctags version, I was just thinking that since we have the hand-written version in Geany already, it might be a base for a hand-written parser that could eventually be submitted upstream so I kept the Could be a nice early 2023 project, what do you think :-)
Probably could be done by checking if after
This is the correct and official way to write ctags parsers, welcome :-). Plus cannibalising other ctags parsers.
Instead of Line 682 in 607fcec
which fills in the things you want to support into the |
I think the custom parser is currently a bit messy and could use some restructuring, but your point is valid. Re: speed, I see four "levels" in which the parser can be implemented:
I think the regex parser could be easily re-implemented using sscanf() as a faster alternative to regex, so if that's an option I think it'd be an elegant solution -- readable, efficient, and less prone to errors than options 1 and 2. Something like if (sscanf(line, " function [%*[^]%]] = %[A-Za-z0-9_]", buffer) == 1) ...
if (sscanf(line, " function%*[ \t]%*[A-Za-z0-9_] = %[A-Za-z0-9_]", buffer) == 1) ... etc. So, what do you think? Would
That still won't ignore words in strings (and maybe other corner cases). |
Warning: a problem with regexes (regexen? regexii? what is the plural ;-) and I think also your sscanf approach is lack of context, so things that look like functions inside strings and comments might be found. Perhaps the sscanf method can find strings and comments first and not look for other stuff inside them? Don't know how fast sscanf is, its possible it won't be any faster than regexes since its still scanning the input more than once. Thats what is slow for regex parsers, the fact that multiple regexes are applied to the same input, not the fact that a well optimised regex library like PCRE is slow. Repeat scans is one thing well written character by character parsers try not to do.
Thats the problem with assignment is declaration (see also Python and Julia for egs). It takes a parser well above the pay grade of these ones to decide which assignments are actually declarations, and which are "just" assignments. The writers of the Python parser took the "every assignment is a declaration" approach and the Julia parser writers took the approach "no assignment is a declaration". So Python is a precedent for having all the names available, and I havn't seen too many complaints about it. Having them available also makes autocomplete more comprehensive and if they are removed the parser won't get back upstream probably since it would not be as "good" as the upstream one and you don't know if other users (emacs, vim, ...) of Matlab ctags can make use of them.
Don't know matlab enough to comment, but if no Matlabbers object I guess its ok if upstream doesn't do it. |
The regices(?) defined in upstream ctags are meant to match from the beginning of the line (notice the If you want to ignore definitions inside block comments (and I considered modifying my current PR or creating a new one for that) then the good news is that that's relatively easy to do, since a block comment is always going to be delimited by lines containing only As for end-of-line comments, those always start with
My understanding is that the scanf family functions only need to parse one character at a time, and never backtrack or look ahead more than one character, so they don't need to perform multiple passes on the input. For example, if you do For example, when matching the string So in other words,
I had never noticed this, and it feels kinda wrong that the same variable can be "declared" in multiple places, but then again, Python programs are usually a bunch of functions/classes with maybe a few "file-scope" variables, and the parser ignores assignments performed inside a function (which are local to the function). So for a typical Python file structure, it makes sense to assume that every assignment done outside of a function is some sort of "global" variable. But one could also make a Python script where most/all the code is outside of a function and is executed directly (this would look a bit ugly in Geany because of all the variables, but that's the price to pay if we want a "normal" module-like Python file to look good). Similarly, Matlab files can be of two types: either "scripts" where all the code inside is executed or "functions" containing one or more function definitions. Maybe we can just disable variable assignment detection when inside a function (i.e., when a line defining a function has been scanned before), and then we'd have Python's behavior. However, I'd argue that in the case of Matlab files, there's nothing similar to C's "file-scope variables" since you'll never mix function definitions and variable declarations, so maybe it's a bit pointless to parse variables. But I'm OK with it as long as the ones in functions are excluded.
I have some experience with Matlab and I'd say structs aren't used that often, or at least I don't use them often (and they're far from the only type of data structure). Plus using |
Definitely fastest but not very readable and if you wanted to go this way, it would be better to convert the parser into the token-based parser (i.e. the "proper" parser). Such parsers first split an input like
into tokens like these
first and then perform analysis on top of these pre-parsed tokens. Also when creating these tokens, these parsers skip things like whitespace or comments so you don't have to worry about these in the rest of the code. When creating these tokens, the parsers read the input character by character and do the necessary comparisons character-wise so they are very fast. In ctags these are all the parsers that don't use This is definitely the way to go if you want the best possible parser - but they require more time to implement and you'd have to rewrite the current implementation of the Matlab parser from scratch.
This is used in most ctags
Like Lex, I'm not entirely sure by the performance of this - even though you don't have to backtrack, I'm not sure how these rules are evaluated and if it's fast enough. Also, personally, I'd prefer just plain C code that does this stuff - it's more readable and it can be reused - you can remove the whole string behind What's sure is that ctags parsers don't really use this method.
Regexii (as the ancient Romans commonly called them) are probably slowest and also least flexible but fastest to write and better than no parser at all.
I'm not a Matlab user but I guess this is probably fine for Geany.
This is where you might run into a problem in universal ctags - the "kinds" ctags support is kind of an interface and dropping it means backwards-incompatible change. In any case, before you spend more time on this parser, I'd suggest opening an issue in the universal-ctags project describing which way you want to proceed and asking if it's fine to avoid some unnecessary work (the maintainer of the project tends to be very responsive and supportive). |
Regardless of what direction does Geany go in a future regarding Matlab parsing (upstream ctags, regex, sscanf, strncmp...), can this PR be merged? As far as I understand it is ready to merge, and solves an issue with Matlab file parsing. I still have some interest in improving this in a future ( EDIT: Actually it seems that I DO have the commit for |
I have included this PR as part of #3563, in case you want to merge both in one sitting. |
A line like `function y = foo(a, b, c)` should yield a tag of `foo`, not `foo(a, b, c)`. That way, Ctrl-clicking `foo` somewhere in the code will take me there. The function name is `foo` after all, not `foo(a, b c)`. Also, fixed issue where a line like `function foo() % this = bug` would yield a tag of `bug` instead of `foo` because the `=` in the comment was not ignored.
tests/ctags/matlab_test.m now captures more corner cases (comments with `=` and variable names starting with `function`). This will prevent accidental regressions in future commits. For now, block comments (text between `%{` and `%}`) are NOT ignored.
78870da
to
19feb68
Compare
I had a look at the code and it appears to do the right thing. The changes seem simple enough that they shouldn't cause any problems - so even though we are close to release, I'm merging this. Thanks! |
Excellent! Thank you very much :) |
A line like
function y = foo(a, b, c)
should yield a tag offoo
, notfoo(a, b, c)
.That way, Ctrl-clicking
foo
somewhere in the code will take me there.The function name is
foo
after all, notfoo(a, b c)
.Also, fixed issue where a line like
function foo() % this = bug
would yield a tag ofbug
instead offoo
because the=
in the comment was not ignored.Finally, added a check to ensure that the line starts with the keyword
function
, and not any word starting withfunction...
which could be a variable name (e.g.functionality
).Here I am considering that function names contain only alphanumeric characters (and underscore) as Matlab's documentation states. I'm not aware of the possibility of declaring functions with
.
or other special characters directly usingfunction
.Example Matlab file demonstrating the issue:
This used to yield tags
foo(a, b, c)
andbug
, as well asstruct('a', 1, 'b', 2);
as a function; now it yields tagsfoo
andbaz
as expected, and omits thefunctionality
thing.(If it's any consolation, notice that GitHub also messes up the highlighting of
baz()
because of the=
in the comment.)PS: Similarly, it might be good to figure out a way to also exclude occurrences of the substring
"struct"
that aren't the keywordstruct
itself, such as in strings, or as part of words, e.g.: