Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cxx parser from uctags #3032

Merged
merged 10 commits into from Dec 19, 2021
Merged

Use cxx parser from uctags #3032

merged 10 commits into from Dec 19, 2021

Conversation

techee
Copy link
Member

@techee techee commented Nov 28, 2021

This pull request switches us to the new cxx parser for C and C++.

There is nothing really special worth noting here apart from the
change how anonymous tags are called in the new parser. In the
old parser we had e.g.

anon_struct_1
anon_union_2
anon_typedef_3

which are now called just

__anon1
__anon2
__anon3

The question is whether we should do something about that or just keep it this way (it should be possible to rename the tags in our code, the only problem is that we have to go through all the tags and apart from tag names also update scope names).

Fixes #2349, fixes #1314, fixes #1249, fixes #1944, fixes #2916

This is to avoid clash with cpreprocessor.c/h used by the new cxx parser.
Merging lcpp.c/h with cpreprocessor.c/h would be difficult (at least for
now) because of the differences in c.c so keep them separate for now.
As a result, when we copy the new cxx parser, we don't have clashes of
these symbols from the two different parsers.
This patch only makes the parser compile, it doesn't enable
it yet.
There are several things needed for this:

1. The new preprocessor has to be defined as a separate parser.

2. Tags from the new c/c++ parsers and the preprocessor parser have to
be mapped to Geany types. We still need to keep the old mappings because
some parsers like Ferite or GLSL still use the old C parser.

3. Anonymous tags have a different name so we have to reflect this in
tm_tag_is_anon().
The changes are mostly these:

1. Spaces in function argument list

(int var1, int var2, ...) - before
(int var1,int var2,...) - now

2. Anonymous tags

anon_struct_1
anon_union_2
anon_typedef_3

vs

__anon1
__anon2
__anon3

3. Improved parsing of the new parser
@techee
Copy link
Member Author

techee commented Nov 28, 2021

Just repushed, there was a problem with one unit test.

@elextr
Copy link
Member

elextr commented Nov 28, 2021

Wow 77 files changed, although some are tests.

@techee
Copy link
Member Author

techee commented Nov 28, 2021

Wow 77 files changed, although some are tests.

Yeah, this plus the new cxx parser consisting of about 30 files.

@elextr
Copy link
Member

elextr commented Nov 28, 2021

Hmmm, well, I can't tell any difference from the old one, are you sure its being used?

None of the features here seem to work, maybe Geany needs more to support them?

@elextr
Copy link
Member

elextr commented Nov 28, 2021

Also following error was output on terminal:

(geany:100237): Tagmanager-WARNING **: 00:10:04.254: definition tag for refonly kind(using) is made: Setting

I will look for it tomorrow and try to make a MRE.

[Edit: which I guess proves its being used :)]

@techee
Copy link
Member Author

techee commented Nov 28, 2021

Hmmm, well, I can't tell any difference from the old one, are you sure its being used?
None of the features here seem to work, maybe Geany needs more to support them?

These tag kinds are currently not mapped to anything in Geany but better to make sure the parser works correctly before adding more stuff (I did the same with other parsers too):

{'h', tm_tag_undef_t}, \
{'l', tm_tag_undef_t}, \
{'z', tm_tag_undef_t}, \
{'L', tm_tag_undef_t}, \
{'D', tm_tag_undef_t},
{'A', tm_tag_undef_t},
{'N', tm_tag_undef_t},
{'U', tm_tag_undef_t},
{'Z', tm_tag_undef_t},

Have you tried some tricky file that produced wrong output before with the new parser?

(geany:100237): Tagmanager-WARNING **: 00:10:04.254: definition tag for refonly kind(using) is made: Setting

I'll have a look at this.

@techee
Copy link
Member Author

techee commented Nov 28, 2021

I created a bug report regarding the warning here:

universal-ctags/ctags#3203

It should be harmless though.

@masatake
Copy link
Contributor

You can use isTagExtraBitMarked() declared in ctags:main/entry.h to detect whether a tag is anonymous or not.
As the extra parameter, use XTAG_ANONYMOUS declared in main/entry.h.

@techee
Copy link
Member Author

techee commented Nov 28, 2021

You can use isTagExtraBitMarked() declared in ctags:main/entry.h to detect whether a tag is anonymous or not.
As the extra parameter, use XTAG_ANONYMOUS declared in main/entry.h.

Yep, I'm aware of that and I want to address that in the future. The problem is that our internal representation of tags TMTag

typedef struct TMTag

doesn't contain a field where this information could be stored and adding more fields is an ABI change because this structure is accessible to plugins so better avoid that. But I think we could use the pointerOrder integer inside TMTag which is just some legacy field unused by anything neither in Geany nor plugins and could convert it to a flag field where information about whether a tag is anonymous or not could be stored.

@elextr
Copy link
Member

elextr commented Nov 28, 2021

is an ABI change because this structure is accessible to plugins so better avoid that.

Its ok to add fields at the end of an ABI structure, just don't change the order of any existing ones.

@techee
Copy link
Member Author

techee commented Nov 29, 2021

Its ok to add fields at the end of an ABI structure, just don't change the order of any existing ones.

It wouldn't matter if you store pointers only to TMTag. But when you have values like

TMTag foo[10];

the size of TMTag increases with a new field and things break.

@techee
Copy link
Member Author

techee commented Nov 29, 2021

I've just pushed fix for the warning from

universal-ctags/ctags@fb305d8

@elextr
Copy link
Member

elextr commented Nov 29, 2021

TMTag foo[10];

I guess you mean a pointer to an array is passed from TM and used by plugins, not that plugins allocate TMTags themselves.

Thats going to be a bit of a problem when languages that are not C need additional data, sigh, add it to the TM hate list. Since it was stolen from Anjuta, maybe we should steal their Sqlite replacement :)

@elextr
Copy link
Member

elextr commented Nov 29, 2021

I've just pushed fix for the warning from

Anyway crisis temporarily averted.

@techee
Copy link
Member Author

techee commented Nov 29, 2021

I guess you mean a pointer to an array is passed from TM and used by plugins, not that plugins allocate TMTags themselves.

I actually meant TMTags allocated by plugins e.g. on stack but I was wrong - in that case it's not a problem so it shouldn't be theoretically a problem to add more fields.

Since it was stolen from Anjuta, maybe we should steal their Sqlite replacement :)

No way, I think our TM is pretty good now (many parts rewritten/removed by me and Colomban) and storing tags to a database doesn't solve anything for us.

@techee
Copy link
Member Author

techee commented Nov 29, 2021

@masatake Maybe one question - currently we have scope separators hard-coded for individual languages here:

const gchar *tm_parser_context_separator(TMParserType lang)

Is it possible to get this information from ctags?

@elextr
Copy link
Member

elextr commented Nov 29, 2021

I actually meant TMTags allocated by plugins

Thats outlawed by HACKING for exactly this reason.

@techee
Copy link
Member Author

techee commented Nov 30, 2021

@masatake Maybe one question - currently we have scope separators hard-coded for individual languages here:
Is it possible to get this information from ctags?

To answer myself after looking at several parsers: no, these are hard-coded in parsers themselves in many cases. Maybe it's possible for those using ATTACH_SEPARATORS() but it's just some parsers.

@masatake
Copy link
Contributor

masatake commented Nov 30, 2021

Sorry to be late.
As you found, I introduced 'ATTACH_SEPARTORS', the more systematic way to define separators.
However, only some parsers use it. Mainly because there is no strong demand:-).

In tags output, you can see separator definitions:

[yamato@dev64]~/var/ctags-github% ./ctags --pseudo-tags=TAG_KIND_SEPARATOR --extras=+'{pseudo}' -o - /tmp/input.php 
!_TAG_KIND_SEPARATOR!PHP	::	/*a/
!_TAG_KIND_SEPARATOR!PHP	::	/*c/
!_TAG_KIND_SEPARATOR!PHP	::	/*d/
!_TAG_KIND_SEPARATOR!PHP	::	/*f/
!_TAG_KIND_SEPARATOR!PHP	::	/*i/
!_TAG_KIND_SEPARATOR!PHP	::	/*l/
!_TAG_KIND_SEPARATOR!PHP	::	/*n/
!_TAG_KIND_SEPARATOR!PHP	::	/*t/
!_TAG_KIND_SEPARATOR!PHP	::	/*v/
!_TAG_KIND_SEPARATOR!PHP	\\	/na/
!_TAG_KIND_SEPARATOR!PHP	\\	/nc/
!_TAG_KIND_SEPARATOR!PHP	\\	/nd/
!_TAG_KIND_SEPARATOR!PHP	\\	/nf/
!_TAG_KIND_SEPARATOR!PHP	\\	/ni/
!_TAG_KIND_SEPARATOR!PHP	\\	/nl/
!_TAG_KIND_SEPARATOR!PHP	\\	/nn/
!_TAG_KIND_SEPARATOR!PHP	\\	/nt/
!_TAG_KIND_SEPARATOR!PHP	\\	/nv/

If you need improvement in this area, please, make an issue at u-ctags.

@masatake
Copy link
Contributor

masatake commented Nov 30, 2021

I'm reading mini-geany.c.
It accesses kindDefinition. So separator definition attached to it can be accessed from geany.
See scopeSeparator in main/kind.h of u-ctags.

However, there is no way to access 'parserDefinition' where a default scope separator is defined.
The default scope separator is used only when per kind scopeSeparator is not defined.
No accessor for the default scope separator is for Geany now.
scopeSeparatorFor() is a good candidate as a part of API exported to Geany.
I will inspect more if you request.


How about following interface:

diff --git a/main/kind.h b/main/kind.h
index 5a9b243a..4d27a3fc 100644
--- a/main/kind.h
+++ b/main/kind.h
@@ -104,5 +104,7 @@ struct sKindDefinition {
 */
 
 extern const char *scopeSeparatorFor (langType lang, int kindIndex, int parentKindIndex);
+extern const char *scopeRootSeparatorFor (langType language);
+extern const char *scopeDefaultSeparatorFor (langType language);
 
 #endif	/* CTAGS_MAIN_KIND_H */

Of course, a parser defines separators explicitly. But it is a bit different topic.

@techee
Copy link
Member Author

techee commented Dec 1, 2021

@masatake Thanks for your detailed explanation. I don't think we really need any special API for that, I was just asking in case we were doing something stupid unnecessarily :-)

Those multiple scope separators in PHP are quite unpleasant though - our code currently assumes there's a single type of scope separator per language. I guess the easiest way for us will be to rewrite tags we receive from ctags to have a single type of scope separator. Is there any other language that uses multiple types of scope separators?

(I'm not sure if multiple types of scope separators are a good idea even if the given language uses these for certain kinds - I think tools processing ctags files won't be very happy they have to deal with this.)

I also started realizing we'll have to take "roles" into account which we currently ignore - in the go compiler for instance the role is used to distinguish between the package of the file and imported packages which are 2 different things and confuse our code:

static roleDefinition GoPackageRoles [] = {
	{ true, "imported", "imported package" },
};

{true, 'p', "package", "packages",
	  .referenceOnly = false, ATTACH_ROLES (GoPackageRoles)},

Am I right to assume that roleBitsType roleBits inside sTagEntryInfo defines whether a kind has a certain role? For instance for some role definition like

static roleDefinition SomeRoles [] = {
	{ true, "foo", "" },
	{ true, "bar", "" },
	{ true, "baz", "" },
};

5 (i.e. 101 in binary) would mean that the kind has the foo and baz roles?

@techee
Copy link
Member Author

techee commented Dec 1, 2021

@masatake After thinking about it for a while, is it actually a good idea to distinguish between imports and package definitions using a role in go (and possibly some other languages)? The way I understand semantics of roles is that they further distinguish the nature of a single kind but if this extra information is ignored, nothing bad happens. For instance in Python I think the following is alright because it just distinguishes different types of imports

/* Roles related to `import'
 * ==========================
 * import X              X = (kind:module, role:imported)
 *
 * import X as Y         X = (kind:module, role:indirectlyImported),
 *                       Y = (kind:namespace, nameref:module:X)
 *                       ------------------------------------------------
 *                       Don't confuse the kind of Y with namespace role of module kind.
 *
 * from X import *       X = (kind:module,  role:namespace)
 *
 * from X import Y       X = (kind:module,  role:namespace),
 *                       Y = (kind:unknown, role:imported, scope:module:X)
 *
 * from X import Y as Z  X = (kind:module,  role:namespace),
 *                       Y = (kind:unknown, role:indirectlyImported, scope:module:X)
 *                       Z = (kind:unknown, nameref:unknown:Y) */

But package definitions and imports of other packages are two fundamentally different things and IMO there should rather be two different kinds in go for these.

Note: Take the above as a note of someone who is lazy to do the extra work in Geany and possibly just tries to figure out how to delegate some work to others :-). What I wrote above make sense in my (very biased) brain but whatever is implemented should primarily be a good thing for ctags.

@techee
Copy link
Member Author

techee commented Dec 2, 2021

Understandable. However, I wanted to make ctags pass extracted information to client tools including Geany as is.
I thought ctags should not do too clever. Reducing information can be done in a client tool. In ctags level, just passing too much information is better than reducing information. Even if a target language doing wrong, ctags doesn't try to fix it. This basic rule makes the development ctags a bit easier.

What ctags can do is providing more API for:
(1) getting separators, and
(2) adjusting separators as a client tool wants when making tags.

OK, understood, I think no API is needed, I just converted \ to :: for PHP in Geany, no real problem there.

It seems that you misunderstand the meaning of roles.

It's true I misunderstood roles, the problem is still present though - I just created a new issue against ctags here universal-ctags/ctags#3211 so we don't pollute this thread.

@techee
Copy link
Member Author

techee commented Dec 3, 2021

@elextr What are the tags you are missing with the new parser? Is it the local variables? It is the 'l' kind you can map e.g. to tm_tag_variable_t and then you get all local variables parsed (and everything works - sidebar, going to tag definition/declaration, scope completion etc.). I just think that maybe showing local variables in the sidebar is a little verbose.

I think we should improve our tag mapping capabilities for the sidebar and also improve the maintainability of the code by moving the mappings from

switch (ft_id)

to tm_ctags.c next to the ctags mappings so they are side by side and easier to manage. It should be also made possible to disable some kinds for the sidebar while keeping them for the rest of what we use tags for like going to tag definitions, declarations etc. One example might be the local variables (but I don't use the sidebar myself much and maybe we want them) but it is also possible to get tags for function parameters - while these can be good for autocompletion or jumping in the code, we don't want to show them separately in the sidebar as they are already part of function prototypes.

#define map_CPP map_C
# define COMMON_C_NEW_PARSER \
{'h', tm_tag_undef_t}, \
{'l', tm_tag_undef_t}, \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elextr If you want to experiment, map the 'l' kind here to tm_tag_variable_t and you'll get all local variables parsed.

The syntax is slightly different from the previous syntax and is
described here:

https://docs.ctags.io/en/latest/parser-cxx.html

Basic usage should be the same, uctags just doesn't support Geany's
wildcard ignores like G_GNUC_*. On the other hand the new parser is
much more resilient to macros so there shouldn't be so much need
for manual ignores.

The original code is still kept for parsers from c.c that still use the
old preprocessor.
@techee
Copy link
Member Author

techee commented Dec 5, 2021

I just added a patch to pass our ignored.tags file to the new cxx parser. The new syntax is a bit different than before so we should maybe also update our documentation for that.

@techee
Copy link
Member Author

techee commented Dec 6, 2021

Just went through the various issue reports and this pull request should fix (at least) the following ones.

Fixes #2349, #1314, #1249, #1944, #2916

@elextr
Copy link
Member

elextr commented Dec 7, 2021

@techee I guess the conflict is from merging #2991

# Conflicts:
#	ctags/Makefile.am
@techee
Copy link
Member Author

techee commented Dec 7, 2021

I just resolved the conflict (only Makefile.am) for this pull request, #3031 and #3035. All the 3 pull requests are more or less ready to be merged from my perspective.

For the cxx parser here, I'd just point out 2 things to decide:

  • whether the __anon tags are fine for the sidebar for anonymous declarations
  • whether the different syntax and capabilities of the ignored.tags file are fine (but I'm afraid there's not much we can do here). If so, I'll update the documentation.

@elextr
Copy link
Member

elextr commented Dec 7, 2021

Will test soon(ish).

whether the different syntax and capabilities of the ignored.tags file are fine (but I'm afraid there's not much we can do here)

This is one of the downsides of following closely upstream, and as you say, not much we can do

I'll update the documentation.

Please.

@elextr
Copy link
Member

elextr commented Dec 7, 2021

Seems to work.

whether the __anon tags are fine for the sidebar for anonymous declarations

Hmmm, I see what you mean, looking at the typedefs at the start of tm_parser.c the struct is shown as __anon and the the typedef then shows later, but there is no link between them. So AFAICT uses of the typedef can't find the members for autocomplete?

I enabled local variables as you suggested (why I was editing tm_parser.c as above) above and yep they are found :-).

But not all (see example below) :-(.

What about function parameters?

These two will help with many simple cases to show options for local_var. or parameter->, at least in C, so for Geany development.

But its raising the question of how far should we try to go?

Without project_organiser or similar to get all symbols in the project, and also the C++ STL, or (for example) libc, GTK and Glib for Geany, we won't get much further than very simple cases. But then with all symbols gathered the current method of scope free autocomplete becomes unusable due to far too many useless options being presented. So proper scope tracking is needed. But scoping also means following the includes, which is not done now, and according to @b4n is very unlikely.

But even with all of those, the trend to inferred types is going to be a big limitation (ok, not for C but for C++ and for other languages that infer types).

For example (showing examples of STL includes and scope and inference):

#include <utility>
#include <string>
using namespace std::string_literals;
f(){
    // the type returned from make_pair is inferred from its parameters as std::pair<int, std::string>
    // the types of i and s are inferred by decomposing that return type
    auto [i, s] = std::make_pair(1, "abc"s); 
    // as above but assigning the pair undecomposed
    auto p = std::make_pair(1, "abc"s);
    s.si????
    p.se????
}

If I do ctrl+space at the ????s what should autocomplete show (answer "size" and "second").

But in fact i and s are not parsed as locals currently so "si" gets me tons of Linux signal symbols, and although p is parsed, because it is inferred, ctags gives it no type and no scope autocompletion, so it gets a motley mixture of "se" symbols from C!!!.

Still being able to get explicitly declared locals and parameters would be a big help with Geany 😄
Especially if we can fix those typedefed __anon structs.
But beyond that is something to leave to the big boys and be happy with the medium capability IMO.

@techee
Copy link
Member Author

techee commented Dec 7, 2021

Hmmm, I see what you mean, looking at the typedefs at the start of tm_parser.c the struct is shown as __anon and the the typedef then shows later, but there is no link between them. So AFAICT uses of the typedef can't find the members for autocomplete?

The disconnection between typedefs and e.g. anonymous enums isn't something new - it was already present in the old parser. The new thing here is that the name lacks the information about what kind of anonymous type it is. For instance, before, you'd get anon_enum_1, anon_struct_2 and now you just get __anon1 and __anon2. Scope autocompletion should still work, it should resolve the typedefs correctly, it's just the display in the sidebar.

If it's not a problem, I would open a separate pull request for this - it's not just a simple renaming of tag names, the problem is that these anonymous tags may appear in scopes as parents of other tags and then we have to update the scope information too. Also we should start using the information from ctags to detect whether a tag is anonymous and not depend on the detection based on just tag name.

But its raising the question of how far should we try to go?

Our scope completion but also jumping to tag definition/declaration isn't just ready for enabling local variables yet - there will be too many false positives. As you said, we can't fully rely on the scope information because we don't know what files get parsed and I think we'll have to find some hybrid approach where we use the scope information in some cases and in others not.

What about function parameters?

They should eventually get enabled too, but again, we aren't ready for them yet. We primarily need to improve our abilities to be able decide what get shown in the symbol tree and how. For instance, while we want the parameters to be parsed by ctags, we don't want to see them below the function name in the sidebar - those would be confused with local variables.

Once all the currently pending parser pull requests are merged, I'd like to move the tree mappings from symbols.c to tm_parsers and improve their flexibility and also simplify maintenance and adding new parsers. For instance for makefiles the definition would look as follows:

/* in addition to what we already have here we could specify the icon for the symbol tree here */
static TMParserMapEntry map_MAKEFILE[] = {
	{'m', tm_tag_macro_t, ICON_MACRO},
	{'t', tm_tag_function_t, ICON_METHOD},
	{'I', tm_tag_undef_t, ICON_NONE},
};

/* This would define the roots in the symbol tree, what icon is used for them and what tags are placed 
under the roots. Note that tm_tag_macro_t is a bit mask so if we wanted to add several kinds under 
the same root, we'd just xor them in the last element. */
static TMParserMapGroup group_MAKEFILE[] = {
	{_("Targets"), ICON_METHOD, tm_tag_macro_t},
	{_("Macros"), ICON_MACRO, tm_tag_function_t},
};

@elextr
Copy link
Member

elextr commented Dec 7, 2021

Scope autocompletion should still work, it should resolve the typedefs correctly, it's just the display in the sidebar.

Ok, I had "autocomplete symbols" off and it seems ctrl+space won't scope autocomplete, just name autocomplete, I didn't realise that. Both scope and name work with it on.

Scope autocomplete works for locals too 👍 (eg typing map-> at tm_parser.c:674) but as you pointed out also works outside the scope of the local 😞

we don't want to see them below the function name in the sidebar - those would be confused with local variables.

Well technically parameters are more correct than locals since they are visible for the whole of the function whereas locals are only visible from the declaration to end of scope.

Neither Vscode or Eclipse show locals or parameters in their symbols for functions, which sort of makes sense, you can't scope specify them (ie can't do funct_name::local_name or funct_name::param_name) but both show their declarations in a tooltip on hover (which I'm still ambivalent on, it can be annoying). And the locals and parameters names are styled different, but thats not something we can do because Scintilla.

Anyway this is just experimenting, not part of this PR, which so far seems to work, I would be happy for it to be merged so it gets more testing.

@techee
Copy link
Member Author

techee commented Dec 7, 2021

Neither Vscode or Eclipse show locals or parameters in their symbols for functions

Yeah, both of them should be disabled for the sidebar.

I would be happy for it to be merged so it gets more testing.

👍

@elextr
Copy link
Member

elextr commented Dec 10, 2021

Docs LGTM

@elextr
Copy link
Member

elextr commented Dec 11, 2021

If nobody raises a problem or beats me to it will squish and merge "soon"

@techee
Copy link
Member Author

techee commented Dec 11, 2021

"soon"

Yeah :-). Anyway, I really think that all the parser pull requests should be merged soon (notice - no apostrophes) to get some testing. If there are problems, we can either report them upstream and get a fixed version or just revert back to the old parser - that should be mostly trivial.

The only parser I'd like to have a look at first is #3034 and test it on Raspberry Pi to see how it works on slower hardware.

@techee
Copy link
Member Author

techee commented Dec 11, 2021

@elextr One more thing: once this pull request is merged, it will be possible to use the upstream ASM parser that depends on the new cxx preprocessor this pull request adds (the parser would have been included in the "parsers with big changes" pull request otherwise if there weren't this dependency). Should I create a separate pull request for that or add it e.g. into #3035?

@elextr
Copy link
Member

elextr commented Dec 11, 2021

"soon"

Just the weekend in case somebody has a problem, really, truely, honestly ... 😄

Should I create a separate pull request

I would, small is good, especially in pull requests.

@techee
Copy link
Member Author

techee commented Dec 16, 2021

Just the weekend in case somebody has a problem, really, truely, honestly ... 😄

Just a gentle reminder - the weekend is over, another weekend is coming :-).

@elextr
Copy link
Member

elextr commented Dec 16, 2021

Just a gentle reminder - the weekend is over, another weekend is coming :-).

Well, I didn't say which weekend did I?? 😀

Just ran out of time, hopefully soon.

@elextr elextr merged commit 5ea3e3e into geany:master Dec 19, 2021
@elextr
Copy link
Member

elextr commented Dec 19, 2021

There ya go, just had to paranoid re-check it still worked for me. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants