Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Kotlin tags #3034

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dolik-rce
Copy link
Contributor

This is a third incarnation of PR to add support for Kotlin language. Previous discussions can be found in #2778 and dolik-rce#1.

Included changes:

  • Small modification of update-ctags.sh, so it can import peg based parsers from ctags.
  • Pull Kotlin parser from ctags and integrate it into tagmanager.
  • Rename filetypes.Kotlin.conf to filetypes.kotlin, to make it work correctly with tagmanager.
  • Added test for kotlin tags.

Disclaimer: The parser was written (and is maintained) by me.

@@ -13,12 +13,18 @@
dstdir = os.path.abspath(sys.argv[2])

os.chdir(dstdir + '/parsers')
parser_dst_files = glob.glob('*.c') + glob.glob('*.h')
parser_dst_files = glob.glob('**/*.c') + glob.glob('**/*.h')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I would slightly prefer mirroring the ctags directory structure - i.e. having peg outside the parsers directory. That would mean explicitly mentioning peg and parsers directories in the globs and in the code below eliminating the

    if os.path.exists('parsers/' + f):
         shutil.copy('parsers/' + f, dstdir + '/parsers/' + f)

part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can add ctags/peg/ directory.

name�16384�com.example.helloworld.Worker�0
process�16�com.example.helloworld.AbstractWorker�0
process�16�com.example.helloworld.Worker�0
result�16384�com.example.helloworld.main�0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked but does the unit test cover all kinds from TMParserMapEntry map_KOTLIN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it covers all 8 tag kinds.

@techee
Copy link
Member

techee commented Dec 5, 2021

Looks good to me in general. I'm not familiar with the peg parsers - I assume these are generated and one has to always run make before so the sources get created, right?

@dolik-rce
Copy link
Contributor Author

I'm not familiar with the peg parsers - I assume these are generated and one has to always run make before so the sources get created, right?

Yes, they are not committed in the git, but they are generated by packcc as needed. At the very least make peg/kotllin.c should be called, otherwise the update script fails. Should I add the call in the script?

@techee
Copy link
Member

techee commented Dec 6, 2021

Yes, they are not committed in the git, but they are generated by packcc as needed. At the very least make peg/kotllin.c should be called, otherwise the update script fails. Should I add the call in the script?

Maybe a good idea, yes. It could be easily forgotten otherwise.

os.system('make -C {} peg/{}'.format(srcdir, ' peg/'.join(peg_dst_files)))
print('Copying peg parsers... ({} files)'.format(len(peg_dst_files)))
for f in peg_dst_files:
shutil.copy(f, dstdir + '/peg')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the code does more or less the same thing twice, it could be changed to something like the following:

os.chdir(srcdir + '/peg')
os.system('make -C {} peg/{}'.format(srcdir, ' peg/'.join(peg_dst_files)))
for dir in ['parsers', 'peg']:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking that the script could use some "readability improvements". Like using functions and giving it some structure... But it seemed a bit out of scope for this PR, so I stayed with the current style.

Would you mind if I implemented it in follow-up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind if I implemented it in follow-up PR?

Yep, agreed. #3032 will do some more changes to the script so maybe better wait until both of these pull requests are merged.

@techee
Copy link
Member

techee commented Dec 6, 2021

I just noticed the size of the parser - 25klocs and 1MB of sources. Have you checked how much the size of stripped libgeany.so increases by this patch?

@elextr
Copy link
Member

elextr commented Dec 7, 2021

@techee maybe its time for "somebody" to add the capability for uctags to load parsers from dll :-)

Although the latest Ubuntu uctags package has an installed size of just 2Mb, and we don't use it all (yet :-)

Of more interest is how fast (slow) is it, remembering it runs while editing, will Kotlists get big pauses?

@dolik-rce
Copy link
Contributor Author

I just noticed the size of the parser - 25klocs and 1MB of sources. Have you checked how much the size of stripped libgeany.so increases by this patch?

Yes, the generated code is humongous. I guess it is price for implementing full Kotlin grammar. I have tried to check the library size with make && strip --strip-unneeded. Current master is about 3.57 MB, this branch yields 3.72 MB, so roughly 150kB. I agree it is quite a lot, but still manageable I hope.

Of more interest is how fast (slow) is it, remembering it runs while editing, will Kotlists get big pauses?

Frankly, the speed could be better. This has already have been discussed when I was adding the parser to ctags and it even resulted in some improvements in PackCC (which is used to generate the code), but it is still slower than handwritten parsers. You can have a look at universal-ctags/ctags#2866 and universal-ctags/ctags#2878 if you're interested.

Anyway, the pauses during editing are not noticeable for reasonably sized files. The only situation where the speed might be annoying is when user opens big kotlin project. But I'm not sure how this actually works, does Geany always process all open files on start? Or is there some kind of caching? I think some of the projects-related plugin even allowed to set option to index all files, even those that are not open. This could then take quite some time.

@elextr
Copy link
Member

elextr commented Dec 7, 2021

does Geany always process all open files on start? Or is there some kind of caching? I think some of the projects-related plugin even allowed to set option to index all files, even those that are not open. This could then take quite some time.

IIUC yes, no, yes, probably

@elextr
Copy link
Member

elextr commented Dec 7, 2021

Implementing the whole grammar has problems for this use-case, not only speed, but also robustness. It is being used on files that are being edited and may be eronneous or incomplet.

Unless the parser can recover from an error it is going to lose all or many symbols in more cases if it parses things that are not essential to detecting the tags since it will then find more errors. It is a problem with many languages, not just kotlin (see for example #2916), but the fact many tag parsers skip lots of code reduces the impact of partly edited code (outside of declarations).

This is also why full fat IDEs to try to insist that at least the the structural syntax is complete (eg {}s match) so their parsers have some handles to resync and recover after errors.

@dolik-rce
Copy link
Contributor Author

The kotlin parser can recover from errors. If it fails to parse something, it continues until it finds next parseable top-level object (e.g. function or class). The backtracking parser is actually quite good for this.

I did try at first to write simple, handwritten parser that only gets the important tags. However the problem with Kotlin is that the grammar is quite complex and it lead to way too many inaccuracies. That is why I had to implement it all in the end. Also, it was very hard to track scope, when only some parts of the code were parsed.

@techee
Copy link
Member

techee commented Dec 7, 2021

Yes, the generated code is humongous. I guess it is price for implementing full Kotlin grammar. I have tried to check the library size with make && strip --strip-unneeded. Current master is about 3.57 MB, this branch yields 3.72 MB, so roughly 150kB. I agree it is quite a lot, but still manageable I hope.

It's not that bad, I guess I'm still stuck with the "it must fit a floppy" mentality.

Implementation Time Avg. per file Tags added
C 4 s 66 us 567695
RegExp 155 s 2.5 ms 527908
Peg 523s 8.6 ms 782500

This is quite surprising, I would have expected the performance to be somewhere between a hand-written parser and a regex parser but it probably depends on the grammar and the lookahead it has to perform. Have you tried profiling the parser to see if there's not some unexpected bottleneck? With hand-written parsers I used the following:

https://wiki.geany.org/howtos/profiling/gperftools

Unless the parser can recover from an error it is going to lose all or many symbols in more cases if it parses things that are not essential to detecting the tags since it will then find more errors. It is a problem with many languages, not just kotlin (see for example #2916), but the fact many tag parsers skip lots of code reduces the impact of partly edited code (outside of declarations).

To be fair, hand-written ctags parsers aren't that great at error recovery either - when you type an extra { at the beginning of a file, it won't get parsed.

@techee
Copy link
Member

techee commented Dec 7, 2021

Of more interest is how fast (slow) is it, remembering it runs while editing, will Kotlists get big pauses?

If the table above is valid and from what I remember the C parser speed is about 50MB of sources per second and the kotlin parser is about 100x slower than the C parser, it gives us 500KB/s for kotlin. And the pauses should be just something like 0.2s not to be noticeable so about 100KB kotlin source files would be roughly the maximum for Geany.

@dolik-rce
Copy link
Contributor Author

This is quite surprising, I would have expected the performance to be somewhere between a hand-written parser and a regex parser but it probably depends on the grammar and the lookahead it has to perform. Have you tried profiling the parser to see if there's not some unexpected bottleneck?

I did a lot of profiling on PackCC and it turns out it is not very efficient with memory allocations (see universal-ctags/ctags#2866 (comment)), the backtracking is constantly allocating and deallocating small chunks of memory. So, more rules leads to more backtracking, which leads to more allocations, which slows it down. I have even proposed couple possible optimizations, but I did not have from enough manpower to actually implement it (sadly, I am not very proficient in C).

I might look into the grammar optimization in future, so the parser might become somewhat faster (especially, if it slows Geany down, that would annoy me a lot). Probably could beat the RegExp based ones (for comparable grammars). According to creator of ctags, the peg parser is about 15-30 times slower than handwritten parser (citation from universal-ctags/ctags#2866 (comment)). Even with grammar optimizations, it would still probably be about an order of magnitude slower than let's say current C parser.

@techee
Copy link
Member

techee commented Dec 12, 2021

Hmm, I'm afraid I don't have very good news regarding the performance. I tried it on Raspberry Pi 400 (which should be slightly faster than regular Raspberry Pi 4) with the following file which IMO isn't that big

https://github.com/JetBrains/Exposed/blob/master/exposed-core/src/main/kotlin/org/jetbrains/exposed/sql/SQLExpressionBuilder.kt

and the lag was already very annoying (to the point that I'd much rather have parsing disabled than editing this way). IMO Raspberry Pi 4 is a reasonable low-end machine Geany should support without big speed problems (apart from that Geany is the semi-official Raspberry Pi editor as it's installed by default) and the 500 LOC file isn't anything extreme either. So I'm not sure whether this parser should be used in Geany. It's shame that PEG parsers are so slow currently - if they were e.g. 10x slower than hand-written parsers, it would be fine I think but 100x is a little too much.

By the way I also tried this file

https://github.com/JetBrains/Exposed/blob/master/exposed-core/src/main/kotlin/org/jetbrains/exposed/sql/SQLExpressionBuilder.kt

which seems to contain many anonymous functions and it seems to be a common pattern in Kotlin files and seeing the symbol tree filled with <lambda> doesn't seem that useful to me so maybe these shouldn't be reported to Geany. In any case, the performance issue seems to be a bigger problem though.

@dolik-rce
Copy link
Contributor Author

That is really surprising ☹️ I'd expect RPi 4 to handle this much better. I don't have any slower hardware readily available right now, but I have tested it with cpulimit -l 50 -i geany SQLExpressionBuilder.kt, which on my i5 should mean it should get about 1.2GHz and was still pretty smooth. There might of course also be some difference due to memory speed.

Well, I started this as a "Christmas project" last year, so I guess I'll just have a another one for this year: to make it faster.

... seems to contain many anonymous functions and it seems to be a common pattern in Kotlin files and seeing the symbol tree filled with doesn't seem that useful to me so maybe these shouldn't be reported to Geany.

Yes, lambdas can be quite common in Kotlin code and you're right, that they are probably useless in the symbol tree most of the time. I think the only valid reason to keep them is that there might be something else declared in their scope. Perhaps we could filter out all lambdas without any child nodes?

@techee
Copy link
Member

techee commented Dec 12, 2021

There might of course also be some difference due to memory speed.

That's quite probable.

Well, I started this as a "Christmas project" last year, so I guess I'll just have a another one for this year: to make it faster.

I know those! #3032 is Christmas 2018 :-).

I think the only valid reason to keep them is that there might be something else declared in their scope.

That's a good point. Filtering tags should in theory be possible, the question is how many such things we should do in Geany. But maybe anonymous functions are quite a common pattern to deserve this special handling. I'm just preparing a pull request that improves handling of anonymous tags so I might add something like that.

@dolik-rce dolik-rce force-pushed the add-kotlin-support-3 branch 2 times, most recently from fa83ce4 to fca1f25 Compare October 31, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants