New parser! #154

mrob95 · 2019-10-22T18:42:44Z

This has been bugging me all week so I finally took another look at it. Turns out I was using the slow, forgiving lark parser (Earley) rather than the fast, extremely uncooperative parser (LALR).

Headlines:

Tests down from 6s to 3s
Breathe start-up down from ~3s to <1s
Caster start-up down from ~20s to <10s
Major simplification

This is still a work in progress, although I think I have caught at least most of the potential edge cases. The only thing remaining that it doesn't like in caster is "((meth|method)|(meth|method)s)" in ide_shared.py. I don't think this should actually be valid syntax, but am willing to be persuaded.

As I said, LALR is quite unforgiving and the error messages are not brilliant so modify this at your peril :-). I tried a lot of things which I thought should have worked but didn't.

TODO:

~~Proper error messages~~
clean up directory structure, delete old parser and tests
~~more tests?~~

Questions:

What happens in apps/family/loader_parser.py? It appears to rely on the old parser but I have no idea what any of that stuff does

drmfinlay · 2019-10-23T09:36:12Z

Thanks for this Mike! I haven't had a look at the code yet, but these sound like pretty good improvements. Unfortunately, I won't have much time to check it over the next week. I'll try to get back to you.

I'm not sure what the modules in apps/family do either. I'll try using them soon.

mrob95 · 2019-10-23T11:09:48Z

No rush, it might be good to get a few people to test this with their grammars to make sure this hasn't broken anything.

I can't make head or tail of apps/family. The original commit for some of it is here but doesnt really give any details. I can't find anything which says what it might be for.

drmfinlay · 2019-11-10T03:31:22Z

Hey @mrob95, sorry I haven't been reviewing this recently. I've noticed the CI build for this issue is failing for Python 2.7 (doc tests), even though GitHub isn't displaying the CI status. I am trying to fix that problem at the moment. I'm pretty sure the Python 3.x builds would also be failing if they included the doc tests.

I wouldn't worry about the apps/family stuff for the moment.

I should be more active on here in the next week to get some issues/PRs sorted out.

mrob95 · 2019-11-10T11:49:19Z

I'll take a look - the tests seem to be passing on my machine and I haven't been seeing the CI builds until now

Edit: Okay, easy fixes. The failing 2.7 tests were (obviously) a Unicode issue - lark always outputs Unicode strings so the assertions need to reflect this.

The failing doctest was due to a misunderstanding - I had assumed that sentences would be parsed into a sequence of individual literals each one word, but it seems like they should actually just be one literal. Makes sense.

drmfinlay · 2019-11-10T23:55:56Z

Thanks for the fixes. I'll try to get the doctests working with Python 3.x today.

I was wondering if the new parser's grammar could be modified to allow quoted strings. This would start to fix #138.

mrob95 · 2019-11-11T07:44:20Z

It already allows e.g. "hello 'world'" -> Literal(u"hello 'world'"), if we needed to produce a different class than that would be easy to do. Do you know what the QuotedLiteral needs to look like?

drmfinlay · 2019-11-11T23:51:17Z

While the parser does accept quotes, it doesn't do exactly what we need for QuotedLiteral yet. Quoted text should be in a separate element without the quotes. As an example, Compound(spec='testing "quoted text"') will currently produce a single literal element Literal(['testing', '"quoted', 'text"']) instead of a Sequence:

Sequence([
    Literal(['testing']),
    QuotedLiteral(['quoted text'])
])

The QuotedLiteral.words property should return the original text without quotes. This part is easy to do in the constructor:

class QuotedLiteral(Literal):
    def __init__(self, text, name=None, value=None, default=None):
        Literal.__init__(self, text, name, value, default)
        self._words = [text]  # treat all words as one word.

I would change the parser myself, but you're already familiar with lark. #138 is low-priority, so never mind if this is too complicated. I can handle the engine tasks after QuotedLiteral is implemented.

mrob95 · 2019-11-12T08:11:58Z

Hmm, it is complicated slightly by the fact that we have to deal with words containing apostrophes as well as potentially quoted words, so there is a lot of potential for ambiguity.

Would it not be simpler to just parse everything as sequences of individual literals? Seems like this would fix the bug without anything having to be quoted or special cased.

Edit, reading that thread again I think I had things upside down. I think the best way to implement this in a clean way is to do it using a different character to ', perhaps backticks?

drmfinlay · 2019-11-12T23:27:10Z

Using sequences of individual literals should fix the problem actually, unless I'm missing something. That is effectively what Vocola does by quoting grammar words. It would also simplify the changes to the engine compilers.

mrob95 · 2019-11-13T07:34:07Z

I think it may be best for me to do that as a separate PR in any case, it seems like it might need a few rounds of testing and iteration and the commit history for this is already a mess ^^

drmfinlay · 2019-11-13T13:34:40Z

Yep, fair enough. It's low-priority anyway.

Speaking of the commit history, did you want to combine/squash some commits by rebasing before merging?

BTW I've had a look at dragonfly.apps.family. It looks like the parser parts are used for custom rule elements. The whole family sub-package seems to be a complex system for loading interconnected command config files (.txt) and library files (.py) called "families", which can span multiple directories monitored for changes.

Anyway, t4ngo never finished implementing it. I think it should be removed from dragonfly2 entirely.

mrob95 · 2019-11-13T16:40:24Z

Yeah my bad I have been meaning to clean it up. Every time you think you are getting the hang of git it finds a way to kick you in the nuts - some combination of committing from multiple different profiles and rebasing confused it and led to duplicate commits.

Regarding apps.family I agree that if it isn't fully implemented and nobody is using it then we should get rid of it. The idea of having dedicated grammar files is potentially a good one, but I think it would be easier to start fresh than try to figure out a half baked implementation.

drmfinlay · 2019-11-14T00:17:13Z

No worries, thanks for cleaning that up. Yep, Git can be tricky to use at times.

Agreed. I'll remove that in a separate PR for future reference. I'd also like to remove the Windows dialog-related classes under dragonfly.windows and some files in the root directory for pretty much the same reason.

So this is ready for merging now?

Work on the command family app was started by Christo Butcher, Dragonfly's original author, but was never finished. It is being removed prior to merging the new parser (PR #154), which is incompatible with parser code in family/loader_parser.py.

mrob95 · 2019-11-14T08:59:38Z

Yes, I think this is ready to merge. I think it is quite likely that someone will find an edge case which worked with the old parser but isn't handled by this one, but it should be simple to fix those as they come up.

drmfinlay · 2019-11-15T04:15:47Z

Okay then, I'll merge this now. Yes, issues like that can be fixed as they come up.

Thanks again for putting this together! 👍

drmfinlay assigned mrob95 Oct 23, 2019

drmfinlay added the Enhancement Enhancement existing feature label Oct 23, 2019

mrob95 mentioned this pull request Oct 24, 2019

Fix questionable syntax in ide.py synkarius/Caster#47

Merged

Add Lark-based parser for compound specs

ff6f39a

mrob95 force-pushed the parsing branch from 320dc67 to ff6f39a Compare November 13, 2019 16:35

drmfinlay mentioned this pull request Nov 14, 2019

Remove old unused files #166

Merged

drmfinlay merged commit 046cf16 into dictation-toolbox:master Nov 15, 2019

mrob95 deleted the parsing branch November 15, 2020 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parser! #154

New parser! #154

mrob95 commented Oct 22, 2019 •

edited

Loading

drmfinlay commented Oct 23, 2019

mrob95 commented Oct 23, 2019

drmfinlay commented Nov 10, 2019

mrob95 commented Nov 10, 2019 •

edited

Loading

drmfinlay commented Nov 10, 2019

mrob95 commented Nov 11, 2019

drmfinlay commented Nov 11, 2019

mrob95 commented Nov 12, 2019 •

edited

Loading

drmfinlay commented Nov 12, 2019

mrob95 commented Nov 13, 2019

drmfinlay commented Nov 13, 2019

mrob95 commented Nov 13, 2019

drmfinlay commented Nov 14, 2019

mrob95 commented Nov 14, 2019

drmfinlay commented Nov 15, 2019

New parser! #154

New parser! #154

Conversation

mrob95 commented Oct 22, 2019 • edited Loading

drmfinlay commented Oct 23, 2019

mrob95 commented Oct 23, 2019

drmfinlay commented Nov 10, 2019

mrob95 commented Nov 10, 2019 • edited Loading

drmfinlay commented Nov 10, 2019

mrob95 commented Nov 11, 2019

drmfinlay commented Nov 11, 2019

mrob95 commented Nov 12, 2019 • edited Loading

drmfinlay commented Nov 12, 2019

mrob95 commented Nov 13, 2019

drmfinlay commented Nov 13, 2019

mrob95 commented Nov 13, 2019

drmfinlay commented Nov 14, 2019

mrob95 commented Nov 14, 2019

drmfinlay commented Nov 15, 2019

mrob95 commented Oct 22, 2019 •

edited

Loading

mrob95 commented Nov 10, 2019 •

edited

Loading

mrob95 commented Nov 12, 2019 •

edited

Loading