Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parser! #154

Merged
merged 1 commit into from
Nov 15, 2019
Merged

New parser! #154

merged 1 commit into from
Nov 15, 2019

Conversation

mrob95
Copy link
Member

@mrob95 mrob95 commented Oct 22, 2019

This has been bugging me all week so I finally took another look at it. Turns out I was using the slow, forgiving lark parser (Earley) rather than the fast, extremely uncooperative parser (LALR).

Headlines:

  • Tests down from 6s to 3s
  • Breathe start-up down from ~3s to <1s
  • Caster start-up down from ~20s to <10s
  • Major simplification

This is still a work in progress, although I think I have caught at least most of the potential edge cases. The only thing remaining that it doesn't like in caster is "((meth|method)|(meth|method)s)" in ide_shared.py. I don't think this should actually be valid syntax, but am willing to be persuaded.

As I said, LALR is quite unforgiving and the error messages are not brilliant so modify this at your peril :-). I tried a lot of things which I thought should have worked but didn't.

TODO:

  • Proper error messages
  • clean up directory structure, delete old parser and tests
  • more tests?

Questions:

  • What happens in apps/family/loader_parser.py? It appears to rely on the old parser but I have no idea what any of that stuff does

@drmfinlay
Copy link
Member

Thanks for this Mike! I haven't had a look at the code yet, but these sound like pretty good improvements. Unfortunately, I won't have much time to check it over the next week. I'll try to get back to you.

I'm not sure what the modules in apps/family do either. I'll try using them soon.

@drmfinlay drmfinlay added the Enhancement Enhancement existing feature label Oct 23, 2019
@mrob95
Copy link
Member Author

mrob95 commented Oct 23, 2019

No rush, it might be good to get a few people to test this with their grammars to make sure this hasn't broken anything.

I can't make head or tail of apps/family. The original commit for some of it is here but doesnt really give any details. I can't find anything which says what it might be for.

@drmfinlay
Copy link
Member

Hey @mrob95, sorry I haven't been reviewing this recently. I've noticed the CI build for this issue is failing for Python 2.7 (doc tests), even though GitHub isn't displaying the CI status. I am trying to fix that problem at the moment. I'm pretty sure the Python 3.x builds would also be failing if they included the doc tests.

I wouldn't worry about the apps/family stuff for the moment.

I should be more active on here in the next week to get some issues/PRs sorted out.

@mrob95
Copy link
Member Author

mrob95 commented Nov 10, 2019

I'll take a look - the tests seem to be passing on my machine and I haven't been seeing the CI builds until now

Edit: Okay, easy fixes. The failing 2.7 tests were (obviously) a Unicode issue - lark always outputs Unicode strings so the assertions need to reflect this.

The failing doctest was due to a misunderstanding - I had assumed that sentences would be parsed into a sequence of individual literals each one word, but it seems like they should actually just be one literal. Makes sense.

@drmfinlay
Copy link
Member

Thanks for the fixes. I'll try to get the doctests working with Python 3.x today.

I was wondering if the new parser's grammar could be modified to allow quoted strings. This would start to fix #138.

@mrob95
Copy link
Member Author

mrob95 commented Nov 11, 2019

It already allows e.g. "hello 'world'" -> Literal(u"hello 'world'"), if we needed to produce a different class than that would be easy to do. Do you know what the QuotedLiteral needs to look like?

@drmfinlay
Copy link
Member

While the parser does accept quotes, it doesn't do exactly what we need for QuotedLiteral yet. Quoted text should be in a separate element without the quotes. As an example, Compound(spec='testing "quoted text"') will currently produce a single literal element Literal(['testing', '"quoted', 'text"']) instead of a Sequence:

Sequence([
    Literal(['testing']),
    QuotedLiteral(['quoted text'])
])

The QuotedLiteral.words property should return the original text without quotes. This part is easy to do in the constructor:

class QuotedLiteral(Literal):
    def __init__(self, text, name=None, value=None, default=None):
        Literal.__init__(self, text, name, value, default)
        self._words = [text]  # treat all words as one word.

I would change the parser myself, but you're already familiar with lark. #138 is low-priority, so never mind if this is too complicated. I can handle the engine tasks after QuotedLiteral is implemented.

@mrob95
Copy link
Member Author

mrob95 commented Nov 12, 2019

Hmm, it is complicated slightly by the fact that we have to deal with words containing apostrophes as well as potentially quoted words, so there is a lot of potential for ambiguity.

Would it not be simpler to just parse everything as sequences of individual literals? Seems like this would fix the bug without anything having to be quoted or special cased.

Edit, reading that thread again I think I had things upside down. I think the best way to implement this in a clean way is to do it using a different character to ', perhaps backticks?

@drmfinlay
Copy link
Member

Using sequences of individual literals should fix the problem actually, unless I'm missing something. That is effectively what Vocola does by quoting grammar words. It would also simplify the changes to the engine compilers.

@mrob95
Copy link
Member Author

mrob95 commented Nov 13, 2019

I think it may be best for me to do that as a separate PR in any case, it seems like it might need a few rounds of testing and iteration and the commit history for this is already a mess ^^

@drmfinlay
Copy link
Member

Yep, fair enough. It's low-priority anyway.

Speaking of the commit history, did you want to combine/squash some commits by rebasing before merging?

BTW I've had a look at dragonfly.apps.family. It looks like the parser parts are used for custom rule elements. The whole family sub-package seems to be a complex system for loading interconnected command config files (.txt) and library files (.py) called "families", which can span multiple directories monitored for changes.

Anyway, t4ngo never finished implementing it. I think it should be removed from dragonfly2 entirely.

@mrob95
Copy link
Member Author

mrob95 commented Nov 13, 2019

Yeah my bad I have been meaning to clean it up. Every time you think you are getting the hang of git it finds a way to kick you in the nuts - some combination of committing from multiple different profiles and rebasing confused it and led to duplicate commits.

Regarding apps.family I agree that if it isn't fully implemented and nobody is using it then we should get rid of it. The idea of having dedicated grammar files is potentially a good one, but I think it would be easier to start fresh than try to figure out a half baked implementation.

@drmfinlay
Copy link
Member

No worries, thanks for cleaning that up. Yep, Git can be tricky to use at times.

Agreed. I'll remove that in a separate PR for future reference. I'd also like to remove the Windows dialog-related classes under dragonfly.windows and some files in the root directory for pretty much the same reason.

So this is ready for merging now?

drmfinlay added a commit that referenced this pull request Nov 14, 2019
Work on the command family app was started by Christo Butcher,
Dragonfly's original author, but was never finished.

It is being removed prior to merging the new parser (PR #154),
which is incompatible with parser code in family/loader_parser.py.
@mrob95
Copy link
Member Author

mrob95 commented Nov 14, 2019

Yes, I think this is ready to merge. I think it is quite likely that someone will find an edge case which worked with the old parser but isn't handled by this one, but it should be simple to fix those as they come up.

@drmfinlay
Copy link
Member

Okay then, I'll merge this now. Yes, issues like that can be fixed as they come up.

Thanks again for putting this together! 👍

@drmfinlay drmfinlay merged commit 046cf16 into dictation-toolbox:master Nov 15, 2019
@mrob95 mrob95 deleted the parsing branch November 15, 2020 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Enhancement existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants