Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in lexer matching rules #15

Closed
srathbun opened this issue May 8, 2012 · 2 comments
Closed

Error in lexer matching rules #15

srathbun opened this issue May 8, 2012 · 2 comments

Comments

@srathbun
Copy link

srathbun commented May 8, 2012

I have a set of rules which both match my input, but instead of returning the first matching rule, PLY returns the broadest rule every time. Adjusting the rule ordering does not seem to help, and I've checked to see if my input string has other characters in it.

    tokens = (
            'FORMFEED','PAGE','ACCOUNTS','ENDSTATEMENT','START','VALIDLINE',
    )

    ##
    ## Regexes for use in tokens
    ##
    ##

    FORMFEED  = r'\f'
    PAGE      = r'\s+STATEMENT PAGE \#: 1\s*'
    ACCOUNTS  = r'=+ S H A R E  A C C O U N T S =+'
    ENDSTATEMENT = r'<\d+>=+ E N D   O F   S T A T E M E N T =+'
    VALIDLINE = r'[\S \t]+'
    START     = r'[\x00]+[ ]+'

    ##
    ## Lexer states
    ##
    states = (
    )

    # Newlines
    def t_NEWLINE(self, t):
        r'\n+'
        t.lexer.lineno += t.value.count("\n")

    @TOKEN(START)
    def t_START(self, t):
        return t

    @TOKEN(PAGE)
    def t_PAGE(self, t):
        return t

    @TOKEN(ACCOUNTS)
    def t_ACCOUNTS(self, t):
        return t

    @TOKEN(ENDSTATEMENT)
    def t_ENDSTATEMENT(self, t):
        return t

    @TOKEN(VALIDLINE)
    def t_VALIDLINE(self, t):
        return t

    @TOKEN(FORMFEED)
    def t_FORMFEED(self, t):
        return t

When I give it the input string:

=============================== S H A R E A C C O U N T S ===============================

I expect to receive an ACCOUNTS token. Instead I get a VALIDLINE token. Putting the lexer into debug mode shows that the master regex is:

'(?P<t_NEWLINE>\\n+)|(?P<t_START>[\\x00]+[ ]+)|(?P<t_PAGE>\\s+STATEMENT PAGE \\#: 1\\s*)|(?P<t_ACCOUNTS>=+ S H A R E A C C O U N T S =+)|(?P<t_ENDSTATEMENT><\\d+>=+ E N D O F S T A T E M E N T =+)|(?P<t_VALIDLINE>.+)|(?P<t_FORMFEED>\\f)'

I tested that in python and I receive the proper return values. Why is PLY returning the wrong token?

@srathbun
Copy link
Author

srathbun commented May 8, 2012

After additional testing, I've determined that this is caused by the spaces in the regex patterns. PLY apparently operates like flex, in that the regexes must use [ ] or \s for whitespace. This is different from the python re module, which allows whitespace in the string.

That is why my tests of the master regex worked with the re module, but not with PLY. Should this be fixed to follow python, or should a note be added to the documentation?

@dabeaz
Copy link
Owner

dabeaz commented Apr 17, 2015

I've added a comment in the documentation about this.

@dabeaz dabeaz closed this as completed Apr 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants