Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-31: leex ignores {m,n} and {m} repetition quantifiers #3053

Closed
OTP-Maintainer opened this issue Oct 21, 2015 · 8 comments
Closed

ERL-31: leex ignores {m,n} and {m} repetition quantifiers #3053

OTP-Maintainer opened this issue Oct 21, 2015 · 8 comments
Assignees
Labels
enhancement priority:medium team:PO Assigned to OTP team PO wontfix Issue will not be fixed by OTP

Comments

@OTP-Maintainer
Copy link

Original reporter: sunaku
Affected version: OTP-18.1
Component: parsetools
Migrated from: https://bugs.erlang.org/browse/ERL-31


Hello,

I'm using Erlang/OTP 18.1 ([erts-7.1] [source] [64-bit] [smp:24:24]
[async-threads:10] [hipe] [kernel-poll:false]) where I find that leex
does not honor extended regexp numerical repetition quantifiers [1].

For example, I'm trying to recognize exactly 11, 10, or 7 uppercase
alphanumeric characters as a product_id in a leex rule, as follows:
{code:erlang}
[A-Z0-9]{10,11}|[A-Z0-9]{7}
           : {token, {product_id, TokenLine, TokenChars}}.
{code}
However, that didn't work, so I tried using AWK syntax [2] exactly:
{code:erlang}
[A-Z0-9]\{10,11\}|[A-Z0-9]\{7\}
           : {token, {product_id, TokenLine, TokenChars}}.
{code}
However, that didn't work either, so I was forced to expand it out:
{code:erlang}
[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]|[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]|[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]
           : {token, {product_id, TokenLine, TokenChars}}.
{code}
This is a giant step backwards in terms of regexp readability and
maintainability, so please make leex honor the extended regexp
numerical repetition quantifiers [1] since AWK supports them [2].

Furthermore, it seems that this work had already been started [3]
but, for reasons unknown to me, was commented-out in the code [4].

Thanks for your consideration

[1] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_06
[2] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04
[3] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L138
[4] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L692-L700
@OTP-Maintainer
Copy link
Author

jlouis said:

You can probably handle this by simply recognizing [A-Z0-9]+ and then handle it in a function:
{code:erlang}
[A-Z0-9]+ : recognize_length(TokenLine, TokenChars)

recognize_length(Line, Chs) when length(Chs) >= 10 andalso length(Chs) =< 11 ->
    {token, {product_id, Line, Chs}};
recognize_length(Line, Chs) when length(Chs) == 7 ->
    {token, {product_id, Line, Chs}};
recognize_length(_, _) -> {error, blargh}.
{code}
I'm not really sure why you want to extend leex with extensions to its regex engines. It isn't that common in lexers to provide these kinds of extensions, based on my knowledge of them over the years.

@OTP-Maintainer
Copy link
Author

sunaku said:

Thanks for suggesting a workaround.  However, in my humble opinion, the workaround feels laborious in comparison to just specifying numerical repetition quantifiers.  In addition, I feel it may become more complicated when having to workaround more complex regular expressions that nest numerical quantifiers among greedy ones, like this: 
{noformat}
(([[:alnum:]]{10,11}\.){3,5})+
{noformat}

Since you considered numerical repetition quantifiers as an "extension" to leex's regex engine, I looked into [AWK's regex specification](http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_06) again more closely and found that numerical repetition quantifiers (also known as "BREs Matching Multiple Characters") are part of AWK's general (or _basic_) regular expression syntax (and not the extended syntax as I had previously assumed --- I was wrong about that; sorry for the confusion).

As a result, this is all the more reason for leex to support numerical repetition quantifiers (or "BREs Matching Multiple Characters") because they're part of AWK's basic regular expressions.

Thanks for your consideration.

@OTP-Maintainer
Copy link
Author

jlouis said:

I'm calling "isn't really needed". Usually, you would just tokenize your product_id blindly and then handle the verification at a later stage, either in the parser or even later in the stages. It is also this reason there has been no pressing need for such an extension. The {m} and {m,n} repetitions are fairly easy to implement by unrolling the DFA construction the right number of times, but I'm sure it would be that useful.

@OTP-Maintainer
Copy link
Author

sunaku said:

Alright, what about the fact that this "extension" seems to have been already implemented [1] but partially commented-out [2] in the codebase?

Would you be open to accepting a patch / pull request if I restored this "extension" to proper working order?

Thanks for your consideration.

[1] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L138
[2] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L692-L700

@OTP-Maintainer
Copy link
Author

kenneth said:

Changed type from Bug to New feature since we don't think this is a bug. The leex documentation does not mention anything about support for repetition quantifiers.

@OTP-Maintainer
Copy link
Author

rvirding said:

A major problem with this is the \{M\} pair is already in use for macro expansion. How would this be handled in that case? Doing a \ \{ \ \} would not work as this is needed now to quote the \{ and \} .

Also it would break consistency if some control characters apply with \ and some don't.

So I am against this suggestion.

@OTP-Maintainer
Copy link
Author

kenneth said:

As Robert Virding is the original developer of leex we follow his advice and reject this suggestion.
Ther reasons are then: 1) Would cause backward compatibility problems,
2) Can't see that it is important enough to motivate 1.

@OTP-Maintainer
Copy link
Author

kenneth said:

See my last comment before this one.

@OTP-Maintainer OTP-Maintainer added wontfix Issue will not be fixed by OTP enhancement team:PO Assigned to OTP team PO priority:medium labels Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement priority:medium team:PO Assigned to OTP team PO wontfix Issue will not be fixed by OTP
Projects
None yet
Development

No branches or pull requests

2 participants