Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent scanning of properties within sets #28

Closed
jaynetics opened this issue Nov 29, 2016 · 1 comment
Closed

Inconsistent scanning of properties within sets #28

jaynetics opened this issue Nov 29, 2016 · 1 comment

Comments

@jaynetics
Copy link
Collaborator

Outside of sets, Regexp::Scanner recognizes two types of nonproperty:

S = Regexp::Scanner
S.scan /\p{ascii}/    # => [[:property,    :ascii, "\\p{ascii}", 0, 9]]
S.scan /\p{^ascii}/   # => [[:nonproperty, :ascii, "\\p{^ascii}", 0, 10]]
S.scan /\P{ascii}/    # => [[:nonproperty, :ascii, "\\P{ascii}", 0, 9]]

Within sets, only one of them is recognized as nonproperty:

S.scan /[\p{ascii}]/  # => [..., [:set, :ascii, "\\p{ascii}", 1, 10], ...]]
S.scan /[\p{^ascii}]/ # => [..., [:nonproperty, :ascii, "\\p{^ascii}", 1, 11], ...]]
S.scan /[\P{ascii}]/  # => [..., [:set, :ascii, "\\P{ascii}", 1, 10], ...]]

And I guess you would actually see it as a bug that \p{^...} does not get the type :set like everything else in a set, is that right? That would be easy to fix, and I think it would make sense from the viewpoint of consistency to change that.

However, fixing that would still make it necessary to scan the data of properties in sets for /\\(P|p\{\^)/ to detect whether they are negative. That is different from how they are handled outside of sets and different from how classes are handled, as classes have their "polarity" encoded at index 1:

S.scan /[[:ascii:]]/  # => [..., [:set, :class_ascii, "[:ascii:]", 1, 10], ...]
S.scan /[[:^ascii:]]/ # => [..., [:set, :class_nonascii, "[:^ascii:]", 1, 10], ...]

IMHO the ideal solution would be if all the information that tokens generate outside of sets was also available when they occur in a set. I am aware that this would require a substantial refactoring and would not be backwards compatible. But it would greatly help acting on specific tokens (escapes are another example) wherever they occur, without having to scan the token data.

Any thoughts?

@ammar
Copy link
Owner

ammar commented Dec 4, 2016

I do see the inconsistency as a bug.

Looks like the in_set global is not being set correctly and emitting the inconsistent type here. I expect the same is happening in the :subset case.

The solution you suggest makes sense, and is very tempting. Having the same characters emit the same tokens, at least within sets at first, can simplify certain applications. It's a different level of consistency.

I would like to assess the amount of work needed for that change and consider the backward compatibility implications for a bit.

In the meantime, fixing the inconsistent negated property type seems like a reasonable first step.

Thanks for submitting this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants