Inconsistent scanning of properties within sets #28

jaynetics · 2016-11-29T20:30:19Z

Outside of sets, Regexp::Scanner recognizes two types of nonproperty:

S = Regexp::Scanner
S.scan /\p{ascii}/    # => [[:property,    :ascii, "\\p{ascii}", 0, 9]]
S.scan /\p{^ascii}/   # => [[:nonproperty, :ascii, "\\p{^ascii}", 0, 10]]
S.scan /\P{ascii}/    # => [[:nonproperty, :ascii, "\\P{ascii}", 0, 9]]

Within sets, only one of them is recognized as nonproperty:

S.scan /[\p{ascii}]/  # => [..., [:set, :ascii, "\\p{ascii}", 1, 10], ...]]
S.scan /[\p{^ascii}]/ # => [..., [:nonproperty, :ascii, "\\p{^ascii}", 1, 11], ...]]
S.scan /[\P{ascii}]/  # => [..., [:set, :ascii, "\\P{ascii}", 1, 10], ...]]

And I guess you would actually see it as a bug that \p{^...} does not get the type :set like everything else in a set, is that right? That would be easy to fix, and I think it would make sense from the viewpoint of consistency to change that.

However, fixing that would still make it necessary to scan the data of properties in sets for /\\(P|p\{\^)/ to detect whether they are negative. That is different from how they are handled outside of sets and different from how classes are handled, as classes have their "polarity" encoded at index 1:

S.scan /[[:ascii:]]/  # => [..., [:set, :class_ascii, "[:ascii:]", 1, 10], ...]
S.scan /[[:^ascii:]]/ # => [..., [:set, :class_nonascii, "[:^ascii:]", 1, 10], ...]

IMHO the ideal solution would be if all the information that tokens generate outside of sets was also available when they occur in a set. I am aware that this would require a substantial refactoring and would not be backwards compatible. But it would greatly help acting on specific tokens (escapes are another example) wherever they occur, without having to scan the token data.

Any thoughts?

The text was updated successfully, but these errors were encountered:

ammar · 2016-12-04T15:45:39Z

I do see the inconsistency as a bug.

Looks like the in_set global is not being set correctly and emitting the inconsistent type here. I expect the same is happening in the :subset case.

The solution you suggest makes sense, and is very tempting. Having the same characters emit the same tokens, at least within sets at first, can simplify certain applications. It's a different level of consistency.

I would like to assess the amount of work needed for that change and consider the backward compatibility implications for a bit.

In the meantime, fixing the inconsistent negated property type seems like a reasonable first step.

Thanks for submitting this issue!

jaynetics mentioned this issue Dec 5, 2016

Only alphanumeric character set ranges are detected as ranges #29

Closed

jaynetics mentioned this issue Apr 17, 2017

Make property scanning in sets consistent #35

Merged

jaynetics closed this as completed in 2ef8ccf Sep 17, 2017

jaynetics mentioned this issue Nov 18, 2017

Overhaul of Set#members needed #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent scanning of properties within sets #28

Inconsistent scanning of properties within sets #28

jaynetics commented Nov 29, 2016

ammar commented Dec 4, 2016

Inconsistent scanning of properties within sets #28

Inconsistent scanning of properties within sets #28

Comments

jaynetics commented Nov 29, 2016

ammar commented Dec 4, 2016