Improve set handling #55

jaynetics · 2018-04-30T14:50:08Z

This makes CharacterSet a standard Subexpression as suggested in #47 (comment)

All equivalent tokens result in the same Scanner and Parser emissions as outside of sets.

New CharacterSet::Range and CharacterSet::Intersection expressions represent respective trees.

Other notable changes are:

example	from type, token	to type, token	from exp	to exp
[\b]	:set, :backspace	:escape, :backspace	none/String	ES::Backspace
[[:xy:]]	:set, :char_xy	:posixclass, :xy	none/String	PosixClass
[[:^xy:]]	:set, :char_nonxy	:nonposixclass, :xy	none/String	PosixClass
\x20	:escape, :hex	:escape, :hex	ES::Literal	ES::Hex
\x20	:escape, :octal	:escape, :octal	ES::Literal	ES::Octal
\u1234	:escape, :codepoint	:escape, :codepoint	ES::Literal	ES::Codepoint
\u{12 34}	:escape, :codepoint_list	:escape, :codepoint_list	ES::Literal	ES::CodepointList

@ammar What do you think? The commit messages provide a bit more explanation if you are wondering about some of the changes, but feel free to suggest any other solution.

# Conflicts: # lib/regexp_parser/expression/subexpression.rb # lib/regexp_parser/lexer.rb # lib/regexp_parser/parser.rb # lib/regexp_parser/scanner/scanner.rl

# Conflicts: # lib/regexp_parser/syntax/ruby/1.8.6.rb # lib/regexp_parser/syntax/ruby/1.9.1.rb # lib/regexp_parser/syntax/ruby/2.0.0.rb # test/syntax/versions/test_1.8.rb

jaynetics · 2018-04-30T20:04:34Z

Turns out I should have read the docs...
https://github.com/k-takata/Onigmo/blob/79114095/doc/RE#L155-L156

Intersections apply to all expressions in their set, not just adjacent ones.

'abc1'.scan(/[a b \d && b c [:digit:]]/x) # => ["b", "1"]
'abc1'.scan(/[^a b \d && b c [:digit:]]/x) # => ["a", "c"]

So maybe Intersection parse results need to look somewhat like this:

RP.parse(/[a&&b]/).first.first # =>
  #<Intersection @expressions=[
    #<Intersection::Left @expressions=[
      #<Literal @text="a"/>
    ],
    #<Intersection::Right @expressions=[
      #<Literal @text="b"/>
    ]/>
  ]/>

Now that would require quite a bit of tree restructuring while parsing.

Not to mention that there can be more than one intersection:

'abc1&'.scan(/[abc && ab && bc]/x) # => ["b"]

Another option could be to treat Sets as group of Sequences by default, which, however, might make them harder to handle just for this somewhat exotic feature.

Hmmm ...

jaynetics · 2018-05-03T20:21:33Z

I'm reasonably happy with this now ...

... this fixes the `>` and `l` #strfregexp_tree parts of Alternation (currently broken on master) and the new Intersection expression. Maybe #level should be renamed #group_level and #nesting_level should become #level instead, for clarity sake?

# Conflicts: # ChangeLog # test/scanner/test_sets.rb # test/warnings.yml

jaynetics added 22 commits April 13, 2018 20:41

Make set members subexpressions instead

35a5588

Merge branch 'master' into improve_set_handling

4e63c3a

# Conflicts: # lib/regexp_parser/expression/subexpression.rb # lib/regexp_parser/lexer.rb # lib/regexp_parser/parser.rb # lib/regexp_parser/scanner/scanner.rl

Make CharacterSet non-terminal

26701ef

Extract char_type scanner and use it in set and main scanner

b0dcf49

Emit properties with type :property/:nonproperty in sets, too

55efa8b

Replace member_hex and range_hex with std tokens

4b14a40

Replace set member token with literal that is not merged in Lexer

70c9097

Replace :set, :escape token by reusing escape_sequence scanner

a90a24e

Remove :escape, :space as \s is always a char type

7aba19e

Handle props and char types in sets through shared escape scanner

9fee8d7

Add more tests

259cdf0

Remove some unused tokens and expressions

bcf2213

Use std #unshift instead of overridden #insert

0014db2

Simplify whitelisting of warnings

2efd57e

Introduce Intersection and Range Subexpression classes

0be70c0

Merge branch 'master' into improve_set_handling

ab38475

# Conflicts: # lib/regexp_parser/syntax/ruby/1.8.6.rb # lib/regexp_parser/syntax/ruby/1.9.1.rb # lib/regexp_parser/syntax/ruby/2.0.0.rb # test/syntax/versions/test_1.8.rb

Revert merge mistake

5270318

Classify backspace in sets as escape sequence - more informative

203dc06

Use new Property-like CharacterClass exp, not Literal, for [:...:]

eea0bfe

Add missing codepoint and use unused hex escape classes

9205cc5

Remove wide hex escapes, not supported in Ruby >= 1.8.6

fdcc3d8

Sharpen set parse tests, extract #include? test to exp test file

e43e9d3

Add SequenceOperation to match Alternation behavior in Intersection

3d1a13c

jaynetics added 5 commits May 4, 2018 23:12

Fix and actually run CharacterClass tests, add #nesting_level test

a098c88

Rename CharacterClass->PosixClass, avoid confusion with character sets

ce37e12

Emit previously unused EscapeSequence::Octal for :escape, :octal tokens

9b7a0bd

Prepare ChangeLog entry

7df8243

Merge branch 'master' into improve_set_handling

21b7e13

# Conflicts: # ChangeLog # test/scanner/test_sets.rb # test/warnings.yml

jaynetics merged commit 87beeea into master Aug 28, 2018

jaynetics deleted the improve_set_handling branch August 28, 2018 10:47

This was referenced Sep 7, 2018

Overhaul of Set#members needed #47

Closed

Only alphanumeric character set ranges are detected as ranges #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve set handling #55

Improve set handling #55

jaynetics commented Apr 30, 2018 •

edited

jaynetics commented Apr 30, 2018 •

edited

jaynetics commented May 3, 2018

Improve set handling #55

Improve set handling #55

Conversation

jaynetics commented Apr 30, 2018 • edited

jaynetics commented Apr 30, 2018 • edited

jaynetics commented May 3, 2018

jaynetics commented Apr 30, 2018 •

edited

jaynetics commented Apr 30, 2018 •

edited