Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Merge subpattern references #18

Open
wants to merge 134 commits into from

2 participants

@nbtrap

Subpattern references enable the matching of self-similar strings by
way of recursion. Unlike backreferences, which refer to the string
matched by a register, subpattern references refer to the pattern
contained within the register and cause the regex engine to recurse,
as though by an actual function call, to the referenced subpattern.

SYNTAX

A subpattern reference node has the form

(:SUBPATTERN-REFERENCE <ref>)

where is a positive fixnum denoting a register number or a
string or symbol denoting a register name.

Using the Perl syntax, a subpattern reference looks like

(?N)

or
(?&NAME)

where N is a positive (decimal) integer and NAME is a register name.

API CHANGES

There are no API changes.

KNOWN ISSUES

Perl Incompatibilities
======================
The semantics of subpattern references (or "sub calls") in Perl are
not well defined. In particular, as of version 5.19.9, the
interaction between subpattern references and backreferences is
inconsistent. This issue was recently raised on the p5p mailing
list, and the Perl devs seem to be seriously considering adopting
the semantics implemented here. See
https://rt.perl.org/Public/Bug/Display.html?id=121299 for details.

Embedded Modifiers
==================
The interaction between subpattern references and embedded modifiers
(e.g. :CASE-INSENSITIVE-P) is undefined for now and will be
addressed in a future release.

AllegroCL Compatibility Mode
===========================
So far as I know, the AllegroCL compatibility mode (enabled by
adding :USE-ACL-REGEXP2-ENGINE to FEATURES before compiling) does
not support this feature.

Other Bugs
==========
Several outstanding bugs are known to at least indirectly affect
subpattern references. Cf. #17 and #12, for example.

IMPLEMENTATION DETAILS

During the match phase, the subpattern reference closure calls the
register closure, passing it an extra argument: the match
continuation.

When the register closure sees that it has been called with an extra
argument, it knows that it has been entered via subpattern
reference. At this point, it saves the state of the local
registers' offsets and creates new dynamic "bindings" for them.
Then it calls the register's inner matcher, restoring the register
offsets state upon return therefrom. If the inner matcher has
succeeded, the subpattern reference's continuation is called.

The presence of one or more subpattern references precludes certain
optimizations. However, the performance for existing code (i.e.,
for regular expressions not containing subpattern references) should
be unaffected hereby.

OTHER CHANGES

The testing code has been overhauled. Of note:

1. The Perl script that generates many of tests has been modified,
among other things, to print results for as many capture groups as
are defined by each regex, but no more.  (The purpose of this was
to support tests involving arbitrarily many capture groups.)
Thus, much of perltestdata file seems to have changed, but it's
mostly superficial.

2. Perl tests are now run with *ALLOW-NAMED-REGISTERS* bound to T
when the regex contains one or more named registers.

3. Many more tests have been added to *TESTS-TO-SKIP*.  These are
by and large a result of Perl's undefined behavior vis-a-vis
subpattern references.
nbtrap added some commits
@nbtrap nbtrap Add RECURSIVE-SUBPATTERN class. 90ebc6b
@nbtrap nbtrap Add lexer tokens for (?R) and (?&NAME) sequences. bf19675
@nbtrap nbtrap Use CASE option instead of IF expression. 9bf3bf0
@nbtrap nbtrap Optionally make PARSE-REGISTER-NAME-AUX parse from within (?&name) co…
…nstruct.
35c05dd
@nbtrap nbtrap Make (?<digit>+) and (?&NAME) constructs <legal-token>s. 5adc50c
@nbtrap nbtrap Parse subpattern references correctly. ca8f723
@nbtrap nbtrap Make note to support (?0) and (?R). da2bba6
@nbtrap nbtrap Rename RECURSIVE-SUBPATTERN -> SUBPATTERN-REFERENCE. 6a493f8
@nbtrap nbtrap Beginnings of code to check for invalid subpattern references. 95c400a
@nbtrap nbtrap Add to doc comment of CONVERT-AUX. b4697b6
@nbtrap nbtrap Add shell of method for handling subpattern reference parse-tree nodes. ceecaf8
@nbtrap nbtrap Give flesh to CONVERT-COMPOUND-PARSE-TREE.
Be sure to keep track of named subpattern references as well as the
highes numbered subpattern reference encountered.
561228d
@nbtrap nbtrap Rename variable NAMED-SUBPATTERN-REFS-SEEN -> NAMED-SUBPATTERN-REFS. 82fdd76
@nbtrap nbtrap Convert named subpattern refs to numbered subpattern refs inside CONV…
…ERT.

Also, keep track of which registers have been referenced by number.
514cb45
@nbtrap nbtrap Return a fifth value from CONVERT, namely, the list of numbers of sub…
…patterns referenced in the regex.
a2a9632
@nbtrap nbtrap Create binding of special variable REFERENCED-REGISTER-MATCHERS-PLIST. 8910783
@nbtrap nbtrap Define the closure that matches subpattern references, and modify the…
… register closure.

This required several things that may not have been necessary and will
have to be revisited.  First of all, for every register, we now create
two inner matchers: one that matches the contents of the register and
what follows the register, and one that only matches the contents of
the register.  Also, we now stop accumulating into STARTS-WITH once we
encounter a register or subpattern reference.

With this patch, subpattern references seem to work for the most part.
They do not yet work with repetitions.
eb3d9ec
@nbtrap nbtrap Define REGEX-LENGTH on SUBPATTERN-REFERENCE. ce1b0fe
@nbtrap nbtrap Define COPY-REGEX and COMPUTE-OFFSETS on SUBPATTERN-REFERENCE.
At this point, one thing that doesn't work quite right is the
determination of register offsets for registers accessed indirectly by
subpattern references.  For example:

  (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")

says that the second register is at position (3, 6), though it should
be (1, 6).  Fixing this will require binding a special variable from
subpattern reference closures that tells register closures not to
touch the register offsets.
972dba5
@nbtrap nbtrap Be sure not to touch register offsets when matching a register indire…
…ctly from a subpattern reference.

With this patch, the following invocation:
  (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")
gives the correct offset values for the second register as (1,6).

One problem that remains is the danger of infinite recursion during
backtracking.  The following invocation:
  (cl-ppcre:scan "(?1)(?2)(a|b|(?1))(c)" "acba")

causes a stack overflow because the second (?1) is called endlessly
during backtracking without the match position advancing through the
string.  Such behavior may be able to be remedied by having the
subpattern reference's closure keep track of where in *STRING* it has
been called before.
b7ab328
@nbtrap nbtrap Don't backtrack through subpattern references ad infinitum.
This is going to be reverted immediately, since apparently Perl isn't
smart enough to do this and will itself overflow the stack.
69f0d7c
@nbtrap nbtrap Revert "Don't backtrack through subpattern references ad infinitum."
This reverts commit 69f0d7c.
55a48e0
@nbtrap nbtrap Use COND instead of multi-branch IF. 38986bd
@nbtrap nbtrap Merge branch 'master' into subpattern-references d79af78
@nbtrap nbtrap Add tests for subpattern references.
1634 and 1635 currently don't work.
5de7565
@nbtrap nbtrap Remove unused variable declaration. fa60aff
@nbtrap nbtrap Skip end of string optimizations with subpattern references. 09625d5
@nbtrap nbtrap Actually, only skip the "skip" optimization when optimizing ends of s…
…trings in patterns containing subpattern references.
a4e1eaa
@nbtrap nbtrap Add forgotten type to ETYPECASE expression. e100f55
@nbtrap nbtrap Add more tests for subpattern references.
Current, the following tests fail: 1638, 1639, 1641, 1642, 1643, 1644,
1645, 1646.
ada178a
@nbtrap nbtrap Use SETF instead of SETQ. e7d4eff
@nbtrap nbtrap Make sure INSIDE-SUBPATTERN-REFERENCE gets set to NIL when NEXT-FN is…
… not called.

The NEXT-FN (or OTHER-FN) parameter is only called when the register's
inner matcher succeeds.  When the inner matcher failed,
INSIDE-SUBPATTERN-REFERENCE was not being unset, though it should have
been.
0309c76
@nbtrap nbtrap Number tests correctly. 806857a
@nbtrap nbtrap Create a temporary set of registers for each pass through a subpatter…
…n reference.

This patch fixes tests 1643-1646.
047c17e
@nbtrap nbtrap Detect whether perl tess regexes contain named registers. 22c26a8
@nbtrap nbtrap Add some additional (trivial) cases to ETYPECASE in CONVERT-NAMED-SUB…
…PATTERN-REFS.
3bbe139
@nbtrap nbtrap Add some tests for named subpattern references. b05a808
@nbtrap nbtrap Get rid of unused INSIDE-SUBPATTERN-REFERENCE variable. c70985f
@nbtrap nbtrap Only compute INNER-MATCHER-WITHOUT-NEXT-FN when it's needed.
It's actually not clear that this is faster, though it probably is.
cdb513c
@nbtrap nbtrap Remove FIXME comment from closures.lisp, and rename one of the variab…
…les.

There is another way to go about this, but there's really no telling
which way would be faster.  The advantage to contructing two matchers
is that it only happens once and compilation time.
4fe74d0
@nbtrap nbtrap Remove FIXME comment from CREATE-MATCHER-AUX for SUBPATTERN-REFERENCE.
The optimization suggested there is trivial.
56243a1
@nbtrap nbtrap Remove more FIXME comments.
Perl does not allow whitespace around numbers/names in subpattern
references.
8591bd4
@nbtrap nbtrap Make named subpattern references refer to the first subpattern with t…
…he given name, as in Perl.

This currently doesn't work for forward references.  E.g.:

(let ((ppcre::*allow-named-registers* t))
  (ppcre:scan (ppcre:parse-string "(?&foo)(?<foo>f)(?<foo>o)") "ffo"))

returns NIL.
8572c7a
@nbtrap nbtrap Don't declare type of variable that may be either FUNCTION or NIL. 4d1c609
@nbtrap nbtrap Add two tests (1652 and 1675) for testing forward subpattern referenc…
…es for special case.

The special case is where the forward reference refers to the
beginning of the constant end of string.  These tests currently fail.
e4abae6
@nbtrap nbtrap Reorder the subpattern reference tests.
The tests are now so ordered that every numbered subpattern reference
test is followed by a corresponding named subpattern reference test.
36acbdd
@nbtrap nbtrap Don't use START-OF-END-STRING-P optimization when subpattern referenc…
…es are present.
87c9afc
@nbtrap nbtrap Add tests for patterns containing illegal whitespace in subpattern re…
…ferences.
6620354
@nbtrap nbtrap Bind *ALLOW-NAMED-REGISTERS* to NIL before running simple tests.
If the user has bound this variable to T, the simple tests will not
all pass.
09ad0d4
@nbtrap nbtrap Add two more tests for dealing with START-OF-END-STRING-P and subpatt…
…ern references.
68d4215
@nbtrap nbtrap Add two more tests.
These tests make sure that CL-PPCRE uses the correct named register
when multiple registers have the same name.
ffca226
@nbtrap nbtrap Remove FIXME comment about disambiguating named subpattern references.
This question has been answered an implemented in a previous commit.
e9c94db
@nbtrap nbtrap Change FIXME comment in COMPUTE-OFFSETS method on SUBPATTERN-REFERENCE. d7d8941
@nbtrap nbtrap Remove another FIXME comment.
As with their offsets, determing the minimum lengths subpattern
references is only feasible for patterns that don't really need
subpattern references to begin with.
c5e06f7
@nbtrap nbtrap Remove the FIXME comment from the REGEX-LENGTH method for SUBPATTERN-…
…REFERENCE.
e04ccda
@nbtrap nbtrap Remove unused sub from perltest.pl eval. 1f1bba8
@nbtrap nbtrap Make perltest.pl handle arbitrarily large and variable numbers of reg…
…isters.

The perltestdata file produced by this updated perltest.pl only
reports results for registers that are actually contained in the
corresponding pattern.
0b70e23
@nbtrap nbtrap Add two tests for double- and triple-digit register number subpattern…
… references.
7bd7c22
@nbtrap nbtrap Remove comment about possibly supporting (?0) and (?R).
These are not worth the time, especially considering that they're
trivially easy to simulate.
f4f8137
@nbtrap nbtrap Remove unneeded variable NAMED-REG-SEEN.
We know a named reg has been seen when one or more of the elements of
REG-NAMES is true.
a37bf58
@nbtrap nbtrap Add some failing tests taken from the pcre distribution. 8678c2b
@nbtrap nbtrap Add HAS-SUBPATTERN-REF-P function to convert.lisp. 41cbc80
@nbtrap nbtrap Don't needlessly stop accumulating for string-beginning optimization.
Specifically, once we see a register, continue building the constant
string beginning unless the regex contains a subpattern reference.
8cdc21b
@nbtrap nbtrap Remove specific test references from comment.
The test numbers are no longer correct and are subject to change further.
7c501f0
@nbtrap nbtrap Fix indentation. e4dfb25
@nbtrap nbtrap Don't create a separate matcher for matching registers from subpatter…
…n references.

Instead, every time we descend into a register from a subpattern
reference, we first push a value onto a register-specific list and pop
it off once we return.  When STORE-END-OF-REG sees that there is a
value on the list, it knows we entered the register from a subpattern
reference and doesn't try to match the part of the regex following the
register.

It's hard to say whether this improves or degrades performance and
readability.  But it does seem simpler.
49878b0
@nbtrap nbtrap Use IF instead of COND for simple condition. da8f474
@nbtrap nbtrap Add some more tests that check for correct backtracking through subpa…
…ttern references.

These currently fail.
2805231
@nbtrap nbtrap Backtrack correctly into subpattern references.
This enables correct matching for calls such as:

  (ppcre:scan "(?1)(o(?1)?)" "oo")
5c0f673
@nbtrap nbtrap Clean up CREATE-MATCHER-AUX method for REGISTER.
Add some comments, and rename a variable.
987c5b4
@nbtrap nbtrap Clarify and remove some comments. bab5e70
@nbtrap nbtrap Add some more subpattern reference tests.
These were taken from PCRE's file testdata/testinput2 with slight
modifications.
e32cf51
@nbtrap nbtrap Add more tests for subpattern references.
These tests were taken from PCRE's testdata/testinput2 with slight
modifications.
6f0ad13
@nbtrap nbtrap Add some more tests for subpattern references.
These were taken from PCRE's testdata/testinput2 with slight modifications.
1b51a27
@nbtrap nbtrap Remove FIXME comment from CONVERT-COMPOUND-PARSE-TREE method on :SUBP…
…ATTERN-REFERENCE.
82414b7
@nbtrap nbtrap Rename REFERENCED-REGISTER-MATCHERS -> REGISTER-MATCHERS. da706f6
@nbtrap nbtrap Reformat comments in the style of other comments in the package.
(No periods, no capitalized sentences.)
5f7af85
@nbtrap nbtrap Reformat comments added in changes adding subpattern references to 70…
… columns.
6d1bee4
@nbtrap nbtrap Add FILTER and WORD-BOUNDARY to the default ETYPECASE clause in CONVE…
…RT-NAMED-SUBPATTERN-REF.

I believe this exhausts all possibilities that need to be covered.
2852e76
@nbtrap nbtrap Convert ETYPECASE -> TYPECASE, since all possibilities are accounted …
…for.

Also, add a test for using subpattern references with the :FILTER
feature.
7d1ed2b
@nbtrap nbtrap Add a test for handling back-references within subpattern references …
…referring to registers outside the referenced subpattern.

This currently fails.  When entering into a subpattern reference, Perl
only creates new registers for those capture groups located inside the
subpattern reference.  The capture groups outside the subpattern
reference retain their values.
eed3e27
@nbtrap nbtrap Add some more tests verifying correct behavior of subpattern- and bac…
…k-reference cooperation.
10a6af6
@nbtrap nbtrap Begin transitioning to the new register offsets storage model.
Instead of storing the beginning/end of register offsets directly in
the corresponding arrays, we will now store lists where the car of
each list is the offset.  The reason for this is that when we descend
into a subpattern reference, instead of making a new array where all
offsets are reset, we will push new offsets onto the front of the
lists corresponding only to those registers contained within the
register we're entering via the subpattern reference.  In other words,
we'll only be resetting certain registers.  This will fix tests 1783
and 1785.

Which registers contain which other registers will be computed during
regex compilation.
c3c5e06
@nbtrap nbtrap Continue transitioning to the new register offsets storage model. 3891ee5
@nbtrap nbtrap Add CONTAINING-REGISTERS slot to REGISTER class. 2b40341
@nbtrap nbtrap Compute CONTAINING-REGISTERS slot of REGISTER instances. 6c6771a
@nbtrap nbtrap Continue transition to new register offsets storage model.
At this points, I had expected everything to work as well as with the
old model, but there are still many tests that are failing, so
apparently there are some bugs left to iron out.
474b91a
@nbtrap nbtrap Add some test cases that illumine one of the current register offsets…
… storage model's defects.
d938c89
@nbtrap nbtrap Don't store possible register offset of register entered via subpatte…
…rn reference.

This was causing many tests to fail with bizarre error messages.  It
wasn't caught using the old register offsets storage model since the
array the value was beging stored in was a temporary array to begin
with.

With this commit, the only tests run by RUN-ALL-TESTS that fail are
those that Perl itself gets wrong.
ffeff74
@nbtrap nbtrap Tidy up the comments in CREATE-MATCHER-AUX method specialized on REGI…
…STER.
c3a7c2a
@nbtrap nbtrap Rename CONTAINING-REGISTERS -> SUBREGISTERS. 1627892
@nbtrap nbtrap Fix test 1798.
It was missing the definition of regThree.
9d63ac7
@nbtrap nbtrap Remove some redundant code in CREATE-MATCHER-AUX specialized on REGIS…
…TER.

The redundant code was moved into LABELS function definitions.
d517e82
@nbtrap nbtrap Inline POP-OFFSETS and PUSH-OFFSETS. bfe5c92
@nbtrap nbtrap Use LOOP instead of MAPCAR/MAPC for pushing/popping register offsets. eb26c38
@nbtrap nbtrap Disable some tests in test/perltestdata, but add them to test/simple.
These are tests that even Perl gets wrong.
1da422d
@nbtrap nbtrap Record SUBREGISTER-COUNT instead of a list of SUBREGISTERS.
The numbers of registers nested within a register must be contiguous
with each other and with that of the parent register, so there's no
need to compute a list of subregisters--we just need to know how many
are nested within, and we can compute the rest therefrom.
0b1b87e
@nbtrap nbtrap Merge branch 'nested-refs' into subpattern-references d3a1dee
@nbtrap nbtrap Add two more tests that currently fail.
Each fails for a different reason.  The first fails because it matches
where Perl does not.  (It's not clear whose bug this is.)  The second
fails because we're not waiting until the regex has finished compiling
before validating register names.
8db37a9
@nbtrap nbtrap Add two more tests that fail.
These are like the previous two tests added in the previous commit,
except they deal with "self-referential" backreferences rather than
forward backreferences.
f08f4e2
@nbtrap nbtrap Document subpattern references in the html documentation. f064a0c
@nbtrap nbtrap Add subpattern reference commentary to docs on *ALLOW-NAMED-REGISTERS*
Also, add some FIXME comments to return to later.
0de65d3
@nbtrap nbtrap Restore the original docstrings to *REG-STARTS*, etc.
This begins a third attempt at a register offsets storage model.

The problem with the current model is that it changed the
implementation of *REG-STARTS*, etc., variables that I didn't realize
were part of the api.  This latest attempt will restore the original
semantics to those variables and store offset stacks (for recursive
subpattern references) in separate variables.
adb9156
@nbtrap nbtrap Add three new special variables for holding the register offsets stacks. 07c8d92
@nbtrap nbtrap Bind new special variables when scanning. b1d2ec7
@nbtrap nbtrap Make *REG-STARTS*, etc., have NIL as initial value instead of (list n…
…il).
ad4d39f
@nbtrap nbtrap Access offset values directly in *REG-STARTS*, etc., instead of thoru…
…gh CAR.
00c2ca5
@nbtrap nbtrap Finish implementing the newest register offsets storage model. e64e5ba
@nbtrap nbtrap Declare type once instead of using THE repeatedly. 9aaa4da
@nbtrap nbtrap Merge branch 'nested-refs2' into subpattern-references 92b459e
@nbtrap nbtrap Add more tests to *TESTS-TO-SKIP*
These are tests that CL-PPCRE gets right but Perl gets wrong, except
fot 1812, which is caused by validating backreference names too soon.
f77686c
@nbtrap nbtrap Move more tests (1809-1812) into test/simple.
These are tests that Perl gets wrong.
f50d7a6
@nbtrap nbtrap Add url reference to comment about Perl mishandling certain construct…
…ions.
53cdefe
@nbtrap nbtrap Add FIXME comment to come back to later.
Does Perl try to match more than one capture group if more than one
have the same name?
2f4c042
@nbtrap nbtrap Add more tests for subpattern-/back-reference cooperation.
These currently fail due to Perl's incorrect behavior.  See Perl's RT
8f838eb
@nbtrap nbtrap Move more tests from test/perltestdata into test/simple. 2fc37a6
@nbtrap nbtrap Create new bindings for the referenced register upon entry to subpatt…
…ern-reference.

When entering a register x via subpattern-reference, the registers
local to x receive new dynamic bindings, which shadow the old bindings
for the duration of the subpattern call.  Previously, "local" did not
include the register itself--x in this case.  With this patch, the
referenced register now receives a new binding as well.

It's not entirely clear that this is the appropriate behavior.  In a
regex like "(.\1?)(?1)", the back-reference to '\1' now will always
fail, rather than potentially matching according to what was matched
in the first pass through the first register.
9b97c91
@nbtrap nbtrap Remove FIXME comment from convert.lisp.
The issue referred to there is one of the the subjects of #17.
46a078c
@nbtrap nbtrap Use "recurse" instead of "refer" to describe the action associated wi…
…th subpattern references.
9b0a9c0
@nbtrap nbtrap Convert calls to PUSH-OFFSETS and POP-OFFSETS to fewer calls to more …
…specific functions.
81be1e4
@nbtrap nbtrap Update comment. a941400
@nbtrap nbtrap Clarify comments in CREATE-MATCHER-AUX method specialized over regist…
…ers.
846c22e
@nbtrap nbtrap Get rid of useless declaration. b5b406c
@nbtrap nbtrap Wrap docstrings to 70 columns. 72d020e
@nbtrap nbtrap Fix lexical/special binding bug.
This went undetected for so long because of a bug in SBCL (and ECL,
apparently).  The way it was written, it shouldn't have worked, but it
did--except on CLISP, which is how the bug was caught.
498e3f0
@nbtrap nbtrap Merge tag 'v2.0.7' into subpattern-references 544fcd1
@nbtrap

Tested on SBCL, ECL, and CLISP.

The documentation says that subpattern refs were added in version 2.1.0, so you might want to change that if you don't bump the version number like that.

I wrote the pull request as plain text (ignoring GitHub's "markdown") so it could be used as the merge's commit message.

closures.lisp
((16 lines not shown))
- (funcall next-fn start-pos)))
+ (declare (special register-matchers))
+ (let ((num (num register))
+ (subregister-count (subregister-count register))
+ ;; a place to store the next function to call when we arrive
+ ;; here via a subpattern reference
+ subpattern-ref-continuations)
+ (declare (fixnum num subregister-count)
+ (list subpattern-ref-continuations))
+ (labels
+ ((push-registers-state (new-starts new-maybe-starts new-ends)
+ (declare (list new-starts new-maybe-starts new-ends))
+ ;; only push the register states for this register and registers
+ ;; local to it
+ (loop for idx from num upto (+ num subregister-count) do
+ (let ()
@hanshuebner Owner

Remove this gratitious let.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
closures.lisp
((29 lines not shown))
+ ;; local to it
+ (loop for idx from num upto (+ num subregister-count) do
+ (let ()
+ (declare (fixnum idx))
+ (push (svref *reg-ends* idx) (svref *reg-ends-stacks* idx))
+ (setf (svref *reg-ends* idx) (pop new-ends))
+ (push (svref *regs-maybe-start* idx) (svref *regs-maybe-start-stacks* idx))
+ (setf (svref *regs-maybe-start* idx) (pop new-maybe-starts))
+ (push (svref *reg-starts* idx) (svref *reg-starts-stacks* idx))
+ (setf (svref *reg-starts* idx) (pop new-starts)))))
+ (pop-registers-state ()
+ ;; return the state that was destroyed by this restore
+ (let (old-starts old-maybe-starts old-ends)
+ (declare (list old-starts old-maybe-starts old-ends))
+ (loop for idx from (+ num subregister-count) downto num do
+ (let ()
@hanshuebner Owner

Extra let, lose it.

@nbtrap
nbtrap added a note

The point of those LETs is so I don't have to use THE every time I reference IDX. Do you prefer that I use LOCALLY, use THE multiple times, or get rid of the type declaration for IDX altogether?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
closures.lisp
((44 lines not shown))
+ (let ()
+ (declare (fixnum idx))
+ (push (svref *reg-ends* idx) old-ends)
+ (setf (svref *reg-ends* idx) (pop (svref *reg-ends-stacks* idx)))
+ (push (svref *regs-maybe-start* idx) old-maybe-starts)
+ (setf (svref *regs-maybe-start* idx) (pop (svref *regs-maybe-start-stacks* idx)))
+ (push (svref *reg-starts* idx) old-starts)
+ (setf (svref *reg-starts* idx) (pop (svref *reg-starts-stacks* idx)))))
+ (values old-starts old-maybe-starts old-ends)))
+ ;; STORE-END-OF-REG is a thin wrapper around NEXT-FN which
+ ;; will update register offsets after the inner matcher has
+ ;; succeded
+ (store-end-of-reg (start-pos)
+ (declare (fixnum start-pos)
+ (function next-fn))
+ (if subpattern-ref-continuations
@hanshuebner Owner

Use cond instead of this let

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
closures.lisp
((95 lines not shown))
- ;; regular expressions like /(a|\1x)*/
- (setf (svref *regs-maybe-start* num) start-pos)
- (let ((next-pos (funcall inner-matcher start-pos)))
- (unless next-pos
- ;; restore old values on failure
- (setf (svref *reg-starts* num) old-*reg-starts*
- (svref *regs-maybe-start* num) old-*regs-maybe-start*
- (svref *reg-ends* num) old-*reg-ends*))
- next-pos)))))))
+ ;; here comes the actual closure for REGISTER; save it in a
+ ;; special variable so it can be called by subpattern
+ ;; references
+ (setf (getf (car register-matchers) num)
+ (lambda (start-pos &optional other-fn)
+ (declare (fixnum start-pos))
+ (if other-fn
@hanshuebner Owner

Use cond instead of let

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
closures.lisp
((109 lines not shown))
+ (declare (fixnum start-pos))
+ (if other-fn
+ ;; the presence of OTHER-FN indicates that this
+ ;; register has been entered via a subpattern
+ ;; reference closure; save the registers state,
+ ;; creating fresh new "bindings" for the local
+ ;; register offsets; restore the state before
+ ;; returning to the caller
+ (progn
+ (push other-fn subpattern-ref-continuations)
+ (push-registers-state nil nil nil)
+ (prog1
+ (funcall inner-matcher start-pos)
+ (pop-registers-state)
+ (pop subpattern-ref-continuations)))
+ (let ((old-*reg-starts* (svref *reg-starts* num))
@hanshuebner Owner

confusing naming. is this special or not? earmuffs are used for special variables only.

@nbtrap
nbtrap added a note

Those names were there already. Shall I change them anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
closures.lisp
@@ -427,6 +493,19 @@ against CHR-EXPR."
reg-start reg-end)
(funcall next-fn next-pos)))))))))
+(defmethod create-matcher-aux ((subpattern-reference subpattern-reference) next-fn)
+ (declare #.*standard-optimize-settings*)
+ (declare (special register-matchers)
+ (function next-fn))
+ ;; close over the special variable REGISTER-MATCHERS in order to
+ ;; reference it during the match phase
+ (let ((num (num subpattern-reference))
+ (register-matchers register-matchers))
+ (declare (fixnum num) (function next-fn))
+ (lambda (start-pos)
+ (let ((subpattern-matcher (getf (car register-matchers) (1- num))))
@hanshuebner Owner

extra let, move the expression into the funcall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
lexer.lisp
@@ -693,7 +697,36 @@ closing #\> will also be consumed."
;; also syntax error
(signal-syntax-error* (1- (lexer-pos lexer))
"Character '~A' may not follow '(?<'."
- next-char ))))))
+ next-char))))))
+ ((#\&)
+ ;; subpattern reference by register name
+ (unless *allow-named-registers*
+ (signal-syntax-error* (1- (lexer-pos lexer))
+ "Character '~A' may not follow '(?'."
+ next-char))
+ (let ((next-char (next-char lexer)))
+ (if (alpha-char-p next-char)
@hanshuebner Owner

use unless -> signal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@hanshuebner hanshuebner commented on the diff
lexer.lisp
((10 lines not shown))
+ "Character '~A' may not follow '(?'."
+ next-char))
+ (let ((next-char (next-char lexer)))
+ (if (alpha-char-p next-char)
+ (progn
+ ;; put the letter back
+ (decf (lexer-pos lexer))
+ (list :subpattern-reference
+ (parse-register-name-aux lexer :subpattern-reference t)))
+ (signal-syntax-error* (1- (lexer-pos lexer))
+ "Character '~A' may not follow '(?&'."
+ next-char))))
+ ((#\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)
+ ;; put the digit back
+ (decf (lexer-pos lexer))
+ (prog1
@hanshuebner Owner

Indentation is wrong. first expression in prog1 gets two spaces extra indentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@hanshuebner
Owner

My review does not constitute a willingness to merge the change, which is up to Edi to decide.

@nbtrap

Suggested changes have been made.

nbtrap added some commits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Dec 30, 2013
  1. @nbtrap

    Add RECURSIVE-SUBPATTERN class.

    nbtrap authored
  2. @nbtrap
  3. @nbtrap
  4. @nbtrap
  5. @nbtrap
  6. @nbtrap
  7. @nbtrap
  8. @nbtrap
  9. @nbtrap
  10. @nbtrap
  11. @nbtrap
  12. @nbtrap

    Give flesh to CONVERT-COMPOUND-PARSE-TREE.

    nbtrap authored
    Be sure to keep track of named subpattern references as well as the
    highes numbered subpattern reference encountered.
  13. @nbtrap
  14. @nbtrap

    Convert named subpattern refs to numbered subpattern refs inside CONV…

    nbtrap authored
    …ERT.
    
    Also, keep track of which registers have been referenced by number.
  15. @nbtrap

    Return a fifth value from CONVERT, namely, the list of numbers of sub…

    nbtrap authored
    …patterns referenced in the regex.
  16. @nbtrap
  17. @nbtrap

    Define the closure that matches subpattern references, and modify the…

    nbtrap authored
    … register closure.
    
    This required several things that may not have been necessary and will
    have to be revisited.  First of all, for every register, we now create
    two inner matchers: one that matches the contents of the register and
    what follows the register, and one that only matches the contents of
    the register.  Also, we now stop accumulating into STARTS-WITH once we
    encounter a register or subpattern reference.
    
    With this patch, subpattern references seem to work for the most part.
    They do not yet work with repetitions.
  18. @nbtrap
  19. @nbtrap

    Define COPY-REGEX and COMPUTE-OFFSETS on SUBPATTERN-REFERENCE.

    nbtrap authored
    At this point, one thing that doesn't work quite right is the
    determination of register offsets for registers accessed indirectly by
    subpattern references.  For example:
    
      (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")
    
    says that the second register is at position (3, 6), though it should
    be (1, 6).  Fixing this will require binding a special variable from
    subpattern reference closures that tells register closures not to
    touch the register offsets.
  20. @nbtrap

    Be sure not to touch register offsets when matching a register indire…

    nbtrap authored
    …ctly from a subpattern reference.
    
    With this patch, the following invocation:
      (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")
    gives the correct offset values for the second register as (1,6).
    
    One problem that remains is the danger of infinite recursion during
    backtracking.  The following invocation:
      (cl-ppcre:scan "(?1)(?2)(a|b|(?1))(c)" "acba")
    
    causes a stack overflow because the second (?1) is called endlessly
    during backtracking without the match position advancing through the
    string.  Such behavior may be able to be remedied by having the
    subpattern reference's closure keep track of where in *STRING* it has
    been called before.
  21. @nbtrap

    Don't backtrack through subpattern references ad infinitum.

    nbtrap authored
    This is going to be reverted immediately, since apparently Perl isn't
    smart enough to do this and will itself overflow the stack.
  22. @nbtrap
Commits on Jan 2, 2014
  1. @nbtrap
Commits on Jan 6, 2014
  1. @nbtrap
  2. @nbtrap

    Add tests for subpattern references.

    nbtrap authored
    1634 and 1635 currently don't work.
  3. @nbtrap
  4. @nbtrap
  5. @nbtrap

    Actually, only skip the "skip" optimization when optimizing ends of s…

    nbtrap authored
    …trings in patterns containing subpattern references.
  6. @nbtrap
  7. @nbtrap

    Add more tests for subpattern references.

    nbtrap authored
    Current, the following tests fail: 1638, 1639, 1641, 1642, 1643, 1644,
    1645, 1646.
  8. @nbtrap

    Use SETF instead of SETQ.

    nbtrap authored
Commits on Jan 7, 2014
  1. @nbtrap

    Make sure INSIDE-SUBPATTERN-REFERENCE gets set to NIL when NEXT-FN is…

    nbtrap authored
    … not called.
    
    The NEXT-FN (or OTHER-FN) parameter is only called when the register's
    inner matcher succeeds.  When the inner matcher failed,
    INSIDE-SUBPATTERN-REFERENCE was not being unset, though it should have
    been.
  2. @nbtrap

    Number tests correctly.

    nbtrap authored
  3. @nbtrap

    Create a temporary set of registers for each pass through a subpatter…

    nbtrap authored
    …n reference.
    
    This patch fixes tests 1643-1646.
  4. @nbtrap
  5. @nbtrap
  6. @nbtrap
Commits on Jan 8, 2014
  1. @nbtrap
  2. @nbtrap

    Only compute INNER-MATCHER-WITHOUT-NEXT-FN when it's needed.

    nbtrap authored
    It's actually not clear that this is faster, though it probably is.
Commits on Jan 12, 2014
  1. @nbtrap

    Remove FIXME comment from closures.lisp, and rename one of the variab…

    nbtrap authored
    …les.
    
    There is another way to go about this, but there's really no telling
    which way would be faster.  The advantage to contructing two matchers
    is that it only happens once and compilation time.
  2. @nbtrap

    Remove FIXME comment from CREATE-MATCHER-AUX for SUBPATTERN-REFERENCE.

    nbtrap authored
    The optimization suggested there is trivial.
  3. @nbtrap

    Remove more FIXME comments.

    nbtrap authored
    Perl does not allow whitespace around numbers/names in subpattern
    references.
  4. @nbtrap

    Make named subpattern references refer to the first subpattern with t…

    nbtrap authored
    …he given name, as in Perl.
    
    This currently doesn't work for forward references.  E.g.:
    
    (let ((ppcre::*allow-named-registers* t))
      (ppcre:scan (ppcre:parse-string "(?&foo)(?<foo>f)(?<foo>o)") "ffo"))
    
    returns NIL.
  5. @nbtrap
  6. @nbtrap

    Add two tests (1652 and 1675) for testing forward subpattern referenc…

    nbtrap authored
    …es for special case.
    
    The special case is where the forward reference refers to the
    beginning of the constant end of string.  These tests currently fail.
  7. @nbtrap

    Reorder the subpattern reference tests.

    nbtrap authored
    The tests are now so ordered that every numbered subpattern reference
    test is followed by a corresponding named subpattern reference test.
  8. @nbtrap
  9. @nbtrap
  10. @nbtrap

    Bind *ALLOW-NAMED-REGISTERS* to NIL before running simple tests.

    nbtrap authored
    If the user has bound this variable to T, the simple tests will not
    all pass.
  11. @nbtrap
  12. @nbtrap

    Add two more tests.

    nbtrap authored
    These tests make sure that CL-PPCRE uses the correct named register
    when multiple registers have the same name.
Commits on Jan 15, 2014
  1. @nbtrap

    Remove FIXME comment about disambiguating named subpattern references.

    nbtrap authored
    This question has been answered an implemented in a previous commit.
  2. @nbtrap
  3. @nbtrap

    Remove another FIXME comment.

    nbtrap authored
    As with their offsets, determing the minimum lengths subpattern
    references is only feasible for patterns that don't really need
    subpattern references to begin with.
  4. @nbtrap
  5. @nbtrap
  6. @nbtrap

    Make perltest.pl handle arbitrarily large and variable numbers of reg…

    nbtrap authored
    …isters.
    
    The perltestdata file produced by this updated perltest.pl only
    reports results for registers that are actually contained in the
    corresponding pattern.
  7. @nbtrap
  8. @nbtrap

    Remove comment about possibly supporting (?0) and (?R).

    nbtrap authored
    These are not worth the time, especially considering that they're
    trivially easy to simulate.
  9. @nbtrap

    Remove unneeded variable NAMED-REG-SEEN.

    nbtrap authored
    We know a named reg has been seen when one or more of the elements of
    REG-NAMES is true.
Commits on Jan 16, 2014
  1. @nbtrap
  2. @nbtrap
  3. @nbtrap

    Don't needlessly stop accumulating for string-beginning optimization.

    nbtrap authored
    Specifically, once we see a register, continue building the constant
    string beginning unless the regex contains a subpattern reference.
Commits on Jan 17, 2014
  1. @nbtrap

    Remove specific test references from comment.

    nbtrap authored
    The test numbers are no longer correct and are subject to change further.
  2. @nbtrap

    Fix indentation.

    nbtrap authored
  3. @nbtrap

    Don't create a separate matcher for matching registers from subpatter…

    nbtrap authored
    …n references.
    
    Instead, every time we descend into a register from a subpattern
    reference, we first push a value onto a register-specific list and pop
    it off once we return.  When STORE-END-OF-REG sees that there is a
    value on the list, it knows we entered the register from a subpattern
    reference and doesn't try to match the part of the regex following the
    register.
    
    It's hard to say whether this improves or degrades performance and
    readability.  But it does seem simpler.
  4. @nbtrap
  5. @nbtrap

    Add some more tests that check for correct backtracking through subpa…

    nbtrap authored
    …ttern references.
    
    These currently fail.
Commits on Jan 19, 2014
  1. @nbtrap

    Backtrack correctly into subpattern references.

    nbtrap authored
    This enables correct matching for calls such as:
    
      (ppcre:scan "(?1)(o(?1)?)" "oo")
  2. @nbtrap

    Clean up CREATE-MATCHER-AUX method for REGISTER.

    nbtrap authored
    Add some comments, and rename a variable.
  3. @nbtrap
Commits on Jan 20, 2014
  1. @nbtrap

    Add some more subpattern reference tests.

    nbtrap authored
    These were taken from PCRE's file testdata/testinput2 with slight
    modifications.
Commits on Jan 26, 2014
  1. @nbtrap

    Add more tests for subpattern references.

    nbtrap authored
    These tests were taken from PCRE's testdata/testinput2 with slight
    modifications.
  2. @nbtrap

    Add some more tests for subpattern references.

    nbtrap authored
    These were taken from PCRE's testdata/testinput2 with slight modifications.
  3. @nbtrap
  4. @nbtrap
  5. @nbtrap

    Reformat comments in the style of other comments in the package.

    nbtrap authored
    (No periods, no capitalized sentences.)
  6. @nbtrap
Commits on Feb 8, 2014
  1. @nbtrap

    Add FILTER and WORD-BOUNDARY to the default ETYPECASE clause in CONVE…

    nbtrap authored
    …RT-NAMED-SUBPATTERN-REF.
    
    I believe this exhausts all possibilities that need to be covered.
  2. @nbtrap

    Convert ETYPECASE -> TYPECASE, since all possibilities are accounted …

    nbtrap authored
    …for.
    
    Also, add a test for using subpattern references with the :FILTER
    feature.
  3. @nbtrap

    Add a test for handling back-references within subpattern references …

    nbtrap authored
    …referring to registers outside the referenced subpattern.
    
    This currently fails.  When entering into a subpattern reference, Perl
    only creates new registers for those capture groups located inside the
    subpattern reference.  The capture groups outside the subpattern
    reference retain their values.
Commits on Feb 16, 2014
  1. @nbtrap

    Add some more tests verifying correct behavior of subpattern- and bac…

    nbtrap authored
    …k-reference cooperation.
Commits on Feb 17, 2014
  1. @nbtrap

    Begin transitioning to the new register offsets storage model.

    nbtrap authored
    Instead of storing the beginning/end of register offsets directly in
    the corresponding arrays, we will now store lists where the car of
    each list is the offset.  The reason for this is that when we descend
    into a subpattern reference, instead of making a new array where all
    offsets are reset, we will push new offsets onto the front of the
    lists corresponding only to those registers contained within the
    register we're entering via the subpattern reference.  In other words,
    we'll only be resetting certain registers.  This will fix tests 1783
    and 1785.
    
    Which registers contain which other registers will be computed during
    regex compilation.
  2. @nbtrap
  3. @nbtrap
  4. @nbtrap
  5. @nbtrap

    Continue transition to new register offsets storage model.

    nbtrap authored
    At this points, I had expected everything to work as well as with the
    old model, but there are still many tests that are failing, so
    apparently there are some bugs left to iron out.
  6. @nbtrap

    Add some test cases that illumine one of the current register offsets…

    nbtrap authored
    … storage model's defects.
  7. @nbtrap

    Don't store possible register offset of register entered via subpatte…

    nbtrap authored
    …rn reference.
    
    This was causing many tests to fail with bizarre error messages.  It
    wasn't caught using the old register offsets storage model since the
    array the value was beging stored in was a temporary array to begin
    with.
    
    With this commit, the only tests run by RUN-ALL-TESTS that fail are
    those that Perl itself gets wrong.
  8. @nbtrap
  9. @nbtrap
  10. @nbtrap

    Fix test 1798.

    nbtrap authored
    It was missing the definition of regThree.
  11. @nbtrap

    Remove some redundant code in CREATE-MATCHER-AUX specialized on REGIS…

    nbtrap authored
    …TER.
    
    The redundant code was moved into LABELS function definitions.
Commits on Feb 18, 2014
  1. @nbtrap
Commits on Feb 19, 2014
  1. @nbtrap
  2. @nbtrap

    Disable some tests in test/perltestdata, but add them to test/simple.

    nbtrap authored
    These are tests that even Perl gets wrong.
  3. @nbtrap

    Record SUBREGISTER-COUNT instead of a list of SUBREGISTERS.

    nbtrap authored
    The numbers of registers nested within a register must be contiguous
    with each other and with that of the parent register, so there's no
    need to compute a list of subregisters--we just need to know how many
    are nested within, and we can compute the rest therefrom.
Commits on Feb 22, 2014
  1. @nbtrap
  2. @nbtrap

    Add two more tests that currently fail.

    nbtrap authored
    Each fails for a different reason.  The first fails because it matches
    where Perl does not.  (It's not clear whose bug this is.)  The second
    fails because we're not waiting until the regex has finished compiling
    before validating register names.
  3. @nbtrap

    Add two more tests that fail.

    nbtrap authored
    These are like the previous two tests added in the previous commit,
    except they deal with "self-referential" backreferences rather than
    forward backreferences.
  4. @nbtrap
  5. @nbtrap

    Add subpattern reference commentary to docs on *ALLOW-NAMED-REGISTERS*

    nbtrap authored
    Also, add some FIXME comments to return to later.
  6. @nbtrap

    Restore the original docstrings to *REG-STARTS*, etc.

    nbtrap authored
    This begins a third attempt at a register offsets storage model.
    
    The problem with the current model is that it changed the
    implementation of *REG-STARTS*, etc., variables that I didn't realize
    were part of the api.  This latest attempt will restore the original
    semantics to those variables and store offset stacks (for recursive
    subpattern references) in separate variables.
  7. @nbtrap
  8. @nbtrap
  9. @nbtrap
  10. @nbtrap
  11. @nbtrap
  12. @nbtrap
  13. @nbtrap
  14. @nbtrap

    Add more tests to *TESTS-TO-SKIP*

    nbtrap authored
    These are tests that CL-PPCRE gets right but Perl gets wrong, except
    fot 1812, which is caused by validating backreference names too soon.
  15. @nbtrap

    Move more tests (1809-1812) into test/simple.

    nbtrap authored
    These are tests that Perl gets wrong.
  16. @nbtrap
  17. @nbtrap

    Add FIXME comment to come back to later.

    nbtrap authored
    Does Perl try to match more than one capture group if more than one
    have the same name?
Commits on Feb 24, 2014
  1. @nbtrap

    Add more tests for subpattern-/back-reference cooperation.

    nbtrap authored
    These currently fail due to Perl's incorrect behavior.  See Perl's RT
  2. @nbtrap
Commits on Feb 26, 2014
  1. @nbtrap

    Create new bindings for the referenced register upon entry to subpatt…

    nbtrap authored
    …ern-reference.
    
    When entering a register x via subpattern-reference, the registers
    local to x receive new dynamic bindings, which shadow the old bindings
    for the duration of the subpattern call.  Previously, "local" did not
    include the register itself--x in this case.  With this patch, the
    referenced register now receives a new binding as well.
    
    It's not entirely clear that this is the appropriate behavior.  In a
    regex like "(.\1?)(?1)", the back-reference to '\1' now will always
    fail, rather than potentially matching according to what was matched
    in the first pass through the first register.
Commits on Feb 27, 2014
  1. @nbtrap

    Remove FIXME comment from convert.lisp.

    nbtrap authored
    The issue referred to there is one of the the subjects of #17.
  2. @nbtrap

    Use "recurse" instead of "refer" to describe the action associated wi…

    nbtrap authored
    …th subpattern references.
Commits on Feb 28, 2014
  1. @nbtrap
  2. @nbtrap

    Update comment.

    nbtrap authored
  3. @nbtrap
  4. @nbtrap

    Get rid of useless declaration.

    nbtrap authored
Commits on Mar 1, 2014
  1. @nbtrap

    Wrap docstrings to 70 columns.

    nbtrap authored
  2. @nbtrap

    Fix lexical/special binding bug.

    nbtrap authored
    This went undetected for so long because of a bug in SBCL (and ECL,
    apparently).  The way it was written, it shouldn't have worked, but it
    did--except on CLISP, which is how the bug was caught.
  3. @nbtrap
  4. @nbtrap

    Fix indentation of PROG1.

    nbtrap authored
  5. @nbtrap

    Get rid of extra LET.

    nbtrap authored
  6. @nbtrap
  7. @nbtrap

    Rename OTHER-FN -> CONT.

    nbtrap authored
  8. @nbtrap
  9. @nbtrap
  10. @nbtrap
Commits on Mar 2, 2014
  1. @nbtrap

    Fix declaration on SUBPATTERN-REFS.

    nbtrap authored
    This should be a SPECIAL declaration, not a type declaration.
Something went wrong with that request. Please try again.