Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Merge subpattern references #18

Open
wants to merge 134 commits into
from
Commits on Dec 30, 2013
  1. Add RECURSIVE-SUBPATTERN class.

    nbtrap committed Dec 27, 2013
  2. Give flesh to CONVERT-COMPOUND-PARSE-TREE.

    nbtrap committed Dec 27, 2013
    Be sure to keep track of named subpattern references as well as the
    highes numbered subpattern reference encountered.
  3. Convert named subpattern refs to numbered subpattern refs inside CONV…

    nbtrap committed Dec 28, 2013
    …ERT.
    
    Also, keep track of which registers have been referenced by number.
  4. Return a fifth value from CONVERT, namely, the list of numbers of sub…

    nbtrap committed Dec 28, 2013
    …patterns referenced in the regex.
  5. Define the closure that matches subpattern references, and modify the…

    nbtrap committed Dec 29, 2013
    … register closure.
    
    This required several things that may not have been necessary and will
    have to be revisited.  First of all, for every register, we now create
    two inner matchers: one that matches the contents of the register and
    what follows the register, and one that only matches the contents of
    the register.  Also, we now stop accumulating into STARTS-WITH once we
    encounter a register or subpattern reference.
    
    With this patch, subpattern references seem to work for the most part.
    They do not yet work with repetitions.
  6. Define COPY-REGEX and COMPUTE-OFFSETS on SUBPATTERN-REFERENCE.

    nbtrap committed Dec 29, 2013
    At this point, one thing that doesn't work quite right is the
    determination of register offsets for registers accessed indirectly by
    subpattern references.  For example:
    
      (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")
    
    says that the second register is at position (3, 6), though it should
    be (1, 6).  Fixing this will require binding a special variable from
    subpattern reference closures that tells register closures not to
    touch the register offsets.
  7. Be sure not to touch register offsets when matching a register indire…

    nbtrap committed Dec 29, 2013
    …ctly from a subpattern reference.
    
    With this patch, the following invocation:
      (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")
    gives the correct offset values for the second register as (1,6).
    
    One problem that remains is the danger of infinite recursion during
    backtracking.  The following invocation:
      (cl-ppcre:scan "(?1)(?2)(a|b|(?1))(c)" "acba")
    
    causes a stack overflow because the second (?1) is called endlessly
    during backtracking without the match position advancing through the
    string.  Such behavior may be able to be remedied by having the
    subpattern reference's closure keep track of where in *STRING* it has
    been called before.
  8. Don't backtrack through subpattern references ad infinitum.

    nbtrap committed Dec 29, 2013
    This is going to be reverted immediately, since apparently Perl isn't
    smart enough to do this and will itself overflow the stack.
Commits on Jan 2, 2014
Commits on Jan 6, 2014
  1. Add tests for subpattern references.

    nbtrap committed Jan 6, 2014
    1634 and 1635 currently don't work.
  2. Actually, only skip the "skip" optimization when optimizing ends of s…

    nbtrap committed Jan 6, 2014
    …trings in patterns containing subpattern references.
  3. Add more tests for subpattern references.

    nbtrap committed Jan 6, 2014
    Current, the following tests fail: 1638, 1639, 1641, 1642, 1643, 1644,
    1645, 1646.
  4. Use SETF instead of SETQ.

    nbtrap committed Jan 6, 2014
Commits on Jan 7, 2014
  1. Make sure INSIDE-SUBPATTERN-REFERENCE gets set to NIL when NEXT-FN is…

    nbtrap committed Jan 7, 2014
    … not called.
    
    The NEXT-FN (or OTHER-FN) parameter is only called when the register's
    inner matcher succeeds.  When the inner matcher failed,
    INSIDE-SUBPATTERN-REFERENCE was not being unset, though it should have
    been.
  2. Number tests correctly.

    nbtrap committed Jan 7, 2014
  3. Create a temporary set of registers for each pass through a subpatter…

    nbtrap committed Jan 7, 2014
    …n reference.
    
    This patch fixes tests 1643-1646.
Commits on Jan 8, 2014
  1. Only compute INNER-MATCHER-WITHOUT-NEXT-FN when it's needed.

    nbtrap committed Jan 8, 2014
    It's actually not clear that this is faster, though it probably is.
Commits on Jan 12, 2014
  1. Remove FIXME comment from closures.lisp, and rename one of the variab…

    nbtrap committed Jan 12, 2014
    …les.
    
    There is another way to go about this, but there's really no telling
    which way would be faster.  The advantage to contructing two matchers
    is that it only happens once and compilation time.
  2. Remove FIXME comment from CREATE-MATCHER-AUX for SUBPATTERN-REFERENCE.

    nbtrap committed Jan 12, 2014
    The optimization suggested there is trivial.
  3. Remove more FIXME comments.

    nbtrap committed Jan 12, 2014
    Perl does not allow whitespace around numbers/names in subpattern
    references.
  4. Make named subpattern references refer to the first subpattern with t…

    nbtrap committed Jan 12, 2014
    …he given name, as in Perl.
    
    This currently doesn't work for forward references.  E.g.:
    
    (let ((ppcre::*allow-named-registers* t))
      (ppcre:scan (ppcre:parse-string "(?&foo)(?<foo>f)(?<foo>o)") "ffo"))
    
    returns NIL.
  5. Add two tests (1652 and 1675) for testing forward subpattern referenc…

    nbtrap committed Jan 12, 2014
    …es for special case.
    
    The special case is where the forward reference refers to the
    beginning of the constant end of string.  These tests currently fail.
  6. Reorder the subpattern reference tests.

    nbtrap committed Jan 12, 2014
    The tests are now so ordered that every numbered subpattern reference
    test is followed by a corresponding named subpattern reference test.
  7. Bind *ALLOW-NAMED-REGISTERS* to NIL before running simple tests.

    nbtrap committed Jan 12, 2014
    If the user has bound this variable to T, the simple tests will not
    all pass.
  8. Add two more tests.

    nbtrap committed Jan 12, 2014
    These tests make sure that CL-PPCRE uses the correct named register
    when multiple registers have the same name.
Commits on Jan 15, 2014
  1. Remove FIXME comment about disambiguating named subpattern references.

    nbtrap committed Jan 13, 2014
    This question has been answered an implemented in a previous commit.
  2. Remove another FIXME comment.

    nbtrap committed Jan 14, 2014
    As with their offsets, determing the minimum lengths subpattern
    references is only feasible for patterns that don't really need
    subpattern references to begin with.
  3. Make perltest.pl handle arbitrarily large and variable numbers of reg…

    nbtrap committed Jan 15, 2014
    …isters.
    
    The perltestdata file produced by this updated perltest.pl only
    reports results for registers that are actually contained in the
    corresponding pattern.
  4. Remove comment about possibly supporting (?0) and (?R).

    nbtrap committed Jan 15, 2014
    These are not worth the time, especially considering that they're
    trivially easy to simulate.
  5. Remove unneeded variable NAMED-REG-SEEN.

    nbtrap committed Jan 15, 2014
    We know a named reg has been seen when one or more of the elements of
    REG-NAMES is true.
Commits on Jan 16, 2014
  1. Don't needlessly stop accumulating for string-beginning optimization.

    nbtrap committed Jan 16, 2014
    Specifically, once we see a register, continue building the constant
    string beginning unless the regex contains a subpattern reference.
Commits on Jan 17, 2014
  1. Remove specific test references from comment.

    nbtrap committed Jan 16, 2014
    The test numbers are no longer correct and are subject to change further.
  2. Fix indentation.

    nbtrap committed Jan 16, 2014
  3. Don't create a separate matcher for matching registers from subpatter…

    nbtrap committed Jan 16, 2014
    …n references.
    
    Instead, every time we descend into a register from a subpattern
    reference, we first push a value onto a register-specific list and pop
    it off once we return.  When STORE-END-OF-REG sees that there is a
    value on the list, it knows we entered the register from a subpattern
    reference and doesn't try to match the part of the regex following the
    register.
    
    It's hard to say whether this improves or degrades performance and
    readability.  But it does seem simpler.
  4. Add some more tests that check for correct backtracking through subpa…

    nbtrap committed Jan 17, 2014
    …ttern references.
    
    These currently fail.
Commits on Jan 19, 2014
  1. Backtrack correctly into subpattern references.

    nbtrap committed Jan 19, 2014
    This enables correct matching for calls such as:
    
      (ppcre:scan "(?1)(o(?1)?)" "oo")
  2. Clean up CREATE-MATCHER-AUX method for REGISTER.

    nbtrap committed Jan 19, 2014
    Add some comments, and rename a variable.
  3. Clarify and remove some comments.

    nbtrap committed Jan 19, 2014
Commits on Jan 20, 2014
  1. Add some more subpattern reference tests.

    nbtrap committed Jan 20, 2014
    These were taken from PCRE's file testdata/testinput2 with slight
    modifications.
Commits on Jan 26, 2014
  1. Add more tests for subpattern references.

    nbtrap committed Jan 24, 2014
    These tests were taken from PCRE's testdata/testinput2 with slight
    modifications.
  2. Add some more tests for subpattern references.

    nbtrap committed Jan 25, 2014
    These were taken from PCRE's testdata/testinput2 with slight modifications.
  3. Reformat comments in the style of other comments in the package.

    nbtrap committed Jan 25, 2014
    (No periods, no capitalized sentences.)
Commits on Feb 8, 2014
  1. Add FILTER and WORD-BOUNDARY to the default ETYPECASE clause in CONVE…

    nbtrap committed Feb 8, 2014
    …RT-NAMED-SUBPATTERN-REF.
    
    I believe this exhausts all possibilities that need to be covered.
  2. Convert ETYPECASE -> TYPECASE, since all possibilities are accounted …

    nbtrap committed Feb 8, 2014
    …for.
    
    Also, add a test for using subpattern references with the :FILTER
    feature.
  3. Add a test for handling back-references within subpattern references …

    nbtrap committed Feb 8, 2014
    …referring to registers outside the referenced subpattern.
    
    This currently fails.  When entering into a subpattern reference, Perl
    only creates new registers for those capture groups located inside the
    subpattern reference.  The capture groups outside the subpattern
    reference retain their values.
Commits on Feb 16, 2014
  1. Add some more tests verifying correct behavior of subpattern- and bac…

    nbtrap committed Feb 16, 2014
    …k-reference cooperation.
Commits on Feb 17, 2014
  1. Begin transitioning to the new register offsets storage model.

    nbtrap committed Feb 16, 2014
    Instead of storing the beginning/end of register offsets directly in
    the corresponding arrays, we will now store lists where the car of
    each list is the offset.  The reason for this is that when we descend
    into a subpattern reference, instead of making a new array where all
    offsets are reset, we will push new offsets onto the front of the
    lists corresponding only to those registers contained within the
    register we're entering via the subpattern reference.  In other words,
    we'll only be resetting certain registers.  This will fix tests 1783
    and 1785.
    
    Which registers contain which other registers will be computed during
    regex compilation.
  2. Continue transition to new register offsets storage model.

    nbtrap committed Feb 16, 2014
    At this points, I had expected everything to work as well as with the
    old model, but there are still many tests that are failing, so
    apparently there are some bugs left to iron out.
  3. Add some test cases that illumine one of the current register offsets…

    nbtrap committed Feb 16, 2014
    … storage model's defects.
  4. Don't store possible register offset of register entered via subpatte…

    nbtrap committed Feb 16, 2014
    …rn reference.
    
    This was causing many tests to fail with bizarre error messages.  It
    wasn't caught using the old register offsets storage model since the
    array the value was beging stored in was a temporary array to begin
    with.
    
    With this commit, the only tests run by RUN-ALL-TESTS that fail are
    those that Perl itself gets wrong.
  5. Fix test 1798.

    nbtrap committed Feb 16, 2014
    It was missing the definition of regThree.
  6. Remove some redundant code in CREATE-MATCHER-AUX specialized on REGIS…

    nbtrap committed Feb 17, 2014
    …TER.
    
    The redundant code was moved into LABELS function definitions.
Commits on Feb 18, 2014
Commits on Feb 19, 2014
  1. Disable some tests in test/perltestdata, but add them to test/simple.

    nbtrap committed Feb 18, 2014
    These are tests that even Perl gets wrong.
  2. Record SUBREGISTER-COUNT instead of a list of SUBREGISTERS.

    nbtrap committed Feb 19, 2014
    The numbers of registers nested within a register must be contiguous
    with each other and with that of the parent register, so there's no
    need to compute a list of subregisters--we just need to know how many
    are nested within, and we can compute the rest therefrom.
Commits on Feb 22, 2014
  1. Add two more tests that currently fail.

    nbtrap committed Feb 21, 2014
    Each fails for a different reason.  The first fails because it matches
    where Perl does not.  (It's not clear whose bug this is.)  The second
    fails because we're not waiting until the regex has finished compiling
    before validating register names.
  2. Add two more tests that fail.

    nbtrap committed Feb 21, 2014
    These are like the previous two tests added in the previous commit,
    except they deal with "self-referential" backreferences rather than
    forward backreferences.
  3. Add subpattern reference commentary to docs on *ALLOW-NAMED-REGISTERS*

    nbtrap committed Feb 21, 2014
    Also, add some FIXME comments to return to later.
  4. Restore the original docstrings to *REG-STARTS*, etc.

    nbtrap committed Feb 21, 2014
    This begins a third attempt at a register offsets storage model.
    
    The problem with the current model is that it changed the
    implementation of *REG-STARTS*, etc., variables that I didn't realize
    were part of the api.  This latest attempt will restore the original
    semantics to those variables and store offset stacks (for recursive
    subpattern references) in separate variables.
  5. Add more tests to *TESTS-TO-SKIP*

    nbtrap committed Feb 22, 2014
    These are tests that CL-PPCRE gets right but Perl gets wrong, except
    fot 1812, which is caused by validating backreference names too soon.
  6. Move more tests (1809-1812) into test/simple.

    nbtrap committed Feb 22, 2014
    These are tests that Perl gets wrong.
  7. Add FIXME comment to come back to later.

    nbtrap committed Feb 22, 2014
    Does Perl try to match more than one capture group if more than one
    have the same name?
Commits on Feb 24, 2014
  1. Add more tests for subpattern-/back-reference cooperation.

    nbtrap committed Feb 24, 2014
    These currently fail due to Perl's incorrect behavior.  See Perl's RT
Commits on Feb 26, 2014
  1. Create new bindings for the referenced register upon entry to subpatt…

    nbtrap committed Feb 26, 2014
    …ern-reference.
    
    When entering a register x via subpattern-reference, the registers
    local to x receive new dynamic bindings, which shadow the old bindings
    for the duration of the subpattern call.  Previously, "local" did not
    include the register itself--x in this case.  With this patch, the
    referenced register now receives a new binding as well.
    
    It's not entirely clear that this is the appropriate behavior.  In a
    regex like "(.\1?)(?1)", the back-reference to '\1' now will always
    fail, rather than potentially matching according to what was matched
    in the first pass through the first register.
Commits on Feb 27, 2014
  1. Remove FIXME comment from convert.lisp.

    nbtrap committed Feb 27, 2014
    The issue referred to there is one of the the subjects of #17.
  2. Use "recurse" instead of "refer" to describe the action associated wi…

    nbtrap committed Feb 27, 2014
    …th subpattern references.
Commits on Feb 28, 2014
  1. Convert calls to PUSH-OFFSETS and POP-OFFSETS to fewer calls to more …

    nbtrap committed Feb 28, 2014
    …specific functions.
  2. Update comment.

    nbtrap committed Feb 28, 2014
  3. Get rid of useless declaration.

    nbtrap committed Feb 28, 2014
Commits on Mar 1, 2014
  1. Wrap docstrings to 70 columns.

    nbtrap committed Mar 1, 2014
  2. Fix lexical/special binding bug.

    nbtrap committed Mar 1, 2014
    This went undetected for so long because of a bug in SBCL (and ECL,
    apparently).  The way it was written, it shouldn't have worked, but it
    did--except on CLISP, which is how the bug was caught.
  3. Fix indentation of PROG1.

    nbtrap committed Mar 1, 2014
  4. Get rid of extra LET.

    nbtrap committed Mar 1, 2014
  5. Rename OTHER-FN -> CONT.

    nbtrap committed Mar 1, 2014
Commits on Mar 2, 2014
  1. Fix declaration on SUBPATTERN-REFS.

    nbtrap committed Mar 2, 2014
    This should be a SPECIAL declaration, not a type declaration.