Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Commits on Sep 14, 2012
  1. Use macro not swash for utf8 quotemeta

    Karl Williamson authored
    The rules for matching whether an above-Latin1 code point are now saved
    in a macro generated from a trie by regen/regcharclass.pl, and these are
    now used by pp.c to test these cases.  This allows removal of a wrapper
    subroutine, and also there is no need for dynamic loading at run-time
    into a swash.
    
    This macro is about as big as I'm comfortable compiling in, but it
    saves the building of a hash that can grow over time, and removes a
    subroutine and interpreter variables.  Indeed, performance benchmarks
    show that it is about the same speed as a hash, but it does not require
    having to load the rules in from disk the first time it is used.
  2. regexec.c: Use new macros instead of swashes

    Karl Williamson authored
    A previous commit has caused macros to be generated that will match
    Unicode code points of interest to the \X algorithm.  This patch uses
    them.  This speeds up modern Korean processing by 15%.
    
    Together with recent previous commits, the throughput of modern Korean
    under \X has more than doubled, and is now comparable to other
    languages (which have increased themselved by 35%)
Commits on Sep 8, 2012
  1. @iabyn

    PL_sawampersand: use 3 bit flags rather than bool

    iabyn authored
    Set a separate flag for each of $`, $& and $'.
    It still works fine in boolean context.
    
    This will allow us to have more refined control over what parts
    of a match string to copy (we currently copy the whole string).
Commits on Aug 28, 2012
  1. Refactor \X regex handling to avoid a typical case table lookup

    Karl Williamson authored
    Prior to this commit 98.4% of Unicode code points that went through \X
    had to be looked up to see if they begin a grapheme cluster; then looked
    up again to find that they didn't require special handling.  This commit
    refactors things so only one look-up is required for those 98.4%.  It
    changes the table generated by mktables to accomplish this, and hence
    the name of it, and references to it are changed to correspond.
Commits on Aug 26, 2012
  1. Prepare for Unicode 6.2

    Karl Williamson authored
    This changes code to be able to handle Unicode 6.2, while continuing to
    handle all prevrious releases.
    
    The major change was a new definition of \X, which adds a property to
    its calculation.  Unfortunately \X is hard-coded into regexec.c, and so
    has to revised whenever there is a change of this magnitude in Unicode,
    which fortunately isn't all that often.  I refactored the code in
    mktables to make it easier next time there is a change like this one.
  2. Comment out unused function

    Karl Williamson authored
    In looking at \X handling, I noticed that this function which is
    intended for use in it, actually isn't used.  This function may someday
    be useful, so I'm leaving the source in.
Commits on Aug 21, 2012
  1. Use new types for comppad and comppad_name

    Father Chrysostomos authored
    I know that a few times I’ve looked at perl source files to find out
    what type to use in ‘<type> foo = PL_whatever’.  So I am changing
    intrpvar.h as well as the api docs.
  2. Fix format closure bug with redefined outer sub

    Father Chrysostomos authored
    CVs close over their outer CVs.  So, when you write:
    
    my $x = 52;
    sub foo {
      sub bar {
        sub baz {
          $x
        }
      }
    }
    
    baz’s CvOUTSIDE pointer points to bar, bar’s CvOUTSIDE points to foo,
    and foo’s to the main cv.
    
    When the inner reference to $x is looked up, the CvOUTSIDE chain is
    followed, and each sub’s pad is looked at to see if it has an $x.
    (This happens at compile time.)
    
    It can happen that bar is undefined and then redefined:
    
    undef &bar;
    eval 'sub bar { my $x = 34 }';
    
    After this, baz will still refer to the main cv’s $x (52), but, if baz
    had  ‘eval '$x'’ instead of just $x, it would see the new bar’s $x.
    (It’s not really a new bar, as its refaddr is the same, but it has a
    new body.)
    
    This particular case is harmless, and is obscure enough that we could
    define it any way we want, and it could still be considered correct.
    
    The real problem happens when CVs are cloned.
    
    When a CV is cloned, its name pad already contains the offsets into
    the parent pad where the values are to be found.  If the outer CV
    has been undefined and redefined, those pad offsets can be com-
    pletely bogus.
    
    Normally, a CV cannot be cloned except when its outer CV is running.
    And the outer CV cannot have been undefined without also throwing
    away the op that would have cloned the prototype.
    
    But formats can be cloned when the outer CV is not running.  So it
    is possible for cloned formats to close over bogus entries in a new
    parent pad.
    
    In this example, \$x gives us an array ref.  It shows ARRAY(0xbaff1ed)
    instead of SCALAR(0xdeafbee):
    
    sub foo {
        my $x;
    format =
    @
    ($x,warn \$x)[0]
    .
    }
    undef &foo;
    eval 'sub foo { my @x; write }';
    foo
    __END__
    
    And if the offset that the format’s pad closes over is beyond the end
    of the parent’s new pad, we can even get a crash, as in this case:
    
    eval
    'sub foo {' .
    '{my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p,$q,$r,$s,$t,$u)}'x999
    . q|
        my $x;
    format =
    @
    ($x,warn \$x)[0]
    .
    }
    |;
    undef &foo;
    eval 'sub foo { my @x; my $x = 34; write }';
    foo();
    __END__
    
    So now, instead of using CvROOT to identify clones of
    CvOUTSIDE(format), we use the padlist ID instead.  Padlists don’t
    actually have an ID, so we give them one.  Any time a sub is cloned,
    the new padlist gets the same ID as the old.  The format needs to
    remember what its outer sub’s padlist ID was, so we put that in the
    padlist struct, too.
Commits on Aug 2, 2012
  1. regcomp.c: Fix multi-char fold bug

    Karl Williamson authored
    Input text to be matched under /i is placed in EXACTFish nodes.  The
    current limit on such text is 255 bytes per node.  Even if we raised
    that limit, it will always be finite.  If the input text is longer than
    this, it is split across 2 or more nodes.  A problem occurs when that
    split occurs within a potential multi-character fold.  For example, if
    the final character that fits in a node is 'f', and the next character
    is 'i', it should be matchable by LATIN SMALL LIGATURE FI, but because
    Perl isn't structured to find multi-char folds that cross node
    boundaries, we will miss this it.
    
    The solution presented here isn't optimum.  What we do is try to prevent
    all EXACTFish nodes from ending in a character that could be at the
    beginning or middle of a multi-char fold.  That prevents the problem.
    But in actuality, the problem only occurs if the input text is actually
    a multi-char fold, which happens much less frequently.  For example,
    we try to not end a full node with an 'f', but the problem doesn't
    actually occur unless the adjacent following node begins with an 'i' (or
    one of the other characters that 'f' participates in).  That is, this
    patch splits when it doesn't need to.
    
    At the point of execution for this patch, we only know that the final
    character that fits in the node is that 'f'.  The next character remains
    unparsed, and could be in any number of forms, a literal 'i', or a hex,
    octal, or named character constant, or it may need to be decoded (from
    'use encoding').  So look-ahead is not really viable.
    
    So finding if a real multi-character fold is involved would have to be
    done later in the process, when we have full knowledge of the nodes, at
    the places where join_exact() is now called, and would require inserting
    a new node(s) in the middle of existing ones.
    
    This solution seems reasonable instead.
    
    It does not yet address named character constants (\N{}) which currently
    bypass the code added here.
Commits on Jul 12, 2012
  1. Eliminate PL_OP_SLAB_ALLOC

    Father Chrysostomos authored
    This commit eliminates the old slab allocator.  It had bugs in it, in
    that ops would not be cleaned up properly after syntax errors.  So why
    not fix it?  Well, the new slab allocator *is* the old one fixed.
    
    Now that this is gone, we don’t have to worry as much about ops leak-
    ing when errors occur, because it won’t happen any more.
    
    Recent commits eliminated the only reason to hang on to it:
     PERL_DEBUG_READONLY_OPS required it.
  2. PERL_DEBUG_READONLY_OPS with the new allocator

    Father Chrysostomos authored
    I want to eliminate the old slab allocator (PL_OP_SLAB_ALLOC),
    but this useful debugging tool needs to be rewritten for the new
    one first.
    
    This is slightly better than under PL_OP_SLAB_ALLOC, in that CVs cre-
    ated after the main CV starts running will get read-only ops, too.  It
    is when a CV finishes compiling and relinquishes ownership of the slab
    that the slab is made read-only, because at that point it should not
    be used again for allocation.
    
    BEGIN blocks are exempt, as they are processed before the Slab_to_ro
    call in newATTRSUB.  The Slab_to_ro call must come at the very end,
    after LEAVE_SCOPE, because otherwise the ops freed via the stack (the
    SAVEFREEOP calls near the top of newATTRSUB) will make the slab writa-
    ble again.  At that point, the BEGIN block has already been run and
    its slab freed.  Maybe slabs belonging to BEGIN blocks can be made
    read-only later.
    
    Under PERL_DEBUG_READONLY_OPS, op slabs have two extra fields to
    record the size and readonliness of each slab.  (Only the first slab
    in a CV’s slab chain uses the readonly flag, since it is conceptually
    simpler to treat them all as one unit.)  Without recording this infor-
    mation manually, things become unbearably slow, the tests taking hours
    and hours instead of minutes.
Commits on Jun 30, 2012
  1. handy.h: Fix isBLANK_uni and isBLANK_utf8

    Karl Williamson authored
    These macros have never worked outside the Latin1 range, so this extends
    them to work.
    
    There are no tests I could find for things in handy.h, except that many
    of them are called all over the place during the normal course of
    events.  This commit adds a new file for such testing, containing for
    now only with a few tests for the isBLANK's
Commits on Jun 13, 2012
  1. @iabyn

    eliminate PL_reginterp_cnt

    iabyn authored
    This used to be the mechanism to determine whether "use re 'eval'" needed
    to be in scope; but now that we make a clear distinction between literal
    and runtime code blocks, it's no longer needed.
Commits on Jun 5, 2012
  1. [perl #78742] Store CopSTASH in a pad under threads

    Father Chrysostomos authored
    Before this commit, a pointer to the cop’s stash was stored in
    cop->cop_stash under non-threaded perls, and the name and name length
    were stored in cop->cop_stashpv and cop->cop_stashlen under ithreads.
    
    Consequently, eval "__PACKAGE__" would end up returning the
    wrong package name under threads if the current package had been
    assigned over.
    
    This commit changes the way cops store their stash under threads.  Now
    it is an offset (cop->cop_stashoff) into the new PL_stashpad array
    (just a mallocked block), which holds pointers to all stashes that
    have code compiled in them.
    
    I didn’t use the lexical pads, because CopSTASH(cop) won’t work unless
    PL_curpad is holding the right pad.  And things start to get very
    hairy in pp_caller, since the correct pad isn’t anywhere easily
    accessible on the context stack (oldcomppad actually referring to the
    current comppad).  The approach I’ve followed uses far less code, too.
    
    In addition to fixing the bug, this also saves memory.  Instead of
    allocating a separate PV for every single statement (to hold the stash
    name), now all lines of code in a package can share the same stashpad
    slot.  So, on a 32-bit OS X, that’s 16 bytes less memory per COP for
    short package names.  Since stashoff is the same size as stashpv,
    there is no difference there.  Each package now needs just 4 bytes in
    the stashpad for storing a pointer.
    
    For speed’s sake PL_stashpadix stores the index of the last-used
    stashpad offset.  So only when switching packages is there a linear
    search through the stashpad.
Commits on May 23, 2012
  1. Excise PL_amagic_generation

    Father Chrysostomos authored
    The core is not using it any more.  Every CPAN module that increments
    it also does newXS, which triggers mro_method_changed_in, which is
    sufficient; so nothing will break.
    
    So, to keep those modules compiling, PL_amagic_generation is now an
    alias to PL_na outside the core.
Commits on Feb 18, 2012
  1. Remove gete?[ug]id caching

    Ævar Arnfjörð Bjarmason authored
    Currently we cache the UID/GID and effective UID/GID similarly to how
    we used to cache getpid() before v5.14.0-251-g0e21945. Remove this
    magical behavior in favor of always calling getpid(), getgid()
    etc. This resolves RT #96208.
    
    A minimal testcase for this is the following by Leon Timmermans
    attached to RT #96208:
    
        eval { require 'syscall.ph'; 1 } or eval { require 'sys/syscall.ph'; 1 } or die $@;
    
        if (syscall(&SYS_setuid, $ARGV[0] + 0 || 1000) >= 0 or die "$!") {
                printf "\$< = %d, getuid = %d\n", $<, syscall(&SYS_getuid);
        }
    
    I.e. if we call the sete?[ug]id() functions unbeknownst to perl the
    $<, $>, $( and $) variables won't be updated. This results in the same
    sort of issues we had with $$ before v5.14.0-251-g0e21945, and
    getppid() before my v5.15.7-407-gd7c042c patch.
    
    I'm completely eliminating the PL_egid, PL_euid, PL_gid and PL_uid
    variables as part of this patch, this will break some CPAN modules,
    but it'll be really easy before the v5.16.0 final to reinstate
    them. I'd like to remove them to see what breaks, and how easy it is
    to fix it.
    
    These variables are not part of the public API, and the modules using
    them could either use the Perl_gete?[ug]id() functions or are working
    around the bug I'm fixing with this commit.
    
    The new PL_delaymagic_(egid|euid|gid|uid) variables I'm adding are
    *only* intended to be used internally in the interpreter to facilitate
    the delaymagic in Perl_pp_sassign. There's probably some way not to
    export these to programs that embed perl, but I haven't found out how
    to do that.
Commits on Feb 16, 2012
  1. perl #77654: quotemeta quotes non-ASCII consistently

    Karl Williamson authored
    As described in the pod changes in this commit, this changes quotemeta()
    to consistenly quote non-ASCII characters when used under
    unicode_strings.  The behavior is changed for these and UTF-8 encoded
    strings to more closely align with Unicode's recommendations.
    
    The end result is that we *could* at some future point start using other
    characters as metacharacters than the 12 we do now.
Commits on Feb 15, 2012
  1. Further eliminate POSIX-emulation under LinuxThreads

    Ævar Arnfjörð Bjarmason authored
    Under POSIX threads the getpid() and getppid() functions return the
    same values across multiple threads, i.e. threads don't have their own
    PID's. This is not the case under the obsolete LinuxThreads where each
    thread has a different PID, so getpid() and getppid() will return
    different values across threads.
    
    Ever since the first perl 5.0 we've returned POSIX-consistent
    semantics for $$, until v5.14.0-251-g0e21945 when the getpid() cache
    was removed. In 5.8.1 Rafael added further explicit POSIX emulation in
    perl-5.8.0-133-g4d76a34 [1] by explicitly caching getppid(), so that
    multiple threads would always return the same value.
    
    I don't think all this effort to emulate POSIX sematics is worth it. I
    think $$ and getppid() are OS-level functions that should always
    return the same as their C equivalents. I shouldn't have to use a
    module like Linux::Pid to get the OS version of the return values.
    
    This is pretty much a complete non-issue in practice these days,
    LinuxThreads was a Linux 2.4 thread implementation that nobody
    maintains anymore[2], all modern Linux distros use NPTL threads which
    don't suffer from this discrepancy. Debian GNU/kFreeBSD does use
    LinuxThreads in the 6.0 release, but they too will be moving away from
    it in future releases, and really, nobody uses Debian GNU/kFreeBSD
    anyway.
    
    This caching makes it unnecessarily tedious to fork an embedded Perl
    interpreter. When someone that constructs an embedded perl interpreter
    and forks their application, the fork(2) system call isn't going to
    run Perl_pp_fork(), and thus the return value of $$ and getppid()
    doesn't reflect the current process. See [3] for a bug in uWSGI
    related to this, and Perl::AfterFork on the CPAN for XS code that you
    need to run after forking a PerlInterpreter unbeknownst to perl.
    
    We've already been failing the tests in t/op/getpid.t on these Linux
    systems that nobody apparently uses, the Debian GNU/kFreeBSD users did
    notice and filed #96270, this patch fixes that failure by changing the
    tests to test for different behavior under LinuxThreads, I've tested
    that this works on my Debian GNU/kFreeBSD 6.0.4 virtual machine.
    
    If this change is found to be unacceptable (i.e. we want to continue
    to emulate POSIX thread semantics for the sake of LinuxThreads) we
    also need to revert v5.14.0-251-g0e21945, because currently we're only
    emulating POSIX semantics for getppid(), not getpid(). But I don't
    think we should do that, both v5.14.0-251-g0e21945 and this commit are
    awesome.
    
    This commit includes a change to embedvar.h made by "make
    regen_headers".
    
    1. http://www.nntp.perl.org/group/perl.perl5.porters/2002/08/msg64603.html
    2. http://pauillac.inria.fr/~xleroy/linuxthreads/
    3. http://projects.unbit.it/uwsgi/ticket/85
Commits on Feb 11, 2012
  1. intrpvar.h: Rmv no longer used PL_ variable

    Karl Williamson authored
    Commit 24caacb removed all uses of this
    variable, but failed to remove it.
  2. regcomp.c: /[[:lower:]]/i should match the same as /\p{Lower}/i

    Karl Williamson authored
    Same for [[:upper:]] and \p{Upper}.  These were matching instead all of
    [[:alpha:]] or \p{Alpha}.  What /\p{Lower}/i and /\p{Upper}/i match instead
    is \p{Cased}, and so that is what these should match.
Commits on Feb 9, 2012
  1. Add compile-time inversion lists for POSIX classes

    Karl Williamson authored
    These will be used in regcomp.c to replace the existing bit-wise
    handling of these, enabling subsequent optimizations.
    
    These are compiled-in, and hence affect the memory footprint of every
    program, including those that don't use Unicode.  The lists that aren't
    tiny are therefore currently restricted to only the Latin1 range;
    anything needed beyond that will have to be read in at execution time,
    just as before.
    
    The design allows for easy conversion from Latin1 to use the full
    Unicode range, should it be deemed desirable for some or all of these.
  2. regcomp.c: Use compile-time invlists

    Karl Williamson authored
    This creates three simple compile-time inversion lists from the data
    that has been generated in a previous commit, and uses two of them.
    Three PL_ variables are used to store them.
Commits on Jan 13, 2012
  1. intrpvar.h: clarification in comment

    Karl Williamson authored
Commits on Nov 11, 2011
  1. Re-order intrpvar.h to avoid false warnings about holes.

    Nicholas Clark authored
    Under the default configuration options for ithreads on x86_64 *nix,
    PERL_IMPLICIT_CONTEXT is defined. The variables specific to this are at the
    end of the interpreter struct, and their size is not an integer multiple of
    its alignment constraint. Hence there will always be a "hole". Move the
    "hole" so that it is beyond the end of the structure. This avoids the Linux
    tool "pahole", used for finding wasted space, from a false positive report
    of a hole that can't be avoided.
  2. Re-order intrpvar.h to avoid holes in the interpreter struct.

    Nicholas Clark authored
    Because commit 45d91b8 needed to change a buffer size in a
    per-thread variable, it created a hole in the ithreads interpreter struct,
    as structure members after the buffer must be word aligned.
    
    Re-order various structure members to avoid the hole.
Commits on Nov 9, 2011
  1. intrpvar.h: Increase size of variable that stores UTF8 bytes

    Karl Williamson authored
    A Perl utf8 string can occupy 13 bytes.  This only accepted up to 11,
    causing a swash assertion failure for very large code points matching
    Unicode properties.
Commits on Oct 27, 2011
  1. Fix CORE::glob

    Father Chrysostomos authored
    This commit makes CORE::glob bypassing glob overrides.
    
    A side effect of the fix is that, with the default glob implementa-
    tion, undefining *CORE::GLOBAL::glob no longer results in an ‘unde-
    fined subroutine’ error.
    
    Another side effect is that compilation of a glob op no longer assumes
    that the loading of File::Glob will create the *CORE::GLOB::glob type-
    glob.  ‘++$INC{"File/Glob.pm"}; sub File::Glob::csh_glob; eval '<*>';’
    used to crash.
    
    This is accomplished using a mechanism similar to lock() and
    threads::shared.  There is a new PL_globhook interpreter varia-
    ble that pp_glob calls when there is no override present.  Thus,
    File::Glob (which is supposed to be transparent, as it *is* the
    built-in implementation) no longer interferes with the user mechanism
    for overriding glob.
    
    This removes one tier from the five or so hacks that constitute glob’s
    implementation, and which work together to make it one of the buggiest
    and most inconsistent areas of Perl.
Commits on Oct 24, 2011
  1. Remove part of intrpvar.h comment

    Father Chrysostomos authored
    This second sentence is no longer true as of 87b9e16.
Commits on Oct 1, 2011
  1. utf8.c: Add function to retrieve new _Perl_IDStart prop

    Karl Williamson authored
  2. Don't use swash to find cntrls

    Karl Williamson authored
    Unicode stability policy guarantees that no code points will ever be
    added to the control characters beyond those already in it.
    
    All such characters are in the Latin1 range, and so the Perl core
    already knows which ones those are, and so there is no need to go out to
    disk and create a swash for these.
  3. No need for swashes for properties that are ASCII-only

    Karl Williamson authored
    These three properties are restricted to being true only for ASCII
    characters.  That information is compiled into Perl, so no need to
    create swashes for them.
  4. No need for swashes for computing if ASCII

    Karl Williamson authored
    This information is trivially computed via the macro, no need to go out
    to disk and store a swash for this.
Commits on Aug 11, 2011
  1. Simplify embedvar.h, removing a level of macro indirection for PL_* v…

    Nicholas Clark authored
    …ariables.
    
    For the default (non-multiplicity) configuration, PERLVAR*() macros now
    directly expand their arguments to tokens such as C<PL_defgv>, instead of
    expanding to C<PL_Idefgv>. This removes over 350 lines from F<embedvar.h>,
    which defined macros to map from C<PL_Idefgv> to C<PL_defgv> and so forth.
Commits on Jul 18, 2011
  1. In intrpvar.h, move all the USE_LOCALE_NUMERIC variables together.

    Nicholas Clark authored
    a453c16 added PL_numeric_radix_sv at the end of the interpreter struct
    to avoid breaking binary compatibility. However, as we now explicitly no longer
    guarantee compatibility across major releases, there's no reason not to move it
    next to the other variables related to it.
Commits on Jul 3, 2011
  1. Change inversion lists to SVs

    Karl Williamson authored
    The inversion list is an opaque object, currently implemented as an SV.
    Even if it ends up being an HV in the future, Nicholas is of the opinion
    that it should be presented to the world as an SV*.
Something went wrong with that request. Please try again.