Skip to content

Commit

Permalink
Applied edits from hbm on PerlMonks.
Browse files Browse the repository at this point in the history
  • Loading branch information
chromatic committed Sep 24, 2010
1 parent 319b2a7 commit 83df8b8
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 69 deletions.
6 changes: 6 additions & 0 deletions CREDITS
Expand Up @@ -173,3 +173,9 @@ E: cstith@gmail.com

N: Mike Huffman
E: mhuffman@aracnet.com

N: E. Choroba
E: choroba on PerlMonks

N: hbm
E: hbm on PerlMonks
32 changes: 20 additions & 12 deletions sections/operator_characteristics.pod
Expand Up @@ -133,15 +133,23 @@ X<fixity; circumfix>
X<postcircumfix>
X<fixity; postcircumfix>

The I<fixity> of an operator is its position relative to its operands. The
mathematic operators tend to be I<infix> operators, where they appear between
their operands. Other operators are I<prefix>, where they appear before their
operands; these tend to be unary operators, such as the prefix increment
operator C<++$x> or the mathematical and boolean negation operators (C<-$x> and
C<!$x>, respectively). I<Postfix> operators appear after their operands (such
as postfix increment C<$x++>). I<Circumfix> operators surround their operands,
such as the anonymous hash and anonymous array creation operators or quoting
operators (C<{ ... }> and C<[ ... ]> or C<qq{ ... }>, for example).
I<Postcircumfix> operators surround some operands but follow others, as in the
case of array or hash indices (C<$hash{ ... }> and C<$array[ ... ]>, for
example).
An operator's I<fixity> is its position relative to its operands:

=over 4

=item I<Infix> operators appear between their operands. Most mathematical
operators are infix operators, such as the multiplication operator in C<$length
* $width>.

=item I<Prefix> operators appear before their operators and I<postfix>
operators appear after. These operators tend to be unary, such as mathematic
negation (C<-$x>), boolean negation (C<!$y>), and postfix increment (C<$z++>).

=item I<Circumfix> operators surround their operands. Examples include the
anonymous hash constructor (C<{ ... }>) and quoting operators (C<qq[ ... ]>).

=item I<Postcircumfix> operators follow certain operands and surround others,
as in the case of hash or array element access (C<$hash{ ... }> and C<$array[
... ]>).

=back
111 changes: 54 additions & 57 deletions sections/regular_expressions.pod
Expand Up @@ -197,8 +197,8 @@ X<greedy quantifiers>
X<quantifiers; greedy>

The C<+> and C<*> quantifiers by themselves are I<greedy quantifiers>; they
match as many times as possible. This is particularly pernicious when using
the tempting-but-troublesome "match any amount of anything" pattern C<.*>:
match as much of the input string as possible. This is particularly pernicious
when matching "any amount of anything" with C<.*>:

=begin programlisting

Expand All @@ -211,11 +211,10 @@ the tempting-but-troublesome "match any amount of anything" pattern C<.*>:

=end programlisting

The problem is more obvious when you expect to match a short portion of a
string. Greediness always tries to match as much of the input string as
possible I<first>, backing off only when it's obvious that the match will not
succeed. Thus you may not be able to fit all of the results into the four
boxes in 7 Down if you go looking for "loam" with:
Greedy quantifiers always try to match as much of the input string as possible
I<first>, backing off only when it's obvious that the match will not succeed.
You may not be able to fit all of the results into the four boxes in 7 Down if
you go looking for "loam" with:

=begin programlisting

Expand All @@ -227,96 +226,94 @@ You'll get C<Alabama>, C<Belgium>, and C<Bethlehem> for starters. The soil
might be nice there, but they're all too long--and the matches start in the
middle of the words.

X<regex anchors>
X<anchors; start of string>

I<Regex anchors> force a match at a specific position in a string. The I<start
of string anchor> (C<\A>) ensures that any match will start at the beginning of
the string:
Turn a greedy quantifier into a non-greedy quantifier by appending the C<?>
quantifier:

=begin programlisting

# also matches "lammed", "lawmaker", and "layman"
my $seven_down = qr/\Al${letters_only}{2}m/;
my $minimal_greedy_match = qr/hot.*?meal/;

=end programlisting

X<anchors; end of string>

Similarly, the I<end of line string anchor> (C<\Z>) ensures that any match will
I<end> at the end of the string.
When given a non-greedy quantifier, the regular expression engine will prefer
the I<shortest> possible potential match, and will increase the number of
characters identified by the C<.*?> token combination only if the current
number fails to match. Because C<*> matches zero or more times, the minimal
potential match for this token combination is zero characters:

=begin programlisting

# also matches "loom", which is close enough
my $seven_down = qr/\Al${letters_only}{2}m\Z/;
say 'Found a hot meal' if 'ilikeahotmeal' =~ /$minimal_greedy_match/;

=end programlisting

X<word boundary metacharacter>

If you're not fortunate enough to have a Unix word dictionary file available,
the I<word boundary metacharacter> (C<\b>) matches only at the boundary between
a word character (C<\w>) and a non-word character (C<\W>):
Use the C<+> quantifier to match one or more items:

=begin programlisting

my $seven_down = qr/\bl${letters_only}{2}m\b/;
my $minimal_greedy_at_least_one = qr/hot.+?meal/;

unlike( 'ilikeahotmeal', $minimal_greedy_at_least_one );

like( 'i like a hot meal', $minimal_greedy_at_least_one );

=end programlisting

=begin sidebar
The C<?> quantifier modifier also applies to the C<?> (zero or one matches)
quantifier as well as the range quantifiers. In every case, it causes the
regex to match as little of the input as possible.

Like Perl, there's more than one way to write a regular expression. Consider
choosing the most expressive and maintainable one.
The greedy modifiers C<.+> and C<.*> are tempting but dangerous. If you write
regular expression with greedy matches, test them thoroughly with a
comprehensive and automated test suite with representative data to lessen the
possibility of unpleasant surprises.

=end sidebar
=head1 Regex Anchors

Sometimes you can't anchor a regular expression. In those cases, you can turn
a greedy quantifier into a non-greedy quantifier by appending the C<?>
quantifier:
X<regex anchors>
X<anchors; start of string>

I<Regex anchors> force a match at a specific position in a string. The I<start
of string anchor> (C<\A>) ensures that any match will start at the beginning of
the string:

=begin programlisting

my $minimal_greedy_match = qr/hot.*?meal/;
# also matches "lammed", "lawmaker", and "layman"
my $seven_down = qr/\Al${letters_only}{2}m/;

=end programlisting

In this case, the regular expression engine will prefer the I<shortest>
possible potential match, increasing the number of characters identified by the
C<.*?> token combination only if the current number fails to match. Because
C<*> matches zero or more times, the minimal potential match for this token
combination is zero characters:
X<anchors; end of string>

The I<end of line string anchor> (C<\Z>) ensures that any match will I<end> at
the end of the string.

=begin programlisting

say 'Found a hot meal' if 'ilikeahotmeal' =~ /$minimal_greedy_match/;
# also matches "loom", which is close enough
my $seven_down = qr/\Al${letters_only}{2}m\Z/;

=end programlisting

If this isn't what you want, use the C<+> quantifier to match one or more
items:
X<word boundary metacharacter>

The I<word boundary metacharacter> (C<\b>) matches only at the boundary between
a word character (C<\w>) and a non-word character (C<\W>). Thus to find
C<loam> but not C<Belgium>, use the anchored regex:

=begin programlisting

my $minimal_greedy_at_least_one = qr/hot.+?meal/;
my $seven_down = qr/\bl${letters_only}{2}m\b/;

unlike( 'ilikeahotmeal', $minimal_greedy_at_least_one );
=end programlisting

like( 'i like a hot meal', $minimal_greedy_at_least_one );
=begin sidebar

=end programlisting
Like Perl, there's more than one way to write a regular expression. Consider
choosing the most expressive and maintainable one.

The C<?> quantifier modifier also applies to the C<?> (zero or one matches)
quantifier as well as the range quantifiers. In every case, it causes the
regex to match as few times as possible.

In general, the greedy modifiers C<.+> and C<.*> are tempting but dangerous
tools. For simple programs which need little maintenance, they may be quick
and easy to write, but non-greedy matching seems to match human expectations
better. If you find yourself writing a lot of regular expression with greedy
matches, test them thoroughly with a comprehensive and automated test suite
with representative data to lessen the possibility of unpleasant surprises.
=end sidebar

=head1 Metacharacters

Expand Down

0 comments on commit 83df8b8

Please sign in to comment.