Skip to content

Latest commit

 

History

History
162 lines (113 loc) · 16.8 KB

anchors-in-regular-expressions.md

File metadata and controls

162 lines (113 loc) · 16.8 KB
title description ms.date dev_langs helpviewer_keywords ms.assetid
Anchors in .NET Regular Expressions
Learn how to use anchors in regular expression patterns.
03/30/2017
csharp
vb
atomic zero-width assertions
regular expressions, anchors
regular expressions, atomic zero-width assertions
anchors, in regular expressions
metacharacters, atomic zero-width assertions
metacharacters, anchors
.NET regular expressions, anchors
.NET regular expressions, atomic zero-width assertions
336391f6-2614-499b-8b1b-07a6837108a7

Anchors in Regular Expressions

Anchors, or atomic zero-width assertions, specify a position in the string where a match must occur. When you use an anchor in your search expression, the regular expression engine does not advance through the string or consume characters; it looks for a match in the specified position only. For example, ^ specifies that the match must start at the beginning of a line or string. Therefore, the regular expression ^http: matches "http:" only when it occurs at the beginning of a line. The following table lists the anchors supported by the regular expressions in .NET.

Anchor Description
^ By default, the match must occur at the beginning of the string; in multiline mode, it must occur at the beginning of the line. For more information, see Start of String or Line.
$ By default, the match must occur at the end of the string or before \n at the end of the string; in multiline mode, it must occur at the end of the line or before \n at the end of the line. For more information, see End of String or Line.
\A The match must occur at the beginning of the string only (no multiline support). For more information, see Start of String Only.
\Z The match must occur at the end of the string, or before \n at the end of the string. For more information, see End of String or Before Ending Newline.
\z The match must occur at the end of the string only. For more information, see End of String Only.
\G The match must start at the position where the previous match ended, or if there was no previous match, at the position in the string where matching started. For more information, see Contiguous Matches.
\b The match must occur on a word boundary. For more information, see Word Boundary.
\B The match must not occur on a word boundary. For more information, see Non-Word Boundary.

Start of String or Line: ^

By default, the ^ anchor specifies that the following pattern must begin at the first character position of the string. If you use ^ with the xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType option (see Regular Expression Options), the match must occur at the beginning of each line.

The following example uses the ^ anchor in a regular expression that extracts information about the years during which some professional baseball teams existed. The example calls two overloads of the xref:System.Text.RegularExpressions.Regex.Matches%2A?displayProperty=nameWithType method:

  • The call to the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%29 overload finds only the first substring in the input string that matches the regular expression pattern.

  • The call to the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%2CSystem.Text.RegularExpressions.RegexOptions%29 overload with the options parameter set to xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType finds all five substrings.

[!code-csharpConceptual.RegEx.Language.Assertions#1] [!code-vbConceptual.RegEx.Language.Assertions#1]

The regular expression pattern ^((\w+(\s?)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))?,?)+ is defined as shown in the following table.

Pattern Description
^ Begin the match at the beginning of the input string (or the beginning of the line if the method is called with the xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType option).
((\w+(\s?)){2,} Match one or more word characters followed either by zero or by one space at least two times. This is the first capturing group. This expression also defines a second and third capturing group: The second consists of the captured word, and the third consists of the captured white space.
,\s Match a comma followed by a white-space character.
(\w+\s\w+) Match one or more word characters followed by a space, followed by one or more word characters. This is the fourth capturing group.
, Match a comma.
\s\d{4} Match a space followed by four decimal digits.
(-(\d{4}|present))? Match zero or one occurrence of a hyphen followed by four decimal digits or the string "present". This is the sixth capturing group. It also includes a seventh capturing group.
,? Match zero or one occurrence of a comma.
(\s\d{4}(-(\d{4}|present))?,?)+ Match one or more occurrences of the following: a space, four decimal digits, zero or one occurrence of a hyphen followed by four decimal digits or the string "present", and zero or one comma. This is the fifth capturing group.

End of String or Line: $

The $ anchor specifies that the preceding pattern must occur at the end of the input string, or before \n at the end of the input string.

If you use $ with the xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType option, the match can also occur at the end of a line. Note that $ is satisfied at \n but not at \r\n (the combination of carriage return and newline characters, or CR/LF). To handle the CR/LF character combination, include \r?$ in the regular expression pattern. Note that \r?$ will include any \r in the match.

The following example adds the $ anchor to the regular expression pattern used in the example in the Start of String or Line section. When used with the original input string, which includes five lines of text, the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%29?displayProperty=nameWithType method is unable to find a match, because the end of the first line does not match the $ pattern. When the original input string is split into a string array, the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%29?displayProperty=nameWithType method succeeds in matching each of the five lines. When the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%2CSystem.Text.RegularExpressions.RegexOptions%29?displayProperty=nameWithType method is called with the options parameter set to xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType, no matches are found because the regular expression pattern does not account for the carriage return character \r. However, when the regular expression pattern is modified by replacing $ with \r?$, calling the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%2CSystem.Text.RegularExpressions.RegexOptions%29?displayProperty=nameWithType method with the options parameter set to xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType again finds five matches.

[!code-csharpConceptual.RegEx.Language.Assertions#2] [!code-vbConceptual.RegEx.Language.Assertions#2]

Start of String Only: \A

The \A anchor specifies that a match must occur at the beginning of the input string. It is identical to the ^ anchor, except that \A ignores the xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType option. Therefore, it can only match the start of the first line in a multiline input string.

The following example is similar to the examples for the ^ and $ anchors. It uses the \A anchor in a regular expression that extracts information about the years during which some professional baseball teams existed. The input string includes five lines. The call to the xref:System.Text.RegularExpressions.Regex.Matches%28System.String%2CSystem.String%2CSystem.Text.RegularExpressions.RegexOptions%29?displayProperty=nameWithType method finds only the first substring in the input string that matches the regular expression pattern. As the example shows, the xref:System.Text.RegularExpressions.RegexOptions.Multiline option has no effect.

[!code-csharpConceptual.RegEx.Language.Assertions#3] [!code-vbConceptual.RegEx.Language.Assertions#3]

End of String or Before Ending Newline: \Z

The \Z anchor specifies that a match must occur at the end of the input string, or before \n at the end of the input string. It is identical to the $ anchor, except that \Z ignores the xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType option. Therefore, in a multiline string, it can only be satisfied by the end of the last line, or the last line before \n.

Note that \Z is satisfied at \n but is not satisfied at \r\n (the CR/LF character combination). To treat CR/LF as if it were \n, include \r?\Z in the regular expression pattern. Note that this will make the \r part of the match.

The following example uses the \Z anchor in a regular expression that is similar to the example in the Start of String or Line section, which extracts information about the years during which some professional baseball teams existed. The subexpression \r?\Z in the regular expression ^((\w+(\s?)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))?,?)+\r?\Z is satisfied at the end of a string, and also at the end of a string that ends with \n or \r\n. As a result, each element in the array matches the regular expression pattern.

[!code-csharpConceptual.RegEx.Language.Assertions#4] [!code-vbConceptual.RegEx.Language.Assertions#4]

End of String Only: \z

The \z anchor specifies that a match must occur at the end of the input string. Like the $ language element, \z ignores the xref:System.Text.RegularExpressions.RegexOptions.Multiline?displayProperty=nameWithType option. Unlike the \Z language element, \z is not satisfied by a \n character at the end of a string. Therefore, it can only match the end of the input string.

The following example uses the \z anchor in a regular expression that is otherwise identical to the example in the previous section, which extracts information about the years during which some professional baseball teams existed. The example tries to match each of five elements in a string array with the regular expression pattern ^((\w+(\s?)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))?,?)+\r?\z. Two of the strings end with carriage return and line feed characters, one ends with a line feed character, and two end with neither a carriage return nor a line feed character. As the output shows, only the strings without a carriage return or line feed character match the pattern.

[!code-csharpConceptual.RegEx.Language.Assertions#5] [!code-vbConceptual.RegEx.Language.Assertions#5]

Contiguous Matches: \G

The \G anchor specifies that a match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started. When you use this anchor with the xref:System.Text.RegularExpressions.Regex.Matches%2A?displayProperty=nameWithType or xref:System.Text.RegularExpressions.Match.NextMatch%2A?displayProperty=nameWithType method, it ensures that all matches are contiguous.

Tip

Typically, you place a \G anchor at the left end of your pattern. In the uncommon case you're performing a right-to-left search, place the \G anchor at the right end of your pattern.

The following example uses a regular expression to extract the names of rodent species from a comma-delimited string.

[!code-csharpConceptual.RegEx.Language.Assertions#6] [!code-vbConceptual.RegEx.Language.Assertions#6]

The regular expression \G(\w+\s?\w*),? is interpreted as shown in the following table.

Pattern Description
\G Begin where the last match ended.
\w+ Match one or more word characters.
\s? Match zero or one space.
\w* Match zero or more word characters.
(\w+\s?\w*) Match one or more word characters followed by zero or one space, followed by zero or more word characters. This is the first capturing group.
,? Match zero or one occurrence of a literal comma character.

Word Boundary: \b

The \b anchor specifies that the match must occur on a boundary between a word character (the \w language element) and a non-word character (the \W language element). Word characters consist of alphanumeric characters and underscores; a non-word character is any character that is not alphanumeric or an underscore. (For more information, see Character Classes.) The match may also occur on a word boundary at the beginning or end of the string.

The \b anchor is frequently used to ensure that a subexpression matches an entire word instead of just the beginning or end of a word. The regular expression \bare\w*\b in the following example illustrates this usage. It matches any word that begins with the substring "are". The output from the example also illustrates that \b matches both the beginning and the end of the input string.

[!code-csharpConceptual.RegEx.Language.Assertions#7] [!code-vbConceptual.RegEx.Language.Assertions#7]

The regular expression pattern is interpreted as shown in the following table.

Pattern Description
\b Begin the match at a word boundary.
are Match the substring "are".
\w* Match zero or more word characters.
\b End the match at a word boundary.

Non-Word Boundary: \B

The \B anchor specifies that the match must not occur on a word boundary. It is the opposite of the \b anchor.

The following example uses the \B anchor to locate occurrences of the substring "qu" in a word. The regular expression pattern \Bqu\w+ matches a substring that begins with a "qu" that does not start a word and that continues to the end of the word.

[!code-csharpConceptual.RegEx.Language.Assertions#8] [!code-vbConceptual.RegEx.Language.Assertions#8]

The regular expression pattern is interpreted as shown in the following table.

Pattern Description
\B Do not begin the match at a word boundary.
qu Match the substring "qu".
\w+ Match one or more word characters.

See also