Short Notes on Regular Expressions
The first thing to recognize when using regular expressions is that everything is essentially a character.Patterns are written to match a specific sequence of characters. Mostly normal ASCII is used, but unicode characters can also be used to match international text. Patterns created for regex are case-sensitive.
Forward slash(\) will be used to create metacharacters. Metacharacters are characters that have special meaning for regex engine.
\d can be used to replace any digit from 0 to 9. All these texts will match this simple expression, as these sequences of characters have digits:
ABc234adf
var g=0
As Joker in a deck of cards can represent any card, dot(.) can be used to match any character, be it alphanumeric, symbol or whitespace. This metacharacter ovverrides period. To match a period, it needs to be escaped using a slash \.
Specific character can be matched by defining them inside square brackets[]. E.g. [abc]. This will match only a, b or c and nothing else.
match can
match man
match fan
skip dan
skip ran
skip pan
Corresponding regex: [cmf]a
Specific characters can be excluded using square brackets and hat(^). E.g. [^abc] will match any character except letters a,b, or c.
match hog
match dog
skip bog
Corresponding regex: [^b]o
If a set of sequential characters are to be denoted, inside square brackets, character range can be defined. E.g. [a-z] this will include any lowercase letter from a to z. Suppose you need to exclude a range of characters: E.g. [^n-p]. Multiple character ranges can also in included in a single bracket set. [A-Z0-9a-z] This includes all alphanumeric characters.
Curly braces{} are used to denote how many repetitions of a character is required. E.g. a{3} This denotes a repeated thrice. A range of repetitions can also be declared. E.g. a{1,3} meaning a will be matched atleast once and no more than 3 times. This notation can be used with any character or metacharacters.
match wazzzzzup
match wazzzup
skip wazup
Corresponding regex: z{3,6}
This is a powerful concept in regular expressions with the ability to match an arbitrary number of charcaters. Let's understand the two terms through examples.
Kleene star: Meaning 0 or more.
\d* => match any number of digits in the sequence, maybe zero.
Kleene plus: Meaning 1 or more.
\d+ => match any number of digits in the sequence, atleast one.
These can be used with any character and metacharacters.
match aaaabcc
match aabbbbc
match aacc
skip a
Corresponding regex: c+
The metacharacter question mark(?) denotes optionality. This metacharacter matches either zero or one of the preceding character or group. For example, the pattern ab?c will match either the strings "abc" or "ac" because the b is considered optional. Similar to the dot metacharacter, the question mark is a special character and it has to be escaped using a slash ? to match a plain question mark character in a string.
match 1 file found?
match 2 files found?
match 24 files found?
skip No files found.
Corresponding regex: \d+ files? found\?
Whitespaces include space(_), tab(\t), newline(\n) and carriage return(\r). Apart from these metacharacters, \s covers all whitespaces.
match 1. abc
match 2. abc
match 3. abc
skip 4.abc
Corresponding regex: \d\.\s+abc
It is best practice to write as specific regular expressions as possible to ensure that false positivesdo not creep in. E.g. search for 'success' in a file also taking into account 'Error: unsuccessful attempt'. To tighten patterns, (^)hat and ($)dollar signs are used to mark the start and end of a line. Note: This hat sign is different from the one used earlier in this tutorial to exclude characters.
match Mission: successful
skip Last Mission: unsuccessful
skip Next Mission: successful upon capture of target
Corresponding regex: ^Mission: successful$
Regular expressions allow information extraction for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group. For example, ^(IMG\d+.png)$ will capture and extract the full image filename, but if extension is not required, the pattern will be ^(IMG\d+).png$ which only captures the part before the period.
capture file_record_transcript.pdf -> file_record_transcript
capture file_07241999.pdf -> file_07241999
skip testfile_fake.pdf.tmp
Corresponding regex: ^(file.+)\.pdf$
Nested groups can be used to extract multiple layers of information. Using previous example,the filename and the picture number both can be extracted using the same pattern by writing an expression like ^(IMG(\d+)).png$. The nested groups are read from left to right in the pattern, with the first capture group being the contents of the first parentheses group, etc.
capture Jan 1987 -> Jan 1987 1987
capture May 1969 -> May 1969 1969
capture Aug 2011 -> Aug 2011 2011
Corresponding regex: (\w+\s(\d+))
The | (logical OR, aka. the pipe) is used to denote different possible sets of characters. Example, "Buy more (milk|bread|juice)" will match only the strings Buy more milk, Buy more bread, or Buy more juice.
match I love cats
match I love dogs
skip I love logs
skip I love cogs
Corresponding regex: I love (cats|dogs)
Back referencing varies depending on the implementation. However, many systems allow to reference captured groups by using \0 (usually the full matched text), \1 (group 1), \2 (group 2), etc. For example, "\2-\1" to put the second captured data first, and the first captured data second. Additionally, there is a special metacharacter \b which matches the boundary between a word and a non-word character. It's most useful in capturing entire words (for example by using the pattern \w+\b).
abc… | Letters |
123… | Digits |
\d | Any Digit |
\D | Any Non-digit character |
. | Any Character |
\. | Period |
[abc] | Only a, b, or c |
[^abc] | Not a, b, nor c |
[a-z] | Characters a to z |
[0-9] | Numbers 0 to 9 |
\w | Any Alphanumeric character |
\W | Any Non-alphanumeric character |
{m} | m Repetitions |
{m,n} | m to n Repetitions |
* | Zero or more repetitions |
+ | One or more repetitions |
? | Optional character |
\s | Any Whitespace |
\S | Any Non-whitespace character |
^…$ | Starts and ends |
(…) | Capture Group |
(a(bc)) | Capture Sub-group |
(.*) | Capture all |
(abc|def) | Matches abc or def |