# Regular Expressions 
## MDS/DH

* The Swiss Army knife/chainsaw of text processing.

* A pattern matching language which is now pretty standardized across high-level programming languages (Perl, Python, Javascript, etc). 

* “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’  Now they have two problems.” - Jamie Zawinski

* Don’t use regexes to parse properly structured text, like XML.

* See [Regex tutorial — A quick cheatsheet by examples](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285) 

## Basic character matching

* Letters and numbers match themselves: `bcd` matches "a<u>bcd</u>e".

* There is a flag to make matching case-insensitive.

* Dot means match any character: `b...f` matches "a<u>bcdef</u>g"

* Square brackets mean match a character from this group: `[bcd][bcd][bcd]` matches "a<u>bcd</u>e" and "a<u>dcb</u>e".

* Backslash toggles specialness. `\\` matches backslash itself; `\.` matches a full-stop.

* `\d`, `\s` and `\w` mean match any digit, space character and word character, respectively.  

* In simple ASCII text: 
    * `\d` effectively means `[0-9]`
    * `\s` effectively means `[ \n\t\r]`
    * `\w` means `[a-zA-Z0-9_]`.  
    * But Unicode is different! 

* `\D`, `\S`, and `\W` are the inverse: any non-digit, any non-space and any non-word character, respectively. 

## Anchors

* Anchors do not match any *characters*; they match at a position in the string (it's as if they match *between* characters).

* `^` matches position at beginning of the string (or line): `^bcd` matches "<u>bcd</u>ef", but not "abcdef".

* `$` matches position at end of string (or line): `^bcd$` matches the string "<u>bcd</u>" and nothing else.

* `\b` matches at a boundary between a space character and a word: `\bbcd` matches "aaa <u>bcd</u>e" but not "abcde". 

* \B is the inverse: `\Bbcd` matches "a<u>bcd</u>e".

* A flag will switch `^` and `$` from matching at start/end of line to start/end of string, for multi-line strings.

A special use of `^` is inside `[]` to invert its contents; `[^0-4]` means match any character that is not 0, 1, 2, 3, or 4.

Anchors do not themselves gobble up any characters in the string!

## Quantifiers

* `*` means the preceding element of the pattern can appear zero, one or more times:

    * `bc*d` matches "a<u>bcccd</u>efg"
    * `bc*d` matches "a<u>bd</u>efg"
    * `b.*f` matches "a<u>bcdef</u>g"
    * `b.*f` matches "a<u>bf</u>g"

* `+` means the preceding element of the pattern can appear one or more times (but has to appear at least once):
    * `bc+d` matches "a<u>bcd</u>e"
    * `bc+d` matches "a<u>bccccd</u>e" 
    * `bc+d` does **not** match "abde"
    * `b.+f` matches "a<u>bcdef</u>g"
    * `b.+f` does **not** match "aabfegg"

* Question mark means the preceding element can appear zero or one times but no more, making the element optional: 
    * `bc?d` matches "a<u>bcd</u>e" and "a<u>bd</u>e"

## Greedy and lazy quantifiers

* Confusingly, the `?` also has another function in relation to quantifiers: coming right after `*` or `+`, it changes the behavior of both.

* Normally, `*` and `+` are *greedy*: they match as many characters as they possibly can.

* The `?` switches off greedy matching and makes `*?` and `+?` do lazy matching instead: they match as few characters as they can while still matching:

    * `b.*c` matches "a<u>bcdeabcdeabc</u>de" (greedy) 
    * `b.*?c` matches "a<u>bc</u>deabcdeabcde" (lazy)
    * `b.+c` matches "a<u>bcdeabcdeabc</u>de" (greedy)
    * `b.+?c` matches "a<u>bcdeabc</u>deabcde" (lazy)

* Between those extremes of maximal and minimal matching, you can specify a range of numbers in between by using curly braces:
    * `{3}` means match exactly three of the preceding item
    * `{3,}` is three or more
    * `{,5}` is up to five
    * `{3,5}` is three to five
    * `bc{3,5}d` matches "a<u>bccccd</u>e" but not "abcde" or "abccccccde".

## Grouping and Capturing

* Those quantifiers apply to the previous element, which can be a character, but can also be a more complex sub-expression.

* If you want a quantifier to apply not just to one element, but to a series of elements, you can do this by grouping with parentheses: 
    * `b(cd)+e` matches "a<u>bcde</u>f" and "a<u>bcdcde</u>f" and so on. 

* You can use the vertical bar to indicate alternative possibilities, which is particularly useful with parens:
    * `a(bc|de)f` matches "<u>abcf</u>" and "<u>adef</u>" but not "abcdef".

* Parens serve a double purpose: in addition to grouping things, things in parentheses are captured (remembered) for later use:
    * `b(c*)d` matches "a<u>bcccd</u>e" and remembers the "ccc" for later reference.

* Using parens is typically how you pull particular sub-strings out of a longer piece of text, and Python will return to you a list of the stuff matched by each set of parens, in order.

* If you want to use the grouping aspect of parens *without* capturing and remembering the contents, you can use `(?:)` to group without capturing: 
    * `a(?:bc|de)+f` matches "<u>abcdef</u>" but does not remember the "bcde".

## Advanced topics

* There are other aspects to regex syntax, such as look-ahead, look-behind and referring back to the contents of previous parens, etc., but it's best to get lots of practice with simpler regexes before trying to use those.