# Covered here

* [Overview](#Overview)
    * [Introduction](#Introduction)
    * [Raw strings](#Raw-strings)
    * [Eager searching](#How-regex-engines-work:-eager-searching)
    * [What's missing from `re`?](#What's-missing-from-re?)
    * [Use string methods if possible](#Use-string-methods-if-possible)
* [Metacharacters (overview)](#Metacharacters)
    * [Character sets](#Character-sets-aka-character-classes)
    * [The wildcard](#The-wildcard)
    * [Braces: character repetitions](#Braces:-character-repetitions)
    * [Kleene star & kleene plus](#Kleene-Star-&-Kleene-Plus)
    * [Optionality: `?`](#Optionality:-?)
    * [Start- and end-of-string anchors](#Start--and-end-of-string-anchors)
    * [Match groups](#Match-groups)
    * [Conditionality: pipe operator](#Conditionality:-pipe-operator)
* [What is greediness?](#What-is-greediness?)
* [Other syntax](#Other-syntax)
    * [Word boundaries](#Word-boundaries)
    * [Backreferences](#Backreferences)
    * [Zero-width assertions](#Zero-width-assertions)
    * [Lookaround](#Lookaround)
- [re-specific syntax & features](#re-specific-syntax-&-features)
    - [Module-level `re` functions](#Module-level-re-functions)
    - [search() vs. match()][1]
    - [Using `compile`](#Using-compile)
    - [`match` object](#match-objects)
    - [Flags](#Flags)
* [Cookbook](#Cookbook)
* [grep](#grep)

[1]: #search()-vs.-match()

# References & resources

* Regex tester: [regex101.com](https://regex101.com/)
* [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)
 * An introductory tutorial to using regular expressions in Python with the re module. It provides a gentler introduction than the corresponding section in the Library Reference
* [RegexOne: Learn Regular Expressions with simple, interactive exercises](https://regexone.com/)
* [Regular-Expressions.info](http://www.regular-expressions.info) - online version of Jan Goyvaerts's book
* [tutorialspoint: Python Regular Expressions](https://www.tutorialspoint.com/python/python_reg_expressions.htm)
* [Regex cookbook](http://www.rexegg.com/regex-cookbook.html)
* [Python Cookbook Chapter 2: Strings and Text](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html)
* [Regular Expressions Cookbook 2nd Edition Code Samples](http://www.regular-expressions-cookbook.com/Regex%20Cookbook%202%20Code%20Samples.html)
* The docs for [Lib/re.py](https://docs.python.org/3/library/re.html)
* Stack Overflow examples:
 * [Top regex users](https://stackoverflow.com/tags/regex/topusers)
 * [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
 * [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions)
* [grep](https://www.gnu.org/software/grep/manual/grep.html) - The utility from the UNIX world that first made regular expressions popular

From Jan Goyvaerts:
* [PowerGREP](https://www.powergrep.com/grep.html) - Next generation grep for Microsoft Windows
* [RegexBuddy](https://www.regexbuddy.com/)

# Overview

## Author's note

Regex is a huge subject and this guide is by no means exhaustive.  It may contain direct excerpts from the links above and I do not claim the work as my own.  Some examples may be sub-par/inefficient given that I put this guide together while learning the subject at the same time.

## Introduction

**Regular expressions** (called REs, rexexps, regexes, or regex patterns) are essentially a tiny, highly specialized programming language.  A regex is a **special text string, described with a formal syntax, for describing a search pattern.**  The name comes from the mathematical theory on which they are based.

The concept is largely programming language-agnostic; regex is effectively a language in itself and:
* There are multiple regular expression _engines_.  An engine is a piece of software that can process the regular expression and then try to match its pattern to a given string.  Usually, the engine is part of a larger application and you do not access the engine directly.
* There are also a number of regular expression syntaxes or _flavors_.  The most popular is Perl 5.  Others are PCRE, .NET, and the Java JDK form.
* Finally, regexes can be implemented in a multitude of languages: Python, Perl, PHP, Java, .NET  These languages and application _support_ regular expressions, as do many text editors when it comes to searching.
 * This tutorial focuses on implementing regexes with Python--specifically with the `re` module.  The syntax used in Python’s `re` module is based on the (native) syntax used for regular expressions in Perl.  Technically, Python is the regex flavor supported by Python's `re` module.

----

**What is a string literal?**  This is an important first question to ask before understanding regular expressions.  
* A **literal** is a notation for representing a fixed value in source code.  In contrast to literals, _variables_ or _constants_ are symbols that can take on one of a class of fixed values.
* A **string literal** is a type of literal, where:
 * the literal is enclosed by quotations;
 * the quotes are not part of the value; 
 * **one must use a method such as escape sequences to avoid the problem of delimiter collision.**

In plain English: 
* String literals can be enclosed in matching single quotes (') or double quotes ("). 
* They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). 
* The backslash (\\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.

The most basic regular expression consists of a single literal character, i.e. `a`, which will match the first
occurrence of that character in the string.

The canonical import is:

In [1]:
import re

## Raw strings

Forgetting about regex for a second, native Python `str`ings use the backslash to escape special characters.  (For instance, `s = 'He said, \'Forget about it.\''`.)  The fact that the backslash is a metacharacter and is used to escape other metacharacters in regex can often lead to clumsy and confusing scenarios where 3-4 backslashes are required in a Python regex.

The solution is to use **raw strings, which leave the backslash character uninterpreted** and help keep regular expressions "sane."  Without them, every backslash `('\')` in a regular expression would have to be prefixed with another one to escape it.

To illustrate:

In [2]:
s = r'(\d+)/(\d+)/(\d+)'  # Without `r`, the backslash would be interpreted as an escape
s2 = '(\\d+)/(\\d+)/(\\d+)'  # An additional backslash required here without `r`
print(s)
print(s == s2)  # These two regexes evaluate to equality
                # Note we haven't used `re` here at all!

(\d+)/(\d+)/(\d+)
True


## How regex engines work: _eager searching_

A regex engine always returns the leftmost match, even if a "better" match could be found later.

![eagermatch.PNG](./imgs/eagermatch.PNG)

In other words, the  engine is "eager" to report a match.

A sidenote:
> There are two kinds of regular expression engines: _text-directed engines_, and _regex-directed engines._  All the regex flavors treated in this tutorial are based on regex-directed engines.

## What's missing from `re`?

The only significant features missing from Python's `re` are:
- [atomic grouping](https://www.regular-expressions.info/atomic.html),
- [possessive quantifiers](https://www.regular-expressions.info/possessive.html).

Prior to Python 3.3, `re` did not support any [Unicode regular expression tokens](https://www.regular-expressions.info/unicode.html).

## Use string methods if possible

Sometimes using the `re` module is a mistake. If you’re matching a fixed string, or a single character class, and you’re not using any re features such as the `IGNORECASE` flag, then the full power of regular expressions may not be required. 

Strings have several methods for performing operations with fixed strings and **they’re usually much faster**, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine.

# Metacharacters

Most letters and characters will simply match themselves. For example, the regular expression `test` will match the string `test` exactly.

There are exceptions to this rule; some characters are special **metacharacters**, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.  The full list:

`. ^ $ * + ? { } [ ] \ | ( )`

**To use any of these characters as a literal in a regex, you need to escape them with a backslash.** If you want to match `1+1=2`, the correct regex is `1\+1=2`.  **Said again, if a metacharacter is not prefaced by a backslash, it has a special meaning rather than being interpreted literally.**

Basic metacharacters (click on any symbol to go to detailed section):

| Metacharacter | Usage | Note |
|:-------------:|-------|------|
| [`.`](#The-wildcard) | Matches anything except a newline character | If you specify the flag `re.DOTALL`, then this metacharacter will match a newline as well.  Use the dot sparingly; it is very powerful but allows you to be lazy. |
| [`^`](#Start--and-end-of-string-anchors) | Matches the **start** of the string. | Except in character classes, where it indicates negation. |
| [`$`](#Start--and-end-of-string-anchors) | Matches the **end** of the string. | Or just before the newline at the end of the string. |
| [`*`](#Kleene-Star-&-Kleene-Plus) | Matches **0 or more** (greedy) repetitions of the preceding regex. | _Greedy_ means that it will match as many repetitions as possible. |
| [`+`](#Kleene-Star-&-Kleene-Plus) | Matches **1 or more** (greedy) repetitions of the preceding regex. |  |
| [`?`](#Optionality:-?) | Matches **0 or 1** (greedy) of the preceding regex. |  |
| [`{` and `}`](#Braces:-character-repetitions) | `{m,n}` matches from m to n repetitions of the preceding regex. | `{m,n}?` is the non-greedy version.  A single brace is treated literally. |
| [`[` and `]`](#Character-sets-aka-character-classes)  | Specifying a character class | A `^` as the first character indicates a **complementing set**. |
| [`(` and `)`](#Match-groups) | Matches the regex inside the parentheses. | The contents can be retrieved or matched later in the string. |
| `\` | Escape all metacharacters, **or** represents a predefined set of characters that are often useful, such as `\d` |  |
| [<code>&#124;</code>](#Conditionality:-pipe-operator) | 'or' operator: A<code>&#124;</code>B creates a regex that will match either A or B. |  |

These are all covered with examples in later sections.

Advanced metacharacters:

| Metacharacter | Usage | Note |
|:-------------:|-------|------|
| `(?:...)` | Non-grouping version of regular parentheses. | xxx |

[UNFINISHED; see [docs](https://github.com/python/cpython/blob/3.6/Lib/re.py#L47)]

## Character sets aka character classes

With a "character class", also called “character set”, you can tell the regex engine to **match only one out of
several characters or sequences.**  For example, the pattern `[abc]` will only match a single a, b, or c letter and nothing else.

In [15]:
test = {'match' : 'can man fan', 'skip' : 'dan ran pan'}
pat = '[cmf]an'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['can', 'man', 'fan']
empty: []


### Character ranges

When using the square bracket notation, there is a shorthand for matching a character in list of sequential characters by using the _dash_ to indicate a character range.  For example, the pattern `[0-6]` will only match any single digit character from zero to six, and nothing else. And likewise, `[^n-p]` will only match any single character except for letters n to p.

Multiple character ranges can also be used in the same set of brackets, along with individual characters.  Keep in mind that patterns are case-sensitive.

In [11]:
test = {'match' : 'Ana Bob Cpc', 'skip' : 'aax bby ccz'}
pat = '[A-C][n-p][a-c]'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['Ana', 'Bob', 'Cpc']
empty: []


Note that you can combine ranges and single characters:

In [12]:
test = {'match' : '0 1 2 a b c x A B X Z ZZ', 'skip' : 'aax bby ccz'}
pat = '[0-9a-fxA-FX]'  # Note that we could use special sequences here but choose not to for now
print('matched:', re.findall(pat, test['match']))

matched: ['0', '1', '2', 'a', 'b', 'c', 'x', 'A', 'B', 'X']


### Negated character classes

You can also indicate _inverse expressions_ (aka _complementing sets_, _negated classes_) with `^` (hat):

In [16]:
test = {'match' : 'hog dog', 'skip' : 'bog'}
pat = '[^b]og' # alternative: [hd]og
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['hog', 'dog']
empty: []


Keep in mind that **a negated character class still must match a character.**  `q[^u]` does _not_ mean "a _q_ not followed by a _u_”. It means: “a _q_ followed by a character that is not a _u_.”  Example:

In [20]:
test = {'match' : 'Iraq is a country', 'skip' : 'Iraq'}
pat = 'q[^u]'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))  # Nothing follows q; this would require negative lookahead

matched: ['q ']
empty: []


### Metacharacters in character classes

**The only special characters or metacharacters inside a character class are the closing bracket (`]`), the backslash (`\`), the caret (`^`) and the hyphen (`-`).** The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.

### Special sequences

Above it is noted that `\` can be used as part of a representation for a set of often-used characters.  These are known as _special sequences_ or _shorthand character classes_ and they are given below.  (Note a [few](https://docs.python.org/3/library/re.html#regular-expression-syntax) are not covered here.)

| Special sequence | Meaning |
|:-------------:|------- |
| `\d` | Matches any **decimal digit**; this is equivalent to the class `[0-9]`. |
| `\D` | Matches any **non-digit** character; this is equivalent to the class `[^0-9]`.  This is the opposite of `\d.` |
| `\s` | Matches any **whitespace** character; this is equivalent to the class `[ \t\n\r\f\v]`. |
| `\S` | Matches any **non-whitespace** character; this is equivalent to the class `[^ \t\n\r\f\v]`.  This is the opposite of `\s.` |
| `\w` | Matches any **alphanumeric** character; this is equivalent to the class `[a-zA-Z0-9_]`. |
| `\W` | Matches any **non-alphanumeric** character; this is equivalent to the class `[^a-zA-Z0-9_]`.  In other words, any character which is not a word character. This is the opposite of `\w`. |
| `\b` | Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters.  This means that `r'\bfoo\b`' matches `'foo'`, `'foo.'`, `'(foo)'`, `'bar foo baz'` but not `'foobar'` or `'foo3'`.
| `\B` | Matches the empty string, but only when it is **not** at the beginning or end of a word. This means that `r'py\B'` matches `'python'`, `'py3'`, `'py2'`, but not `'py'`, `'py.'`, or `'py!'`.  Opposite of `\b`. |
| `\\` | Matches a literal backslash. |

Note that the capital letter version is the negation of the lowercase version.

### Repeating character classes

If you repeat a character class by using the `?`, `*`, or `+` operators, you will repeat the entire character class, and not just the character that it matched.  For instance, the regex `[0-9]+` can match '837' as well as '222.`

In [19]:
test = {'match' : '837 1234 222 333', 'skip' : 'xxx'}
pat = '[0-9]+'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip'])) 

matched: ['837', '1234', '222', '333']
empty: []


If you want to **repeat the matched character**, rather than the class, you will need to use backreferences.  (Covered later/TODO.)

## The wildcard

The wildcard, which is represented by the `.` (dot) metacharacter, and can match any single character (letter, digit, whitespace, everything).

In [6]:
import re
test = {'match' : 'cat. 896. ?=+.', 'skip' : 'abc1'}
pat = '...\.'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['cat.', '896.', '?=+.']
empty: []


If you specify the flag `re.DOTALL`, then this metacharacter will match a newline as well.

Use the dot sparingly; it is very powerful but allows you to be lazy.

## Braces: character repetitions

One way that we can specify the number of repetitions of characters to match is to explicitly spell out exactly how many characters we want, eg. `\d\d\d` which would match exactly three digits.  A more convenient way is to specify how many repetitions of each character we want using the **curly braces notation**. 

| Notation | Meaning |
|----------|---------|
| `a{m}` | Match `a` exactly `m` times. |
| `a{m,}` | Match `a` at least `m` times (no upper limit). |
| `a{m, n}` | Match `a` `m` thru `n` times, _inclusive_. |
| `{m,n}?` | Non-greedy version of the above. |

This quantifier can be used with any character, or special metacharacters, for example `w{3}` (three w's), `[wxy]{5}` (five characters, each of which can be a w, x, or y) and `.{2,6}` (between two and six of any character).

In [25]:
test = {'match' : 'wazzzzzup wazzzup', 'skip' : 'wazup'}
pat = 'waz{3,5}up'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['wazzzzzup', 'wazzzup']
empty: []


In [24]:
test = {'match' : 'wazzzzzup wazzzup', 'skip' : 'wazup'}
pat = 'waz{3,}up'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['wazzzzzup', 'wazzzup']
empty: []


Note that a single brace is treated literally.

Match a number between 1000 and 9999:

In [31]:
test = '5462'
pat = '^[1-9][0-9]{3}$'
print(re.findall(pat, test))

['5462']


Match a number between 100 and 99999:

In [30]:
test = '546'
pat = '^[1-9][0-9]{2,4}$'
print(re.findall(pat, test))

['546']


## Kleene Star & Kleene Plus

* Kleene Star (`*`): 0 or more (greedy) of the character or group it follows.
* Kleene Plus (`+`): 1 or more (greedy) of the character or group it follows.

In [None]:
test = {'match' : 'aaaabcc aabbbbc aacc', 'skip' : 'a'}
# at least two 'a's, zero or more 'b's, and at least one 'c' in each str to match
pat = 'aa+b*c+'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

## Optionality: `?`

The `?` metacharacter denotes optionality.  This metacharacter allows you to **match either zero or one** of the preceding character or group. For example, the pattern `ab?c` will match either the strings "abc" or "ac" because the _b_ is considered optional.

In [15]:
test = {'match' : '1 file found? 2 files found? 24 files found?', 'skip' : 'No files found.'}
pat = '\d+ files? found\?'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['1 file found?', '2 files found?', '24 files found?']
empty: []


In [16]:
test = 'color brown, colour blue'
pat = 'colou?r'
print('matched:', re.findall(pat, test))

matched: ['color', 'colour']


You can make several tokens optional by grouping them together using round brackets, and placing the
question mark after the closing bracket. E.g.:

In [17]:
test = 'Nov, November'
pat = 'Nov(?:ember)?'  # Remember that ?: is needed in Python to make parentheses non-grouping
print('matched:', re.findall(pat, test))

matched: ['Nov', 'November']


## Start- and end-of-string anchors

We can define a pattern that describes both the start and the end of the line using the special `^` (hat) and `$` (dollar sign) metacharacters.  Anchors do not match any character at all. Instead, they match a position before, after or between characters. They can be used to “anchor” the regex match at a certain position.

Note that this is different than the hat used inside a set of bracket `[^...]` for excluding characters, which can be confusing when reading regular expressions.

In [27]:
test = {'match' : 'Mission: successful', 'skip' : 'Last Mission: unsuccessful Next Mission: successful upon capture of target'}
pat = '^Mission: successful$'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

print()
test = {'match' : 'abc aaa', 'skip' : 'bac baa'}
pat = '^a'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['Mission: successful']
empty: []

matched: ['a']
empty: []


Match _only_ an integer:

In [37]:
test = {'match' : '8614', 'skip' : 'qsdf4ghjk'}
pat = '^\d+$'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['8614']
empty: []


Match a [_valid_ HTML5 tag](https://www.w3schools.com/tags/ref_html_dtd.asp):

In [22]:
test = {'match': '<script> <body> <break> <h1> <h2>', 'skip': '<1>, 1, <1h>'}
pat = re.compile("""<              # Literal '<' character
                    [A-Za-z]       # Any 1 letter
                    [A-Za-z0-9]*   # Any letter or digit, 0 or more
                    >              # Literal '>' character""", re.X)
print('matched:', pat.findall(test['match']))
print('empty:', pat.findall(test['skip']))

matched: ['<script>', '<body>', '<break>', '<h1>', '<h2>']
empty: []


## Match groups

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses `(` and `)` metacharacters. **Any subpattern inside a pair of parentheses will be captured as a group.**  This allows you to apply a regex operator, e.g. a repetition operator, to the entire group. In practice, this can be used to extract information like phone numbers or emails.

Imagine for example that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as `^(IMG\d+\.PNG)$` to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern `^(IMG\d+)\.PNG$` which only captures the part before the period.

In [39]:
test = {'match' : 'file_record_transcript.pdf', 'skip' : 'testfile_fake.pdf.tmp'}
pat = '^(file.+)\.pdf$'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['file_record_transcript']
empty: []


In [40]:
test = {'match' : 'file_07241999.pdf', 'skip' : 'testfile_fake.pdf.tmp'}
pat = '^(file.+)\.pdf$'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['file_07241999']
empty: []


Below, capture groups (parentheses) allow the contents of each group can be extracted individually:

In [22]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
re.findall(datepat, text)

[('11', '27', '2012'), ('3', '13', '2013')]

### Nested groups

Nested groups are read from left to right in the pattern, with the first capture group being the contents of the first parentheses group.

In [None]:
test = {'match' : 'Jan 1978 Mar 1979 Jul 2011', 'skip' : 'skip over 5000'}
pat = '([A-Z][a-z]{2} (\d{4}))'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

All the quantifiers including the star \*, plus +, repetition {m,n} and the question mark ? can all be used within the capture group patterns. 

For example, if I knew that a phone number may or may not contain an area code, the right pattern would test for the existence of the whole group of digits `(\d{3})?` and not the individual characters themselves (which would be wrong).

Depending on the regular expression engine you are using, you can also use non-capturing groups which will allow you to match the group but not have it show up in the results.

In [None]:
test = {'match' : '1280x720 1920x1600 1024x768', 'skip' : 'skip over 5000'}
pat = '(\d{4})x(\d+)'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

### Named capturing groups

See also the section on [backreferences](#Backreferences).

All modern regular expression engines support capturing groups, which are numbered from left to right, starting with one. The numbers can then be used in backreferences to match the same text again in the regular expression, or to use part of the regex match for further processing. In a complex regular expression with many capturing groups, the numbering can get a little confusing.  By assigning a name to a capturing group, you can easily reference it by name.  Syntax:

> `(?P<name>group)` captures the match of `group` into the backreference “name.”

Python’s `sub()` function allows you to reference a named group as `\1` or `\g<name>`.

Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one.  That is, you could technically name the group but then reference by number:

In [76]:
test = 'abcd'
pat = re.compile("""(a)       # 1st capture group, literal 'a'
                    (?P<x>b)  # 2nd capture group, literal 'b', named 'x'
                    (c)       # 3rd capture group, literal 'c'
                    (?P<y>d)  # 4th capture group, literal 'd', named 'y' """, re.X)
print(pat.sub(r'\1\2\3\4', test))  # Referencing numbers only
print(pat.sub(r'\1\g<x>\3\g<y>', test))  # Referencing group names

abcd
abcd


## Conditionality: pipe operator

When using groups, you can use the `|` (logical OR, aka. the pipe) to denote different possible sets of characters.

You can write the pattern "Buy more (milk|bread|juice)" to match only the strings Buy more milk, Buy more bread, or Buy more juice.

Like normal groups, you can use any sequence of characters or metacharacters in a condition, for example, `([cb]ats*|[dh]ogs?)` would match either (cats or bats), or, (dogs or hogs).

In [43]:
test = {'match' : 'I love cats I Love dogs', 'skip' : 'I love logs I love cogs'}
pat = 'I [Ll]ove cats|dogs'
# TODO: incorrect...
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['I love cats', 'dogs']
empty: []


### Precedence

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to
match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want
to limit the reach of the alternation, you will need to use round brackets for grouping.

# What is greediness?

The question mark (optional) metacharacter is one metacharacter that is **greedy.**  The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine always tries to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

## Greedy vs. lazy quantifiers

Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. 
- A greedy quantifier will first try to repeat the token as many times as possible, and gradually give up matches as the engine backtracks to find an overall match. 
- A lazy quantifier will first repeat the token as few times as required, and gradually expand the match as the engine backtracks through the regex to find an overall match.

Because greediness and laziness change the order in which permutations are tried, they can change the overall regex match. However, they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular expression in case no match can be found.

Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for performance reasons. You can also use possessive quantifiers to eliminate certain matches. [UNFINISHED]

In [37]:
test = 'Today is Feb 23rd, 2003'
pat = 'Feb 23(?:rd)?'

# Catches 'Feb 23rd' and not 'Feb 23'
print(re.findall(pat, test))

['Feb 23rd']


When repeating a regular expression, as in `a*`, the resulting action is to consume as much of the pattern as possible. This fact often bites you when you’re trying to match a pair of balanced delimiters, such as the angle brackets surrounding an HTML tag. The naive pattern for matching a single HTML tag doesn’t work because of the greedy nature of `.*`.  Here's another example as it relates to finding HTML tags:

In [245]:
test = 'This is a <EM>first</EM> test'
pat = '<.*>'  # Wrong; the plus operator is greedy
              # Repeat preceding token as many times as possible
print('Wrong:', re.findall(pat, test))
print()

# Solution 1 - specific characters
pat = re.compile("""<              # Literal '<' character
                    [A-Za-z]       # Any 1 letter
                    [A-Za-z0-9]*   # Any letter or digit, 0 or more
                    >              # Literal '>' character""", re.X)  # Correct
print("OK (Sol'n 1):", re.findall(pat, test))

# Solution 2 - use the non-greedy qualifiers *?, +?, ??, or {m,n}?
# These match *as little* text as possible
pat = re.compile('<[^/].*?>')  # Explicilty excludes the closing tag
print("OK (Sol'n 2):", pat.findall(test))

Wrong: ['<EM>first</EM>']

OK (Sol'n 1): ['<EM>']
OK (Sol'n 2): ['<EM>']


# Other syntax

## Word boundaries

[UNFINISHED]

## Backreferences

Using grouping with parentheses creates a **backreference.**  A backreference _stores (only) the part of the string matched by the part of the regular expression inside the parentheses._

The way to avoid backreferences is to use **non-capturing parentheses.**  It is interesing to note that using backreferences actually _slows down_ the regex engine.

**With backreference:**

In [120]:
test = {'match': 'Set SetValue', 'skip': 'Value'}
pat = 'Set(Value)?'  # Value is optional, and is backreferenced
print('matched:', re.findall(pat, test['match']))  # First backreference is empty
print('empty:', re.findall(pat, test['skip']))

matched: ['', 'Value']
empty: []


**With non-capturing parentheses:**

In [121]:
pat = 'Set(?:Value)?'  
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['Set', 'SetValue']
empty: []


The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. Note the question mark after the opening bracket is unrelated to the question mark at the end of the regex.

### Using backreferences

Backreferences allow you to reuse part of the regex match. You can reuse it inside the regular expression, or afterwards.  You can use the backreference in the replacement text during a search-and-replace operation by typing `\1` (backslash one) into the replacement text as a raw string.

In [13]:
test = 'EditPad Lite'
pat = 'EditPad (Lite|Pro)'
print(re.sub(pat, r'\1 version', test))  # raw str is required; backslash isn't an escape

Lite version


Backreferences can not only be used after a match has been found, but also during the match. Suppose you want to match a pair of opening and closing HTML tags, and the text in between. By putting the opening tag into a backreference, we can reuse the name of the tag for the closing tag.

In [154]:
test = '<script type="text/javascript" src="../_static/jquery.js"></script>'
test2 = 'Testing <B><I>bold italic</I></B> text'
pat = re.compile(r"""<                    # Literal '<' character
                     (                    # Opening bracket for backreference
                     [A-Za-z]             # Any uppercase letter or lowercase letter
                     [A-Za-z0-9]*         # Any letter or digit, 0 or more.  Tags must start with letter
                     )                    # Closing bracket - end of first backreference
                     [^>]*                # 0 or more of some character other than '>'
                     >                    # Literal '>' character - close the tag
                     .*?                  # ??
                     </\1>                # Reuse the first backreference; note literal fwd slash '/' """, re.X)
print(pat.findall(test))
print(pat.findall(test2))

['script']
['B']


You can reuse backreferences more than once:

In [56]:
test = 'axaxa bxbxb cxcxc'
pat = r'([a-c])x\1x\1'
print('matched:', re.findall(pat, test))

matched: ['a', 'b', 'c']


The regex engine does not permanently substitute backreferences in the regular expression. It will use the last match saved into the backreference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten. There is a clear difference between the two following regexes:

- `([abc]+)`: will match 'cab' and put 'cab' into the first backreference;
- `([abc])+`: will match 'cab' but put just 'b' into the first backreference.  The '+' causes the pair of parentheses to repeat 3 times.  The first time, 'c' was stored.  The second time 'a' and the third time 'b.' Each time, the previous value was overwritten, so 'b' remains.

One last comment: parentheses and backreferences cannot be used inside character classes.

## Zero-width assertions

Some metacharacters are _zero-width assertions_.  They don’t cause the engine to advance through the string; instead, they consume no characters at all, and simply succeed or fail. For example, `\b` is an assertion that the current position is located at a word boundary; the position isn’t changed by the `\b` at all. 

> --> This means that zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times.

## Lookaround

Another zero-width assertion is the lookahead assertion. Lookahead assertions are available in both positive and negative form:

| Type of lookahead | Syntax | Definition |
| ----------------- | ------ | ---------- |
| Positive lookahead assertion | `(?=...)` | This succeeds if the contained regular expression, represented here by `...`, successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn’t advance at all; the rest of the pattern is tried right where the assertion started.
 | 
| Negative lookahead assertion | `(?!...)` | The opposite of the positive assertion; it succeeds if the contained expression doesn’t match at the current position in the string. |

Consider a simple pattern to match a filename and split it apart into a base name and an extension, separated by a period.  (There's no lookaround here.)

In [177]:
test = ('foo.bar', 'autoexec.bat', 'sendmail.cf', 'printers.conf', 'sample.batch')
# pat = '.*[.].*$'
pat = re.compile(""".*   # 0 or more of any character
                    [.]  # literal period mark
                    .*   # 0 or more of any character
                    $    # Asserts position at end of string """, re.X)
for t in test:
    # print(re.search(pat, t))
    print(pat.search(t))

<_sre.SRE_Match object; span=(0, 7), match='foo.bar'>
<_sre.SRE_Match object; span=(0, 12), match='autoexec.bat'>
<_sre.SRE_Match object; span=(0, 11), match='sendmail.cf'>
<_sre.SRE_Match object; span=(0, 13), match='printers.conf'>
<_sre.SRE_Match object; span=(0, 12), match='sample.batch'>


Walkthrough:
- Notice that the period needs to be treated specially because it’s a metacharacter, so it’s inside a character class to only match that specific character. 
- Also notice the trailing `$`; this is added to ensure that all the rest of the string must be included in the extension.

But what if we wanted to match filenames where the extension is not `bat`?  In this case, a negative lookahead does the trick:

In [178]:
pat = re.compile('.*[.](?!bat$)[^.]*$')
for t in test:
    print(pat.search(t))  # Skips second domain, .bat

<_sre.SRE_Match object; span=(0, 7), match='foo.bar'>
None
<_sre.SRE_Match object; span=(0, 11), match='sendmail.cf'>
<_sre.SRE_Match object; span=(0, 13), match='printers.conf'>
<_sre.SRE_Match object; span=(0, 12), match='sample.batch'>


Walkthrough:
- The negative lookahead means: if the expression `bat` doesn’t match at this point, try the rest of the pattern; if `bat$` does match, the whole pattern will fail. 
The trailing `$` is required to ensure that something like "sample.batch", where the extension only starts with "bat", will be allowed. 
- The `[^.]*` makes sure that the pattern works when there are multiple dots in the filename.

Excluding another filename extension is now easy; simply add it as an alternative inside the assertion. The following pattern excludes filenames that end in either `bat` or `exe`:

In [180]:
pat = re.compile('.*[.](?!bat$|cf$)[^.]*$')
for t in test:
    print(pat.search(t)) 

<_sre.SRE_Match object; span=(0, 7), match='foo.bar'>
None
None
<_sre.SRE_Match object; span=(0, 13), match='printers.conf'>
<_sre.SRE_Match object; span=(0, 12), match='sample.batch'>


UNFINISHED; see HOWTO [Lookahead Assertions](https://docs.python.org/3.6/howto/regex.html#lookahead-assertions).

Let’s say we want to **find a word that is six letters long and contains the three adjacent letters “cat”.** Actually, we can match this without lookaround. We just specify all the options and hump them together using alternation:

`pat = 'cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat'`

But this method gets unwieldy if you want to find any word between 6 and 12 letters long containing either “cat”, “dog” or “mouse”.  In this case, **positive lookaheads** come to the rescue.  They allow you to satisfy 2 conditions simultaneously.

In this example, we basically have two requirements for a successful match. 
1. First, we want a word that is 6 letters long. (`\b\w{6}\b`)
2. Second, the word we found must contain the word “cat”. (`\b\w*cat\w*\b`)

Combining the two, we get:

In [145]:
test = 'catdog'
test2 = 'dogcat'
# TODO: Verbose mode will not work here.  See bug:
# https://bugs.python.org/issue15606
pat = re.compile(r'(?=\b\w{6}\b)\b\w*cat\w*\b')
print(pat.search(test))
print(pat.search(test2))

<_sre.SRE_Match object; span=(0, 6), match='catdog'>
<_sre.SRE_Match object; span=(0, 6), match='dogcat'>


How it works:
- At each character position in the string where the regex is attempted, the engine will first attempt the regex inside the positive lookahead. 
- This sub-regex, and therefore the lookahead, matches only when the current character position in the string is at the start of a 6-letter word in the string. 
- If not, the lookahead will fail, and the engine will continue trying the regex from the start at the next character position in the string.
- The lookahead is zero-width. So when the regex inside the lookahead has found the 6-letter word, the current position in the string is still at the beginning of the 6-letter word. At this position will the regex engine attempt the remainder of the regex.

## If-Then-Else Conditionals in Regexes

A special construct `(?ifthen|else)` allows you to create conditional regular expressions. 
- If the `if` part evaluates to True, then the regex engine will attempt to match the `then` part. 
    - Otherwise, the `else` part is attempted instead. 
- The syntax consists of a pair of parentheses. 
    - The opening bracket must be followed by a question mark, immediately followed by the `if` part, immediately followed by the `then` part. 
    - This part can be followed by a vertical bar and the `else` part. You may omit the `else` part, and the vertical bar with it.

In [150]:
# TODO: are these supported in Python?
#pat = re.compile('a?b(?ac|d)')
#test = {'match': 'bd abc abd', 'skip': 'bc'}  # should match just 'bd' in 'abd'
#print('matched:', re.findall(pat, test['match']))
#print('empty:', re.findall(pat, test['skip']))

In [113]:
# TODO: unfinished
test = 'To: BradSolomon@gmail.com xxx'
pat = r'^((From|To)|Subject): ((?(2)\w+@\w+\.[a-z]+|.+))'
pat = re.compile(r"""^                                 # Matches the start of the string.
                     ((From|To)|Subject):[ ]           # xxx
                     ((?(2)\w+@\w+\.[a-z]+|.+))        # xxx """, re.X)
# print(re.search(pat, test))
print(pat.search(test))

<_sre.SRE_Match object; span=(0, 25), match='To: BradSolomon@gmail.com'>


# `re`-specific syntax & features

## Module-level `re` functions

The `re` module exports the following functions:

    `match`     Match a regular expression pattern to the beginning of a string.
    `fullmatch` Match a regular expression pattern to all of a string.
    `search`    Search a string for the presence of a pattern.
    `sub`       Substitute occurrences of a pattern found in a string.
    `subn`      Same as sub, but also return the number of substitutions made.
    `split`     Split a string by the occurrences of a pattern.
    `findall`   Find all occurrences of a pattern in a string.
    `finditer`  Return an iterator yielding a match object for each match.
    `compile`   Compile a pattern into a RegexObject.
    `purge`     Clear the regular expression cache.
    `escape`    Backslash all non-alphanumerics in a string.

These functions can be separated into three main groups:
- `match` thru `finditer` perform the actual searching/matching/finding/replacing.
- `compile` creates a regular expression object (`regex` class), discussed below.
- `purge` and `escape` are the odds-and-ends and are infrequently used.

### `search()` vs. `match()`

* `re.match()` checks for a match only at the beginning of the string.
* `re.search()` checks for a match anywhere in the string.  (Mnemonic: "search everywhere")

## Using `compile`

`compile` is somewhat separate from the other functions above in that it creates a [regular expression object](https://docs.python.org/3/library/re.html#regular-expression-objects).  Compiled regular expression objects support methods that are similar to the module-level functions shown above.  That is, **`re.compile(regex).search(subject)` is equivalent to `re.search(regex, subject)`.**  For example,

In [153]:
test = 'The cat in the hat'

pat1 = '[ch]at'
pat2 = re.compile(pat1)

print(re.search(pat1, test))  # the module-level function
print(pat2.search(test))      # the method version

<_sre.SRE_Match object; span=(4, 7), match='cat'>
<_sre.SRE_Match object; span=(4, 7), match='cat'>


Under the hood, the module-level functions simply create a pattern object for you and call the appropriate method on it. They also store the compiled object in a cache, so future calls using the same RE won’t need to parse the pattern again and again.

### Why compile?

If you’re going to perform a lot of matches using the same pattern, it usually pays to precompile the regular expression pattern into a pattern object first.

* The technical reason is that  module-level functions maintain a cache of compiled expressions. However, the size of the cache is limited, and using compiled expressions directly avoids the cache lookup overhead. 
* Another advantage of using compiled expressions is that by precompiling all expressions when the module is loaded, the compilation work is shifted to application start time, instead of to a point when the program may be responding to a user action.

## `match` objects

A [match object ](https://docs.python.org/3/library/re.html#match-objects) contains information about the match: where it starts and ends, the substring it matched, and more.

The following functions and methods return match objects (in the case of "success"):

- `search()`
- `match()`
- `fullmatch()`
- `finditer()` - returns an iterator yielding match objects

The methods of a match object are detailed in the above link; here's a summary:

    `expand`         The string obtained by doing backslash substitution on the passed parameter.
    `group`          Returns one or more subgroups of the match.
    `__getitem__`    This is identical to m.group(g) or m[0], m[1], ...
    `groups`         Return a tuple containing all the subgroups of the match.
    `groupdict`      Return a dictionary containing all the named subgroups of the match, keyed by subgroup name.
    `start`          Return the indices of the start of the substring matched by parameter.
    `end`            Return the indices of the end of the substring matched by parameter.
    `span`           Return the 2-tuple (m.start(group), m.end(group)), for a match m, r.
    `pos`            The value of `pos` which was passed to the search() or match() method of a regex object.
    `endpos`         The value of `endpos` which was passed to the search() or match() method of a regex object.
    `lastindex`      The integer index of the last matched capturing group.  None if no group was matched.
    `lastgroup`      Name of the last matched capturing group.
    `re`             The regular expression object whose match() or search() method produced this match instance.
    `string`         he string passed to match() or search().

Example: the `start()` and `end()` methods give the indexes into the string showing where the text matched by the pattern occurs.

In [4]:
pattern = 'this'
text = 'Does this text match the pattern?'
match = re.search(pattern, text)
s = match.start()
e = match.end()
print('Found "%s"\nin "%s"\nfrom %d to %d ("%s")' % \
      (match.re.pattern, match.string, s, e, text[s:e]))

Found "this"
in "Does this text match the pattern?"
from 5 to 9 ("this")


You could use these two methods to remove a substring:

In [152]:
email = "tony@tiremove_thisger.net"
m = re.search("remove_this", email)  # Match Object
print(email[:m.start()] + email[m.end():])  # remove remove_this from email addresses

tony@tiger.net


One important feature is that `match` objects always have a boolean value of `True`.  (Or are truthy.) Since `match()` and `search()` return `None` when there is no match, you can test whether there was a match with a simple `if` statement:

In [19]:
datepat = re.compile(r'\d+/\d+/\d+')  # Creates a `RegexObject`
bool(datepat.match(text1))

True

In [6]:
regexes = [re.compile(p) for p in ['this', 'that']]
text = 'Does this text match the pattern?
for regex in regexes:
    print('Seeking "%s" ->' % regex.pattern)
    if regex.search(text):
        print('match!')
    else:
        print('no match')

Seeking "this" ->
match!
Seeking "that" ->
no match


Find some additional good examples of match object methods in the docs link above.

## Flags

Some functions in `re` have a `flags` parameter that contain general instructions for pattern matching.

In [7]:
# re.search(pattern, string, flags=0)
# returns a `Match` object or None if nothing found
re.search("^a", "Abc", re.IGNORECASE) # or: re.I

<_sre.SRE_Match object; span=(0, 1), match='A'>

Another flag is [`DOTALL`](https://docs.python.org/3/library/re.html#re.DOTALL):

> Make the `'.'` special character match any character at all, including a newline; without this flag, `'.'` will match     anything _except_ a newline.

Complete list of flags:

![reflags.PNG](./imgs/reflags.PNG)

# Cookbook

## Grouping field info

Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. For example, an RFC-822 header line is divided into a header name and a value, separated by a `':'`, like this:

In [224]:
from tabulate import tabulate

header = """
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
"""

pat = re.compile(r'(?P<field>[A-Z]\w+):\s(?P<value>.+)\n')
groups = pat.findall(header)
print(tabulate(groups, tablefmt='grid'))

+---------+------------------------------------+
| From    | author@example.com                 |
+---------+------------------------------------+
| Agent   | Thunderbird 1.5.0.9 (X11/20061227) |
+---------+------------------------------------+
| Version | 1.0                                |
+---------+------------------------------------+
| To      | editor@example.com                 |
+---------+------------------------------------+


## Is a poker hand legitimate?

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.  **To see if a given string is a valid hand**, one could do the following:

In [156]:
def displaymatch(match):
    """Helper function to gracefully display match objects."""
    if not match:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

valid = re.compile(r"^[a2-9tjqk]{5}$")

# Test 4 hands
print(displaymatch(valid.match("akt5q")))  # Valid.
print(displaymatch(valid.match("727ak")))  # Valid
print(displaymatch(valid.match("akt5e")))  # Invalid.
print(displaymatch(valid.match("akt")))    # Invalid.

<Match: 'akt5q', groups=()>
<Match: '727ak', groups=()>
None
None


## Does a poker hand contain a pair?

A hand above, "727ak", contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:

In [159]:
pair = re.compile(r".*(.).*\1")

print(displaymatch(pair.match("717ak")))
print(displaymatch(pair.match("718ak")))  # No pairs
print(displaymatch(pair.match("354aa")))
print(displaymatch(pair.match("33322")))  # Full house
print(displaymatch(pair.match("33522")))  # Two pair

<Match: '717', groups=('7',)>
None
<Match: '354aa', groups=('a',)>
<Match: '33322', groups=('2',)>
<Match: '33522', groups=('2',)>


## Text munging/scrambling

This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence **except for the first and last characters**.

In [165]:
import random

def repl(m):
    inner_word = list(m.group(2))
    # print(inner_word)
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)

text = "Professor Abdolmalek, please report your absences promptly."

# `repl` param can be a string or a function!
print(re.sub(r"(\w)(\w+)(\w)", repl, text))
print(re.sub(r"(\w)(\w+)(\w)", repl, text))

Psosfreor Adbmolleak, peslae rrpoet your abneecss plomtpry.
Psfoersor Aeldbmaolk, psaele rperot your abnecses pomprtly.


## Checking for repeated words

In [66]:
test = 'I accidentally typed the the word twice.'
pat = re.compile(r"""\b     # Word boundary
                     (\w+)  # 1 or more of any alphanumeric character; equivalent to [a-zA-Z0-9_].
                     \s+    # 1 or more whitespace
                     \1     # Repeat backreference (the repeated word)
                     \b     # Word boundary """, re.X) 
pat.sub(r'\1', test)

'I accidentally typed the word twice.'

## Getting HTML tags
Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do.

In [39]:
s = '<script type="text/javascript" src="../_static/jquery.js"></script>'
pat = r'<script\b[^>]*>(.*?)</script>'

print(re.findall(pat, s))  # TODO: no result

['']


## Splitting strings with multiple & varying delimiters

In [227]:
# some text with multiple & varying delimiters
line = 'asdf fjdk; afed, fjek,asdf,     foo'

# delims below: 
# {space}, {semicolon-space}, {comma-space}, {comma}, {comma-multiple spaces}
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

Breaking down the above:
* The metacharacters `[]` specify a **character class**, which is a set of characters that you wish to match. 
* Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a `'-'`. 
 * For example, `[abc]` will match any of the characters a, b, or c; this is the same as `[a-c]`, which uses a range to express the same set of characters.
 * Note that metacharacters are **not** active inside classes.
* The `\s` following the brackets is just a normal special space character.  It says, look for a space after the initial sep.
* The `*` specifies that the previous character can be matched zero or more times, instead of exactly once.

Consider the above without `\s*` at the end:

In [39]:
re.split(r'[;,\s]', line)

['asdf', 'fjdk', '', 'afed', '', 'fjek', 'asdf', '', '', '', '', '', 'foo']

Using parentheses specifies a _capture group_; the matched text is also included in the result:

In [40]:
fields = re.split(r'(;|,|\s)\s*', line)
print(fields)

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']


You could then split these to reform to an output string:

In [11]:
values = fields[::2]; values

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [13]:
delimiters = fields[1::2] + ['']; delimiters

[' ', ';', ',', ',', ',', '']

## Match dates specified as digits, such as “11/27/2012":

* Another repeating metacharacter is +, which matches one or more times.
 * Pay careful attention to the difference between `*` and `+`; `*` matches zero or more times, so whatever’s being repeated may not be present at all, while `+` requires at least one occurrence.
* `/` is just a literal.

In [16]:
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

# Simple matching: \d+ means match one or more digits
bool(re.match(r'\d+/\d+/\d+', text1))

True

## Using the digits metacharacter

In [23]:
test = {'match' : 'abc789xyz define "123" var g = 456;', 'skip' : 'word 12 1'}
pat = '\d\d\d'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['789', '123', '456']
empty: []


## Matching differently formatted numbers

In [51]:
test = {'match' : '3.14529 -255.38 128 1.9e10 123,340.00', 'skip' : '720p'}
# match a string that starts with an optional negative sign, one or more digits, 
# optionally followed by a comma and more digits, followed by an optional 
# fractional component which consists of a period, one or more digits, and 
# another optional component, the exponent followed by more digits
pat = '-?\d+.+[^a-z]$' # or: ^-?\d+(,\d+)*(\.\d+(e\d+)?)?$
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['3.14529 -255.38 128 1.9e10 123,340.00']
empty: []


## Matching phone number area codes

In [50]:
test = {'match' : '415-555-1234 650-555-2345 (416)555-3456 202 555 4567 4035555678 1 416 555 9292', 'skip' : '720p'}
#  breaks down into the country code '1?', the captured area code '\(?(\d{3})\)?', 
# and the rest of the digits '\d{3}' and '\d{4}' respectively. We use '[\s-]?' to 
# catch the space or dashes between each component
pat = '1?[\s-]?\(?(\d{3})\)?[\s-]?\d{3}[\s-]?\d{4}'
print('matched:', re.findall(pat, test['match']))
print('empty:', re.findall(pat, test['skip']))

matched: ['415', '650', '416', '202', '403', '416']
empty: []


## Making a phonebook

In [56]:
text = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""

# Convert the string into a list with each nonempty line having its own entry
entries = re.split("\n+", text)

# Split each entry into a list with first name, last name, telephone number, 
# and address.  The :? pattern matches the colon after the last name, so 
# that it does not occur in the result list. With a maxsplit of 4, we could 
# separate the house number from the street name:
entries = [re.split(":? ", entry, 4) for entry in entries]
entries = DataFrame(entries, columns=['first', 'last', 'phone', 'streetnum',
                                      'street'])
print(entries)

     first       last         phone streetnum             street
0     Ross    McFluff  834.345.1254       155         Elm Street
1   Ronald  Heathmore  892.345.3428       436      Finley Avenue
2    Frank     Burger  925.541.7625       662  South Dogwood Way
3  Heather   Albrecht  548.326.4584       919         Park Place


## Finding all adverbs

In [59]:
text = "He was carefully disguised but captured quickly by police."
re.findall("\w+ly", text) # optional here: r"\w+ly"

['carefully', 'quickly']

## Multiple conditions

Word must start with capital C, can contain /, and must end with 1 or 2 digits:

In [76]:
matches = ['C/12', 'C/6', 'C12', 'C6']
skips = ['c6', 'c123', 'c12D', 'C12D']
pat = '^C\/?\d{1,2}$'
print('matched:', [re.findall(pat, match) for match in matches])
print('empty:', [re.findall(pat, skip) for skip in skips])

matched: [['C/12'], ['C/6'], ['C12'], ['C6']]
empty: [[], [], [], []]


# grep

See Goyvaerts - https://www.regular-expressions.info/grep.html.