The [re module](https://docs.python.org/3/library/re.html) provides regex matching operators in Python

In [1]:
import re

In [2]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.7/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

In [3]:
# Text to parse
strings = ['abc123',
        'this is number one two three "123"',
        'my house is = 123;']

In [4]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.



In [5]:
for string in strings:
    print(re.search('abc', string))

<re.Match object; span=(0, 3), match='abc'>
None
None


Only the first string has the pattern 'abc'

Now, let's look for the pattern '1':

In [6]:
for string in strings:
    print(re.search('1', string))

<re.Match object; span=(3, 4), match='1'>
<re.Match object; span=(30, 31), match='1'>
<re.Match object; span=(14, 15), match='1'>


All the strings contain the pattern '1'

## `\d` Digits

But now it's time for the introduction of regex metacharacters. The first metacharacter introduces is `\d`.

`\d` matches any digit from 0 to 9. Let's use it instead of '1' in the above example:

In [7]:
for string in strings:
    print(re.search('\d', string))

<re.Match object; span=(3, 4), match='1'>
<re.Match object; span=(30, 31), match='1'>
<re.Match object; span=(14, 15), match='1'>


## `.` Wildcard

The `.` is a wildcard used for matching any single character except a newline.It can be:

- a letter (abc...)
- a number (0123456789)
- a special character (!@%^&*()
- a white space (` `)

If you need to match the dot special character `.`, you need to escape it using the backslash \ using the following expression `\.`.

Below are some strings. Some of them contain a dot `.` and others do not.

First, let's create a pattern to match all of them using the wildcard `.`.

Afterward, create a pattern to match only those that contain a dot. To do this, we'll need to escape the dot using `\.` instead of `.`

In [8]:
strings = ['My name is Mike.',
           'abc.',
           '1000.00',
           'house of cards']

# match all string using the wildcard
for string in strings:
    print(re.search('.', string))

<re.Match object; span=(0, 1), match='M'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='h'>


In [13]:
# match only those that contain a dot, escaping the dot with a backslash
for string in strings:
    print(re.search(r'\.', string))


<re.Match object; span=(15, 16), match='.'>
<re.Match object; span=(3, 4), match='.'>
<re.Match object; span=(4, 5), match='.'>
None


Notice that I used `r'\.'` instead of just `\.`.

This is because Python has a list of [valid escape sequences](https://docs.python.org/2.0/ref/strings.html) and a blackslash is not valid if it's not followed by one of the valid escape sequences.

From the [re documentation](https://docs.python.org/3/library/re.html):

> "any invalid escape sequences in Python’s usage of the backslash in string literals now generate a DeprecationWarning and in the future this will become a SyntaxError. This behaviour will happen even if it is a valid escape sequence for a regular expression."

In the re module, `r` before the search pattern indicates that we are using a raw string. 

## `[]` Square Brackets: Character class

The square bracktes, also called character class, are used to match one out seveal alternatives.

For example, `[ab]` will match either `a` or `b`:

In [24]:
# Text to parse
strings = ['abc', # match? Yes, contains a
           '123ab', # Yes, contains a
           'a', # Yes, contains a
           'b', # Yes, contains b
           'ab', # Yes, contains a
           'c', # No, doesn't contain a or b
           'cd', # No, doesn't contain a or b
           '123'] # No, doesn't contain a or b

In [23]:
for string in strings:
    print(re.search('[ab]', string))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 1), match='a'>
None
None
None


# `[^ ]` Hat (caret) inside square brackets: Complementing Set

`^` when inside square brackets means indicates a complementing set.

For example, `[^ab]` will match strings only if they contain any character that is not a or b. 

In [39]:
# Text to parse
strings = ['a',
           'b',
           'c', # match, contains c
           'ab',
           'abc' # match, contains c, note that it also contains 'a' and 'b'
           'ac',
           'c', # match, contains c
           '123', # match, contains 1
           '123a'] # match, contains 1, note that it also contains 'a'

for string in strings:
    print(re.search('[^ab]', string))

None
None
<re.Match object; span=(0, 1), match='c'>
None
<re.Match object; span=(2, 3), match='c'>
<re.Match object; span=(0, 1), match='c'>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='1'>


Complementing set meaning in other regex implementations.

In the re module, `^` has no special meaning if it’s not the first character in the set.

In other regex implementations, `^` can be placed in another position different that the first character insider the square brackets `[..^..]` . In this case (not Python), it will negate everything that follows it, but not what is before it.

For example `[a^b]` will match if if contains 'a' but also if it contains any character that is not 'b'. So, it is the same as `[^b]`:

In [40]:
print('In python, `^` has no special meaning if it’s not the first character in the set.')

# Text to parse
strings = ['a',
           'b',
           'c', 
           'ab',
           'abc' 
           'ac',
           'c',
           '123', 
           '123a'] 

for string in strings:
    print(re.search('[a^b]', string))
    
# TODO: test using SQL

In python, `^` has no special meaning if it’s not the first character in the set.
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>
None
None
<re.Match object; span=(3, 4), match='a'>


# `^` Hat (caret) outside square brackets: start of the string

The caret, when not used inside square brackets, Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

For example, the search pattern `^a` will match strings that begin with "a".

In [45]:
# Text to parse
strings = ['a', # matches beginning with 'a'
           'b',
           'ab', # matches beginning with 'a'
           'ba',
           'This is a dog.',
           'A dog is an animal.', # Doesn't match: Note that it doesn't match uppercase 'A'
           'a dog is an animal'] # matches beginning with 'a'

for string in strings:
    print(re.search('^a', string))
    

<re.Match object; span=(0, 1), match='a'>
None
<re.Match object; span=(0, 1), match='a'>
None
None
None
<re.Match object; span=(0, 1), match='a'>
