# Introduction to Regex Matching with Python

In this notebook, I'll show how to use Regex to check for the presence of search patterns in strings using progresive examples.

A *regular expression* (regex, for short), is a text string that describes a search pattern.

The simplest regular expression is just an single character, like 'a'. An ordinary character will simply match itself. For example, we check if 'a' is in the string 'pandas'. We can concatenate ordinary characters. For example, we can search for 'dog' in the string 'I like dogs and cats.'.

But regex is powerful, and can do much more. Below, I'll introduce Regex using progresive examples implemented with the Python re module. The [re module](https://docs.python.org/3/library/re.html) provides regex matching operators in Python.

In [1]:
import re

# for some functions defined by me to show only in jupyter
# import sys
# import os
# sys.path.insert(1, '/home/fede/Documents/Learn/Private_Scripts/Various_Utilities')
# sys.path.insert(0, os.path.abspath('/home/fede/Documents/Learn/Private_Scripts/Various_Utilities/'))
# import jupyter_demostrations
# from jupyter_demostrations import fshow

In [37]:
def fshow(code_to_execute: str, explanation: str = "") -> None:
    """ prints the input string and the output """
    print(f"{explanation}\n>>> {code_to_execute} \n{eval(code_to_execute)}")


def fshow_multiple(dict_of_functions: dict) -> None:
    for key, value in examples.items():
        print(key)
        for item in value:
            fshow(item)

re provides several functions for using Regex in Python. The most important function provided by re is `search`. And in this notebook I'll focus only in this function.

But here is the full list of functions provided by re.

- match     Match a regular expression pattern to the beginning of a string.
- fullmatch Match a regular expression pattern to all of a string.
- **search**    Search a string for the presence of a pattern.
- sub       Substitute occurrences of a pattern found in a string.
- subn      Same as sub, but also return the number of substitutions made.
- split     Split a string by the occurrences of a pattern.
- findall   Find all occurrences of a pattern in a string.
- finditer  Return an iterator yielding a Match object for each match.
- compile   Compile a pattern into a Pattern object.
- purge     Clear the regular expression cache.
- escape    Backslash all non-alphanumerics in a string.

Let's begin with a simple example. Let's look for the presence of the letter 'a' in the string 'cat'.

The re.search function inputs are:
* A regex pattern.
* A string.

In our example, the regex pattern is just `'a'` (Don't worry, I'll introduce more complex patterns shortly :)

And the string is `'cat'`.

Let first look at the output and afterward I'll explain it:

In [3]:
print(re.search("a", "cat"))

<re.Match object; span=(1, 2), match='a'>


As you can see, the output is not a boolean, but provides more information.

The output of `re.search` is a Match object if a match was found. In this case, since 'a' is in 'cat', it returns a Match object.

The `span` attribute shows the start and end of the match. In this case, since 'a' is in the second position, it returns the tuple (1, 2):

Let's look for the letter 'a' in the string 'dog':

In [4]:
print(re.search("a", "dog"))

None


Since 'a' is not in 'dog', `re.search` returns `None` 

In [5]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.



## re.search vs re.match

In [6]:
# re.match: Match a regular expression pattern to the beginning of a string.
print(re.match("c", "cat"))
print(re.match("a", "cat"))
print(re.match("cat", "cat"))
# re.match: Match a regular expression pattern to all of a string.
print(re.fullmatch("c", "cat"))
print(re.fullmatch("cat", "cat"))
# re.search: Search a string for the presence of a pattern.
print(re.search("a", "cat"))
# re.sub Substitute occurrences of a pattern found in a string.
print(re.sub("c", "b", "cat"))
# re.subn Same as sub, but also return the number of substitutions made.
print(re.subn("c", "b", "cat"))
# re.split Split a string by the occurrences of a pattern.
print(re.split(" ", "hello world"))
# re.findall Find all occurrences of a pattern in a string.
print(re.findall("o", "hello world"))
# re.finditer  Return an iterator yielding a Match object for each match.
print([i for i in re.finditer("o", "hello world")])
# re.compile   Compile a pattern into a Pattern object.
# to separate definition of the regex from its use.
# to get better performance when it's run a lot of times
pattern = re.compile("a")
print(re.search(pattern, "cat"))
# re.purge Clear the regular expression cache.
# little memory benefit, and can actually hurt performance if you purged it.
re.purge()
# re.escape Backslash all non-alphanumerics in a string.
print(re.search(".", "Hello World."))  # notice the end dot in the string
print(re.search(re.escape("."), "Hello World."))  # notice the end dot in the string
re.escape(".")

<re.Match object; span=(0, 1), match='c'>
None
<re.Match object; span=(0, 3), match='cat'>
None
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(1, 2), match='a'>
bat
('bat', 1)
['hello', 'world']
['o', 'o']
[<re.Match object; span=(4, 5), match='o'>, <re.Match object; span=(7, 8), match='o'>]
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(0, 1), match='H'>
<re.Match object; span=(11, 12), match='.'>


'\\.'

In [7]:
examples = {
    "re.match": ("re.match('c', 'cat')", "re.match('cat', 'cat')"),
    "re.fullmatch": ("re.fullmatch('c', 'cat')", "re.fullmatch('cat', 'cat')"),
}

# re.search: Search a string for the presence of a pattern.
print(re.search("a", "cat"))
# re.sub Substitute occurrences of a pattern found in a string.
print(re.sub("c", "b", "cat"))
# re.subn Same as sub, but also return the number of substitutions made.
print(re.subn("c", "b", "cat"))
# re.split Split a string by the occurrences of a pattern.
print(re.split(" ", "hello world"))
# re.findall Find all occurrences of a pattern in a string.
print(re.findall("o", "hello world"))
# re.finditer  Return an iterator yielding a Match object for each match.
print([i for i in re.finditer("o", "hello world")])
# re.compile   Compile a pattern into a Pattern object.
# to separate definition of the regex from its use.
# to get better performance when it's run a lot of times
pattern = re.compile("a")
print(re.search(pattern, "cat"))
# re.purge Clear the regular expression cache.
# little memory benefit, and can actually hurt performance if you purged it.
re.purge()
# re.escape Backslash all non-alphanumerics in a string.
print(re.search(".", "Hello World."))  # notice the end dot in the string
print(re.search(re.escape("."), "Hello World."))  # notice the end dot in the string
re.escape(".")

<re.Match object; span=(1, 2), match='a'>
bat
('bat', 1)
['hello', 'world']
['o', 'o']
[<re.Match object; span=(4, 5), match='o'>, <re.Match object; span=(7, 8), match='o'>]
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(0, 1), match='H'>
<re.Match object; span=(11, 12), match='.'>


'\\.'

In [8]:
for key, value in examples.items():
    print(key)
    for i in value:
        fshow(i)

re.match

>>> re.match('c', 'cat') 
<re.Match object; span=(0, 1), match='c'>

>>> re.match('cat', 'cat') 
<re.Match object; span=(0, 3), match='cat'>
re.fullmatch

>>> re.fullmatch('c', 'cat') 
None

>>> re.fullmatch('cat', 'cat') 
<re.Match object; span=(0, 3), match='cat'>


In [9]:
dict = {"apple": "red", "mango": "green", "orange": "orange"}

for key, value in dict.items():
    print(key, value)

for item in dict.items():
    print(item[0], dict[item[0]])

for item in dict.items():
    print(item[0], item[1])

for i in enumerate(dict):
    print(i[1], dict[i[1]])

apple red
mango green
orange orange
apple red
mango green
orange orange
apple red
mango green
orange orange
apple red
mango green
orange orange


In [10]:
fshow_multiple(examples)

re.match

>>> re.match('c', 'cat') 
<re.Match object; span=(0, 1), match='c'>

>>> re.match('cat', 'cat') 
<re.Match object; span=(0, 3), match='cat'>
re.fullmatch

>>> re.fullmatch('c', 'cat') 
None

>>> re.fullmatch('cat', 'cat') 
<re.Match object; span=(0, 3), match='cat'>


In [11]:
fshow("re.match('c', 'cat')")


>>> re.match('c', 'cat') 
<re.Match object; span=(0, 1), match='c'>


In [12]:
fshow("re.match('a', 'cat')")


>>> re.match('a', 'cat') 
None


Lets now look at the same examples using a Python loop:

In [13]:
# Text to parse
strings = ["cat", "dog"]

for string in strings:
    print(re.search("a", string))

<re.Match object; span=(1, 2), match='a'>
None


I'll add the word 'area' to the list of strings:

In [14]:
# Text to parse
strings = ["cat", "dog", "area"]

for string in strings:
    print(re.search("a", string))

<re.Match object; span=(1, 2), match='a'>
None
<re.Match object; span=(0, 1), match='a'>


In the case of 'area', the letter 'a' is to be found in two postions. At the beginning and at the end. The re.Match object span attribute returned is (0, 1), because it matches the first occurence. 

Let's now check if there is a match of 'og' in the list of strings 'cat, 'dog', 'fog':

In [15]:
# Text to parse
strings = ["cat", "dog", "fog"]

for string in strings:
    print(re.search("og", string))

None
<re.Match object; span=(1, 3), match='og'>
<re.Match object; span=(1, 3), match='og'>


## `\d` Digits

Admitedly, my previous examples were not impressive. The same output could have been easily achieved without using Regex. 

But now it's time for the introduction of Regex special characters or metacharacters. Metacharacters are the building blocks of regular expressions, and allow us to build more complex regular expressions.

There are many Regex special characters. The first metacharacter I'll introduce is `\d`.

`\d` matches any digit from 0 to 9.

In [16]:
strings = ["123", "aa1", "cat"]

for string in strings:
    print(re.search("\d", string))

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(2, 3), match='1'>
None


## `.` Wildcard

The `.` is a wildcard used for matching any single character except a newline. It can be:

- a letter (abc...)
- a number (0123456789)
- a special character (!@%^&*()
- a white space (` `)

If you need to match the dot special character `.`, you need to escape it using the backslash \ using the following expression `\.`.

Below are some strings. Some of them contain a dot `.` and others do not.

First, let's create a pattern to match all of them using the wildcard `.`.

In [17]:
strings = ["My name is Mike.", "abc.", "1000.00", "house of cards", " "]

# match all string using the wildcard
for string in strings:
    print(re.search(".", string))

<re.Match object; span=(0, 1), match='M'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(0, 1), match=' '>


As we can see, the unescaped `.` matches any character, even the whitespace `' '`.

Now, I'll create a pattern to match only those that contain a dot. To do this, we'll need to escape the dot using `\.` instead of `.`

In [18]:
# match only those that contain a dot, escaping the dot with a backslash
for string in strings:
    print(re.search(r"\.", string))

<re.Match object; span=(15, 16), match='.'>
<re.Match object; span=(3, 4), match='.'>
<re.Match object; span=(4, 5), match='.'>
None
None


## `[]` Square Brackets: Character class

The square bracktes, also called character class, are used to match one out seveal alternatives.

For example, `[ab]` will match either `a` or `b`:

In [19]:
# Text to parse
strings = [
    "abc",  # match? Yes, contains a
    "123ab",  # Yes, contains a
    "a",  # Yes, contains a
    "b",  # Yes, contains b
    "ab",  # Yes, contains a
    "c",  # No, doesn't contain a or b
    "cd",  # No, doesn't contain a or b
    "123",
]  # No, doesn't contain a or b

In [20]:
for string in strings:
    print(re.search("[ab]", string))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 1), match='a'>
None
None
None


# `[^ ]` Hat (caret) inside square brackets: Complementing Set

`^` when inside square brackets means indicates a complementing set.

For example, `[^ab]` will match strings only if they contain any character that is not a or b. 

In [21]:
# Text to parse
strings = [
    "a",
    "b",
    "c",  # match, contains c
    "ab",
    "abc" "ac",  # match, contains c, note that it also contains 'a' and 'b'
    "c",  # match, contains c
    "123",  # match, contains 1
    "123a",
]  # match, contains 1, note that it also contains 'a'

for string in strings:
    print(re.search("[^ab]", string))

None
None
<re.Match object; span=(0, 1), match='c'>
None
<re.Match object; span=(2, 3), match='c'>
<re.Match object; span=(0, 1), match='c'>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='1'>


### Complementing set meaning in other regex implementations.

In the re module, `^` has no special meaning if it’s not the first character in the set.

In other regex implementations, `^` can be placed in another position different that the first character inside the square brackets, like in this example: `[..^..]` . In this case (not in Python), it will negate everything that follows it, but not what is before it.

For example `[a^b]` will match if if contains 'a' but also if it contains any character that is not 'b'. So, it is the same as `[^b]`:

In [22]:
print(
    "In python, `^` has no special meaning if it’s not the first character in the set."
)

# Text to parse
strings = ["a", "b", "c", "ab", "abc" "ac", "c", "123", "123a"]

for string in strings:
    print(re.search("[a^b]", string))

# TODO: test using SQL

In python, `^` has no special meaning if it’s not the first character in the set.
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>
None
None
<re.Match object; span=(3, 4), match='a'>


# `^` Hat (caret) outside square brackets: start of the string

The caret, when not used inside square brackets, Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

For example, the search pattern `^a` will match strings that begin with "a".

In [23]:
# Text to parse
strings = [
    "a",  # matches beginning with 'a'
    "b",
    "ab",  # matches beginning with 'a'
    "ba",
    "This is a dog.",
    "A dog is an animal.",  # Doesn't match: Note that it doesn't match uppercase 'A'
    "a dog is an animal",
]  # matches beginning with 'a'

for string in strings:
    print(re.search("^a", string))

<re.Match object; span=(0, 1), match='a'>
None
<re.Match object; span=(0, 1), match='a'>
None
None
None
<re.Match object; span=(0, 1), match='a'>


## Character Ranges

We've just learned that [ab] will match either a or b. Similarly, 
* `[abc]` will match either a, b or c.
* `[abcd]` will match either a, b, c, or d 
And so on. 

When we need to match a range of characters, Regex provides a simpler way. For example, if we need to match any character from 'a' to 'd' we can use `[a-d]`:   

In [24]:
strings = ["a", "b", "c", "d", "e"]

for string in strings:
    print(re.search("[a-d]", string))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 1), match='c'>
<re.Match object; span=(0, 1), match='d'>
None


Of course, this is the same as `[abcd]`, but when we have a long list, character ranges are quite useful.

We can use the hat to match only the complementing set of a character range:

In [25]:
strings = ["a", "b", "c", "d", "e"]

for string in strings:
    print(re.search("[`^a-d]", string))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 1), match='c'>
<re.Match object; span=(0, 1), match='d'>
None


## Repetitions: Curly Braces {}

Sometimes we will need to find a repetition of characters. For example 'aaa' will match three consecutive 'a'. Regex provides a convenient way to specify a number of repetitions. The curly braces `{}`.

For example, if we want to match 'aaa', we can also use `a{3}`

In [26]:
strings = ["a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a{3}", string))

None
None
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 3), match='aaa'>


Curly braces also allows us to specify a range of repetitions. For example, if we want to match three or four repetitions, but not one, two or five; we can use this expression: `'a{3,4}`

In [27]:
strings = ["a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a{3,4}", string))

None
None
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 4), match='aaaa'>
<re.Match object; span=(0, 4), match='aaaa'>


## Greedy vs Lazy ?

Take a look at the last Match range. It matched only 'aaaa' (four times 'a', not five). 

In Regex, there are two types of matches. 

* **'Greedy'** means match longest possible string.
* **'Lazy'** means match shortest possible string.

As most Regex implementations, Python Re is greedy by default. That's why in our previous example, `a{3,4}` in 'aaaa' matched 'aaaa'  (the longest possible string). If we wanted to match only 'aaa' (four times 'a'), we can do this by converting the Regex expression to lazy.

To convert an expression to lazy, we add the question mark after an expression. So, the lazy version of `a{3,4}` is `a{3,4}?`. 

*Note: to match the question mark, one must escape it using `\?`*

In [28]:
strings = ["a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a{3,4}?", string))

None
None
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 3), match='aaa'>


Note that when it matches, it matches only 'aaa', the shortest possible string.

## Repetitions: Zero or more *

The asterisk `*` matches 0 or more (greedy) repetitions of the preceding expression. For example: `a*` will match the longest posible repetitions of `a`, but it will also match an empty string `''` because it is zero repetitions:

In [29]:
strings = ["", "a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a*", string))

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 4), match='aaaa'>
<re.Match object; span=(0, 5), match='aaaaa'>


The lazy version, a*? will match the shortest posible repetitions of a:

In [30]:
strings = ["", "a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a*?", string))

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>


Because * includes zero repetitions, it will match '' (the shortest match)

## Repetitions: One or more +

The sum symbol `+` matches 1 or more (greedy) repetitions of the preceding expression. For example: `a+` will match the longest posible repetitions of `a` (at least one, will not match an empty string `''`)

In [31]:
strings = ["", "a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a+", string))

None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 4), match='aaaa'>
<re.Match object; span=(0, 5), match='aaaaa'>


As before, the lazy version `?` matches the shortest possible string:

In [32]:
strings = ["", "a", "aa", "aaa", "aaaa", "aaaaa"]

for string in strings:
    print(re.search("a+?", string))

None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='a'>


## Backslah \ : Matching Special Characters

Sometimes we need to match special charaters that are used by Regex to build regular expressions, like `*`, `.` or `?`.

In other cases, we need to match newlines or tabs.

In these cases, we need to escape the special charaters using the backlash symbol `\`.

But in some cases, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. A good approach is to indicate Python that we are using a raw string by appending `r` to the string pattern. So for example, if we need to match the backslash, we can use `re.search(r'\\', string)`, otherwise we will receive an error.

In [33]:
strings = ["a", "*"]

for string in strings:
    print(re.search(r"\*", string))

None
<re.Match object; span=(0, 1), match='*'>


In [34]:
# Example: match newlines
strings = [
    "a",
    """line1
           line2""",
]

for string in strings:
    print(re.search(r"\n", string))

None
<re.Match object; span=(5, 6), match='\n'>


In [35]:
# Example: match tabs
strings = ["a", "this line contains a tab	"]

for string in strings:
    print(re.search(r"\t", string))

None
<re.Match object; span=(24, 25), match='\t'>


In [36]:
# Example: match a literal backslash
strings = ["this line contains a \ backslash", "a"]

for string in strings:
    print(re.search(r"\\", string))

<re.Match object; span=(21, 22), match='\\'>
None
