# CMSC331 - Spring 2021

## <font color='blue'>Regular Expressions in Python 3</font>

### Instructor: Fereydoon Vafaei

This notebook provides an introduction to RegEx in Python 3. RegEx is a good example of how **pattern matching** works. Pattern matching is used in lexical analyzers.

A regular expression (or RegEx or RE) is a special text string for describing a search pattern. One can think of regular expressions as wildcard notations such as \*.txt to find all text files in a file manager. The regex equivalent is ^.*\.txt$. But regular expressions' features and capabilities are much more.

A **"regex"** is a special pattern or sequence of characters describing a certain search pattern. In other words, regex specifies a set of strings that matches it.

A **"match"** is the piece of text, or sequence of bytes or characters that pattern was found to correspond to by the regex processing software. In other words, regex can be used to check if a string contains the specified search pattern.

The first thing to recognize when using regular expressions is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters (also known as a string). Most patterns use normal ASCII, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!, but unicode characters can also be used to match any type of international text.

Many applications and programming languages have their own implementation of regular expressions, often with slight and sometimes with significant differences from other implementations. When two applications use a different implementation of regular expressions, we say that they use different "regular expression flavors".

In Python, you are familiar with strings and how to do operations on strings such as checking if two strings are equal using equality operator `==` or how to test wether a string is a substring of another using `in` or bulit-in methods such as `.find()` and `.index()`. Read the examples below.

In [1]:
'foo' == 'foo'

True

In [2]:
'foo' == 'bar'

False

In [3]:
s = 'foo123bar'
'123' in s

True

In [4]:
s = 'foo123bar'
s.find('123')

3

In [5]:
s.index('123')

3

In [6]:
s.index('foo')

0

> You can also concatenate strings using `+` operator.

In [7]:
fb = 'foo' + 'bar'
fb

'foobar'

Python regex functions let you check if a particular string matches a given regex or if a given regex matches a particular string. First, you need to import `re`, the Python library of functions that let you define and work with regex to find matches.

In [8]:
# The Python library to work with regex
import re

> When you import the `re` module, you can start using regular expressions and the associated functions.

### `re.match()` vs `re.search()`

Python `re` module offers two different primitive operations based on regular expressions: `re.match()` checks for a match only at the beginning of the string, while `re.search()` checks for the first location of a match **anywhere** in the string. The output of these two methods is a `Match object`. Notice that the `span` in the output `Match object` specifies the position of the match.

In [9]:
re.match("a", "abcdef")    # Match

<re.Match object; span=(0, 1), match='a'>

In [10]:
re.match("c", "abcdef")    # No match

In [11]:
re.search("c", "abcdef")   # Match

<re.Match object; span=(2, 3), match='c'>

In [12]:
re.search("bc", "abcdef")   # Match

<re.Match object; span=(1, 3), match='bc'>

In [13]:
s = 'foo123bar'
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

> A match object is **truthy**, so you can use it in a Boolean context like a conditional statement:

In [9]:
s = 'foobar'
if re.search('123', s):
    print('Found a match.')
else:
    print('No match.')

No match.


In [10]:
s = 'foo123bar'
if re.search('123', s):
    print('Found a match.')
else:
    print('No match.')

Found a match.


### Defining RegEx Patterns

As an example, the following code defines a RegEx pattern. The pattern in this example is specified as any five letter string starting with `a` and ending with `s`. 

Here, we use `re.match()` function to search pattern within the `test_string`. The method returns a match object if the search is successful. If not, it returns `None`.

To specify regular expressions, **metacharacters** are used. In the following example, `^` and `'$'` are **metacharacters** refering to the beginning and the end of string respectively. The dot `.` metacharacter matches any character except a newline, so it functions like a wildcard. Thus, the pattern specifies that there are three characters between `a` and `s`.

In [15]:
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	

Search successful.


Regular expressions beginning with **metacharachter** `'^'` can be used with `search()` to restrict the match at the beginning of the string:

In [16]:
re.search("c", "abcdef")  # Match

<re.Match object; span=(2, 3), match='c'>

In [17]:
# Now restrict .search() with '^' to check only the beginning of the given string
re.search("^c", "abcdef")  # No match

In [18]:
re.search("^a", "abcdef")  # Match

<re.Match object; span=(0, 1), match='a'>

> Note however that in MULTILINE mode, `match()` only matches at the beginning of the string, whereas using `search()` with a regular expression beginning with `'^'` will match at the beginning of each line.

In [19]:
re.match('X', 'A\nB\nX', re.MULTILINE)  # No match

In [20]:
re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match

<re.Match object; span=(4, 5), match='X'>

In the following example, `re.search()` searches the string to see if it starts with "The" and ends with "Spain". The star symbol `*` matches zero or more occurrences of the pattern left to it, which in this case is `.` i.e. a wildcard.

In [21]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) 

if (x):
  print("YES! We have a match!")
else:
  print("No match")

YES! We have a match!


In [11]:
txt = "The rain in Paris"
x = re.search("^The.*Spain$", txt) 

if (x):
  print("YES! We have a match!")
else:
  print("No match")

No match


### `re` Functions

The `re` module offers a set of functions that allows us to search a string for a match. The following table includes some of the functions most commonly used in working with regex in Python.

| Function |                            Description                            |                     Syntax                     |
|:--------:|:-----------------------------------------------------------------:|:----------------------------------------------:|
|  findall |               Returns a list containing all matches               |      re.findall(pattern, string, flags=0)      |
|  search  | Returns a Match object if there is a match anywhere in the string |       re.search(pattern, string, flags=0)      |
|   match  |       Checks for a match only at the beginning of the string      |       re.match(pattern, string, flags=0)       |
|   split  |    Returns a list where the string has been split at each match   | re.split(pattern, string, maxsplit=0, flags=0) |
|    sub   |             Replaces one or many matches with a string            |      re.sub(pattern, repl, string, max=0)      |

> **Note:** By uisng `sub()` function and replacing the pattern with empty string `''`, you can eliminate certain pattern from input strings. For instance, you can eliminate all whitespaces by finding the matches with `'\s+'` and replacing them with empty string. The following code shows some examples (also see Exrecise-2):

In [58]:
# Some examples of using sub() method:

text = ' abcdefghi'

result1 = re.sub('abc',  '',    text)        # Delete pattern abc
result2 = re.sub('abc',  'def', text)        # Replace pattern abc -> def
result3 = re.sub(r'\s+', '',   text)         #Eliminate whitespaces, prefix r makes the following a raw string
result4 = re.sub('abc(def)ghi', r'\1', text) # Replace a string with a part of itself

print(result1)
print(result2)
print(result3)
print(result4)

 defghi
 defdefghi
abcdefghi
 def


### Python RegEx Metacharachters

Metacharacters are characters with a special meaning. The real power of regex matching in Python emerges when regex contains special characters called **metacharacters**. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

The full list of metacharachters used in `re` and how to use them can be found in [Python documentation](https://docs.python.org/3.7/howto/regex.html).

### Metacharachters  `[]`  `^` `$` `.` `*` `+` `?` `{}` `|` `()` `\`

In this notebook, we're going to see the functionality of a few of the metacharachters.

Square brackets metacharachter `[]` specifies a set of characters you wish to match.

In [22]:
s = '123abc'
re.search('[abc]', s) # search whether any of a, b, or c is in s - Match 'a'

<re.Match object; span=(3, 4), match='a'>

Consider the problem of how to determine whether a string contains any three consecutive decimal digit characters.

In a regex, a set of characters specified in square brackets `[]` makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [23]:
s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s)

<re.Match object; span=(3, 6), match='123'>

`[0-9]` matches any single decimal digit character — any character between `'0'` and `'9'`, inclusive. The full expression `[0-9][0-9][0-9]` matches any sequence of three decimal digit characters. In this case, `s` matches because it contains three consecutive decimal digit characters, `'123'`.

These strings also match:

In [24]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [25]:
re.search('[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [26]:
re.search('[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

> **Note:**
* `[a-e]` is the same as `[abcde]`
* `[1-4]` is the same as `[1234]`
* `[0-39]` is the same as `[01239]`

You can complement (invert) the character set by using caret `^` symbol at the start of a square-bracket `[]`.

* `[^abc]` means any character except a or b or c
* `[^0-9]` means any non-digit character

In [46]:
# Exrecise-1: Write a RegEx that matches with the first character in the string s that isn’t a digit
s = '12345foo'
re.search('[^0-9]', s)

<re.Match object; span=(5, 6), match='f'>

Take a look at another regex metacharacter. As mentioned above, the dot `.` metacharacter matches any character except a newline, so it functions like a wildcard:

In [27]:
s = 'foo123bar'
re.search('1.3', s)

<re.Match object; span=(3, 6), match='123'>

In [23]:
s = 'foo13bar'
re.search('1.3', s)

> In the first example, the regex `1.3` matches `'123'` because the `'1'` and `'3'` match literally, and the `.` matches the `'2'`. Here, you’re essentially asking, "Does `s` contain a `'1'`, then any character (except a newline), then a `'3'`?" The answer is yes for `'foo123bar'` but no for `'foo13bar'`.

- The caret symbol `^` - as you saw earlier in this notebook - is used to check if a string starts with a certain character.
- The dollar symbol `$` is used to check if a string ends with a certain character.
- The star symbol `*` matches zero or more occurrences of the pattern left to it.
- The plus symbol `+` matches one or more occurrences of the pattern left to it.
- The question mark symbol `?` matches zero or one occurrence of the pattern left to it.
- Consider this code: `{n,m}`. This means at least `n`, and at most `m` repetitions of the pattern left to it.
- Vertical bar `|` is used for alternation (or operator).
- `()` - Group Parentheses `()` is used to group sub-patterns. For example, `(a|b|c)xz` matches any string that matches either a or b or c followed by xz.

**Note: Escaping Metacharacters**

- Occasionally, you may want to include a metacharacter in your regex, except you don't want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

> backslash `\` Removes the special meaning of a metacharacter.



In [11]:
# Using * literally as * NOT as a metacharacter

text = 'hi***there'
print(re.search('\*+', text))

<re.Match object; span=(2, 5), match='***'>


In [29]:
# Using * as metacharacter

print('mn ', re.search('ma*n', 'mn')) # match because * means zero or more occ. of the pattern left to it
print('man', re.search('ma*n', 'man')) # match 
print('maaan ', re.search('ma*n', 'maaan')) # match
print('main ', re.search('ma*n', 'main')) # no match because after * there should be nothing but n
print('woman ', re.search('ma*n', 'woman')) # match because * means zero or more occ. of pattern left to it

mn  <re.Match object; span=(0, 2), match='mn'>
man <re.Match object; span=(0, 3), match='man'>
maaan  <re.Match object; span=(0, 5), match='maaan'>
main  None
woman  <re.Match object; span=(2, 5), match='man'>


### `.group()`

> The `.group()` method of the `match` object returns the part of the string where there is a match. See the following examples.

In [13]:
# .group() example-1
string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object
match = re.search(pattern, string) 

print(match)

if match:
  print(match.group())
else:
  print("pattern not found")


<re.Match object; span=(2, 8), match='801 35'>
801 35


> You can access certain elements in `.group()`

In [14]:
match.group(1)

'801'

In [15]:
match.group(2) 

'35'

In [16]:
match.group(1, 2)

('801', '35')

In [17]:
match.groups()

('801', '35')

In [67]:
# .group() example-2
match = re.search('(?<=abc)def', 'abcdef') #(?<=abc) excludes abc from the match!
match.group()

'def'

Notice that to find all matches in the string, you should use `.findall()` method.

In [22]:
# .findall() example
string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object
match = re.findall(pattern, string) 

print(type(match))
print(match)

if match:
    for it in range(len(match)):
        print(match[it])
else:
  print("pattern not found")

<class 'list'>
[('801', '35'), ('102', '11')]
('801', '35')
('102', '11')


### Special Sequences

The following list of special sequences isn’t complete. For a complete list of sequences and expanded class definitions for Unicode string patterns, see the last part of [Regular Expression Syntax](https://docs.python.org/3/library/re.html#re-syntax) in the Standard Library reference. In general, the Unicode versions match any character that’s in the appropriate category in the Unicode database.

`\d`

    Matches any decimal digit; this is equivalent to the class [0-9]
`\D`

    Matches any non-digit character; this is equivalent to the class [^0-9]
`\s`

    Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
`\S`

    Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
`\w`

    Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
`\W`

    Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]

These sequences can be included inside a character class. For example, `[\s,.]` is a character class that will match any whitespace character, or `','` or `'.'`

### More Exercises

In [3]:
# Exrecise-2 

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub('#.*$', "", phone)
print ("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    # '\D' Matches nondigits and is equivalent to '[^0-9]'
print ("Phone Num : ", num)

Phone Num :  2004-959-559 
Phone Num :  2004959559


In [61]:
# Exrecise-3

# Find all numbers and return them as a list
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result) # ['12', '89', '34']

['12', '89', '34']


In [74]:
# Exrecise-4

# Split at each white-space character
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x) # ['The', 'rain', 'in', 'Spain']

['The', 'rain', 'in', 'Spain']


In [9]:
# Exercise-5

phone = "001---123--234-5678"

# Replace one or more repetitions of - like --- with a single space " "
num = re.sub("-+", " ", phone)
print(num) # prints 001 123 234 5678

001 123 234 5678


In [2]:
# Exercise-6
# Write a RegEx that matches a string that has an 'a' followed by zero or more 'b'

def text_match(text):
        patterns = 'ab*?'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("ac"))
print(text_match("abc"))
print(text_match("abbc"))
print(text_match("abab"))
print(text_match("bc"))
print(re.search('ab*', "ac"))
print(re.search('ab*', "abc"))
print(re.search('ab*', "abbc"))
print(re.search('ab*', "abab"))
print(re.search('ab*', "bc"))

Found a match!
Found a match!
Found a match!
Found a match!
Not matched!
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>
<re.Match object; span=(0, 2), match='ab'>
None


### References

[1] https://docs.python.org/3.7/library/re.html

[2] https://www.regular-expressions.info/tutorial.html

[3] https://regexone.com/lesson/introduction_abcs

[4] https://realpython.com/regex-python/

[5] https://www.w3schools.com/python/python_regex.asp

[6] https://www.programiz.com/python-programming/regex

[7] https://lzone.de/examples/Python%20re.sub

[8] https://docs.python.org/3/howto/regex.html