# Regular expressions
* A formal language for defining text strings (character sequences)
* Used for pattern matching (e.g. searching & replacing in text)

1. Disjunctions
2. Negation
3. Optionality
4. Aliases
5. Anchors

## 1. Disjunctions

In [1]:
import re

In [2]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [3]:
pattern = r"the"

In [4]:
print(re.sub(pattern, "X", text))

Most of X time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's X use of apostrophes? Or it might need a more up-to-date model.



In [5]:
# also match on upper case T
pattern = r"[Tt]he"

In [6]:
# match on digits
pattern = r"[0-9]"

In [7]:
# match on upper case
pattern = r"[A-Z]"

In [8]:
# match any of these words
pattern = r"of|the|we"

## 2. Negation

In [9]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [None]:
# match on anything that is NOT digits
pattern = r"[^0-9]"

In [11]:
print(re.sub(pattern, " ", text))

                                                                          13 01 2021                                                                                                                      


In [12]:
# match on anything that is NOT lower case characters
pattern = r"[^a-z]"

## 3. Optionality
- `.` matches any single character (wildcard).
- `?` makes the preceding character optional (matches 0 or 1 times).
- `*` matches 0 or more repetitions of the preceding character (Kleene star).
- `+` matches 1 or more repetitions of the preceding character (Kleene plus).
- Adding `?` after `*` or `+` makes the match non-greedy (matches as little as possible).

In [13]:
text = '''begin began begun beginning'''

In [14]:
# . means match anything (like a 'wild card')
pattern = r"beg.n"

In [15]:
print(re.sub(pattern, "X", text))

X X X Xning


In [16]:
text = '''colour can be spelled color'''

In [17]:
# ? means previous character is optional
pattern = r"colou?r"

In [18]:
print(re.sub(pattern, "", text))

 can be spelled 


In [19]:
# * is the Kleene star, meaning match 0 or more of previous char
pattern = r"w.*"

In [20]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [21]:
print(re.sub(pattern, "", text))

Most of the time 
But 
The students' attempts aren't 
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [22]:
# make sure the match is non-greedy using the ? character
pattern = r"w.*? "

In [23]:
text = '''foo fooo foooo fooooo!'''

In [24]:
# + is the Kleene plus, meaning match 1 or more of previous char
pattern = r"fooo+"

In [25]:
print(re.sub(pattern, "", text))

foo   !


## 4. Aliases

#### \w - match word
#### \d - match digit
#### \s - match whitespace
#### \W - match not word
#### \D - match not digit
#### \S - match not whitespace

In [26]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [27]:
pattern = r"\w"

In [28]:
# match of all word characters
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [29]:
pattern = r"\d"

In [30]:
# match of all digit characters
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or //?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



## 5. Anchors

In [31]:
# delete all words
pattern = '\w+'

In [32]:
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [33]:
# delete only words at the start of a string
pattern = '^\w+'

In [34]:
print(re.sub(pattern, "", text))

 of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [35]:
# switch on multiline mode to delete words at the start of each line
print(re.sub(pattern, "", text, flags=re.MULTILINE))

 of the time we can use white space.
 what about fred@gmail.com or 13/01/2021?
 students' attempts aren't working.
 it's the use of apostrophes? Or it might need a more up-to-date model.



In [36]:
# use $ to anchor the match at the end of a string
pattern = '\W$'

In [37]:
# delete non-words from end of string
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model


In [38]:
# switch on multiline mode to delete non-words at the end of each line
print(re.sub(pattern, "", text, flags=re.MULTILINE))

Most of the time we can use white space
But what about fred@gmail.com or 13/01/2021
The students' attempts aren't working
Maybe it's the use of apostrophes? Or it might need a more up-to-date model
