# Regular expressions

- A formal language for defining text strings (character sequences)
- Used for pattern matching (e.g. searching & replacing in text)

1. Disjunctions
2. Negation
3. Optionality
4. Aliases
5. Anchors

# 1. Disjunctions

In [1]:
import re

In [2]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.
'''

In [3]:
pattern = r'the'

In [14]:
print(re.sub(pattern, "X", text))

Most X X time X can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's X use X apostrophes? or it might need a more up-to-date model.



In [7]:
pattern = r'[Tt]he'

In [9]:
pattern = r'[0-9]'

In [11]:
pattern = r"[A-Z]"

In [13]:
pattern = r'of|the|we'

# 2. Negation

In [15]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.
'''

In [16]:
pattern = r"[^0-9]"

In [20]:
print(re.sub(pattern, " ", text))

 ost of the time we can use white space   ut what about fred gmail com or              he students  attempts aren t working   aybe it s the use of apostrophes  or it might need a more up to date model  


In [18]:
pattern = r"[^a-z]"

# 3.Optionality 

In [21]:
text = '''begin began begun beginning'''

In [22]:
pattern = r"beg.n"

In [23]:
print(re.sub(pattern, "X", text))

X X X Xning


In [24]:
text = '''colour can be spelled color'''

In [25]:
pattern = r'colou?r'

In [26]:
print(re.sub(pattern, "", text))

 can be spelled 


In [27]:
pattern = r"w.*"

In [28]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.
'''

In [33]:
print(re.sub(pattern,"",text))

Most of the time can use space.
But about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.



In [32]:
pattern = r"w.*? "

In [34]:
text = '''foo fooo foooo foooooo!'''

In [37]:
pattern = r'fooo+'

In [38]:
print(re.sub(pattern, "", text))

foo   !


# 4. Aliases

\w - match word <br>
\d - match digit <br>
\s - match whitespace <br>
\W - match not word<br>
\D - match not digit <br>
\S - match not whitespace 

In [39]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.
'''

In [40]:
pattern = r'\w'

In [41]:
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [44]:
pattern = r"\d"

In [45]:
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or //?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.



# 5.Anchors

In [46]:
pattern = '\w+'

In [47]:
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [48]:
pattern = '^\w+'

In [49]:
print(re.sub(pattern, "", text))

 of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model.



In [50]:
print(re.sub(pattern, "", text, flags=re.MULTILINE))

 of the time we can use white space.
 what about fred@gmail.com or 13/01/2021?
 students' attempts aren't working.
 it's the use of apostrophes? or it might need a more up-to-date model.



In [51]:
pattern = '\W$'

In [52]:
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? or it might need a more up-to-date model


In [53]:
print(re.sub(pattern, "", text, flags=re.MULTILINE))

Most of the time we can use white space
But what about fred@gmail.com or 13/01/2021
The students' attempts aren't working
Maybe it's the use of apostrophes? or it might need a more up-to-date model


In [54]:
word = 'colour color'

In [56]:
print(re.sub('colo?ur', "",word))

 color
