# Mini Intro to Regular Expression
by Dr Liang Jin

Part of Mini Python Sessions: [github.com/drliangjin/minipy](https://github.com/drliangjin/minipy)

Official Python `re` docs: [docs.python.org](https://docs.python.org/3/library/re.html)

### Python built-in `re` module

Regular Expression is a powerful tool for :
- searching text
- matching text
- extracting text
- manipulating text

In [None]:
import re

### Task: find telephone number from `LUMS: 01524 510752`

In [None]:
easy_task = "LUMS: 01524 510752"

In [None]:
# slice the string
tel = easy_task[6:]
tel

#### What if there are thousands of lines of strings with different formats?

In [None]:
# \d => 1 digit (0 to 9)
tel = re.search(r'\d\d\d\d\d \d\d\d\d\d\d', easy_task)
print(tel)
print(tel.group(0))

What is `r'...'`?

In [None]:
print('hello\nworld')

In [None]:
print(r'hello\nworld')

#### Differnt ways to format string?

In [None]:
print('We can use re to extract "%s" from "%s"' % (tel.group(0), easy_task))

In [None]:
print('We can use re to extract "{}" from "{}"'.format(tel.group(0), easy_task))

In [None]:
# my new love!
print(f'We can use re to extract "{tel.group(0)}" from "{easy_task}"')

### Task: area code and main number? `LUMS: 01524 510752`

In [None]:
# again, we can slice the string using the index
print(easy_task[6:12], easy_task[13:])

In [None]:
tel = re.search(r'(\d\d\d\d\d) (\d\d\d\d\d\d)', easy_task)
print(tel.group(1))
print(tel.group(2))

In [None]:
areaCode, mainNumber = tel.groups()
print(areaCode, mainNumber)

What if there are special characters such as `( )`?

### Task: deal with special characters in `(0)1524 510752`

In [None]:
hard_task = "LUMS: (0)1524 510752"

In [None]:
# Escape with backslash `\`, such as parenthesis `\(` and `\)`
tel = re.search(r'(\(\d\)\d\d\d\d) (\d\d\d\d\d\d)', hard_task)

areaCode, mainNumber = tel.groups()

print(areaCode, mainNumber)

NOTE: like `\d`, regular expression syntax such as `\D`, `\w`, `\W`, `\s`, `\S`... <br>
NOTE: like `()`, special characters such as `|`, `[]`, `{}` that need to escape...

### Task: match with alternatives using `|` and `?`

In [None]:
easy_task = "LUMS: 01524 510752"
hard_task = "LUMS: (0)1524 510752"

In [None]:
# create a pattern using vertial bar indicating alternatives
pattern1 = re.compile(r'(\d\d\d\d\d|\(\d\)\d\d\d\d) (\d\d\d\d\d\d)')

easy_areaCode1, easy_mainNumber1 = pattern1.search(easy_task).groups()
hard_areaCode1, hard_mainNumber1 = pattern1.search(hard_task).groups()

print(easy_areaCode1, hard_areaCode1)

In [None]:
# optional character using ?
pattern2 = re.compile(r'(\(?\d\)?\d\d\d\d) (\d\d\d\d\d\d)')

easy_areaCode2, easy_mainNumber2 = pattern2.search(easy_task).groups()
hard_areaCode2, hard_mainNumber2 = pattern2.search(hard_task).groups()

print(easy_areaCode2, hard_areaCode2)

### Regular Expression special characters

#### for a character
- `\d`: any numeric digit, or `[0-9]`
- `\w`: any word including alphabetic letter, numeric digit, and underscore
- `\s`: any space, tab, or newline
- `.` : anything except `\n` (newline), the wildcard character

In [None]:
# match different separater formats
tels = ['01524 510752', '01524-510752', '01524.510752', '01524  510752', '01524510752']

In [None]:
# a dot example
pattern = re.compile(r'\d\d\d\d\d.\d\d\d\d\d\d')

for tel in tels:
    match = pattern.search(tel)
    if match:
        print("Found: {}".format(match.group()))

#### for a pattern
- `*`: zero or more the preceding character, a common combination: `.*`
- `+`: one or more the preceding character, for instance, at least 1 digit, `\d+`
- `?`: zero or one the preceding character, for optional character
- `^`: start with the following character
- `$`: end with the previous character

In [None]:
# match tel number without country code
tels = ['+44 01524 510752', '44 01524-510752', '01524.510752', '01524510752', '(0)1524 510752']

In [None]:
# a carot example
pattern = re.compile(r'^\(?0\)?\d+.?\d+')

for tel in tels:
    match = pattern.search(tel)
    if match:
        print("Found: {}".format(match.group()))

#### for a group or a set
- `()`: grouping a subset of a string
- `[]`: a set of selected characters to match
- `[^ ]`: a set of selected characters to ignore
- `|` : a group of possible alternatives, see the example before
- `{}`: a number of times the preceding character

In [None]:
# only match gentlemen (without errors)
names = ['Mr Xi',
         'Mr. Trump', 
         'Mr Trump', 
         'Ms Trump', 
         'Mrs. Trump',
         'Mr rump',
         'Mr. T']

In [None]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w+')

for name in names:
    match = pattern.search(name)
    if match:
        print(match.group())

In [None]:
# only match animals
words = ['hog', 'dog', 'bog']

In [None]:
pattern = re.compile(r'[^b]og')
for word in words:
    match = pattern.search(word)
    if match:
        print(match.group())

In [None]:
# match the last two expressions
words = ['+44 (0)1524 65201',
         '+44 (0)1524 510752',
         '+44 (0)1524 99999999',
         '+44 (0)1524 9999']

In [None]:
pattern = re.compile(r'.*\s\d{5,6}') # problematic

for word in words:
    match = pattern.search(word)
    if match:
        print(match.group())

### Other Regular Expression functions

### `re.search()` VS `re.findall()`

In [None]:
three_tels = """LUMS general office: +44 (0)1524 510752
Undergraduate enquiries: +44 (0)1524 592938
Postgraduate enquiries: +44 (0)1524 510733"""

In [None]:
pattern = re.compile(r'(\+\d{2})\s(\(?0\)?\d{4})\s(\d{5,6})')

In [None]:
# search returns the first match and ignore all the remainings
match = pattern.search(three_tels)
print(match)

In [None]:
# findall returns a list of matches
matchs = pattern.findall(three_tels)
print(matchs)

### `re.sub()` for substituting strings

In [None]:
top_secret = """Classified, Max clearance level, eyes only: 
Agent Liang pass the extremely secret documents to Special Agent Geogre. 
After 15 sec, this notebook will explodeeee!"""

In [None]:
# let's censor the document log
pattern = re.compile(r'Agent\s\w+')

protected_secret = pattern.sub('YOUKNOWWHO', top_secret)

print(protected_secret)

In [None]:
# Initial only
pattern = re.compile(r'Agent\s(\w)\w*')

censored_secret = pattern.sub(r'\1*****', top_secret)

print(censored_secret)

### example: name orders

In [None]:
name1 = 'Firstname Lastname'
name2 = 'Lastname, Firstname'

In [None]:
pattern1 = re.compile(r'([A-Z]\w*)\s([A-Z]\w*)')
swapped_name1 = pattern1.sub(r'\2, \1', name1)
print(swapped_name1)

In [None]:
pattern2 = re.compile(r'([A-Z]\w*),\s([A-Z]\w*)')
swapped_name2 = pattern2.sub(r'\1 \2', name2)
print(swapped_name2)