# Regular Expression and Python re Module

## Regular Expression (Regex) Basic

- `.`       -- Any Character Except New Line \n
- `\d`      -- Digit (0-9)
- `\D`      -- Not a Digit (0-9)
- `\w`      -- Word Character (a-z, A-Z, 0-9, _)
- `\W`      -- Not a Word Character
- `\s`      -- Whitespace (space, tab, newline)
- `\S`      -- Not Whitespace (space, tab, newline)


- `\b`      -- Word Boundary
- `\B`      -- Not a Word Boundary
- `^`       -- Beginning of a String
- `$`       -- End of a String


- `[]`      -- Matches ONE of Characters in brackets (Character set)
- `[^ ]`    -- Matches Characters NOT in brackets
- `|`       -- Either Or
- `( )`     -- Group

Quantifiers: (following above contents)
- `*`       -- 0 or More
- `+`       -- 1 or More
- `?`       -- 0 or One
- `{3}`     -- Exact Number
- `{3,4}`   -- Range of Numbers (Minimum, Maximum)

MetaCharacters: (need to be escaped with backslash)
- `.[{()\^$|?*+`

## Python re Module

**Standard procedure:**

1. `re.compile(r'regex', flags)` : create regex pattern
    - `flags` is like `re.IGNORECASE` or `re.I` to ignore case
2. `finditer(text_to_search)` : return all matches of each lines, with location info
3. `findall(text_to_search)` : return all matches of each lines as list only; and if matching groups, only return groups
4. `match(text_to_search)` : return match at the beginning of the string
5. `search(text_to_search)` : retunr the first match

## Demonstration of Regex with re Module

In [152]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat pat mat bat

https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

In [153]:
import re

def re_finditer(regex, text = text_to_search):
    count = 1
    pattern = re.compile(regex)
    matches = pattern.finditer(text)
    if len(list(pattern.finditer(text))) == 0:
        print  (regex+': Nothing Matched')
    else:
        for match in matches:
            print (regex+' '+str(count)+': '+str(match))
            count += 1
    print ('\r')

In [121]:
# Simple example
re_finditer(r'abc')
re_finditer(r'ABC')
re_finditer(r'cba')

abc 1: <_sre.SRE_Match object; span=(1, 4), match='abc'>

ABC 1: <_sre.SRE_Match object; span=(28, 31), match='ABC'>

cba: Nothing Matched



In [112]:
# With word boundary for ' Ha HaHa'
re_finditer(r'\bHa')  # 1st and 2nd Ha
re_finditer(r'Ha\b')  # 1st and 3rd Ha
re_finditer(r'\bHa\b')  # 1st Ha
re_finditer(r'\BHa')  # 3rd Ha

\bHa 1: <_sre.SRE_Match object; span=(67, 69), match='Ha'>
\bHa 2: <_sre.SRE_Match object; span=(70, 72), match='Ha'>

Ha\b 1: <_sre.SRE_Match object; span=(67, 69), match='Ha'>
Ha\b 2: <_sre.SRE_Match object; span=(72, 74), match='Ha'>

\bHa\b 1: <_sre.SRE_Match object; span=(67, 69), match='Ha'>

\BHa 1: <_sre.SRE_Match object; span=(72, 74), match='Ha'>



In [113]:
# With beginning and end
re_finditer(r'^Start', text = 'Start the life and go to the end')
re_finditer(r'^a', text = 'Start the life and go to the end')
re_finditer(r'end$', text = 'Start the life and go to the end')
re_finditer(r'the$', text = 'Start the life and go to the end')

^Start 1: <_sre.SRE_Match object; span=(0, 5), match='Start'>

^a: Nothing Matched

end$ 1: <_sre.SRE_Match object; span=(27, 30), match='end'>

the$: Nothing Matched



In [115]:
# Phone numbers
re_finditer(r'\d\d\d.\d\d\d.\d\d\d\d')
re_finditer(r'\d{3}.\d{3}.\d{4}')  # same as above
re_finditer(r'\d{3}[-.]\d{3}[-.]\d{4}')
re_finditer(r'[89]00[-.]\d{3}[-.]\d{4}')

\d\d\d.\d\d\d.\d\d\d\d 1: <_sre.SRE_Match object; span=(140, 152), match='321-555-4321'>
\d\d\d.\d\d\d.\d\d\d\d 2: <_sre.SRE_Match object; span=(153, 165), match='123.555.1234'>
\d\d\d.\d\d\d.\d\d\d\d 3: <_sre.SRE_Match object; span=(166, 178), match='123*555*1234'>
\d\d\d.\d\d\d.\d\d\d\d 4: <_sre.SRE_Match object; span=(179, 191), match='800-555-1234'>
\d\d\d.\d\d\d.\d\d\d\d 5: <_sre.SRE_Match object; span=(192, 204), match='900-555-1234'>

\d{3}.\d{3}.\d{4} 1: <_sre.SRE_Match object; span=(140, 152), match='321-555-4321'>
\d{3}.\d{3}.\d{4} 2: <_sre.SRE_Match object; span=(153, 165), match='123.555.1234'>
\d{3}.\d{3}.\d{4} 3: <_sre.SRE_Match object; span=(166, 178), match='123*555*1234'>
\d{3}.\d{3}.\d{4} 4: <_sre.SRE_Match object; span=(179, 191), match='800-555-1234'>
\d{3}.\d{3}.\d{4} 5: <_sre.SRE_Match object; span=(192, 204), match='900-555-1234'>

\d{3}[-.]\d{3}[-.]\d{4} 1: <_sre.SRE_Match object; span=(140, 152), match='321-555-4321'>
\d{3}[-.]\d{3}[-.]\d{4} 2: <_sre.SRE_Match 

In [122]:
# Words
re_finditer(r'[^b]at')

[^b]at 1: <_sre.SRE_Match object; span=(257, 260), match='cat'>
[^b]at 2: <_sre.SRE_Match object; span=(261, 264), match='pat'>
[^b]at 3: <_sre.SRE_Match object; span=(265, 268), match='mat'>



In [182]:
# Title + name
re_finditer(r'M[r|s|rs]\.?\s[A-Z]\w*')

M[r|s|rs]\.?\s[A-Z]\w* 1: <_sre.SRE_Match object; span=(155, 166), match='Mr. Schafer'>
M[r|s|rs]\.?\s[A-Z]\w* 2: <_sre.SRE_Match object; span=(167, 175), match='Mr Smith'>
M[r|s|rs]\.?\s[A-Z]\w* 3: <_sre.SRE_Match object; span=(176, 184), match='Ms Davis'>
M[r|s|rs]\.?\s[A-Z]\w* 4: <_sre.SRE_Match object; span=(199, 204), match='Mr. T'>



In [163]:
## Slicing out domain and suffix from url
# Finding them
re_finditer(r'https?://(www\.)?(\w+)(\.\w+)')

https?://(www\.)?(\w+)(\.\w+) 1: <_sre.SRE_Match object; span=(223, 245), match='https://www.google.com'>
https?://(www\.)?(\w+)(\.\w+) 2: <_sre.SRE_Match object; span=(246, 264), match='http://coreyms.com'>
https?://(www\.)?(\w+)(\.\w+) 3: <_sre.SRE_Match object; span=(265, 284), match='https://youtube.com'>
https?://(www\.)?(\w+)(\.\w+) 4: <_sre.SRE_Match object; span=(285, 305), match='https://www.nasa.gov'>



In [179]:
# Understand groups
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(text_to_search)
for match in matches:
    print ('Group 0: '+ str(match.group(0)))  # whole matached
    print ('Group 1: '+ str(match.group(1)))
    print ('Group 2: '+ str(match.group(2)))
    print ('Group 3: '+ str(match.group(3)))
    print ('\r')

Group 0: https://www.google.com
Group 1: www.
Group 2: google
Group 3: .com

Group 0: http://coreyms.com
Group 1: None
Group 2: coreyms
Group 3: .com

Group 0: https://youtube.com
Group 1: None
Group 2: youtube
Group 3: .com

Group 0: https://www.nasa.gov
Group 1: www.
Group 2: nasa
Group 3: .gov



In [180]:
# Substitute matched by particular groups
subbed_urls = pattern.sub(r'\2\3', text_to_search)
print (subbed_urls)


abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat pat mat bat

google.com
coreyms.com
youtube.com
nasa.gov



In [188]:
## Demonstrate other methods of re
# findall without groups
regex = re.compile(r'M[r|s|rs]\.?\s[A-Z]\w*')
matches = regex.findall(text_to_search)
print (matches) 

# finall with groups
regex = re.compile(r'(M[r|s|rs])(\.?\s[A-Z]\w*)')
matches = regex.findall(text_to_search)
print (matches) 

['Mr. Schafer', 'Mr Smith', 'Ms Davis', 'Mr. T']
[('Mr', '. Schafer'), ('Mr', ' Smith'), ('Ms', ' Davis'), ('Mr', '. T')]


In [191]:
# match -- only match at the beginning
regex = re.compile(r'Start')
matches = regex.match('Start the life and go to the end')
print (matches)

regex = re.compile(r'life')
matches = regex.match('Start the life and go to the end')
print (matches)

<_sre.SRE_Match object; span=(0, 5), match='Start'>
None


In [193]:
# search -- only return the first match
regex = re.compile(r'the')
matches = regex.search('Start the life and go to the end')
print (matches)

<_sre.SRE_Match object; span=(6, 9), match='the'>


### Reference

1. [Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)](https://www.youtube.com/watch?v=K8L6KVGG-7o&t=512s)