# re Module in Python
## By Allen Huang

1. Basics
2. Anchors
3. Characters in brackets
4. Quantifiers
5. Practice
6. Capture information from groups
7. Methods other than finditer

 Regular Expression: re Module can be used to search for text patterns within text editors 

### 1. Basics

In [2]:
# a raw string is a string prefixed with an r, that tells Python not to handle any back slashes in any special way  
print('\tTab')
print(r'\tTab')
# we want our regular expression to interpret strings that passing in and not have python doing anything to them first  

	Tab
\tTab


In [3]:
import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
# separate patterns into a variable and also can reuse that variable to perform mutiple searches
# just search for literal text ABV
pattern = re.compile(r'abc')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)
# it returns an iterator that contains all of the matches 
# span is begining and end index of the match
# it is case sensitive, so ABC can not be matched, and also the order abc is strict

<re.Match object; span=(1, 4), match='abc'>


In [4]:
print(text_to_search[1:4])

abc


In [7]:
# match . 需要用\.
pattern = re.compile(r'\.')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)
# MetaCharacters (Need to be escaped):
# . ^ $ * + ? { } [ ] \ | ( )

<re.Match object; span=(111, 112), match='.'>
<re.Match object; span=(146, 147), match='.'>
<re.Match object; span=(167, 168), match='.'>
<re.Match object; span=(171, 172), match='.'>
<re.Match object; span=(218, 219), match='.'>
<re.Match object; span=(249, 250), match='.'>
<re.Match object; span=(262, 263), match='.'>


In [8]:
# match the URL and escaped . using a backslash
pattern = re.compile(r'coreyms\.com')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(139, 150), match='coreyms.com'>


In [12]:
# dot matches any character except new line
pattern = re.compile(r'.')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

In [11]:
# Type of characters that we can match
# the capital letter alway negate whatever the lowercase version means
type_match = '''
.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)  including: lowercase, uppercase, ditgit, underscore 
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)  including: ' ', \n
\S      - Not Whitespace (space, tab, newline)
Anchors: do not actually match any characters but rather invisible positions berfore or after characters. Can be used as a conjunction.
\b      - Word Boundary  including: whitespace, non alpanumeric character
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets  attention: only match ONE character
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)
'''

In [14]:
pattern = re.compile(r'\s')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(27, 28), match='\n'>
<re.Match object; span=(54, 55), match='\n'>
<re.Match object; span=(65, 66), match='\n'>
<re.Match object; span=(68, 69), match=' '>
<re.Match object; span=(73, 74), match='\n'>
<re.Match object; span=(88, 89), match=' '>
<re.Match object; span=(94, 95), match=' '>
<re.Match object; span=(97, 98), match=' '>
<re.Match object; span=(100, 101), match=' '>
<re.Match object; span=(110, 111), match='\n'>
<re.Match object; span=(112, 113), match=' '>
<re.Match object; span=(114, 115), match=' '>
<re.Match object; span=(116, 117), match=' '>
<re.Match object; span=(118, 119), match=' '>
<re.Match object; span=(120, 121), match=' '>
<re.Match object; span=(122, 123), match=' '>
<re.Match object; span=(124, 125), match=' '>
<re.Match object; span=(126, 127), match=' '>
<re.Match object; span=(128, 129), match=' '>
<re.Match object; span=(130, 131), match=' '>
<re.Match object; span=(132, 133), match=' '>
<r

### 2. Anchors

In [15]:
# word boundary before 
pattern = re.compile(r'\bHa')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)
# it matching the first Ha, becasue the start of the line a word boundary
# also matcging the second Ha, space is also a word boundary
# the third one is not, no word boundary before it

<re.Match object; span=(66, 68), match='Ha'>
<re.Match object; span=(69, 71), match='Ha'>


In [17]:
pattern = re.compile(r'\BHa')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(71, 73), match='Ha'>


In [16]:
# word boundary after
pattern = re.compile(r'Ha\b')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(66, 68), match='Ha'>
<re.Match object; span=(71, 73), match='Ha'>


In [20]:
sentence  = 'Start a sentence and then bring it to an end'

In [21]:
pattern = re.compile(r'^Start')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>


In [22]:
# we get no match if it is not the begining of a string
pattern = re.compile(r'^and')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [23]:
pattern = re.compile(r'end$')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

<re.Match object; span=(41, 44), match='end'>


### 3. Characters in brackets

In [25]:
# create a pattern to match 321-555-4321 and 123.555.1234
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(177, 189), match='123*555*1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [28]:
# only match a . or a -
# within a character set, we do not need to escape the .
pattern = re.compile(r'\d\d\d[.-]\d\d\d[.-]\d\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [29]:
pattern = re.compile(r'[89]00[.-]\d\d\d[.-]\d\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [35]:
# range in character set
pattern = re.compile(r'[^1-5A-Za-z]')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

In [37]:
# everything is not a 's' after M
pattern = re.compile(r'M[^s]')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(40, 42), match='MN'>
<re.Match object; span=(74, 76), match='Me'>
<re.Match object; span=(216, 218), match='Mr'>
<re.Match object; span=(228, 230), match='Mr'>
<re.Match object; span=(246, 248), match='Mr'>
<re.Match object; span=(260, 262), match='Mr'>


In [None]:
pattern = re.compile(r'[^]')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [27]:
# parsing information
with open('/Users/hkmac/Desktop/Carzy_Allen_Github/Data_and_Testfile/re_Module_data.txt', 'r') as f:
    contents = f.read()
    pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
    matches = pattern.finditer(contents)
    for match in matches:
        print(match)

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(557, 569), match='783-555-4799'>
<re.Match object; span=(647, 659), match='516-555-4615'>
<re.Match object; span=(740, 752), match='127-555-1867'>
<re.Match object; span=(829, 841), match='608-555-4938'>
<re.Match object; span=(915, 927), match='568-555-6051'>
<re.Match object; span=(1003, 1015), match='292-555-1875'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1180, 1192), match='614-555-1166'>
<re.Match object; span=(1269, 1281), match='530-555-2676'>
<re.Match object; span=(1355, 1367), match='470-555-2750'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; spa

In [30]:
with open('/Users/hkmac/Desktop/Carzy_Allen_Github/Data_and_Testfile/re_Module_data.txt', 'r') as f:
    contents = f.read()
    pattern = re.compile(r'[89]00.\d\d\d.\d\d\d\d')
    matches = pattern.finditer(contents)
    for match in matches:
        print(match)

<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; span=(1790, 1802), match='800-555-7100'>
<re.Match object; span=(2051, 2063), match='900-555-5118'>
<re.Match object; span=(2826, 2838), match='900-555-5428'>
<re.Match object; span=(3284, 3296), match='800-555-8810'>
<re.Match object; span=(3971, 3983), match='900-555-9598'>
<re.Match object; span=(4945, 4957), match='800-555-2420'>
<re.Match object; span=(5566, 5578), match='900-555-3567'>
<re.Match object; span=(6189, 6201), match='800-555-3216'>
<re.Match object; span=(6889, 6901), match='900-555-7755'>
<re.Match object; span=(7864, 7876), match='800-555-1372'>
<re.Match object; span=(8741, 8753), match='900-555-6426'>


### 4. Quantifiers

Different from character set, quantifiers can match more than one character at once.

In [38]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(177, 189), match='123*555*1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [40]:
# match mutiple characters at once
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(177, 189), match='123*555*1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [44]:
pattern = re.compile(r'M(r|s|rs).?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 227), match='Mr. Schafer'>
<re.Match object; span=(228, 236), match='Mr Smith'>
<re.Match object; span=(237, 245), match='Ms Davis'>
<re.Match object; span=(246, 259), match='Mrs. Robinson'>
<re.Match object; span=(260, 265), match='Mr. T'>


### 5. Practice

In [46]:
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

In [58]:
pattern = re.compile(r'[c|C]orey\.?-?(321)?-?M?(S|s)chafer\@\w*-?\w*\.\w*')

matches = pattern.finditer(emails)

for match in matches:
    print(match)
# it is not readable
# 注意，Group后面可以跟quantifiers，但是{}本身就是quantifiers，后面不能跟

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


In [64]:
# step by step
# + means repeat[]one or more times untill we reach @
pattern = re.compile(r'[a-zA-Z.0-9-]+@[a-z-]+.[a-z]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


In [59]:
# a pattern can match every E-mails
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


### 6. Capture information from groups

After matching, we can actually use the information captured from those groups.

In [70]:
import re

urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

In [72]:
pattern = re.compile(r'[a-z]+:/{2}[a-z]+\.[a-z.]+')

matches = pattern.finditer(urls)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 42), match='http://coreyms.com'>
<re.Match object; span=(43, 62), match='https://youtube.com'>
<re.Match object; span=(63, 83), match='https://www.nasa.gov'>


In [79]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)

for match in matches:
    print(match.group(0))

https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov


Now we have 3 different groups, the first group is optional www, the second is word characters that make up the domain name, and the last group is top-level group. Group 0 is everything that we captured.

In [86]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(1), end = '*')
    print(match.group(2), end = '*')
    print(match.group(3), end = '\n')

www.*google*.com
None*coreyms*.com
None*youtube*.com
www.*nasa*.gov


In [87]:
# back reference, pass in the substitution -- back references that reference these groups
# also, pass in the text that we want to replace
subbed_urls = pattern.sub(r'\2\3', urls)
print(subbed_urls)
# every time it finds a match, it replace the match with group2 and group3


google.com
coreyms.com
youtube.com
nasa.gov



### 7. Methods other than finditer

- finditer: returns match objects with extra information and functionality
- findall: just return the matches as a list of strings, if it is matching groups, it will only return groups

- if only one group, onyl return this group
- if mutiple groups, returns a tuple that contains all of the group
- if no groups, returns a list of strings of all matches 

In [89]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.findall(urls)
for match in matches:
    print(match)

('www.', 'google', '.com')
('', 'coreyms', '.com')
('', 'youtube', '.com')
('www.', 'nasa', '.gov')


match method: the regular expression matches at the begining of the string

In [90]:
sentence = 'Start here and dance with me'

In [92]:
pattern = re.compile(r'dance')
matches = pattern.match(sentence)
print(matches)
# it only return the first match, without iterate. If no match, it returns None
# sentence is not at the begining

None


In [93]:
pattern = re.compile(r'Start')
matches = pattern.match(sentence)
print(matches)

<re.Match object; span=(0, 5), match='Start'>


In [94]:
# search within the entire string
pattern = re.compile(r'dance')
matches = pattern.search(sentence)
print(matches)
# only print out the first match

<re.Match object; span=(15, 20), match='dance'>


In [95]:
pattern = re.compile(r'play')
matches = pattern.search(sentence)
print(matches)

None


flags:

- muti-line flag: allows us to use the caret and the dot sign to match the begining of each line in a multi-line string rather than just the begining or end of the string.
- verbose flag: add whiteplace and add comments directly within your pattern.

In [96]:
# each letter in ''Start" could be lowercase or uppercase
# add a ignore case flag
# 简写：re.I
pattern = re.compile(r'start', re.IGNORECASE)
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>
