<a href="https://colab.research.google.com/github/carloslme/automating-boring-stuff/blob/main/Chapter_7_Pattern_Matching_with_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Finding Patterns of Text Without Regular Expressions
Create a function that evaluate a phone number with the next format: 415-555-*1234*

In [None]:
def isPhoneNumber(text):
  
  # Checks if the text given is exactly 12 characters
  if len(text) != 12:
    return False

  # Checks if the first three numbers are only numeric characters
  for i in range(0,3):
    if not text[i].isdecimal():
      return False
    # Checks if the fourth character is '-'
    if text[3] != '-':
      return False 

  # Checks if the next three numbers are only numeric characters
  for i in range(4, 7):
    if not text[i].isdecimal():
      return False 
    # Checks if the eighth character is '-'
    if text[7] != '-': 
      return False 

  # Checks if the next three numbers are only numeric characters
  for i in range(8, 12):
    if not text[i].isdecimal():
      return False 
  return True

In [None]:
print(isPhoneNumber('415-555-4242'))
print(isPhoneNumber('Hello world!'))

True
False


In [None]:
# Find the pattern of text in a larger string
message = 'Call me at 415-555-1011 tomorrow. 411-555-9999 is my office.'
for i in range(len(message)):
  chunk = message[i:i+12]
  if isPhoneNumber(chunk):
    print('Phone number found: ' + chunk)
print('Done')

Phone number found: 415-555-1011
Phone number found: 411-555-9999
Done


##Finding Patterns of Text With Regular Expressions

### *Matching Regex Objects*

In [None]:
import re
# For this example, \d means any number 0-9
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') #r' means raw string

In [None]:
mo = phoneNumRegex.search('My number is 415-555-4242')

In [None]:
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


### *Grouping with Parentheses*

In [None]:
import re

''' 
Adding () -> (\d\d\d) group the numbers that contains the expression
group() match object method to grab the matching text from just one group. 
'''

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
mo.group(1)

'415'

In [None]:
mo.group(2)

'555-4242'

In [None]:
mo.group(0)

'415-555-4242'

In [None]:
mo.group()

'415-555-4242'

In [None]:
'''
If you would like to retrieve all the groups at once, use the groups() method.
'''
mo.groups()

('415', '555-4242')

In [None]:
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

415
555-4242


In [None]:
''' 
To escape the ( and ) characters the next code can be added
'''

phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')

In [None]:
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')

In [None]:
mo.group(1)

'(415)'

In [None]:
mo.group(2)

'555-4242'

### *Matching Multiple Groups with the Pipe*
The | character is called a pipe . You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'

In [None]:
'''
When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object.
'''

In [None]:
import re
heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey')
mo1.group()

'Batman'

In [None]:
mo2 = heroRegex.search('Tina Fey and Batman.')
mo2.group()

'Tina Fey'

In [None]:
''' 
Specifying one prefix 
'''
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')

# Return the full matched word
mo.group() 

'Batmobile'

In [None]:
# Return just the part of the matched text inside the first parentheses
mo.group(1)

'mobile'

### *Optional Matching with the Question Mark*
Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern.

In [None]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The adventure of Batman')
mo1.group()

'Batman'

In [None]:
mo2 = batRegex.search('The adventures of Batwoman')
mo2.group()

'Batwoman'

In [None]:
'''
The (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of wo in it. This is why the regex matches both 'Batwoman' and 'Batman'
'''

"\nThe (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of wo in it. This is why the regex matches both 'Batwoman' and 'Batman'\n"

In [None]:
# Using it in the previous phone number examples
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo1.group()

'415-555-4242'

In [None]:
mo2 = phoneRegex.search('My number is 649-5490')
mo2.group()

'649-5490'

### *Matching Zero or More with the Star*
The * (called the star or asterisk ) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can 

In [None]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('My name is Batman.')
mo1.group()

'Batman'

In [None]:
mo2 = batRegex.search('She is is Batwoman.')
mo2.group()

'Batwoman'

In [None]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

### *Matching One or More with the Plus*
While * means “match zero or more,” the + (or plus ) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once . It is not optional.

In [None]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [None]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [None]:
mo3 = batRegex.search('Batman')
mo3 == None

True

# Matching Specific Repetitions with Curly Brackets
If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa' , but it will not match 'HaHa' , since the latter has only two repeats of the (Ha) group.

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa' , but it will not match 'HaHa' , since the latter has only two repeats of the (Ha) group.

You can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. Curly brackets can help make your regular expressions shorter.

In [None]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [None]:
mo2 = haRegex.search('Ha')
mo2 == None

True

##Greedy and Nongreedy Matching
Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.




In [None]:
import re

greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [None]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

Note that the question mark can have two meanings in regular expressions: declaring a nongreedy match or flagging an optional group. These meanings are entirely unrelated.

##The findall() Method
In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string. *as long as there are no groups in the regular expression .*

In [None]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

In [None]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there are groups in the regular expression, then findall() will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex.

In [None]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

## Shorthand character class
* \d -> Any numeric digit from 0-9
* \D -> Any character that is NOT a numeric digit from 0-9
* \w Any letter, numeric digit, or the underscore character. (This is a matchin "word" characters)
* \W Any character that is NOT a letter, numeric digit, or the underscore character. (This is a matchin "word" characters)
* \s Any space, tab, or newline character. (Think of this as matching “space” characters.)
* \S Any character that is not a space, tab, or newline.

In [None]:
import re

xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

## Making Your Own Character Classes
You can define your own character class using square brackets. [ ]

In [None]:
vowelRegex = re.compile(r'[aeiouA EIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o',
 'o',
 'o',
 ' ',
 'e',
 'a',
 ' ',
 'a',
 ' ',
 'o',
 'o',
 ' ',
 'A',
 ' ',
 'O',
 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

By placing a caret character ( ^ ) just after the character class’s opening bracket, you can make a negative character class . A negative character class will match all the characters that are not in the character class.

In [None]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

## The Caret and Dollar Sign Characters
You can also use the caret symbol ( ^ ) at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign ( \$ ) at the end of the regex to indicate the string must end with this regex pattern. And you can use the ^ and $ together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

In [None]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [None]:
beginsWithHello.search('He said hello.') == None

True

The r'\d$' regular expression string matches strings that end with a numeric character from 0 to 9.

In [None]:
import re

endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your number is 89')

<_sre.SRE_Match object; span=(16, 17), match='9'>

In [None]:
endsWithNumber.search('Your number is forty two.') == None

True

The r'^\d+$' regular expression string matches strings that both begin and end with one or more numeric characters.

In [None]:
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<_sre.SRE_Match object; span=(0, 10), match='1234567890'>

In [None]:
wholeStringIsNum.search('12345xyz67890') == None

True

## The Wildcard Character 
The . (or dot ) character in a regular expression is called a wildcard and will match any character except for a newline.

In [None]:
import re
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

Remember that the dot character will match just one character, which is why the match for the text flat in the previous example matched only lat . To match an actual dot, escape the dot with a backslash.

# Matching Everything with Dot-Star
Sometimes you will want to match everything and anything. For example, say you want to match the string 'First Name:' , followed by any and all text, followed by 'Last Name:' , and then followed by anything again. You can use the dot-star ( .* ) to stand in for that “anything.”

In [None]:
nameRegex =  re.compile(r'First name: (.*) Last name: (.*)')
fn = nameRegex.search('First name: Carlos Last name: Lopez')

fn.group(1)


'Carlos'

In [None]:
fn.group(2)

'Lopez'

The dot-star uses greedy mode: It will always try to match as much text as possible. To match any and all text in a nongreedy fashion, use the dot, star, and question mark ( .*? ). Like with curly brackets, the question mark tells Python to match in a nongreedy way.

In [None]:
nonGreedyRegex = re.compile(r'<.*?>')
ng = nonGreedyRegex.search('<My cat is handsome> and you know it!>')
ng.group()

'<My cat is handsome>'

In [None]:
greedyRegex = re.compile(r'<.*>')
gr = greedyRegex.search('<My cat is handsome> and you know it!>')
gr.group()

'<My cat is handsome> and you know it!>'

# Matching Newlines with the Dot Character
The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile() , you can make the dot character match all characters, including the newline character.

In [None]:
import re
noNewLineRegex = re.compile(r'.*')
noNewLineRegex.search('Serve the public trust. \n Protect the innocent. \nUphold the law.').group()

'Serve the public trust. '

In [None]:
noNewLineRegex = re.compile(r'.*',re.DOTALL)
noNewLineRegex.search('Serve the public trust. \n Protect the innocent. \nUphold the law.').group()

'Serve the public trust. \n Protect the innocent. \nUphold the law.'

# Review
* The ? matches zero or one of the preceding group. 
* The * matches zero or more of the preceding group. 
* The + matches one or more of the preceding group. 
* The {n} matches exactly n of the preceding group. 
* The {n,} matches n or more of the preceding group. 
* The {,m} matches 0 to m of the preceding group. 
* The {n,m} matches at least n and at most m of the preceding group. 
* {n,m}? or *? or +? performs a nongreedy match of the preceding group. 
* ^spam means the string must begin with spam.
* spam$ means the string must end with spam. 
* The . (dot) matches any character, except newline characters.
* \d , \w , and \s match a digit, word, or space character, respectively. 
* \D , \W , and \S match anything except a digit, word, or space character, respectively. 
* [abc] matches any character between the brackets (such as a , b , or c ). 
* [^abc] matches any character that isn’t between the brackets.

# Case-Insensitive Matching
When you care only about matching the letters without worrying whether they’re uppercase or lowercase. To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile() .

In [None]:
robocop = re.compile(r'robocop',re.IGNORECASE)
robocop.search('RoBoCop is part man, part machine, all cop.').group()

'RoBoCop'

In [None]:
robocop.search('ROBOCOP protects the innocents.').group()

'ROBOCOP'

# Substituting Strings with the sub() Method
Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [None]:
import re 
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')


'CENSORED gave the secret documents to CENSORED.'

Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub() , you can type \1 , \2 , \3 , and so on, to mean “Enter the text of group 1 , 2 , 3 , and so on, in the substitution.” For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex Agent (\w)\w* and pass r'\1****' as the first argument to sub() . The \1 in that string will be replaced by whatever text was matched by group 1 —that is, the (\w) group of the regular expression.

In [None]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent')

'A**** told C**** that E**** knew B**** was a double agent'

# Managing Complex Regexes
The “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile(), it will ignore whitespace and comments inside the regular expression string.
* ' ' ' create a multiline string for spreading 
* \# are comments

In [2]:
import re
# Instead of this hard-to-read regular expression...
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4} (\s*(ext|x|ext.)\s*\d{2,5})?)')

In [3]:
# ... something like this
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))? # area code 
    (\s|-|\.)? # separator \d{3} # first 3 digits 
    (\s|-|\.) # separator \d{4} # last 4 digits 
    (\s*(ext|x|ext.)\s*\d{2,5})? # extension 
    )''', re.VERBOSE)