In [1]:
# Create table
tableData = [['apples', 'oranges', 'cherries', 'banana'],
             ['Alice', 'Bob', 'Carol', 'David'],
             ['dogs', 'cats', 'moose', 'goose']]

print(len(tableData))

# Results would look like below
#   apples Alice  dogs
#  oranges   Bob  cats
# cherries Carol moose
#   banana David goose


def print_table(table):
    # Create a new list of 3 "0" values: One for each list in table Data
    col_width = 0 * len(table)
    print(col_width)

3


# Pattern Matching with Regular Expressions

### Finding text patterns without using regular expressions vs. how to simplify with regular expressions
I’ll show you basic matching with regular expressions and then move on to some more powerful features, such as string substitution and creating your own character classes. Finally, at the end of the chapter, you’ll write a program that can automatically extract phone numbers and email addresses from a block of text.

### Finding Patterns of Text without Regular Expressions
Say you want to find a phone number in a string. You know the pattern: three numbers, a hyphen, three numbers, a hyphen, and four numbers. Here’s an example: 415-555-4242.

In [4]:
# Create a function called isPhoneNumber() to check whether a string matches this pattern and return TRUE/FALSE
def isPhoneNumber(text):
    # Check that the string is exactly 12 characters
    if len(text) != 12:
        return False
    for i in range(0, 3):
        # Checks that the area code consists of only numeric characters
        if not text[i].isdecimal():
            return False
    # The number must have the first hyphen after the area code
    if text[3] != '-':
        return False
    # Checks for 3 more numeric characters
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    # Check for another hyphen
    if text[7] != '-':
        return False
    # Check for the final 3 numeric characters
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    # Return True if everything holds True
    return True

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))
print('408999999 is a phone number without dashes but will return false:')
print(isPhoneNumber('408999999'))

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False
408999999 is a phone number without dashes but will return false:
False


In [5]:
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
message

'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'

In [12]:
for i in range(len(message)):
    chunk = message[i:i+12]
#     print(chunk)

In [8]:
# Loop through chunks of 12 at a time and prints out the entire phone number based on isPhoneNumber function
for i in range(len(message)):
    chunk = message[i:i+12]
    # Pass chunk to isPhoneNumber() to see whether it matches the phone number pattern
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

# The loop goes through the entire string, testing each 12-character piece and printing any chunk it finds that 
# satisfies isPhoneNumber(). Once we’re done going through message, we print Done.

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done


In [11]:
print(message[0:0+12])
print(message[1:1+12])

Call me at 4
all me at 41


### Finding Patterns of Text with Regular Expressions
Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character—that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d \d\d regex.

But regular expressions can be much more sophisticated. For example, adding a 3 in curly brackets ({3}) after a pattern is like saying, “Match this pattern three times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number format.

### Creating Regex Objects
All of the regex functions in Python are in the re module:

Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object).

In [12]:
import re

# Create a Regex object that matches the phone number pattern
#  By putting an r before the first quote of the string value,
#  you can mark the string as a raw string, which does not escape characters
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneNumRegex

re.compile(r'\d\d\d-\d\d\d-\d\d\d\d', re.UNICODE)

### Matching Regex Objects
A Regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object. Match objects have a group() method that will return the actual matched text from the searched string.

In [16]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
mo

<re.Match object; span=(13, 25), match='415-555-4242'>

In [17]:
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


The mo variable name is just a generic name to use for Match objects. This example might seem complicated at first, but it is much shorter than the earlier isPhoneNumber.py program and does the same thing.

Here, we pass our desired pattern to re.compile() and store the resulting Regex object in phoneNumRegex. Then we call search() on phoneNumRegex and pass search() the string we want to search for a match. The result of the search gets stored in the variable mo. In this example, we know that our pattern will be found in the string, so we know that a Match object will be returned. Knowing that mo contains a Match object and not the null value None, we can call group() on mo to return the match. Writing mo.group() inside our print statement displays the whole match, 415-555-4242.

### Review of Regular Expression Matching

While there are several steps to using regular expressions in Python, each step is fairly simple.

    Import the regex module with import re.

    Create a Regex object with the re.compile() function. (Remember to use a raw string.)

    Pass the string you want to search into the Regex object’s search() method. This returns a Match object.

    Call the Match object’s group() method to return a string of the actual matched text.



### Grouping with Parentheses

Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.

The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text.

In [20]:
# Compile regex object
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d\d)') # group 1 and group 2 in parentheses

# Create match object from string
mo = phoneNumRegex.search('my number is 415-222-2222')

# Group by match object
print(mo.group(1)) # first group
print(mo.group(2)) # second group
print(mo.group(0)) # all
print(mo.group()) # all

# Multiple groups
print(mo.groups()) # plural s

222
2222
222-2222
222-2222
('222', '2222')


In [21]:
areaCode, mainNumber = mo.groups() # single line assignment
print(areaCode)
print(mainNumber)

222
2222


### Matching parentheses as well
Parentheses have a special meaning in regular expressions, but what do you do if you need to match a parenthesis in your text? For instance, maybe the phone numbers you are trying to match have the area code set in parentheses. In this case, you need to escape the ( and ) characters with a backslash.

In [30]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)') # add extra \( \) and () for the area code to match () as well
mo = phoneNumRegex.search('text this number (408) 999-9999')
print(mo.group(1))
print(mo.group(2))
print(mo.groups())

(408)
999-9999
('(408)', '999-9999')


### Matching Multiple Groups with the Pipe

The | character is called a pipe. You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object.

In [32]:
# Search for Batman or Tina Fey using Regex
heroRegex = re.compile(r'Batman|Tina Fey')
mo_1 = heroRegex.search('Batman and Tina Fey')
mo_1.group()

'Batman'

In [34]:
mo_2 = heroRegex.search('Tina Fey and Batman')
mo_2.group()

'Tina Fey'

You can also use the pipe to match one of several patterns as part of your regex. For example, say you wanted to match any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done with parentheses. 

In [40]:
# Regex for prefix of words
batRegex = re.compile(r'Bat(\|man|mobile|copter|bat)') # search for prefix 'Bat'
mo = batRegex.search('Batmobile lost a wheel with Batman')
print(mo.group()) # returns full matched text 'Batmobile'
print(mo.groups())
print(mo.group(1)) # returns just the part of the matched text inside the first parentheses group 'mobile'

Batmobile
('mobile',)
mobile


The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1) returns just the part of the matched text inside the first parentheses group, 'mobile'. By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.

If you need to match an actual pipe character, escape it with a backslash, like '\|'.

### Optional Matching with the Question Mark
Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern.

In [43]:
batRegex = re.compile(r'Bat(wo)?man') #  pattern wo is an optional group
mo_1 = batRegex.search('the adventures of Batman')
mo_1.group()

'Batman'

In [44]:
mo_2 = batRegex.search('the Batwoman adventures of Batman')
mo_2.group()

'Batwoman'

The (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of wo in it. This is why the regex matches both 'Batwoman' and 'Batman'.

Using the earlier phone number example, you can make the regex look for phone numbers that do or do not have an area code.

#### Think of the "?" as saying, "Match zero or one of the group preceding this question mark"

If you need to match an actual question mark character, escape it with \?.

In [9]:
# Think of the "?" as saying, "Match zero or one of the group preceding this question mark"
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d') # ? after area code so before section is optional
mo_1 = phoneRegex.search('my number is 408-999-9999 and 408-888-8888')
print(mo_1.group())

# Think of the "?" as saying, "Match zero or one of the group preceding this question mark"
phoneRegex2 = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d') # ? after area code so before section is optional
mo_3 = phoneRegex2.search('my number is 999-9999 and 408-888-8888')
print(mo_3.group())
print(mo_3.groups())
print(phoneRegex2.findall('my number is 999-9999 and 408-888-8888'))

408-999-9999
999-9999
(None,)
['', '408-']


In [46]:
mo_2 = phoneRegex.search('my number is 999-9999 and 888-8888')
mo_2.group()

'999-9999'

### Matching Zero or More with the Star

The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again. 

In [52]:
batRegex = re.compile(r'Bat(wo)*man')

# For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string
mo_1 = batRegex.search('The Adventures of Batman')
mo_1.group()

'Batman'

In [53]:
# For 'Batwoman', the (wo)* matches one instance of wo
mo_2 = batRegex.search('The Adventures of Batwoman')
mo_2.group()

'Batwoman'

In [51]:
# For 'Batwowowowoman', (wo)* matches four instances of wo
mo_3 = batRegex.search('The Adventures of Batwowowowoman')
mo_3.group()

'Batwowowowoman'

For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string; for 'Batwoman', the (wo)* matches one instance of wo; and for 'Batwowowowoman', (wo)* matches four instances of wo.

If you need to match an actual star character, prefix the star in the regular expression with a backslash, \*.

### Matching One or More with the Plus

While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional. Enter the following into the interactive shell, and compare it with the star regexes in the previous section

In [54]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [55]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [59]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

True

### Matching Specific Repetitions with Curly Brackets

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

You can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns:

(Ha){3}
(Ha)(Ha)(Ha)

In [60]:
haRegex = re.compile(r'(ha){3}')
mo1 = haRegex.search('hahahahahahaha')
mo1.group()

'hahaha'

In [62]:
mo2 = haRegex.search('ha') 
mo2 == None # needs a minimum of ha * 3

True

### Greedy and Nongreedy Matching

Since (Ha){3,5} can match three, four, or five instances of Ha in the string 'HaHaHaHaHa', you may wonder why the Match object’s call to group() in the previous curly bracket example returns 'HaHaHaHaHa' instead of the shorter possibilities. After all, 'HaHaHa' and 'HaHaHaHa' are also valid matches of the regular expression (Ha){3,5}.

Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

In [63]:
greedyHaRegex = re.compile(r'(ha){3,5}') # greedy approach, which means that it will find the longest string possible
mo1 = greedyHaRegex.search('hahahahaha')
mo1.group()

'hahahahaha'

In [65]:
nongreedyHaRegEx = re.compile(r'(ha){3,5}?') # nongreedy with ? searches for the min needed (3)
mo2 = nongreedyHaRegEx.search('hahahahaha')
mo2.group()

'hahaha'

### The findall() Method

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string. To see how search() returns a Match object only on the first instance of matching text

In [10]:
phoneNumRegEx = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegEx.search('Cell: 408-888-8888 Work: 408-999-9999')
print(mo.group())
print(mo.groups())

408-888-8888
()


On the other hand, findall() will not return a Match object but a list of strings—as long as there are no groups in the regular expression. Each string in the list is a piece of the searched text that matched the regular expression. 

In [67]:
phoneNumRegEx = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegEx.findall('Cell: 408-888-8888 Work: 408-999-9999')

['408-888-8888', '408-999-9999']

If there are groups in the regular expression, then findall() will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex. To see findall() in action, enter the following into the interactive shell (notice that the regular expression being compiled now has groups in parentheses)

In [75]:
phoneNumRegEx = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups vs no groups
phoneNumRegEx.findall('Cell: 408-888-8888 Work: 408-999-9999 Bok: 999-888-8888 Line: 555-5555')

[('408', '888', '8888'), ('408', '999', '9999'), ('999', '888', '8888')]

To summarize what the findall() method returns, remember the following:

    When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d, the method findall() returns a list of string matches, such as ['415-555-9999', '212-555-0000'].

    When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\ d\d\d), the method findall() returns a list of tuples of strings (one string for each group), such as [('415', '555', '9999'), ('212', '555', '0000')].



### Character Classes
\d
	

Any numeric digit from 0 to 9.

\D
	

Any character that is not a numeric digit from 0 to 9.

\w
	

Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)

\W
	

Any character that is not a letter, numeric digit, or the underscore character.

\s
	

Any space, tab, or newline character. (Think of this as matching “space” characters.)

\S
	

Any character that is not a space, tab, or newline.

Character classes are nice for shortening regular expressions. The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing (0|1|2|3|4|5).

In [80]:
xmasRegex = re.compile(r'\d+\s\w+') # + means that the group preceding + has to appear at least once
xmasRegex.findall('4rose, 5 rose, 99 no yes 11, 12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese,\n5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['5 rose',
 '99 no',
 '12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

#### The regular expression \d+\s\w+ will match text that has one or more numeric digits (\d+), followed by a whitespace character (\s), followed by one or more letter/digit/underscore characters (\w+). The findall() method returns all matching strings of the regex pattern in a list.

### Making Your Own Character Classes

There are times when you want to match a set of characters but the shorthand character classes (\d, \w, \s, and so on) are too broad. You can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase.

In [81]:
# Match any vowels
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

Note that inside the square brackets, the normal regular expression symbols are not interpreted as such. This means you do not need to escape the ., *, ?, or () characters with a preceding backslash. For example, the character class [0-5.] will match digits 0 to 5 and a period. You do not need to write it as [0-5\.].

By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class. 

In [82]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

In [83]:
# Match any letters and number
vowelRegex = re.compile(r'[a-zA-Z0-9]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD654sad6a5s4ad6.')

['R',
 'o',
 'b',
 'o',
 'c',
 'o',
 'p',
 'e',
 'a',
 't',
 's',
 'b',
 'a',
 'b',
 'y',
 'f',
 'o',
 'o',
 'd',
 'B',
 'A',
 'B',
 'Y',
 'F',
 'O',
 'O',
 'D',
 '6',
 '5',
 '4',
 's',
 'a',
 'd',
 '6',
 'a',
 '5',
 's',
 '4',
 'a',
 'd',
 '6']

### The Caret and Dollar Sign Characters

You can also use the caret symbol (^) at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign ( ) at the end of the regex to indicate the string must end with this regex pattern. And you can use the ^ and $ together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

For example, the r'^Hello' regular expression string matches strings that begin with 'Hello'. 

In [84]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')
beginsWithHello.search('He said hello.') == None

True

In [85]:
# The r'\d$' regular expression string matches strings that end with a numeric character from 0 to 9
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your number is 42')
endsWithNumber.search('Your number is forty two.') == None

True

In [87]:
# The r'^\d+$' regular expression string matches strings that both begin and end with one or more numeric characters
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [88]:
wholeStringIsNum.search('12345xyz67890') == None # xyz (not numbers)

True

In [89]:
wholeStringIsNum.search('12 34567890') == None # space 

True

The first two search() calls in the previous interactive shell example demonstrate how the entire string must match the regex if ^ and $ are used.

In [91]:
wholeStringIsNum.search('1234567890') == None

False

### Matching Everything with Dot-Star
Sometimes you will want to match everything and anything. For example, say you want to match the string 'First Name:', followed by any and all text, followed by 'Last Name:', and then followed by anything again. You can use the dot-star (.*) to stand in for that “anything.” Remember that the dot character means “any single character except the newline,” and the star character means “zero or more of the preceding character.”

In [11]:
# Match everything after first name and last name using (.*)
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Swz')
print(mo.group(1)) # prints first group
print(mo.group(2)) # prints second group
print(mo.groups()) # prints tuple of groups

Al
Swz
('Al', 'Swz')


The dot-star uses greedy mode: It will always try to match as much text as possible. To match any and all text in a nongreedy fashion, use the dot, star, and question mark (.*?). Like with curly brackets, the question mark tells Python to match in a nongreedy way.

In [12]:
# See difference between greedy and nongreedy
nongreedyRegex = re.compile(r'<.*?>') # non greedy using ?
mo = nongreedyRegex.search('<To serve man> for dinner')
mo.group()

'<To serve man>'

In [14]:
# Greedy approach
greedyRegex = re.compile(r'<.*>') # greedy approach without ?
mo = greedyRegex.search('<To serve man> for dinner>') # extra > at the end
mo.group()

'<To serve man> for dinner>'

Both regexes roughly translate to “Match an opening angle bracket, followed by anything, followed by a closing angle bracket.” But the string '<To serve man> for dinner.>' has two possible matches for the closing angle bracket. In the nongreedy version of the regex, Python matches the shortest possible string: '<To serve man>'. In the greedy version, Python matches the longest possible string: '<To serve man> for dinner.>'.

### Matching Newlines with the Dot Character
The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character.

In [15]:
noNewlineRegex = re.compile('.*') # match everything without new line
noNewlineRegex.search('Serve the public\nProtecc the food\nAttacc the intruder').group()

'Serve the public'

In [16]:
noNewlineRegex = re.compile('.*', re.DOTALL) # match everything with new line
noNewlineRegex.search('Serve the public\nProtecc the food\nAttacc the intruder').group()

'Serve the public\nProtecc the food\nAttacc the intruder'

The regex noNewlineRegex, which did not have re.DOTALL passed to the re.compile() call that created it, will match everything only up to the first newline character, whereas newlineRegex, which did have re.DOTALL passed to re.compile(), matches everything. This is why the newlineRegex.search() call matches the full string, including its newline characters.

### Review of Regex Symbols
This chapter covered a lot of notation, so here’s a quick review of what you learned:

The ? matches zero or one of the preceding group.

The * matches zero or more of the preceding group.

The + matches one or more of the preceding group.

The {n} matches exactly n of the preceding group.

The {n,} matches n or more of the preceding group.

The {,m} matches 0 to m of the preceding group.

The {n,m} matches at least n and at most m of the preceding group.

{n,m}? or *? or +? performs a nongreedy match of the preceding group.

^spam means the string must begin with spam.

spam$ means the string must end with spam.

The . matches any character, except newline characters.

\d, \w, and \s match a digit, word, or space character, respectively.

\D, \W, and \S match anything except a digit, word, or space character, respectively.

[abc] matches any character between the brackets (such as a, b, or c).

[^abc] matches any character that isn’t between the brackets.

### Case-Insensitive Matching
Normally, regular expressions match text with the exact casing you specify. For example, the following regexes match completely different strings:

In [19]:
regex1 = re.compile('Robocop')
regex2 = re.compile('ROBOCOP')
regex3 = re.compile('robOcop')
regex4 = re.compile('RobocOp')

But sometimes you care only about matching the letters without worrying whether they’re uppercase or lowercase. To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile().

In [20]:
robocop = re.compile(r'robocop', re.I)
robocop.search('Robocop is part man, part machine').group()

'Robocop'

In [21]:
robocop.search('ROBOcOp is a robot').group()

'ROBOcOp'

In [22]:
robocop.search('robocop n rObOcOp').group()

'robocop'

In [23]:
robocop.search('robocop n rObOcOp').groups()

()

In [27]:
mo = robocop.search('robocop n rObOcOp')
print(mo.group())
# print(mo.group(2))
# print(mo.group(1))

robocop


### Substituting Strings with the sub() Method
Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [28]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice is a secret agent with Agent Bob.')

'CENSORED is a secret agent with CENSORED.'

Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type \1, \2, \3, and so on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”

For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex Agent (\w)\w* and pass r'\1****' as the first argument to sub(). The \1 in that string will be replaced by whatever text was matched by group 1—that is, the (\w) group of the regular expression.

In [38]:
# Show first letter of the agent name only
agentNamesRegex = re.compile(r'Agent (\w)\w*') # question on why \w after
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol to buzz off before she called Agent Jack over.')

'A**** told C**** to buzz off before she called J**** over.'

### Managing Complex Regexes
Regular expressions are fine if the text pattern you need to match is simple. But matching complicated text patterns might require long, convoluted regular expressions. You can mitigate this by telling the re.compile() function to ignore whitespace and comments inside the regular expression string. This “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

Now instead of a hard-to-read regular expression like this:

phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)')

you can spread the regular expression over multiple lines with comments like this:

phoneRegex = re.compile(r'''(

    (\d{3}|\(\d{3}\))?            # area code
    
    (\s|-|\.)?                    # separator
    
    \d{3}                         # first 3 digits
    
    (\s|-|\.)                     # separator
    
    \d{4}                         # last 4 digits
    
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    
    )''', re.VERBOSE)
  
Note how the previous example uses the triple-quote syntax (''') to create a multiline string so that you can spread the regular expression definition over many lines, making it much more legible.

The comment rules inside the regular expression string are the same as regular Python code: The # symbol and everything after it to the end of the line are ignored. Also, the extra spaces inside the multiline string for the regular expression are not considered part of the text pattern to be matched. This lets you organize the regular expression so it’s easier to read.

### Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE
What if you want to use re.VERBOSE to write comments in your regular expression but also want to use re.IGNORECASE to ignore capitalization? Unfortunately, the re.compile() function takes only a single value as its second argument. You can get around this limitation by combining the re.IGNORECASE, re.DOTALL, and re.VERBOSE variables using the pipe character (|), which in this context is known as the bitwise or operator.

So if you want a regular expression that’s case-insensitive and includes newlines to match the dot character, you would form your re.compile() call like this:

In [34]:
# Reg expression that's case-insensitive and includes newlines to match the dot character
someRegexvalue = re.compile('foo', re.IGNORECASE | re.DOTALL)
someRegexvalue.search('FoO')

<re.Match object; span=(0, 3), match='FoO'>

In [36]:
# All three options for the second argument
someRegexvalue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
someRegexvalue.search('fOOFoo')

<re.Match object; span=(0, 3), match='fOO'>

### Project: Phone Number and Email Address Extractor
Say you have the boring task of finding every phone number and email address in a long web page or document. If you manually scroll through the page, you might end up searching for a long time. But if you had a program that could search the text in your clipboard for phone numbers and email addresses, you could simply press CTRL-A to select all the text, press CTRL-C to copy it to the clipboard, and then run your program. It could replace the text on the clipboard with just the phone numbers and email addresses it finds.

Whenever you’re tackling a new project, it can be tempting to dive right into writing code. But more often than not, it’s best to take a step back and consider the bigger picture. I recommend first drawing up a high-level plan for what your program needs to do. Don’t think about the actual code yet—you can worry about that later. Right now, stick to broad strokes.

For example, your phone and email address extractor will need to do the following:

Get the text off the clipboard.

Find all phone numbers and email addresses in the text.

Paste them onto the clipboard.

Now you can start thinking about how this might work in code. The code will need to do the following:

Use the pyperclip module to copy and paste strings.

Create two regexes, one for matching phone numbers and the other for matching email addresses.

Find all matches, not just the first match, of both regexes.

Neatly format the matched strings into a single string to paste.

Display some kind of message if no matches were found in the text.

This list is like a road map for the project. As you write the code, you can focus on each of these steps separately. Each step is fairly manageable and expressed in terms of things you already know how to do in Python.

In [15]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

# Create two regexes, one for matching phone numbers and the other for matching email addresses.
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code, 3 digits or 3 digits within a parentheses
    (\s|-|\.)?                        # separator (\s = space, hypen = -, period = .) joined with pipe |
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

# Create regex for email addresses
emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+      # username; lowercase and uppercase letters, numbers, a dot, an underscore, signs, hyphen
    @                      # @ symbol
    [a-zA-Z0-9.-]+         # domain name
    (\.[a-zA-Z]{2,4})      # dot-something
    )''', re.VERBOSE)

# Find all matches in the clipboard text
text = str(pyperclip.paste())
matches = [] # create empty list to store variables

# Loop through list
for groups in phoneRegex.findall(text):
    # There is one tuple for each match, and each tuple contains strings for each group in the regular expression. 
    # Remember that group 0 matches the entire regular expression, so the group at index 0 of the tuple is the one 
    # you are interested in.
    # These groups are the area code, first three digits, last four digits, and extension
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)

for groups in emailRegex.findall(text):
    matches.append(groups[0])

# Join the matches into a string for the clipboard
# Copy results to the clipboard
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')
    

No phone numbers or email addresses found.


In [40]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

# Create two regexes, one for matching phone numbers and the other for matching email addresses.
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code, 3 digits or 3 digits within a parentheses
    (\s|-|\.)?                        # separator (\s = space, hypen = -, period = .) joined with pipe |
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

# Find all matches in the clipboard
text = str(pyperclip.paste())
matches = []

# Text
mo = phoneRegex.findall('800.420.7240')
print(mo) # full text
print(mo[0]) # 0th element (which is the list)
print(mo[0][0])
print(mo[0][1])
print(mo[0][3])
print(mo[0][5])
print('-'.join([mo[0][1], mo[0][3], mo[0][5]]))


[('800.420.7240', '800', '.', '420', '.', '7240', '', '', '')]
('800.420.7240', '800', '.', '420', '.', '7240', '', '', '')
800.420.7240
800
420
7240
800-420-7240


In [41]:
# Match 42, 1,234, 6,368,111 not 12,34,456 or 1234
numRegex = re.compile(r'^\d{1,3}(,\d{3})*$')
print(numRegex.findall('42')) # lacks third digit
print(numRegex.findall('1,234'))
print(numRegex.findall('6,368,745'))
print(numRegex.findall('12,34,567' )) # only has two digits between commas
print(numRegex.findall('1234')) # lacks comma

['']
[',234']
[',745']
[]
[]


In [47]:
# Match full name of someone whose last namem is Nakamoto
strRegex = re.compile(r'[A-Z][a-z]*\sNakamoto') # zero or more letters and theres a space and Nakamoto string
print(strRegex.findall('Satoshi Nakamoto'))
print(strRegex.findall('Alice Nakamoto'))
print(strRegex.findall('satoshi Nakamoto')) # first name is not capitalized
print(strRegex.findall('Mr. Nakamoto')) # preceding word has a nonletter character
print(strRegex.findall('Mr Nakamoto ')) 
print(strRegex.findall(' Nakamoto ')) # no first name

['Satoshi Nakamoto']
['Alice Nakamoto']
[]
[]
['Mr Nakamoto']
[]


How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period?

In [57]:
firstRegex = re.compile(r'(Alice|Bob|Carol)\s(eats|pets|throws)\s(apples|cats|baseballs)\.', re.IGNORECASE)
print(firstRegex.search('ALice eats apples.').group())
print(firstRegex.findall('ALice eats apples.'))
print(firstRegex.findall('ALice eats apples.')[0][1]) # second element

print(firstRegex.findall('ALICE THROWS FOOTBALLS.')) # doesnt have required text

ALice eats apples.
[('ALice', 'eats', 'apples')]
eats
[]


In [62]:
#! python3
# regex_strip.py

"""
Write a function that takes a string and does the same thing as the strip()
string method. If no other arguments are passed other than the string to strip,
then whitespace characters will be removed from the beginning and end of the
string. Otherwise, the characters specified in the second argument to the
function will be removed from the string.
"""

import re

test_string = input('Please enter a string to strip: ')
char_rm = input('What characters do you want to remove? (Press enter for whitespace.)')

# This function takes in two argument. A string to strip and optional second argument to strip a removing character.
# The first if statement is run if the user does input a second argument. It strips the char/string provided.
# Your re.compile will use your second argument as its argument (or whitespace if none) and sub the
# whitespace. The whitespace will need to be removed from the beginning and end, so it will take
# two regex. As always, ensure your characters are in the right order! To check that all white
# spaces are removed print out the length of the new string.
def white_strip(string, remove):
    if remove != '':
        strip_regex = re.compile(remove)
        new_string = strip_regex.sub('', string)
        return new_string
    else:
        strip_regex = re.compile('^\s*')
        new_string = strip_regex.sub('', string)
        strip_regex = re.compile('\s*$')
        new_string = strip_regex.sub('', new_string)
        return new_string


new_string = white_strip(test_string, char_rm)
print(new_string)

Please enter a string to strip: Write a function that uses regular expressions to make sure the password string it is passed is strong. A strong password is defined as one that is at least eight characters long, contains both uppercase and lowercase characters, and has at least one digit. You may need to test the string against multiple regex patterns to validate its strength.
What characters do you want to remove? (Press enter for whitespace.)a-z
Write a function that uses regular expressions to make sure the password string it is passed is strong. A strong password is defined as one that is at least eight characters long, contains both uppercase and lowercase characters, and has at least one digit. You may need to test the string against multiple regex patterns to validate its strength.


In [63]:
#! python3
# strong_password_detection.py

"""
Write a function that uses regular expressions to make sure the password
string it is passed is strong. A strong password is defined as one that
is at least eight characters long, contains both uppercase and lowercase
characters, and has at least one digit. You may need to test the string
against multiple regex patterns to validate its strength.
"""

import re

# First, create regex's that meet all expectations given by the problem. There should be
# 1) at least 8 characters
# 2) at least one upper case
# 3) at least one lower case
# 4) at least one number
# Regex can be tricky! Make sure you have all symbols, paretheses, and escape characters
# placed in the correct spot. This may take some practice! It may help to test each regex
# one at a time until you get used to it.
length_regex = re.compile('.{8,}') # 8 characters long
lower_case_regex = re.compile('[a-z]+') # contains lowercase characters
upper_case_regex = re.compile('[A-Z]+') # contains uppercase characters
digit_regex = re.compile('[\d]+') # contains at least 1 digit

# Create a regex list of of the ones created above.
regex_list = [length_regex,
              lower_case_regex,
              upper_case_regex,
              digit_regex]

# For this, I initiated a regex_count and set it to 0 right when the function starts. This will
# be used later if the password meets expectations. The for loops will loop over each regex and
# break out of the loop if it doesn't meet one of the criteria. If it makes it through each one
# successfully, regex_count adds a 1 to its value. Once the loop is finished and has 4 successful
# regex searches then it will print a message saying your password is strong enough.
def strong_password(password):
    regex_count = 0
    for regex in regex_list:
        if regex.search(password) is None:
            print('Sorry, your password is not strong enough')
            break
        else:
            regex_count += 1
            continue
    if regex_count is 4:
        print('Congrats. Your password is strong enough!')

# User input to type in a password
pw = input('Please type in a password:\n')
strong_password(pw)

# Below is a list of passwords you can test. Copy it into your code and remove the comments
pw_test1 = 'testpw'
pw_test2 = 'Testpw'
pw_test3 = 'TESTPW'
pw_test4 = 'TESTPW123'
pw_test5 = 'Testpw123'
pw_test6 = 'TESTPW123!@#'
pw_test7 = 'Tb1@Tb1@'
pw_test8 = 'TestPW123'
pw_test9 = '!@345ssfe@#23T4'

strong_password(pw_test1)
strong_password(pw_test2)
strong_password(pw_test3)
strong_password(pw_test4)
strong_password(pw_test5)
strong_password(pw_test6)
strong_password(pw_test7)
strong_password(pw_test8)
strong_password(pw_test9)

"""
This is an alternate way you can write this function. If you don't want to iterate through a loop
and add a count iterator, then you can create an if loop with each regex separately. The first time
it fails it will return a statement saying your password isn't strong enough. If it makes it through
the end then the password is strong enough
def is_strong_pw(password):
    if pw_reg_1.search(password) is None:
        return 'Your password is not strong enough'
    if pw_reg_2.search(password) is None:
        return 'Your password is not strong enough'
    if pw_reg_3.search(password) is None:
        return 'Your password is not strong enough'
    if pw_reg_4.search(password) is None:
        return 'Your password is not strong enough'
    if pw_reg_5.search(password) is None:
        return 'Your password is not strong enough'
    else:
        return 'Congrats! Your password will suffice!'
pw_reg_1 = re.compile(r'[a-z]+')
pw_reg_2 = re.compile(r'[A-Z]+')
pw_reg_3 = re.compile(r'[0-9]+')
pw_reg_4 = re.compile(r'[!@#$^&*()]+')
pw_reg_5 = re.compile(r'.{8,}')
"""

Please type in a password:
a
Sorry, your password is not strong enough


"\nThis is an alternate way you can write this function. If you don't want to iterate through a loop\nand add a count iterator, then you can create an if loop with each regex separately. The first time\nit fails it will return a statement saying your password isn't strong enough. If it makes it through\nthe end then the password is strong enough\ndef is_strong_pw(password):\n    if pw_reg_1.search(password) is None:\n        return 'Your password is not strong enough'\n    if pw_reg_2.search(password) is None:\n        return 'Your password is not strong enough'\n    if pw_reg_3.search(password) is None:\n        return 'Your password is not strong enough'\n    if pw_reg_4.search(password) is None:\n        return 'Your password is not strong enough'\n    if pw_reg_5.search(password) is None:\n        return 'Your password is not strong enough'\n    else:\n        return 'Congrats! Your password will suffice!'\npw_reg_1 = re.compile(r'[a-z]+')\npw_reg_2 = re.compile(r'[A-Z]+')\npw_reg_