In [1]:
# Create table
tableData = [['apples', 'oranges', 'cherries', 'banana'],
             ['Alice', 'Bob', 'Carol', 'David'],
             ['dogs', 'cats', 'moose', 'goose']]

print(len(tableData))

# Results would look like below
#   apples Alice  dogs
#  oranges   Bob  cats
# cherries Carol moose
#   banana David goose


def print_table(table):
    # Create a new list of 3 "0" values: One for each list in table Data
    col_width = 0 * len(table)
    print(col_width)

3


# Pattern Matching with Regular Expressions

### Finding text patterns without using regular expressions vs. how to simplify with regular expressions
I’ll show you basic matching with regular expressions and then move on to some more powerful features, such as string substitution and creating your own character classes. Finally, at the end of the chapter, you’ll write a program that can automatically extract phone numbers and email addresses from a block of text.

### Finding Patterns of Text without Regular Expressions
Say you want to find a phone number in a string. You know the pattern: three numbers, a hyphen, three numbers, a hyphen, and four numbers. Here’s an example: 415-555-4242.

In [4]:
# Create a function called isPhoneNumber() to check whether a string matches this pattern and return TRUE/FALSE
def isPhoneNumber(text):
    # Check that the string is exactly 12 characters
    if len(text) != 12:
        return False
    for i in range(0, 3):
        # Checks that the area code consists of only numeric characters
        if not text[i].isdecimal():
            return False
    # The number must have the first hyphen after the area code
    if text[3] != '-':
        return False
    # Checks for 3 more numeric characters
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    # Check for another hyphen
    if text[7] != '-':
        return False
    # Check for the final 3 numeric characters
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    # Return True if everything holds True
    return True

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))
print('408999999 is a phone number without dashes but will return false:')
print(isPhoneNumber('408999999'))

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False
408999999 is a phone number without dashes but will return false:
False


In [5]:
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
message

'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'

In [12]:
for i in range(len(message)):
    chunk = message[i:i+12]
#     print(chunk)

In [8]:
# Loop through chunks of 12 at a time and prints out the entire phone number based on isPhoneNumber function
for i in range(len(message)):
    chunk = message[i:i+12]
    # Pass chunk to isPhoneNumber() to see whether it matches the phone number pattern
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

# The loop goes through the entire string, testing each 12-character piece and printing any chunk it finds that 
# satisfies isPhoneNumber(). Once we’re done going through message, we print Done.

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done


In [11]:
print(message[0:0+12])
print(message[1:1+12])

Call me at 4
all me at 41


### Finding Patterns of Text with Regular Expressions
Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character—that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d \d\d regex.

But regular expressions can be much more sophisticated. For example, adding a 3 in curly brackets ({3}) after a pattern is like saying, “Match this pattern three times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number format.

### Creating Regex Objects
All of the regex functions in Python are in the re module:

Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object).

In [15]:
import re

# Create a Regex object that matches the phone number pattern
#  By putting an r before the first quote of the string value,
#  you can mark the string as a raw string, which does not escape characters
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneNumRegex

re.compile(r'\d\d\d-\d\d\d-\d\d\d\d', re.UNICODE)

### Matching Regex Objects
A Regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object. Match objects have a group() method that will return the actual matched text from the searched string.

In [16]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
mo

<re.Match object; span=(13, 25), match='415-555-4242'>

In [17]:
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


The mo variable name is just a generic name to use for Match objects. This example might seem complicated at first, but it is much shorter than the earlier isPhoneNumber.py program and does the same thing.

Here, we pass our desired pattern to re.compile() and store the resulting Regex object in phoneNumRegex. Then we call search() on phoneNumRegex and pass search() the string we want to search for a match. The result of the search gets stored in the variable mo. In this example, we know that our pattern will be found in the string, so we know that a Match object will be returned. Knowing that mo contains a Match object and not the null value None, we can call group() on mo to return the match. Writing mo.group() inside our print statement displays the whole match, 415-555-4242.

### Review of Regular Expression Matching

While there are several steps to using regular expressions in Python, each step is fairly simple.

    Import the regex module with import re.

    Create a Regex object with the re.compile() function. (Remember to use a raw string.)

    Pass the string you want to search into the Regex object’s search() method. This returns a Match object.

    Call the Match object’s group() method to return a string of the actual matched text.



### Grouping with Parentheses

Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.

The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text.

In [20]:
# Compile regex object
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d\d)') # group 1 and group 2 in parentheses

# Create match object from string
mo = phoneNumRegex.search('my number is 415-222-2222')

# Group by match object
print(mo.group(1)) # first group
print(mo.group(2)) # second group
print(mo.group(0)) # all
print(mo.group()) # all

# Multiple groups
print(mo.groups()) # plural s

222
2222
222-2222
222-2222
('222', '2222')


In [21]:
areaCode, mainNumber = mo.groups() # single line assignment
print(areaCode)
print(mainNumber)

222
2222


### Matching parentheses as well
Parentheses have a special meaning in regular expressions, but what do you do if you need to match a parenthesis in your text? For instance, maybe the phone numbers you are trying to match have the area code set in parentheses. In this case, you need to escape the ( and ) characters with a backslash.

In [30]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)') # add extra \( \) and () for the area code to match () as well
mo = phoneNumRegex.search('text this number (408) 999-9999')
print(mo.group(1))
print(mo.group(2))
print(mo.groups())

(408)
999-9999
('(408)', '999-9999')


### Matching Multiple Groups with the Pipe

The | character is called a pipe. You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object.

In [32]:
# Search for Batman or Tina Fey using Regex
heroRegex = re.compile(r'Batman|Tina Fey')
mo_1 = heroRegex.search('Batman and Tina Fey')
mo_1.group()

'Batman'

In [34]:
mo_2 = heroRegex.search('Tina Fey and Batman')
mo_2.group()

'Tina Fey'

You can also use the pipe to match one of several patterns as part of your regex. For example, say you wanted to match any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done with parentheses. 

In [40]:
# Regex for prefix of words
batRegex = re.compile(r'Bat(\|man|mobile|copter|bat)') # search for prefix 'Bat'
mo = batRegex.search('Batmobile lost a wheel with Batman')
print(mo.group()) # returns full matched text 'Batmobile'
print(mo.groups())
print(mo.group(1)) # returns just the part of the matched text inside the first parentheses group 'mobile'

Batmobile
('mobile',)
mobile


The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1) returns just the part of the matched text inside the first parentheses group, 'mobile'. By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.

If you need to match an actual pipe character, escape it with a backslash, like '\|'.

### Optional Matching with the Question Mark
Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern.

In [43]:
batRegex = re.compile(r'Bat(wo)?man') #  pattern wo is an optional group
mo_1 = batRegex.search('the adventures of Batman')
mo_1.group()

'Batman'

In [44]:
mo_2 = batRegex.search('the Batwoman adventures of Batman')
mo_2.group()

'Batwoman'

The (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of wo in it. This is why the regex matches both 'Batwoman' and 'Batman'.

Using the earlier phone number example, you can make the regex look for phone numbers that do or do not have an area code.

#### Think of the "?" as saying, "Match zero or one of the group preceding this question mark"

If you need to match an actual question mark character, escape it with \?.

In [74]:
# Think of the "?" as saying, "Match zero or one of the group preceding this question mark"
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d') # ? after area code so before section is optional
mo_1 = phoneRegex.search('my number is 408-999-9999 and 408-888-8888')
print(mo_1.group())

# Think of the "?" as saying, "Match zero or one of the group preceding this question mark"
phoneRegex2 = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d') # ? after area code so before section is optional
mo_3 = phoneRegex2.search('my number is 999-9999 and 408-888-8888')
print(mo_3.group())

408-999-9999
999-9999


In [46]:
mo_2 = phoneRegex.search('my number is 999-9999 and 888-8888')
mo_2.group()

'999-9999'

### Matching Zero or More with the Star

The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again. 

In [52]:
batRegex = re.compile(r'Bat(wo)*man')

# For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string
mo_1 = batRegex.search('The Adventures of Batman')
mo_1.group()

'Batman'

In [53]:
# For 'Batwoman', the (wo)* matches one instance of wo
mo_2 = batRegex.search('The Adventures of Batwoman')
mo_2.group()

'Batwoman'

In [51]:
# For 'Batwowowowoman', (wo)* matches four instances of wo
mo_3 = batRegex.search('The Adventures of Batwowowowoman')
mo_3.group()

'Batwowowowoman'

For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string; for 'Batwoman', the (wo)* matches one instance of wo; and for 'Batwowowowoman', (wo)* matches four instances of wo.

If you need to match an actual star character, prefix the star in the regular expression with a backslash, \*.

### Matching One or More with the Plus

While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional. Enter the following into the interactive shell, and compare it with the star regexes in the previous section

In [54]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [55]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [59]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

True

### Matching Specific Repetitions with Curly Brackets

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

You can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns:

(Ha){3}
(Ha)(Ha)(Ha)

In [60]:
haRegex = re.compile(r'(ha){3}')
mo1 = haRegex.search('hahahahahahaha')
mo1.group()

'hahaha'

In [62]:
mo2 = haRegex.search('ha') 
mo2 == None # needs a minimum of ha * 3

True

### Greedy and Nongreedy Matching

Since (Ha){3,5} can match three, four, or five instances of Ha in the string 'HaHaHaHaHa', you may wonder why the Match object’s call to group() in the previous curly bracket example returns 'HaHaHaHaHa' instead of the shorter possibilities. After all, 'HaHaHa' and 'HaHaHaHa' are also valid matches of the regular expression (Ha){3,5}.

Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

In [63]:
greedyHaRegex = re.compile(r'(ha){3,5}') # greedy approach, which means that it will find the longest string possible
mo1 = greedyHaRegex.search('hahahahaha')
mo1.group()

'hahahahaha'

In [65]:
nongreedyHaRegEx = re.compile(r'(ha){3,5}?') # nongreedy with ? searches for the min needed (3)
mo2 = nongreedyHaRegEx.search('hahahahaha')
mo2.group()

'hahaha'

### The findall() Method

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string. To see how search() returns a Match object only on the first instance of matching text

In [66]:
phoneNumRegEx = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegEx.search('Cell: 408-888-8888 Work: 408-999-9999')
mo.group()

'408-888-8888'

On the other hand, findall() will not return a Match object but a list of strings—as long as there are no groups in the regular expression. Each string in the list is a piece of the searched text that matched the regular expression. 

In [67]:
phoneNumRegEx = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegEx.findall('Cell: 408-888-8888 Work: 408-999-9999')

['408-888-8888', '408-999-9999']

If there are groups in the regular expression, then findall() will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex. To see findall() in action, enter the following into the interactive shell (notice that the regular expression being compiled now has groups in parentheses)

In [75]:
phoneNumRegEx = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups vs no groups
phoneNumRegEx.findall('Cell: 408-888-8888 Work: 408-999-9999 Bok: 999-888-8888 Line: 555-5555')

[('408', '888', '8888'), ('408', '999', '9999'), ('999', '888', '8888')]

To summarize what the findall() method returns, remember the following:

    When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d, the method findall() returns a list of string matches, such as ['415-555-9999', '212-555-0000'].

    When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\ d\d\d), the method findall() returns a list of tuples of strings (one string for each group), such as [('415', '555', '9999'), ('212', '555', '0000')].



### Character Classes
\d
	

Any numeric digit from 0 to 9.

\D
	

Any character that is not a numeric digit from 0 to 9.

\w
	

Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)

\W
	

Any character that is not a letter, numeric digit, or the underscore character.

\s
	

Any space, tab, or newline character. (Think of this as matching “space” characters.)

\S
	

Any character that is not a space, tab, or newline.

Character classes are nice for shortening regular expressions. The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing (0|1|2|3|4|5).

In [80]:
xmasRegex = re.compile(r'\d+\s\w+') # + means that the group preceding + has to appear at least once
xmasRegex.findall('4rose, 5 rose, 99 no yes 11, 12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese,\n5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['5 rose',
 '99 no',
 '12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

### The regular expression \d+\s\w+ will match text that has one or more numeric digits (\d+), followed by a whitespace character (\s), followed by one or more letter/digit/underscore characters (\w+). The findall() method returns all matching strings of the regex pattern in a list.