# Chapt 7
## Patter Matching with Regular Expressions

We want to detect and identify phone numbers. *Pattern:* 3 digits - 3 digits - 4 digits

In [6]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():  # This checks to see if the first three characters are numbers
            return False
    if text[3] != '-':
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8,12):
        if not text[i].isdecimal():
            return False
    return True

In [7]:
print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi Moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

415-555-4242 is a phone number:
True
Moshi Moshi is a phone number:
False


Now we will find a phone number in a string of any length.

In [8]:
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done


The isPhoneNumber() function will only catch one format of number but if the number is presented as (415) 555-1011 is will not register as a number. 

In [2]:
import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


**Observations:** 
- stored the regex function (*compile*) and passing the regex search criteria as an argument in variable *phoneNumRegex*
- stored another function (*search*) in a varaible *mo* on the first function and passing the text as an argument, which is the Match object
- calling a final function (*group*) on the previous variable 

Now, we can extend flexibility of returning different segments of our search by placing parethesis around different parts of our search criteria.


In [3]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
mo.group(1)

'415'

In [39]:
mo.group(2)

'555-4242'

In [40]:
mo.group(0)

'415-555-4242'

In [41]:
mo.group()

'415-555-4242'

In [44]:
mo.groups()

('415', '555-4242')

In [6]:
len('My number is 415-555-4242.') - len(mo.group()) 

14

In [46]:
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

415
555-4242


So We have an important trio of functions here that we need to chunk together:
- Compile
- Search
- Group

Now we will capture the parenthesis as part of our criteria and not as a special attribute of the function. We must place a backslash in front the parethesis that you want to capture. `\(` and `\)`


In [50]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is (415) 555-4242.')

In [51]:
mo.group()

'(415) 555-4242'

In [52]:
mo.group(1)

'(415)'

In [54]:
phoneNumRegex = re.compile(r'(\(\d{3}\)) (\d{3}-\d{4})')
mo = phoneNumRegex.search('My number is (415) 555-4242.')

In [55]:
mo.group()

'(415) 555-4242'

## `|` is called pipe. We can use it to match multiple groups. When the multiple matches are present in a text the first instance of a match will be assigned.

In [56]:
heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [58]:
mo2 = heroRegex.search('Tina Fey and Batman.')
mo2.group()

'Tina Fey'

In [60]:
mo = heroRegex.findall('Tina Fey and Batman.')
mo

['Tina Fey', 'Batman']

In [61]:
mo1

<re.Match object; span=(0, 6), match='Batman'>

In [62]:
mo2

<re.Match object; span=(0, 8), match='Tina Fey'>

In [63]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()

'Batmobile'

In [64]:
mo.group(1)

'mobile'

## If you want to add a section that you want to match optionally then you can place a `?`-flag which indicates that the preceding group is optional. 

In [3]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [4]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [6]:
phoneRegex = re.compile(r'(\d{3}-)?\d{3}-\d{4}')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo1.group()

'415-555-4242'

In [7]:
mo2 = phoneRegex.search('My number is 555-4242')
mo2.group()

'555-4242'

`?` says *Match zero or one of the group preceding this question mark.*

In any case, you can use a backslash `\` in front of `|` or `?` in order to find those symbols in the text.

## You can use `*` to *Match zero or more*

In [8]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [9]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [10]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

## Matching one or more with plus `+`

In [14]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [17]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [18]:
mo3 = batRegex.search('The adventures of Batman')
mo3 == None

True

## Matching Specific Repetions with Curly Brackets

In [21]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [22]:
mo2 = haRegex.search('Ha')
mo2 == None

True

## Greedy and NonGreedy matching

Python's regexes are greedy by default.

In [24]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [26]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

## The findall() Method
`search()` returns a match object of the first instance and `findall()` will return a list of all instances found.

In [27]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

In [28]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

In [29]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]