# Pattern matching with Regular Expressions p148-171.

### Finding patterns of text without regular expressions

Finding a phone number in a string using a function.

In [2]:
def isPhoneNumber(text):
    if len(text) !=13:  # this has been changed to the English phone number system.
        return False
    for i in range(0, 4):
        if not text[i].isdecimal():
            return False
    for i in range(5, 8):
        if not text[i].isdecimal():
            return False
    for i in range(10, 13):
        if not text[i].isdecimal():
            return False
    return True


def isMobile(text):
    if len(text) !=12:  # this has been changed to the English phone number system.
        return False
    for i in range(0, 5):  # for a mobile number
        if not text[i].isdecimal():
            return False
    for i in range(6, 12):
        if not text[i].isdecimal():
            return False
    return True

message = 'Call me on 0113 254 6879 tomorrow. My office number is 0161 453 5767.'
for i in range(len(message)):
    chunk = message[i:i+13]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done!')

# print('0113 264 9034 is a phone number')
# print(isPhoneNumber('0113 264 9034'))
# print('01423 mish mash is not a number')
# print(isPhoneNumber('01423 mish mash'))
# print('07765 565787 is a mobile number')
# print(isMobile('07756 747747'))
# print('07736 this is not a moblie number')
# print(isMobile('07736 this is not a moblie number'))

Phone number found: 0113 254 6879
Phone number found: 0161 453 5767
Done!


 The last bit of code loops over chunks of the text in chunk sizes of 13 and sees if it is a phone number using the isPhoneNumber function e.g.  'Call me on 01' then 'all me on 011'. Eventually the chunk '0113 254 6879' is True and the string *'Phone number found: 0113 254 6879'* is printed.

## Finding Patterns of Text with Regular Expressions.

Passing a string value representing a regular expression to *re.compile()* returns a Regex pattern object.

In [3]:
import re

phoneNumRegex = re.compile(r'\d\d\d\d \d\d\d \d\d\d\d')  # adding r in front of the expression means it is a raw string
mo = phoneNumRegex.search('My number is 0113 345 6789')
print('Phone number found: ' + mo.group())

Phone number found: 0113 345 6789


Adding the group() call in the last line displays the whole match 0113 345 6789.

## Summary

* Import the re module with **import re**.
* Create a regex object with the re.compile() function and make sure it is a raw string with r.
* Pass the desired string into the search of the regex object search() method.
* Call the *Match* objects group() method to return a string of the actual matched text.
* regular expressions can be tested [here]( http://regexr.com//)

## More pattern matching with Regulare Expressions

Separating an area code from the rest of the number using the **group(s)**

In [4]:
phoneNumRegex = re.compile(r'(\d\d\d\d) (\d\d\d \d\d\d\d)')
mo = phoneNumRegex.search('Here is a number that is in the text 0116 245 6987.')

mo.group(1) # first set of parentheses match

'0116'

In [5]:
mo.group(2) # second set of parentheses match

'245 6987'

In [6]:
mo.group(0) # entire set

'0116 245 6987'

In [7]:
mo.group() # entire set

'0116 245 6987'

Getting the entire group using **groups()** which returns a tuple (immutable)

In [8]:
mo.groups()

('0116', '245 6987')

Since groups() returns a tuple which has multiple values in this case, then the information can split as follows

In [9]:
areaCode, mainNumber = mo.groups()

In [10]:
print(areaCode)

0116


In [11]:
print(mainNumber)

245 6987


## Matching multiple groups with the pipe |

The regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'. If both of the strings occur in the sentence then the first one is returned. Note that the **findall()** will find all matching occurrences.

In [12]:
hereoRegex = re.compile(r'Batman|Tina Fey')
mo1 = hereoRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [13]:
hereoRegex = re.compile(r'Batman|Tina Fey')
mo2 = hereoRegex.search('Tina Fey is not Batman.')
mo2.group()

'Tina Fey'

Another example is using the pipe to match different several patterns. Foe example if you wanted to mathc any of the following: Batman, Batwoman, Batmobile, Batcopter or Batbat.

In [14]:
batRegex = re.compile(r'Bat(man|woman|mobile|copter|bat)')
mo = batRegex.search('The Batmobile lost a wheel')
mo.group()

'Batmobile'

In [15]:
mo.group(1) # just returns the part that was matched inside the first parenthesese group.

'mobile'

In [19]:
mo = batRegex.search('The Batmotorcycle lost a wheel')
mo == None
# mo.group() will give an error message

True

## Optional matching with ?

Here the regex matches if the part of text is there or not.

In [24]:
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batman')
mo.group()

'Batman'

In [26]:
batRegex = re.compile(r'Bat(wo)?man')  # the (wo)? means that wo is optional in the search.
mo = batRegex.search('The Adventures of Batwoman')
mo.group()

'Batwoman'

Back to the phone number example, the regex can look for phone numbers with or without the area code.

In [36]:
phoneRegex = re.compile(r'(\d\d\d\d)? (\d\d\d \d\d\d\d)')
mo1 = phoneRegex.search('Here is the new number 248 6532')
mo1.group()

' 248 6532'

In [28]:
mo2 = phoneRegex.search('Here is the new number 0115 248 6532')
mo2.group()

'0115 248 6532'

## Matching zero or more with ***

The asterisk is used to match zero or more.

In [29]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('Tha Adventures of Batman')
mo1.group()

'Batman'

In [30]:
batRegex = re.compile(r'Bat(wo)*man')
mo2 = batRegex.search('Tha Adventures of Batwoman')
mo2.group()

'Batwoman'

In [33]:
batRegex = re.compile(r'Bat(wo)*man')
mo3 = batRegex.search('Tha Adventures of Batwowowowowoman')
mo3.group()

'Batwowowowowoman'