# Pattern matching with Regular Expressions p148-171.

### Finding patterns of text without regular expressions

Finding a phone number in a string using a function.

In [1]:
def isPhoneNumber(text):
    if len(text) !=13:  # this has been changed to the English phone number system.
        return False
    for i in range(0, 4):
        if not text[i].isdecimal():
            return False
    for i in range(5, 8):
        if not text[i].isdecimal():
            return False
    for i in range(10, 13):
        if not text[i].isdecimal():
            return False
    return True


def isMobile(text):
    if len(text) !=12:  # this has been changed to the English phone number system.
        return False
    for i in range(0, 5):  # for a mobile number
        if not text[i].isdecimal():
            return False
    for i in range(6, 12):
        if not text[i].isdecimal():
            return False
    return True

message = 'Call me on 0113 254 6879 tomorrow. My office number is 0161 453 5767.'
for i in range(len(message)):
    chunk = message[i:i+13]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done!')

# print('0113 264 9034 is a phone number')
# print(isPhoneNumber('0113 264 9034'))
# print('01423 mish mash is not a number')
# print(isPhoneNumber('01423 mish mash'))
# print('07765 565787 is a mobile number')
# print(isMobile('07756 747747'))
# print('07736 this is not a moblie number')
# print(isMobile('07736 this is not a moblie number'))

Phone number found: 0113 254 6879
Phone number found: 0161 453 5767
Done!


 The last bit of code loops over chunks of the text in chunk sizes of 13 and sees if it is a phone number using the isPhoneNumber function e.g.  'Call me on 01' then 'all me on 011'. Eventually the chunk '0113 254 6879' is True and the string *'Phone number found: 0113 254 6879'* is printed.

## Finding Patterns of Text with Regular Expressions.

Passing a string value representing a regular expression to *re.compile()* returns a Regex pattern object.

In [2]:
import re

phoneNumRegex = re.compile(r'\d\d\d\d \d\d\d \d\d\d\d')  # adding r in front of the expression means it is a raw string
mo = phoneNumRegex.search('My number is 0113 345 6789')
print('Phone number found: ' + mo.group())

Phone number found: 0113 345 6789


Adding the group() call in the last line displays the whole match 0113 345 6789.

## Summary

* Import the re module with **import re**.
* Create a regex object with the re.compile() function and make sure it is a raw string with r.
* Pass the desired string into the search of the regex object search() method.
* Call the *Match* objects group() method to return a string of the actual matched text.
* regular expressions can be tested [here]( http://regexr.com//)

## More pattern matching with Regulare Expressions

Separating an area code from the rest of the number using the **group(s)**

In [3]:
phoneNumRegex = re.compile(r'(\d\d\d\d) (\d\d\d \d\d\d\d)')
mo = phoneNumRegex.search('Here is a number that is in the text 0116 245 6987.')

mo.group(1) # first set of parentheses match

'0116'

In [4]:
mo.group(2) # second set of parentheses match

'245 6987'

In [5]:
mo.group(0) # entire set

'0116 245 6987'

In [6]:
mo.group() # entire set

'0116 245 6987'

Getting the entire group using **groups()** which returns a tuple (immutable)

In [7]:
mo.groups()

('0116', '245 6987')

Since groups() returns a tuple which has multiple values in this case, then the information can split as follows

In [8]:
areaCode, mainNumber = mo.groups()

In [9]:
print(areaCode)

0116


In [10]:
print(mainNumber)

245 6987


## Matching multiple groups with the pipe

The regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'. If both of the strings occur in the sentence then the first one is returned. Note that the **findall()** will find all matching occurrences.

In [11]:
hereoRegex = re.compile(r'Batman|Tina Fey')
mo1 = hereoRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [12]:
hereoRegex = re.compile(r'Batman|Tina Fey')
mo2 = hereoRegex.search('Tina Fey is not Batman.')
mo2.group()

'Tina Fey'

Another example is using the pipe to match different several patterns. Foe example if you wanted to mathc any of the following: Batman, Batwoman, Batmobile, Batcopter or Batbat.

In [13]:
batRegex = re.compile(r'Bat(man|woman|mobile|copter|bat)')
mo = batRegex.search('The Batmobile lost a wheel')
mo.group()

'Batmobile'

In [14]:
mo.group(1) # just returns the part that was matched inside the first parenthesese group.

'mobile'

## Optional matching with? Zero or one match.

Here the regex matches if the part of text is there or not.

In [15]:
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batman')
mo.group()

'Batman'

In [16]:
batRegex = re.compile(r'Bat(wo)?man')  # the (wo)? means that wo is optional in the search.
mo = batRegex.search('The Adventures of Batwoman')
mo.group()

'Batwoman'

Back to the phone number example, the regex can look for phone numbers with or without the area code.

In [17]:
phoneRegex = re.compile(r'(\d\d\d\d)? (\d\d\d \d\d\d\d)')
mo1 = phoneRegex.search('Here is the new number 248 6532')
mo1.group()

' 248 6532'

In [18]:
mo2 = phoneRegex.search('Here is the new number 0115 248 6532')
mo2.group()

'0115 248 6532'

## Matching zero or more with ***

The asterisk is used to match zero or more.

In [19]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('Tha Adventures of Batman')
mo1.group()

'Batman'

In [20]:
batRegex = re.compile(r'Bat(wo)*man')
mo2 = batRegex.search('Tha Adventures of Batwoman')
mo2.group()

'Batwoman'

In [21]:
batRegex = re.compile(r'Bat(wo)*man')
mo3 = batRegex.search('Tha Adventures of Batwowowowowoman')
mo3.group()

'Batwowowowowoman'

## Matching one or more with the +

Whilst the * means match 'zero or more,' the + (plus) means match one or more. The group preceding the star must appear at least once.

In [22]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [23]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [24]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

True

The regex Bat(wo)+man will not match the string 'The Adventures of Batman' because at least one 'wo' is required by the plus sign.

## Matching specific repetitions with curly brackets {}

In [25]:
haRegex = re.compile(r'(Ha){3}')
mo = haRegex.search('HaHaHa')
mo.group()

'HaHaHa'

In [26]:
mo2 = haRegex.search('Ha')
mo2 == None

True

Here (Ha){3} matches 'HaHaHa' but not 'Ha' since it does not occur in the string mo2.

## Greedy and Non-greedy matching

Since (Ha){3,5} can match three, four or five instances of Ha in the string 'HaHaHaHaHa' you need to express the regular expression in a non-greedy way i.e. don't just take the longest string possible

In [27]:
greedyRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [28]:
nongreedyRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

Note that the question mark can have two meanings in regular expressions:
* Declaring a non-greedy match or
* Flagging an optional group

# The findall() method

In [29]:
phoneNumRegex = re.compile(r'\d\d\d\d\d \d\d\d\d\d\d')
mo = phoneNumRegex.search('Cell: 07897 585884 work: 07898 567478')
mo.group()

'07897 585884'

Using the findall() method returns a list of strings as long as there are no groups in the regular expression.

In [32]:
phoneNumRegex = re.compile(r'\d\d\d\d\d \d\d\d\d\d\d')
mo = phoneNumRegex.findall('Cell: 07897 585884 work: 07898 567478')
mo

['07897 585884', '07898 567478']

If there are groups in the regular expression then the results return a list of tuples. Each tuple is a found match.

In [36]:
phoneNumRegex = re.compile(r'(\d\d\d\d\d) (\d\d\d\d\d\d)')
mo2 = phoneNumRegex.findall('Cell: 07897 585884 work: 07898 567478')
mo2

[('07897', '585884'), ('07898', '567478')]

In summary the findall() method returns:

* Without groups a list of strings is returned e.g. ['07989 858588', '07989 353555']
* With groups present a list of tuples is returned e.g. [('07989', '858588'), ('07989', '353555')]

## Using character classes

* \d Numeric digit from 0 to 9
* \D Not a nueric digit
* \w Any letter, numeric digit, or the underscore character (e.g. matching a word)
* \W Any character that is not a letter, digit or underscore
* \s Space, Tab or new line character
* \S Any character that is not a space, tab or new line

In [38]:
xmasRegex = re.compile(r'\d+\s\w+')  #digits, space, letters
mo3 = xmasRegex.findall('12 Drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans')
mo3

['12 Drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans']

## Making your own character class using []

In [39]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
mo4 = vowelRegex.findall('Robocop eats baby food and GOBSTOPPERS')
mo4

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'a', 'O', 'O', 'E']

To repeat the same lookup using the negative i.e. the non-vowles use the caret '^' in the opening bracket.

In [40]:
vowelRegex = re.compile(r'[^aeiouAEIOU]')
mo5 = vowelRegex.findall('Robocop eats baby food and GOBSTOPPERS')
mo5

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 ' ',
 'n',
 'd',
 ' ',
 'G',
 'B',
 'S',
 'T',
 'P',
 'P',
 'R',
 'S']