<h1>Welcome to the Dark Art of Coding:</h1>
<h2>Automating with Python</h2>
Chapter 07: Pattern matching with Regular Expressions

<h1>Today's class</h1>
<ul>
<li><strong>Review and Questions?</strong></li>
<li><strong>Objectives:</strong></li>
<ul>
    <li><strong>Example without RegEx</strong></li>
    <li><strong>Example with RegEx</strong></li>
    <li><strong>Debuggex</strong></li>
</ul>
</ul>

<h1>Review and questions?</h1>
<ul>
    <li>Name two ways to include an apostrophe in a string</li>
    <li>What syntax do you use to get the 5th item from a string</li>
    <li>Name three methods associated with strings</li>
</ul>


<h1>What are *Regular Expressions*</h1>

A regular expression is a special string used for pattern recognition and matching

In [None]:
# if we want to determine if a string is a phone number, 
# there are a number of tests that we can perform
# Presume that we are using a number with the format: xxx-yyy-zzzz


def isPhoneNumber(text):
    if len(text) != 12:
        return False

In [None]:
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False

In [None]:
    if text[3] != '-':
        return False

In [None]:
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False

In [None]:
    if text[7] != '-':
        return False

In [None]:
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

In [2]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

In [3]:
print('Checking against: 465-814-0978')
print(isPhoneNumber('465-814-0978'))
print('Checking against: not_a_number')
print(isPhoneNumber('not_a_number'))

Checking against: 465-814-0978
True
Checking against: not_a_number
False


In [4]:
message = 'text me at 123-456-7890. call me at 098-765-4321'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Found number: ' + chunk)

Found number: 123-456-7890
Found number: 098-765-4321


<h2>What do we do when the number format varies?</h2>

456.789.0123

(443) 554-6655

098 629 7452

In [5]:
import re

phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # We use a raw string because typing: r'\d\d\d'
                                                   # is easier than typing: '\\d\\d\\d'

In [6]:
mo = phoneRegex.search('My number is 786-234-6273')
print('found numbers:', mo.group())

found numbers: 786-234-6273


In [7]:
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneRegex.search('My number is 786-234-6273')

In [8]:
print(mo.group(0))

786-234-6273


In [9]:
print(mo.group(1))

786


In [10]:
print(mo.group(2))

234-6273


In [11]:
areaCode, mainNumber = mo.groups()

In [12]:
print(areaCode)
print(mainNumber)

786
234-6273


In [13]:
phoneRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneRegex.search('(808) 872-8204')
print(mo.group())

(808) 872-8204


In [14]:
multiRegex = re.compile(r'bat|cat')
mo = multiRegex.search('batman and catwoman')
print(mo.group())
mo = multiRegex.search('catwoman and batman')
print(mo.group())

bat
cat


In [15]:
endRegex = re.compile(r'bat(man|mobile|arangs|bat)')
mo = endRegex.search('Get in the batmobile!')
print(mo.group())
print(mo.group(1))

batmobile
mobile


In [16]:
phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneRegex.findall('Home: 726-282-0186, Cell: 873-193-8264')

['726-282-0186', '873-193-8264']

In [17]:
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phoneRegex.findall('Home: 726-282-0186, Cell: 873-193-8264')

[('726', '282', '0186'), ('873', '193', '8264')]

In [18]:
haRegex = re.compile(r'(ha){3}')
haRegex.search('hahahahaha').group()

'hahaha'

In [19]:
haRegex = re.compile(r'(ha){3,5}')

In [20]:
haRegex.search('hahahahaha').group()

'hahahahaha'

In [21]:
haRegex = re.compile(r'(ha){3,5}?')
haRegex.search('hahahahaha').group()

'hahaha'

|Character class |Represents                                      |
|:--------------:|:----------------------------------------------:|
|\d              |numeric digits 0-9                              |
|\D              |everything BUT digits 0-9                       |
|\w              |any letter, numeric or underscore character     |
|\W              |everything BUT letters, numerics, or underscores|
|\s              |spaces, tabs, and newline characters            |
|\S              |everything BUT spaces, tabs, and newlines       |

|Regex symbols    |Their function                                         |
|:---------------:|:-----------------------------------------------------:|
|?                |matches zero or one                                    |
|\*               |matches zero or more                                   |
|+                |matches one or more                                    |
|{n}              |matches exactly n                                      |
|{n,}             |matches n or more                                      |
|{,m}             |matches 0 to m                                         |
|{n,m}            |matches at least n and at most m                       |
|{n,m}?, \*?, +?  |performs a non-greedy(lazy) match                      |
|^spam            |the string must begin with spam                        |
|spam$            |the string must end with spam                          |
|.                |matches any character except newlines                  |
|\d, \w, \s       |match a digit, word, and space character respectively  |
|\D, \W, \S       |match anything BUT a digit word or space character     |
|[abc]            |matches any character between the brackets             |
|[^abc]           |matches any character but the ones between the brackets|

In [None]:
re.IGNORECASE
re.DOTALL
re.VERBOSE

In [22]:
helloRegex = re.compile(r'hello', re.IGNORECASE)
print(helloRegex.findall('I said "HELLO!" to the man after he said hello to me'))

['HELLO', 'hello']


In [24]:
dotallRegex = re.compile(r'.*')
print(dotallRegex.search('Batman is love\nBatman is life').group())

Batman is love


In [25]:
dotallRegex = re.compile(r'.*', re.DOTALL)
print(dotallRegex.search('Batman is love\nBatman is life').group())

Batman is love
Batman is life


In [26]:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

In [27]:
phoneRegex = re.compile(r'''(
                            (\d{3}|\(\d{3}\))?
                            (\s|-|\.)?
                            \d{3}
                            (\s|-|\.)
                            \d{4}
                            (\s*(ext|x|ext.)\s*\d{2,5})?
                            )''', re.VERBOSE)

In [None]:
re.compile(r'example string', re.VERBOSE | re.IGNORECASE)

<h1>Homework:</h1>
<strong>IP address Regex!</strong>