# Searching For Simple Patterns

Being able to match letters and metacharters is the simplest task that regular expressions can do. In this section we will see how we can use regular expressions to perform more complex pattern matching. We can form any pattern we want by using the metacharacters mentioned above.

The first metacharacter we are going to look at is the backslash (\\). We already saw that the backslash can be used to escape all the metacharacters so you can still match them in patterns. However, the backslash can also be followed by various characters to signal various special sequences. Here is a list of the special sequences we are going to look at in this notebook:

* **\d** - Matches any decimal digit; this is equivalent to the set [0-9]


* **\D** - Matches any non-digit character; this is equivalent to the set [^0-9]


* **\s** - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]


* **\S** - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]


* **\w** - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]


* **\W** - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]

We can see that there is a difference between lowercase and uppercase sequences. For example, while \\d mathces any digit, \D mathces everything that is not a digit. Similarly, while \\s mathces any whitespace character, \S matches everything that is not a whitespace character. Let'start by learning how search for numbers using `\d`

### Matching Numbers Using \d

In the code below we will create a regular expression to find the decimal digits in a sentence.

In [1]:
import re

sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

regex = re.compile(r'\d')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(15, 16), match='1'>
<_sre.SRE_Match object; span=(16, 17), match='2'>
<_sre.SRE_Match object; span=(17, 18), match='3'>
<_sre.SRE_Match object; span=(18, 19), match='0'>
<_sre.SRE_Match object; span=(46, 47), match='1'>
<_sre.SRE_Match object; span=(47, 48), match='5'>
<_sre.SRE_Match object; span=(48, 49), match='6'>
<_sre.SRE_Match object; span=(49, 50), match='7'>
<_sre.SRE_Match object; span=(50, 51), match='8'>
<_sre.SRE_Match object; span=(51, 52), match='9'>


As we can see, all the matches found correspond to decimal digits between 0 and 9.

Conversely, if wanted to find all the characters that are not decimal digits, we will use \\D, instead.

In [2]:
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

regex = re.compile(r'\D')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='A'>
<_sre.SRE_Match object; span=(1, 2), match='l'>
<_sre.SRE_Match object; span=(2, 3), match='i'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(6, 7), match='l'>
<_sre.SRE_Match object; span=(7, 8), match='i'>
<_sre.SRE_Match object; span=(8, 9), match='v'>
<_sre.SRE_Match object; span=(9, 10), match='e'>
<_sre.SRE_Match object; span=(10, 11), match='s'>
<_sre.SRE_Match object; span=(11, 12), match=' '>
<_sre.SRE_Match object; span=(12, 13), match='i'>
<_sre.SRE_Match object; span=(13, 14), match='n'>
<_sre.SRE_Match object; span=(14, 15), match=' '>
<_sre.SRE_Match object; span=(19, 20), match=' '>
<_sre.SRE_Match object; span=(20, 21), match='F'>
<_sre.SRE_Match object; span=(21, 22), match='i'>
<_sre.SRE_Match object; span=(22, 23), match='r'>
<_sre.SRE_Match object; span=(23, 24), match='s'>
<_sre.SRE_Match obj

We can see that none of the mathces are decimal digits. We also see, that by using \\D we were able to match all characters, including periods (`.`) and white spaces.

# TODO: Find IP Addresses

In the code, we have a sentence with three IP addresses. Write a single regular expression that can match any IP address. Then use the `.finditer()` method to find the regex in the `sample_text` string. Finally, write a loop to print the `matches`. 

**HINT :** Use the special sequence `\d` and take advantage that all IP addresses have the same pattern. 

In [3]:
# import re module
import re

sample_text = 'Here are three IP address: 123.456.789.123, 999.888.777.666, 111.222.333.444'

# Write a regex that matches the metacharacter/s of your choice
regex = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d')

# Use the .finditer method to find the above regex
matches = regex.finditer(sample_text)

# Write a loop to print the match
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(27, 42), match='123.456.789.123'>
<_sre.SRE_Match object; span=(44, 59), match='999.888.777.666'>
<_sre.SRE_Match object; span=(61, 76), match='111.222.333.444'>


If you wrote your regex correctly you should see three matches above.

### Matching Whitespace Characters Using \s

In the code below we will create a regular expression to find all the whitespace characters in a sentence. For this example, we will use a string literal that spans multiple lines. To do this, we will employ triple-quotes `'''...'''`.

In [4]:
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

regex = re.compile(r'\s')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(1, 2), match='\t'>
<_sre.SRE_Match object; span=(7, 8), match=' '>
<_sre.SRE_Match object; span=(13, 14), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='\x0c'>
<_sre.SRE_Match object; span=(18, 19), match='\n'>
<_sre.SRE_Match object; span=(23, 24), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(33, 34), match='\r'>
<_sre.SRE_Match object; span=(34, 35), match='\n'>
<_sre.SRE_Match object; span=(40, 41), match=' '>
<_sre.SRE_Match object; span=(46, 47), match=' '>
<_sre.SRE_Match object; span=(49, 50), match=' '>
<_sre.SRE_Match object; span=(57, 58), match='\x0b'>
<_sre.SRE_Match object; span=(58, 59), match='\n'>


As we can see, all the matches found correspond to white spaces, tabs (\\t), newlines (\\n), carriage returns (\\r), formfeeds (\\f), and vertical tabs (\\v). Notice that formfeeds appear as `\x0c` and vertical tabs as `\x0b`. 

Conversely, if wanted to find all the characters that are not whitespace characters, we will use \S instead.

In [5]:
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

regex = re.compile(r'\S')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(2, 3), match='A'>
<_sre.SRE_Match object; span=(3, 4), match='l'>
<_sre.SRE_Match object; span=(4, 5), match='i'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='e'>
<_sre.SRE_Match object; span=(8, 9), match='l'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='v'>
<_sre.SRE_Match object; span=(11, 12), match='e'>
<_sre.SRE_Match object; span=(12, 13), match='s'>
<_sre.SRE_Match object; span=(14, 15), match='i'>
<_sre.SRE_Match object; span=(15, 16), match='n'>
<_sre.SRE_Match object; span=(16, 17), match=':'>
<_sre.SRE_Match object; span=(19, 20), match='1'>
<_sre.SRE_Match object; span=(20, 21), match='2'>
<_sre.SRE_Match object; span=(21, 22), match='3'>
<_sre.SRE_Match object; span=(22, 23), match='0'>
<_sre.SRE_Match object; span=(24, 25), match='F'>
<_sre.SRE_Match object; span=(25, 26), match='i'>
<_sre.SRE_Match object; span=(26, 27), match='r'>
<_sre.SRE_Mat

We can see that none of the mathces are whitespace characters. We also see, that by using \\S we were able to match all characters, including periods (`.`), letters, and numbers.

# TODO: Print Numbers Between Whitespace Charcters

In the code below we have a multi-line string with numbers in between whitespace charcters. Write a single regular expression that finds the tabs (`\t`) and the newlines (`\n`) in this multi-line string. Then use the `.finditer()` method to find the regex in the `sample_text` string. Then, write a loop that uses the span information from each `match` to only print the numbers found in the original multi-line string. Notice that each set of numbers is of different length. Your code should work in general for sets of numbers of any length. So, if the numbers in the string were to change your code should still be able to print the numbers. 

**HINT :** Notice that there are no whitespaces in the multiline string. Use the `\s` sequence to find the tabs and newlines. Then notice that you can use the span's `end` and `start` index from consecutive matches to figure out the length of each set of numbers. Use these indices to print the numbers found in the original multi-line string. The `MatchObject` has a `.start()` and `.end()` methods to extract the `start` and `end` indices from the span. We can use these methods instead of the statement `match.span()[0]` that we were using before. In other words, `match.start()` is equivalent to `match.span()[0]` and `match.end()` is equivalent to `match.span()[1]`.

In [6]:
# import re module
import re

sample_text = '''
123\t45\t7895
1\t222\t33
'''

# Write a regex that matches the whitespace chracters
regex = re.compile(r'\s')

# Use the .finditer method to find the above regex
matches = regex.finditer(sample_text)

counter = 0

# Write a loop to print the numbers found in the original string
for match in matches:    
    if counter != 0:
        start_idx = match.start()        
        print('\nNumbers from the original text:', sample_text[end_idx:start_idx])        
    end_idx = match.end()    
    counter += 1


Numbers from the original text: 123

Numbers from the original text: 45

Numbers from the original text: 7895

Numbers from the original text: 1

Numbers from the original text: 222

Numbers from the original text: 33


### Matching Alphanumeric Characters Using \w

In the code below we will create a regular expression to find all the alphanumeric characters in a sentence. This inlcudes the underscore ( _ ), all the numbers 0 thorugh 9, and all upper and lower case letters.

In [7]:
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''
regex = re.compile(r'\w')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 2), match='Y'>
<_sre.SRE_Match object; span=(2, 3), match='o'>
<_sre.SRE_Match object; span=(3, 4), match='u'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(7, 8), match='n'>
<_sre.SRE_Match object; span=(9, 10), match='c'>
<_sre.SRE_Match object; span=(10, 11), match='o'>
<_sre.SRE_Match object; span=(11, 12), match='n'>
<_sre.SRE_Match object; span=(12, 13), match='t'>
<_sre.SRE_Match object; span=(13, 14), match='a'>
<_sre.SRE_Match object; span=(14, 15), match='c'>
<_sre.SRE_Match object; span=(15, 16), match='t'>
<_sre.SRE_Match object; span=(17, 18), match='F'>
<_sre.SRE_Match object; span=(18, 19), match='A'>
<_sre.SRE_Match object; span=(19, 20), match='K'>
<_sre.SRE_Match object; span=(20, 21), match='E'>
<_sre.SRE_Match object; span=(22, 23), match='C'>
<_sre.SRE_Match object; span=(23, 24), match='o'>
<_sre.SRE_Match object; span=(24, 25), match='m'>
<_sre.SRE_Mat

As we can see, all the matches found correspond to alphanumeric characters only, inlcuding the underscore in the email address.

Conversely, if wanted to find all the characters that are not alphanumneric characters, we will use \W instead.

In [8]:
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''
regex = re.compile(r'\W')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(4, 5), match=' '>
<_sre.SRE_Match object; span=(8, 9), match=' '>
<_sre.SRE_Match object; span=(16, 17), match=' '>
<_sre.SRE_Match object; span=(21, 22), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(32, 33), match=':'>
<_sre.SRE_Match object; span=(33, 34), match='\n'>
<_sre.SRE_Match object; span=(48, 49), match='@'>
<_sre.SRE_Match object; span=(54, 55), match='.'>
<_sre.SRE_Match object; span=(58, 59), match='.'>
<_sre.SRE_Match object; span=(59, 60), match='\n'>


We can see that none of the mathces are alphanumeric characters. We also see, that by using \\W we were able to match all whitespace characters, and the @ symbol in the email.

# TODO: Find emails

In the code below we have a multi-line string with three email addresses. All the email address have the same pattern, namely, the first name initial, followed by a dot (`.`), followed by the last name initial. Write a regular expression that can find all the email addresses that have this pattern. Then use the `.finditer()` method to find the regex in the `sample_text` string. Then, write a loop that prints all the `mathes`. 

In [51]:
# import re module
import re

sample_text = '''
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com
'''

# Write a regex that matches the emails
regex = re.compile(r'\w\.\w@email.com')

# Use the .finditer method to find the above regex
matches = regex.finditer(sample_text)

# Write a loop to print the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(15, 28), match='j.s@email.com'>
<_sre.SRE_Match object; span=(44, 57), match='a.w@email.com'>
<_sre.SRE_Match object; span=(70, 83), match='m.j@email.com'>


If you wrote your regex correctly, you should see three email addresses above.