# Searching For Simple Patterns

Being able to match letters and metacharacters is the simplest task that regular expressions can do. In this section we will see how we can use regular expressions to perform more complex pattern matching. We can form any pattern we want by using the metacharacters mentioned in the previous lesson.

The first metacharacter we are going to look at is the backslash (`\`). We already saw that the backslash can be used to escape all the metacharacters, so that you can search for them directly. However, the backslash can also be followed by various characters to signal various special sequences. Here is a list of the special sequences we are going to look at in this notebook:

* `\d` - Matches any decimal digit; this is equivalent to the set [0-9]


* `\D` - Matches any non-digit character; this is equivalent to the set [^0-9]


* `\s` - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]


* `\S` - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]


* `\w` - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]


* `\W` - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]

We can see that there is a difference between lowercase and uppercase sequences. For example, while `\d` matches any digit, `\D` matches everything that is **not** a digit. Similarly, while `\s` matches any whitespace character, `\S` matches everything that is **not** a whitespace character; and while `\w` matches any alphanumeric character, `\W` matches everything that is **not** an alphanumeric character.

Let's start by learning how to use `\d` to search for decimal digits.

### Matching Numbers Using `\d`

In the code below, we will use `'\d'` as our regular expression to find all the decimal digits in our `sample_text` string:

In [1]:
# Import re module
import re

# Sample text
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

# Create a regular expression object with the regular expression '\d'
regex = re.compile(r'\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(15, 16), match='1'>
<_sre.SRE_Match object; span=(16, 17), match='2'>
<_sre.SRE_Match object; span=(17, 18), match='3'>
<_sre.SRE_Match object; span=(18, 19), match='0'>
<_sre.SRE_Match object; span=(46, 47), match='1'>
<_sre.SRE_Match object; span=(47, 48), match='5'>
<_sre.SRE_Match object; span=(48, 49), match='6'>
<_sre.SRE_Match object; span=(49, 50), match='7'>
<_sre.SRE_Match object; span=(50, 51), match='8'>
<_sre.SRE_Match object; span=(51, 52), match='9'>


As we can see, all the matches found above correspond to only decimal digits between 0 and 9.

Conversely, if wanted to find all the characters that are **not** decimal digits, we will use `\D` as our regular expression, as shown below:

In [2]:
# Import re module
import re

# Sample text
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

# Create a regular expression object with the regular expression '\D'
regex = re.compile(r'\D')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='A'>
<_sre.SRE_Match object; span=(1, 2), match='l'>
<_sre.SRE_Match object; span=(2, 3), match='i'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(6, 7), match='l'>
<_sre.SRE_Match object; span=(7, 8), match='i'>
<_sre.SRE_Match object; span=(8, 9), match='v'>
<_sre.SRE_Match object; span=(9, 10), match='e'>
<_sre.SRE_Match object; span=(10, 11), match='s'>
<_sre.SRE_Match object; span=(11, 12), match=' '>
<_sre.SRE_Match object; span=(12, 13), match='i'>
<_sre.SRE_Match object; span=(13, 14), match='n'>
<_sre.SRE_Match object; span=(14, 15), match=' '>
<_sre.SRE_Match object; span=(19, 20), match=' '>
<_sre.SRE_Match object; span=(20, 21), match='F'>
<_sre.SRE_Match object; span=(21, 22), match='i'>
<_sre.SRE_Match object; span=(22, 23), match='r'>
<_sre.SRE_Match object; span=(23, 24), match='s'>
<_sre.SRE_Match obj

We can see that none of the matches are decimal digits. We also see, that by using `\D` we were able to match all characters, including periods (`.`) and white spaces.

# TODO: Find IP Addresses

In the cell below, our `sample_text` string contains three IP addresses. Write a single regular expression that can match any IP address and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression.  Finally, write a loop to print all the `matches` found by the `.finditer()` method.

**HINT :** Use the special sequence `\d` and take advantage that all IP addresses have the same pattern.

In [None]:
# Import re module


# Sample text
sample_text = 'Here are three IP address: 123.456.789.123, 999.888.777.666, 111.222.333.444'

# Create a regular expression object with the regular expression
regex = 

# Search the sample_text for the regular expression
matches = 

# Print all the matches


If you wrote your regex correctly you should see three matches above corresponding to the three IP addresses in our `sample_text` string.

### Matching Whitespace Characters Using `\s`

In the code below, we will use `\s` as our regular expression to find all the whitespace characters in our `sample_text` string. For this example, we will use a string literal that spans multiple lines. To create this multi-line string, we will use triple-quotes (`'''`) both at the beginning and at the end of the multi-line string.

In [3]:
# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\s'
regex = re.compile(r'\s')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(1, 2), match='\t'>
<_sre.SRE_Match object; span=(7, 8), match=' '>
<_sre.SRE_Match object; span=(13, 14), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='\x0c'>
<_sre.SRE_Match object; span=(18, 19), match='\n'>
<_sre.SRE_Match object; span=(23, 24), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(33, 34), match='\r'>
<_sre.SRE_Match object; span=(34, 35), match='\n'>
<_sre.SRE_Match object; span=(40, 41), match=' '>
<_sre.SRE_Match object; span=(46, 47), match=' '>
<_sre.SRE_Match object; span=(49, 50), match=' '>
<_sre.SRE_Match object; span=(57, 58), match='\x0b'>
<_sre.SRE_Match object; span=(58, 59), match='\n'>


As we can see, all the matches found correspond to white spaces, tabs (`\t`), newlines (`\n`), carriage returns (`\r`), form feeds (`\f`), and vertical tabs (`\v`). Notice that form feeds appear as `\x0c` and vertical tabs as `\x0b`. 

Conversely, if wanted to find all the characters that are **not** whitespace characters, we will use `\S` as our regular expression, as shown below:

In [4]:
# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\S'
regex = re.compile(r'\S')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(2, 3), match='A'>
<_sre.SRE_Match object; span=(3, 4), match='l'>
<_sre.SRE_Match object; span=(4, 5), match='i'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='e'>
<_sre.SRE_Match object; span=(8, 9), match='l'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='v'>
<_sre.SRE_Match object; span=(11, 12), match='e'>
<_sre.SRE_Match object; span=(12, 13), match='s'>
<_sre.SRE_Match object; span=(14, 15), match='i'>
<_sre.SRE_Match object; span=(15, 16), match='n'>
<_sre.SRE_Match object; span=(16, 17), match=':'>
<_sre.SRE_Match object; span=(19, 20), match='1'>
<_sre.SRE_Match object; span=(20, 21), match='2'>
<_sre.SRE_Match object; span=(21, 22), match='3'>
<_sre.SRE_Match object; span=(22, 23), match='0'>
<_sre.SRE_Match object; span=(24, 25), match='F'>
<_sre.SRE_Match object; span=(25, 26), match='i'>
<_sre.SRE_Match object; span=(26, 27), match='r'>
<_sre.SRE_Mat

We can see that none of the matches above are whitespace characters. We also see, that by using `\S` we were able to match all characters, including periods (`.`), letters, and numbers.

# TODO: Print The Numbers Between Whitespace Characters

In the cell below, our `sample_text`  consists of a multi-line string with numbers in between whitespace characters:

```python
123	45	7895
1	222	33
```

Notice that not all the numbers have the same number of digits. For example, the first number (`123` ) has three digits, but the second number (`45` ) only has two digits. 

Notice that not all the numbers have the same number of digits. For example, the first number (`123` ) has three digits, but the second number (`45` ) only has two digits. 

Write a single regular expression that finds the tabs (`\t`) and the newlines (`\n`) in this multi-line string and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Then, write a loop that uses the span information from each `match` to only print the numbers found in the original multi-line string. Your code should work in the general case where the numbers can have any number of digits. For example, if the numbers in the string were to change your code should still be able to find them and print them. Finally, in this exercise you cannot use `\d` in your regular expression. 

**HINT :** Notice that there are no whites paces in the multiline string. Use the `\s` sequence to find the tabs and newlines. Then notice that you can use the span's `end` and `start` index from consecutive matches to figure out the number of digits of each number. Use these indices to print the numbers found in the original multi-line string. You can use the `match.span()` method we saw before to find the `start` and `end` indices of each `match`. Alternatively, you can also use the `.start()` and `.end()` methods to extract the `start` and `end` indices of each match. The `match.start()` is equivalent to `match.span()[0]` and `match.end()` is equivalent to `match.span()[1]`.

In [None]:
# Import re module


# Sample text
sample_text = '''
123\t45\t7895
1\t222\t33
'''

# Print sample_text
print('Sample Text:\n', sample_text)

# Create a regular expression object with the regular expression
regex = 

# Search the sample_text for the regular expression
matches = 


# Write a loop to print all the numbers found in the original string


### Matching Alphanumeric Characters Using `\w`

In the code below, we will use `\w` as our regular expression to find all the alphanumeric characters in our `sample_text` string. This includes the underscore ( `_` ), all the numbers from 0 through 9, and all the uppercase and lowercase letters:

In [5]:
# Import re module
import re

# Sample text
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''

# Create a regular expression object with the regular expression '\w'
regex = re.compile(r'\w')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 2), match='Y'>
<_sre.SRE_Match object; span=(2, 3), match='o'>
<_sre.SRE_Match object; span=(3, 4), match='u'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(7, 8), match='n'>
<_sre.SRE_Match object; span=(9, 10), match='c'>
<_sre.SRE_Match object; span=(10, 11), match='o'>
<_sre.SRE_Match object; span=(11, 12), match='n'>
<_sre.SRE_Match object; span=(12, 13), match='t'>
<_sre.SRE_Match object; span=(13, 14), match='a'>
<_sre.SRE_Match object; span=(14, 15), match='c'>
<_sre.SRE_Match object; span=(15, 16), match='t'>
<_sre.SRE_Match object; span=(17, 18), match='F'>
<_sre.SRE_Match object; span=(18, 19), match='A'>
<_sre.SRE_Match object; span=(19, 20), match='K'>
<_sre.SRE_Match object; span=(20, 21), match='E'>
<_sre.SRE_Match object; span=(22, 23), match='C'>
<_sre.SRE_Match object; span=(23, 24), match='o'>
<_sre.SRE_Match object; span=(24, 25), match='m'>
<_sre.SRE_Mat

As we can see, all the matches found correspond to alphanumeric characters only, including the underscore in the email address.

Conversely, if wanted to find all the characters that are **not** alphanumeric characters, we will use `\W` as our regular expression, as shown below:

In [6]:
# Import re module
import re

# Sample text
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''

# Create a regular expression object with the regular expression '\W'
regex = re.compile(r'\W')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(4, 5), match=' '>
<_sre.SRE_Match object; span=(8, 9), match=' '>
<_sre.SRE_Match object; span=(16, 17), match=' '>
<_sre.SRE_Match object; span=(21, 22), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(32, 33), match=':'>
<_sre.SRE_Match object; span=(33, 34), match='\n'>
<_sre.SRE_Match object; span=(48, 49), match='@'>
<_sre.SRE_Match object; span=(54, 55), match='.'>
<_sre.SRE_Match object; span=(58, 59), match='.'>
<_sre.SRE_Match object; span=(59, 60), match='\n'>


We can see that none of the matches are alphanumeric characters. We also see, that by using `\W` we were able to match all whitespace characters, and the `@` symbol in the email address.

# TODO: Find emails

In the cell below, our `sample_text` consists of a multi-line string that contains three email addresses:

```
j.s@email.com
a.w@email.com
m.j@email.com
```

Notice, that all three email address have the same pattern, namely, the first name initial, followed by a dot (`.`), followed by the last name initial, and ending in ``` @email.com```. 

Take advantage of the fact that all three email addresses have the same pattern to write a single regular expression that can find all three email addresses in our `sample_text` string. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

In [None]:
# Import re module


# Sample text
sample_text = '''
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com
'''

# Print sample_text
print('Sample Text:\n', sample_text)

# Create a regular expression object with the regular expression
regex = 

# Search the sample_text for the regular expression
matches = 

# Print all the matches


If you wrote your regex correctly you should see three matches above corresponding to the three email addresses found in our `sample_text` string.

# Solution

[Solution notebook](simple_patterns_solution.ipynb)