# Finding Words Using Regexes

We’ll begin by learning how to find letters and words in a string using regular expressions. In this notebook, we will use the `re` module from Python's standard library to work with regular expressions. The `re` module not only contains functions that allow us to check if a given regular expression matches a particular string, but also contains functions that allow us to modify strings in various ways.

In the code below, we will use a regular expression to find all the locations of the letter `a` in a short piece of text. To do this, we will use the `compile()` function from the `re` module. The `re.compile(pattern)` function converts a regular expression `pattern` into a regular expression object. This allows us to save our regular expressions into objects that can later be used to perform pattern matching using various methods, such as `.match()`, `.search()`, `.findall()`, and `.finditer()`.

In this example, our regular expression pattern is `'a'` and we will pass it to the `re.compile()` function as raw string. We then use the `.finditer()` method to search the `sample_text` for the given regular expression contained in the `regex` object. The `.finditer()` method returns an iterator yielding `MatchObject` instances over all non-overlapping matches for the regular expression pattern in the string. The string is scanned from left-to-right, and matches are returned in the order found. Because the `.finditer()` method returns an iterator, we can loop through it to print all the matches.

In [1]:
import re

sample_text = 'Alice and Walter are walking to the store.'

regex = re.compile(r'a')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(11, 12), match='a'>
<_sre.SRE_Match object; span=(17, 18), match='a'>
<_sre.SRE_Match object; span=(22, 23), match='a'>


We can see that each match corresponds to a match object with a given `span` and corresponding `match`. The `span=(start,end)` is a tuple that indicates the `start` and `end` indices of the given `match` in the string `sample_text`. For example, if we look at the first match, we can see that the span corresponds to the indices 6 through 7. Therefore, if we print the indices 6 through 7 of the `sample_text` string, they should correspond to the letter `a`:

In [7]:
print(sample_text[6:7])

a


Notice, however that even though the first letter in our `sample_text` is an `A`, the `.finditer()` method didn't return it as a match. This is because, regular expressions are case sensitive. Therefore, in order to match this upper case `A` we will need to use:

In [8]:
sample_text = 'Alice and Walter are walking to the store.'

regex = re.compile(r'A')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='A'>


Notice that now, the `.finditer()` method only returned one match, since there is only one upper case `A` in our `sample_text`. Also, notice that the `span=(0,1)` tuple tells us that the upper case `A` is the first letter in the `sample_text` string. 

We should note that the `re` module allows us to perform **case-insensitive** searches by the means of **Flags**. For example, we might want to search our string for the letter `a`, regardless if it is upper or lower case. We will learn about flags in a later lesson. 

Besides searhing for single letter, we can also search for groups of letters. This is done in exactly the same manner as with single letters. Let's search for the word `walking` in our `sample_text` string:

In [12]:
sample_text = 'Alice and Walter are walking to the store.'

regex = re.compile(r'walking')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<_sre.SRE_Match object; span=(21, 28), match='walking'>

Match from the original text: walking


When searching for groups of letters the order of the letters matters. For example, if we were to search for `ginwakl` we wouldn't find any mathces even though the same group of letters are contained in the word walking:

In [39]:
sample_text = 'Alice and Walter are walking to the store.'

regex = re.compile(r'ginwakl')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

We can clearly see that there are no matches because the `.finditer()` method is looking for those letters in that particular order in our `sample_text` string.

# TODO: Find Words

In the code below, we have a string that contains the name Walter Brown written in amixture of upper and lower case letters. Write a regular expression that matches the name `WaLtEr BroWN` in the string. Then use the `.finditer()` method to find the regex in the `sample_text` string. Then, write a loop to print the `matches`. Finally, using the span information from the `match`, print the match from the original string.

In [10]:
# import re module
import re

sample_text = 'Alice and WaLtEr BroWN are talking with wAlTer Jackson.'

# Write a regex that matches WaLtEr BroWN
regex = re.compile(r'WaLtEr BroWN')

# Use the .finditer method to find the above regex
matches = regex.finditer(sample_text)

# Write a loop to print the match
for match in matches:
    print(match)

    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<_sre.SRE_Match object; span=(10, 22), match='WaLtEr BroWN'>

Match from the original text: WaLtEr BroWN



# Matching a Period (`.`)

Now, let's use a regular expression to find the period (`.`) at the end of `sample_text` string. Let's search for the period in the same manner as we did for the letters:

In [40]:
sample_text = 'Alice and Walter are walking to the store.'

regex = re.compile(r'.')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='A'>
<_sre.SRE_Match object; span=(1, 2), match='l'>
<_sre.SRE_Match object; span=(2, 3), match='i'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(7, 8), match='n'>
<_sre.SRE_Match object; span=(8, 9), match='d'>
<_sre.SRE_Match object; span=(9, 10), match=' '>
<_sre.SRE_Match object; span=(10, 11), match='W'>
<_sre.SRE_Match object; span=(11, 12), match='a'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='t'>
<_sre.SRE_Match object; span=(14, 15), match='e'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
<_sre.SRE_Match object; span=(16, 17), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='a'>
<_sre.SRE_Match object; span=(18, 19), match='r'>
<_sre.SRE_Match object; span=(19, 20), match='e'>
<_sre.SRE_Match obj

We can clearly see that something has gone wrong, the `.finditer()` method has matched every single single charater in the `sample_text` string, including whitespaces, upper and lower case letters, and the period at the end.

This because in regular expressions the `.` is a special charcter known as a **Metacharacter**. Metacharters are used to give special instructions and we will learn about them in the next lesson.