# Finding Complicated Patterns

### Finding Names

In the code below, we have a multi-line string with the names of the 4 highest mountains in the world according to Wikipedia. 
Let's create a regular expressions that allow us to find the names the names of these mountains. The first to notice is that the word mountain has been abbreviated in two different ways, as `Mt.` and as `Mt` (without the period). Therefore, if we want to find all the names of the mountains we need to create a regular expression that allows us to indicate that the period in the abbreviation as optional. We can do this by using the `?` metacharacter. The `?` will match 0 or 1 repetitions of the preceding regular expression. For example, the regular expression `'Mt\.?'` will match either `Mt` or `Mt.`, as shown below:

In [1]:
import re

sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''
regex = re.compile(r'Mt\.?')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 3), match='Mt'>
<_sre.SRE_Match object; span=(28, 31), match='Mt.'>
<_sre.SRE_Match object; span=(51, 53), match='Mt'>
<_sre.SRE_Match object; span=(84, 87), match='Mt.'>


We can clearly see that we get the two different abbreviations. Now let's continue creating our regular expression so that it can match all the mountain names. To do this, we notice that the next character after the abbreviation is a whitespace, therefore, the next sequence in our regex is going to be \\s. After that whitespace, we have the name of mountain. We can see that the first letter in all the names is an uppercase letter, so we will use the character set \[A-Z\] to math any possible uppercase letter. Then comes the tricky part. We see that all the mountain names are of different lengths but only contain alphanumeric characters. To match any alphanumeric character we will the sequence \\w as we saw before and to help us match names of any length we will use the `*` metacharcter. The `*` metacharcter, matches 0 or more repetitions of the preceding regular expression (as many repetitions as are possible). For example, `ab*` will match `a` or `a` followed by any number of `b`s, such as `ab` or `abbbbb`. Let's put all this together to see how it works:

In [2]:
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''
regex = re.compile(r'Mt\.?\s[A-Z]\w*')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>


We can see that we managed to match all the mountain names regardless of their length or abbreviation. 

# Groups

In the code below we have added a new mountain to our list, but the name of this mountain has two differences from the other ones. First, mountain has been abreviated as `Mnt` instead of `Mt` and second the first letter of the name is lowercase not uppercase. 

To be able to match the new abbreviation and the other abbreviations as well, we will use the parenthesis,`( )`, metacharcters to define a **group**. A their name suggests, groups, group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as `*, ?, or {m}` that we have seen before. For example, `(ab)*` will match zero or more repetitions of `ab`. We will also use the OR `|` metacharcter within the group to be able to select either `Mnt` or `Mt`. For example, `(Mt|Mnt)`, will match either `Mnt` or `Mt`.

To be able to match the first letter of name that is in lowercase, we will add the lowercase letters to our previous charcter set, namely, we will now use \[a-zA-Z\]. Let's put all this together to see how it works:

In [3]:
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''
regex = re.compile(r'(Mt|Mnt)\.?\s[a-zA-Z]\w*')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>
<_sre.SRE_Match object; span=(111, 121), match='Mnt makalu'>


Alternatively, since the the first letter in both abbreviations is the same, `M`,  then we could have put the `M` outside the grouping, as shown below:

In [4]:
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''
regex = re.compile(r'M(t|nt)\.?\s[a-zA-Z]\w*')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>
<_sre.SRE_Match object; span=(111, 121), match='Mnt makalu'>


Notice that we get the same result as before. 

# TODO: Finding email Addresses Revisited

In the code below, we have a multi-line string with four email addresses. As we can see, all the email addresses look different. Write a regular expression that is able to find all these email addresses. Then use the `.finditer()` method to find the regex in the `sample_text` string. Finally, write a loop to print the `matches`.

**HINT :** First we notice that all the characters before the `@` symbol only contain lowercase letters, undersocores, and numbers. To match this part of the email address we can use the charcters set `\[a-z_0-9\]` followed by the `+` metacharcter, to account for the fact that all email addresses must have at least one character or more beofre the `@` symbol. The `+` metacharcter matches 1 or more repetitions of the preceding regular expression. For exmaple, `ab+` will match `a` followed by any non-zero number of `b`s, but it will not match just `a`.

The `@` symbol is not a metacharcter so we can matching directly without the need of escaping it. Then we notice that the domain names contain lowercase letters, uppercase letters, undersocres, and dashes. Again we can use the charcters set `\[a-zA-Z_-\]` followed by the `+` metacharcter, to account for the fact that all domains must have at least one character or more after the `@` symbol. To match the first dot, (`.`), we need to escape it first using the backslah, (`\.`) because the dot is a metacharcter. We can then use the character set `\[a-z\]+` to match either `com`, `edu`, or `gov`.

To match the last email address we need to add an optional dot followed by another character set of only lowercase letter. 

In [5]:
# import re module
import re

sample_text = '''
fake_email@fake-email.edu
fakeemail43@fake_email.com
fake891_email@fakemail.gov
52fake_email@FAKE_email.com.nl
'''

# Write a regex that matches the email addresses
regex = re.compile(r'[a-z_0-9]+@[a-zA-Z_-]+\.[a-z]+\.?[a-z]+')

# Use the .finditer method to find the above regex
matches = regex.finditer(sample_text)

# Write a loop to print the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 26), match='fake_email@fake-email.edu'>
<_sre.SRE_Match object; span=(27, 53), match='fakeemail43@fake_email.com'>
<_sre.SRE_Match object; span=(54, 80), match='fake891_email@fakemail.gov'>
<_sre.SRE_Match object; span=(81, 111), match='52fake_email@FAKE_email.com.nl'>
