# Charcter Sets

### Finding Phone Numbers

In the code below we will create a regular expression to match phone numbers within a multi-line string that mimics a phone book. By looking a the multi-line string, we notice that even though all the phone numbers have different digits, they all have the same pattern, namely, 3 digits followed by a single charater, followed by 3 more digits, followed by another single character, followed by 4 digits. We will take advantage of this, to find all the phone numbers by combining the special sequence \\d and the dot (`.`) metcharacter, as shown in the code below:

In [1]:
import re

sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''
regex = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>
<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


We can see that we managed to find all the phone numbers in our multi-line string even though, they all have different digits and different charcters in between the groups of numbers. Notice that by using the dot we were able to match either the dash (-), the whitespace, and the parenthesis separating the groups of numbers. By using the dot we avoid having to create three different regular expressions to match the three possible characters separating the groups of numbers.

Now we can write the above regular expression in a more compact from by using the { } metacharcters.
For example, `{m}` specifies that exactly `m` copies of the previous regular expression should be matched. For example, `\d{3}` will match exactly three decimal digits. In the code below we employ these metacharcters to write the previous code in a more compact form:

In [2]:
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''
regex = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>
<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


As we can see we get the same reuslt as before. 

Now let's suppose I only wanted to find phone numbers in which the groups of digits were separated by either a dash (-) or a whitespace. In this case we can use what is known as a character set, which is a set of characters that you wish to match. Character sets are specified using the \[ \] metacharcters. For example, the charcter set \[-  \] (notice that there is a whitespace after the dash) will only match a dash or a whitespace, as shown below:

In [3]:
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


We can clearly see that now we only match the phone numbers that are separated by either a dash or a whitespace. The last phone number is not matched because even though the last group of numbers is separated by a dash, the first group of numbers is separated by a parantheshis which is not in our character set.

It is important to note that even though a character set can have many charcters, it only matches one of those characters at a time. For example, suppose I added a whitespace after the dash in Mr. Brown's phone number, as shown below:

In [4]:
sample_text = '''
Mr. Brown: 555- 123- 4567
'''
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

We can see that now, that we get no matches with the same regular expression. This is because the character set \[-  \] is only matching a single charcter that must be either a dash or a whitespace.   

Let's see another example of a charcter set. Now let's suppose I only wanted to find the phone numbers with area code `455` or `655`. In this case we can use the chracter set \[46\] to indicate that the first number should be either a 4 or a 6 as shown in the code below:

In [5]:
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''
regex = re.compile(r'[46]55[- ]\d{3}[- ]\d{4}')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


We can see that we only get the two phone numbers that have area code `455` and `655`.

Now let's suppose I wanted to look for phone numbers that end on the numbers 6,7,8, or 9. In this case we could use the charcter set \[6789\], however there is a more compact form of doing this. Within a character set, when a dash (-) is placed **between** digits or letters, is actually used to specify a range. Therefore, the charcter set \[6-9\] is the same as \[6789\]. The same applies to letters, for example, the charcter set \[a-f\] is the same as \[abcdef\]. It is important to note, that when a dash is placed at the beginning of a charcter set, as we did in the previous example, the dash is take literally. 

In the code below, we use a charaterset to find all the phone numbers that end on the numbers 6,7,8, or 9:

In [6]:
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''
regex = re.compile(r'\d{3}.\d{3}.\d{3}[6-9]')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


As we can see, we get all the phone numbers that end on the numbers 6,7, and 9. The last phone number is not mathced because its last digit is a 4.

Now let's suppose I wanted to find the phone numbers that do not end on the numbers 6,7,8, or 9. In this case we could use the charcter set \[1-5\], however we could also use the regular expression \[^6-9\], notice the caret (^) at the beginning. We already learned that outside of a character set the caret matches a sequence of characters when they are located at the beginning of a string. However, when the caret appears at the beginning of a character set it negates the set. This means it matches everything that is not in that character. For example, the regular expression \[^6-9\] will match any charcter that is not 6,7,8, or 9. Similarly, the regular expression \[^a-zA-Z\] will match any character that is not a lowercase or uppercase letter. 

In the code below, we find the phone numbers that do not end on the numbers 6,7,8, or 9.

In [7]:
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''
regex = re.compile(r'\d{3}.\d{3}.\d{3}[^6-9]')

matches = regex.finditer(sample_text)

for match in matches:
    print(match)

<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


As we can see, we only get one match since there is only one phone number that doesn't end with the numbers 6,7,8, or 9.

# TODO: 

In the code below we have the same multi-line string as before, but we have added the country calling code to each phone number. Notice that the country codes can have anywhere from 1 to 3 numbers. Write a regular expression that can match the `+` character, the country calling code (regardless of the number of digits), and the phone number. Then use the `.finditer()` method to find the regex in the `sample_text` string. Finally, write a loop to print the `matches`.

**HINT :** You can use the qualifier `{m,n}`.  This qualifier means there must be at least `m` repetitions, and at most `n` repetitions of the previous regular expression. For example, `a/{1,3}b` will match `a/b`, `a//b`, and `a///b`. It won’t match `ab`, which has no slashes, or `a////b`, which has four slashes.

In [8]:
# import re module
import re

sample_text = '''
Mr. Brown: +1-555-123-4567
Mrs. Smith: +61 455 555 4549
Mr. Jackson: +375-655-777-7346
Ms. Wilson: +213 (555)999-8464
'''

# Write a regex that matches the phone numbers with country codes
regex = re.compile(r'\+\d{1,3}.\d{3}.\d{3}.\d{4}')

# Use the .finditer method to find the above regex
matches = regex.finditer(sample_text)

# Write a loop to print the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 27), match='+1-555-123-4567'>
<_sre.SRE_Match object; span=(40, 56), match='+61 455 555 4549'>
<_sre.SRE_Match object; span=(70, 87), match='+375-655-777-7346'>
