## Credit
- Udacity
- https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
- https://regex101.com

In [1]:
import re

In [5]:
sample_text = '. ^ $ * + ? { } [ ] \ | ( )'

In [11]:
regex = re.compile(r'\. \^ \$ \* \+ \? \{ \} \[ \] \\ \| \( \)')

In [12]:
matches = re.finditer(regex, sample_text)

In [13]:
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 27), match='. ^ $ * + ? { } [ ] \\ | ( )'>


In [16]:
sample_text = 'John bought a winter coat for $25.99 dollars.'
regex = re.compile(r'\$25\.99')

In [17]:
re.findall(regex,sample_text)

['$25.99']

# Searching For Simple Patterns
#### Thanks to Udacity

Being able to match letters and metacharacters is the simplest task that regular expressions can do. In this section we will see how we can use regular expressions to perform more complex pattern matching. We can form any pattern we want by using the metacharacters mentioned in the previous lesson.

The first metacharacter we are going to look at is the backslash (`\`). We already saw that the backslash can be used to escape all the metacharacters, so that you can search for them directly. However, the backslash can also be followed by various characters to signal various special sequences. Here is a list of the special sequences we are going to look at in this notebook:

* `\d` - Matches any decimal digit; this is equivalent to the set [0-9]


* `\D` - Matches any non-digit character; this is equivalent to the set [^0-9]


* `\s` - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]


* `\S` - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]


* `\w` - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]


* `\W` - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]

We can see that there is a difference between lowercase and uppercase sequences. For example, while `\d` matches any digit, `\D` matches everything that is **not** a digit. Similarly, while `\s` matches any whitespace character, `\S` matches everything that is **not** a whitespace character; and while `\w` matches any alphanumeric character, `\W` matches everything that is **not** an alphanumeric character.

Let's start by learning how to use `\d` to search for decimal digits.

In [20]:
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'
regex = re.compile(r'\d')
matches = regex.finditer(sample_text)
for match in matches:
    start , end = match.span()
    print(sample_text[start:end])

1
2
3
0
1
5
6
7
8
9


In [24]:
sample_text = 'Here are three IP \
    address: 123.456.789.123, 999.888.777.666, 111.222.333.444'
regex = re.compile(r'\d{3}\.\d{3}\.\d{3}\.\d{3}')
matches = regex.finditer(sample_text)
for match in matches:
    start , end = match.span()
    print(sample_text[start:end])

123.456.789.123
999.888.777.666
111.222.333.444


In [32]:
# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\s'
regex = re.compile(r'\s')

matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(1, 2), match='\t'>
<_sre.SRE_Match object; span=(7, 8), match=' '>
<_sre.SRE_Match object; span=(13, 14), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='\x0c'>
<_sre.SRE_Match object; span=(18, 19), match='\n'>
<_sre.SRE_Match object; span=(23, 24), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(33, 34), match='\r'>
<_sre.SRE_Match object; span=(34, 35), match='\n'>
<_sre.SRE_Match object; span=(40, 41), match=' '>
<_sre.SRE_Match object; span=(46, 47), match=' '>
<_sre.SRE_Match object; span=(49, 50), match=' '>
<_sre.SRE_Match object; span=(57, 58), match='\x0b'>
<_sre.SRE_Match object; span=(58, 59), match='\n'>


In [39]:
sample_text = '''
123\t45\t7895
1\t222\t33
'''
print('Sample Text:\n', sample_text)
regex = re.compile(r'\s')
matches = regex.finditer(sample_text)
counter = 0
for match in matches:    
    if counter != 0:
        start_idx = match.start()        
        print('\nNumbers from the original text:', sample_text[end_idx:start_idx])        
    end_idx = match.end()
    counter += 1

Sample Text:
 
123	45	7895
1	222	33


Numbers from the original text: 123

Numbers from the original text: 45

Numbers from the original text: 7895

Numbers from the original text: 1

Numbers from the original text: 222

Numbers from the original text: 33


In [35]:
list(matches)

[]

In the cell below, our `sample_text` consists of a multi-line string that contains three email addresses:

```
j.s@email.com
a.w@email.com
m.j@email.com
```

Notice, that all three email address have the same pattern, namely, the first name initial, followed by a dot (`.`), followed by the last name initial, and ending in ``` @email.com```. 

Take advantage of the fact that all three email addresses have the same pattern to write a single regular expression that can find all three email addresses in our `sample_text` string. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

In [40]:
sample_text = '''
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com
'''
print('Sample Text:\n', sample_text)
regex = re.compile(r'\w\.\w@email.com')
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

Sample Text:
 
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com

<_sre.SRE_Match object; span=(15, 28), match='j.s@email.com'>
<_sre.SRE_Match object; span=(44, 57), match='a.w@email.com'>
<_sre.SRE_Match object; span=(70, 83), match='m.j@email.com'>


# Word Boundaries

We will now learn about another special sequence that you can create using the backslash:

* `\b`

This special sequence doesn't really match a particular set of characters, but rather determines word boundaries. A word in this context is defined as a sequence of alphanumeric characters, while a boundary is defined as a white space, a non-alphanumeric character, or the beginning or end of a string. We can have boundaries either before or after a word. Let's see how this works with an example.

In the code below, our `sample_text` string contains the following sentence:

```
The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.
```

As we can see the word `class` appears in three different positions:

1. As a stand-alone word: The word `class` has white spaces both before and after it.


2. At the beginning of a word: The word `class`  in `classroom` has a white space before it.


3. At the end of a word: The word `class`  in `subclass` has a whitespace after it.

If we use `class` as our regular expression, we will match the word `class` in all three positions as shown in the code below:

In [42]:
sample_text = 'The biology class will meet in the \
    first floor classroom to learn about Theria, a subclass of mammals.'
regex = re.compile(r'\bclass')
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(51, 56), match='class'>


In [45]:
sample_text = 'John went to the store in his car, but forgot to buy bread.'
regex = re.compile(r'\b\w{3}\b')
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(13, 16), match='the'>
<_sre.SRE_Match object; span=(26, 29), match='his'>
<_sre.SRE_Match object; span=(30, 33), match='car'>
<_sre.SRE_Match object; span=(35, 38), match='but'>
<_sre.SRE_Match object; span=(49, 52), match='buy'>


In [46]:
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

regex = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>
<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


# Finding Complicated Patterns

In this lesson, we will learn how to use the remaining metacharacters in our list, namely:

```python
* + ? | ( )
```
We will employ these metacharacters to find more complicated patterns of text. 

### Finding Names

In the code below, our `sample_text` consists of a multi-line string that contains the names and heights of the 4 highest mountains in the world according to Wikipedia:

```
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
```

Let's create a regular expression that will allow us to find the names of these mountains. The first thing to notice is that the word mountain has been abbreviated in two different ways, as `Mt.` and as `Mt` (without the period). Therefore, if we want to find all the names of the mountains we need to indicate in our regular expression that the period (`.`) in the abbreviation is optional. We can do this by using the `?` metacharacter in our regular expression. The `?` will match 0 or 1 repetitions of the preceding regular expression. For example, the regular expression `ab?` will match either `a` or `ab`. In other words, the `?` after the `b` indicates that the `b` after the `a` is optional. Let’s see how this works.

In the code below, we employ the `?` metacharacter to indicate that the period (`.`) after `Mt` is optional by using the regular expression `Mt\.?`:

In [47]:
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''
regex = re.compile(r'Mt\.?')
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 3), match='Mt'>
<_sre.SRE_Match object; span=(28, 31), match='Mt.'>
<_sre.SRE_Match object; span=(51, 53), match='Mt'>
<_sre.SRE_Match object; span=(84, 87), match='Mt.'>


In [48]:
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'Mt\.?\s[A-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>


In [49]:
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'(Mt|Mnt)\.?\s[a-zA-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 11), match='Mt Everest'>
<_sre.SRE_Match object; span=(28, 34), match='Mt. K2'>
<_sre.SRE_Match object; span=(51, 67), match='Mt Kangchenjunga'>
<_sre.SRE_Match object; span=(84, 94), match='Mt. Lhotse'>
<_sre.SRE_Match object; span=(111, 121), match='Mnt makalu'>


# Substitutions with Groups

We can do more sophisticated substitutions by using groups. Let's see an example. In the code below we have a multi-line string that contains the names of 4 people. As we can see, some people have middle names but other don't. Let's use the `.sub()` method to replace all names in the string with just the first and last name. For example, the name `John David Smith` should be replaced by `John Smith` and `Alice Jackson` should stay the same.

The first step is to create a regular expression that matches all the names in the list. Now, keeping in mind that we need to be able to make replacements later we will use groups to be able to distinguish between the first name, the middle name, and the last name. Since all names have a first name then we can use this group `([a-zA-z]+)` to match all the first names. Now, not all names have middle names, so having a middle name is optional. Since the first and middle name are separated by a whitespace we also need to indicate that the whitespace is also optional. So, to do indicate that the whitespace and middle name are optional we will include the `?` metacharacter after the whitespace and second group, `[ ]?([a-zA-z]+)?`. After the first or middle name we have a whitespace that we can match with `\[  \]`. Notice that in this case we didn't use the sequence `\\s` since this will match newlines as well and we don't what match those. Finally we make a third group to match the last name. Since all names have last names, we don't need to use the `?` metacharacter. Putting all together we get:

In [52]:
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')
matches = regex.finditer(sample_text)
for match in matches:
    print('\nFirst Name: '+ match.group(1))
    
    if match.group(2) is None:
        print('Middle Name: None')
    else:
        print('Middle Name: '+ match.group(2))
    print('Last Name: '+ match.group(3))


First Name: John
Middle Name: David
Last Name: Smith

First Name: Alice
Middle Name: None
Last Name: Jackson

First Name: Mary
Middle Name: Elizabeth
Last Name: Wilson

First Name: Mike
Middle Name: None
Last Name: Brown


In [53]:
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Substitute all names in the sample_text with the first and last name
new_text = regex.sub(r'\1 \3', sample_text)

# Print the modified text
print(new_text)


John Smith
Alice Jackson
Mary Wilson
Mike Brown



# Flags

We saw at the beginning of this lesson that regexes are case sensitive, therefore we often have to use regexes with both uppercase and lower case letters. However, the `re.compile(pattern, flags)` function, has a `flag` keyword that can be used to allow more flexibility. For example, the `re.IGNORECASE` flag can be used to perform **case-insensitive** matching. In the code below we have a string that contains the name Walter written in two different combinations of upper and lower case letters. In order to be able to find this two renditions of Walter, we will probably have to use a long character set to be able to account for all possible combinations of lower and upper case letters. However, in this case we can use the `re.IGNORECASE` to indicate that we don't care about the case of the letters, we just want to find the name Walter no matter how it is written. Let's see how this works:

In [55]:

sample_text = 'Alice and WaLtEr Brown are talking with wAlTer Jackson.'

# Create a regular expression object with the regular expression 'walter'
# that ignores the case of the letters
regex = re.compile(r'walter', re.IGNORECASE)

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(10, 16), match='WaLtEr'>
<_sre.SRE_Match object; span=(40, 46), match='wAlTer'>
