# Regular Expressions

Regular Expressions are ways of defining *search patterns*. This is often really useful for searching through text for particular patterns that can be quite complicated. 

Regular Expressions are common features in programming languages, text editors, command-line utilities, etc.

### Test your Regex here!
https://regex101.com/

# Regular Expressions in Python

To use Regular Expressions in Python, we have to import the `re` module, as follows:

In [1]:
import re

# Raw Strings

In [2]:
print('\thello, world')

	hello, world


In [3]:
print(r'\thello, world')

\thello, world


# `re.search()` vs. `re.match()`

`re.search()` finds the *first* instance of a match

`re.match()` only matches a pattern if it is at the beginning of a line


In [5]:
test_string = 'abcdef defghjijklmnopqrstuvwxyz'

In [16]:
re.search(r'def', test_string)

<re.Match object; span=(3, 6), match='def'>

In [20]:
re.match(r'def', test_string)

# `re.findall()` vs. `re.finditer()`

`re.findall()`, as the name suggests, finds all of the matches and returns them as a *list of strings*

`re.finditer()` returns an *iterator* which yields the `Match` object

In [28]:
test_string = 'abcdef defghjijklmnopqrstuvwxyz'

In [29]:
re.findall(r'def', test_string)

['def', 'def']

In [30]:
re.finditer(r'def', test_string)

<callable_iterator at 0x10832e278>

In [31]:
for i in re.finditer(r'def', test_string):
    print(i)

<re.Match object; span=(3, 6), match='def'>
<re.Match object; span=(7, 10), match='def'>


# Match objects

Match objects always return `True`. Therefore, we can check if something matched, by using the following pattern:

In [41]:
test_string = 'abcdef defghjijklmnopqrstuvwxyz'

In [44]:
letter_search = re.search(r'ghj', test_string)
if letter_search:
    print('The match starts at index: {}'.format(letter_search.start()))
    print('The match ends before index: {}'.format(letter_search.end()))

The match starts at index: 10
The match ends before index: 13


In [45]:
letter_search = re.search(r'\d', test_string)
if letter_search:
    print('Match was found')
else:
    print('Match was not found')

Match was not found


## Match Groups

In [46]:
urls = (['http://www.google.com',
        'https://www.facebook.com',
        'http://www.duke.edu',
        'http://cms.gov'])

Create a regular expression that contains 3 groups. The first should be an optional, non-capturing group that matches `www.` The second should match the part of the URL after `www` (if that is present) and the second should match the domain type (e.g. `.com`)

In [79]:
url_regex = re.compile(r'(?:www\.)?(\w+)(\.\w+)$')
url_matches = [url_regex.finditer(url) for url in urls]

for match_iterator in url_matches:
    for string in match_iterator:
        print(string)
        print(string.group(1))
        print(string.group(2))


<re.Match object; span=(7, 21), match='www.google.com'>
google
.com
<re.Match object; span=(8, 24), match='www.facebook.com'>
facebook
.com
<re.Match object; span=(7, 19), match='www.duke.edu'>
duke
.edu
<re.Match object; span=(7, 14), match='cms.gov'>
cms
.gov


## A quick aside on generators and nested list comprehensions

In [82]:
url_matches = [url_regex.finditer(url) for url in urls]
[string.group(1) + string.group(2) for match_iterator in url_matches for string in match_iterator]

['google.com', 'facebook.com', 'duke.edu', 'cms.gov']