# 13. Regular Expressions

### Objectives
* Understand what regular expressions are and what kind of questions they can answer
* Know that regular expressions consist of **literal** and **special** characters
* Know the basic functionality of all the special characters - `. ^ $ * + ? { } [ ] \ | ( )`
* Know how to combine multiple special characters together
* Know how to change operator precedence with parentheses
* Use `contains` to select entire values and `extract` to select substrings

# Regular Expressions for more Powerful String Manipulations
**Regular Expressions** give us a way to do much more powerful string manipulations. A regular expression, or simply **regex**, is a special string that describes a specific pattern that you would like to match in another string.

### Examples of questions that regexes can answer
It might be helpful to see a list of questions that a regex pattern can match:
* Match all words that begin with 'S' and end in 'y'
* Match the word 'friend' or 'freind'
* Match a word with at least 3 digits in it
* Match all Gmail email addresses
* Capture the word immediately following the word 'Author'
* Capture the word immediately following the third occurrence of the word 'coffee'

# Primarily use `contains` and `extract`
This notebook will be primarily concerned with finding matching patterns within string values of a Pandas Series. We will then select all values within the Series that match the pattern via boolean indexing. The **`contains`** string Series method will be used for this.

At the end of this notebook, we will use the **`extract`** string Series method to extract particular substrings from the strings within the Series.

### A simple example without regular expressions
Let's match all movie titles that contain either an 'x', 'y', or 'z'. Without using a regex, we would use multiple **`contains`** string methods separating them with the logical **or** symbol:

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

In [2]:
has_xyz = title.str.contains('x') | title.str.contains('y') | title.str.contains('z')
title[has_xyz].head()

9     Harry Potter and the Half-Blood Prince
21                    The Amazing Spider-Man
30                                   Skyfall
35                       Monsters University
37           Transformers: Age of Extinction
Name: title, dtype: object

We can sum up this boolean Series to determine the number of values that have either an 'x', 'y', or 'z' in them.

In [3]:
has_xyz.sum()

1193

### Use a regex instead
Instead, we can use the regex **`'[xyz]'`**, which matches the pattern for any string that contains an 'x', 'y', or 'z'. We can verify that we get the same total. This regex plus many more will be covered in detail below.

In [None]:
title.str.contains('[xyz]').sum()

## Regular Expressions are a Mini-Programming Language
Regular expressions are a miniature programming language that have their own strict set of rules just like any other language. The syntax is written as a string mixing both **literal** and **special** characters. 

## Literal vs Special Characters
There are two distinct categories of characters within a regex string - **Literal** and **Special**
* **Literal** - these characters don't have any special meaning. They simply represent themselves. They are also referred to as **regular** characters.
* **Special** - these characters do have a special meaning. Each special character represents something very specific. They are also referred to as **metacharacters**.

## Matching with only Literal Characters
The most simple regex patterns you can write contain only literal characters. These strings will look like most any string you normally use in a search engine. Let's search for movies that have the word **`'Star'`** in them.

**`'Star'`** is a valid regular expression. We will use the **`contains`** Series string method which accepts a regular expression as its first argument. It returns a boolean Series.

In [None]:
pattern = 'Star'
title.str.contains(pattern).head()

### Filter for only movies containing `Star`
Let's take this resulting Series and use it for boolean indexing. The result should be the movie titles that have **`Star`** in them.

In [None]:
pattern = 'Star'
title[title.str.contains(pattern)]

## Regular Expressions are case sensitive
Regexes are case sensitive by default. **`'Star'`** only matches movie titles with an uppercase **`'S'`** followed immediately by lowercase **`'tar'`**. Let's search for lowercase **`'star'`**:

In [None]:
pattern = 'star'
title[title.str.contains(pattern)]

### Find all movies with exact string `'Star Wars'`

In [None]:
pattern = 'Star Wars'
title[title.str.contains(pattern)]

#### Find all movies with exact string `'hine'`:

In [None]:
pattern = 'hine'
title[title.str.contains(pattern)]

## Special Characters
The following characters are the **special** or **metacharacters**

`. ^ $ * + ? { } [ ] \ | ( )`

#### Details and examples with special characters
The rest of this notebook is devoted to examples that explain each of the special characters above. This will not be an exhaustive coverage of regular expressions as they can get quite complex. There are even entire books written on the subject.

## The dot metacharacter `.`
The **dot** or **period** is a special character that matches any character. For example the regex **`'m.le'`** will match any string that has an **`m`** followed by any character followed by **`le`**. It will match 'male', 'mile', 'mole', 'thimble', 'tumble', etc...

Let's see how many movie titles have this pattern:

In [None]:
pattern = 'm.le'
title[title.str.contains(pattern)]

## The caret metacharacter `^`
The caret, **`^`** is a special character that forces the pattern to match from the beginning of the string. Let's take a look at the difference between the regexes **`War`** and **`^War`**. The first matches the word 'War' anywhere in the string. The second matches the word 'War' only at the beginning.

Let's output the differences:

In [None]:
pattern = 'War'
title[title.str.contains(pattern)].head()

In [None]:
pattern = '^War'
title[title.str.contains(pattern)]

## The dollar metacharacter `$`
The dollar character, **`$`** works analogously to the caret but instead forces a match to the **end** of the string. Let's find all the movies that end in 'War':

In [None]:
pattern = 'War$'
title[title.str.contains(pattern)]

## Start and End Anchor tags
The caret and dollar metacharacters are also know as **anchor** tags since they anchor the pattern to either the beginning or end.

## The asterisk metacharacter `*`
The **asterisk** or **star** metacharacter matches the previous character 0 or more times. For instance, the regex, **`'Ah* No'`** will look for strings that have an uppercase 'A' followed by 0 or more lowercase 'h' followed by ' No'. 

Let's see how this works on Series of fake data:

In [None]:
# Create Series of fake data
s = pd.Series(['Ouch', 'Ah No', 'Ahh', 'Nooo', 'Ahhhhhhh No', 'A No', 'A'])
s

In [None]:
pattern = 'Ah* No'
s[s.str.contains(pattern)]

Without the ' No' at the end, it would match two more values:

In [None]:
pattern = 'Ah*'
s[s.str.contains(pattern)]

## The plus metacharacter `+`
The **plus** metacharacter is very similar to the asterisk, except that it matches 1 or more of the previous character. Thus for the regex **`'Ah+ No'`**, the 'h' must appear at least once.

In [None]:
pattern = 'Ah+ No'
s[s.str.contains(pattern)]

## The question mark metacharacter `?`
The question mark is similar to bot the asterisk and the star, except that it matches the previous character 0 or 1 times exactly. For instance, the regex, **`'Mea?n'`** will match both 'Mean' and 'Men'. Basically, the character before the question mark is **optional**.

In [None]:
pattern = 'Mea?n'
title[title.str.contains(pattern)]

## The curly braces metacharacter `{m, n}`
The curly braces metacharacter matches the previous character a given number of times. There are three different ways to use the curly braces:

* a single integer **`a{3}`**
* a single integer followed by a comma **`a{3,}`**
* two integers separated by a comma **`a{3,5}`**

**`a{3}`** matches exactly three consecutive a's. **`a{3,}`** matchces 3 or more consecutive a's. **`a{3,5}`** matches between 3 and 5 consecutive a's.

Let's create another Series by hand and match all the strings that begin with 'A', have the letter 'h' repeat between 2 and 5 times and then followed by ' No'.

In [None]:
s = pd.Series(['Ouch', 'Ahhh No', 'Ahh No', 'Nooo', 'Ahhhhhhh No', 'A No', 'A', 'Ahhh'])
s

In [None]:
pattern = 'Ah{2,5} No'
s[s.str.contains(pattern)]

## The pipe metacharacter `|`
The pipe metacharacter is equivalent to an **or** condition. It matches the entire word before or after the pipe. The regex **`'Friend|Enemy'`** matches any string with 'Friend' or 'Enemy' in it.

In [None]:
pattern = 'Friend|Enemy'
title[title.str.contains(pattern)]

You can add as many pipes as you please:

In [None]:
pattern = 'Friend|Enemy|Good|Evil'
title[title.str.contains(pattern)].head()

## The brackets metacharacter `[ ]`
The brackets metacharacter allows you match one of several characters at single particular position. As we saw with the very fist example, **`'[xyz]'`** matches any single 'x', 'y', or 'z'.

Another example, **`'T[aeiou]d'`** matches any words that begin with 'T', followed by exactly one vowel and then 'd'. The brackets contain all the possible matches for a single character.

Concretely, it matches the following: 'Tad', 'Ted', 'Tid', 'Tod', and 'Tud'.

In [None]:
pattern = 'T[aeiou]d'
title[title.str.contains(pattern)]

### Entire character classes with the brackets
Let's say you want to match all the lowercase letters 'a' through 'z'. You could write each letter within the brackets. Thankfully, there is a much easier way with **character classes**.

Character classes are special notation within the brackets that can be used to denote entire subsets of characters. Take the following:
* **`'[0-9]'`** represents all digits 0 through 9
* **`'[a-z]'`** represents all lowercase letters
* **`'[A-Z]'`** represents all uppercase letters
* **`'[a-zA-Z]'`** represents all lowercase and uppercase letters

### Digits in movies
Let's match all movies with a digit in them.

In [None]:
pattern = '[0-9]'
title[title.str.contains(pattern)].head()

### Matching movies with 2 digits in a row
We can match movies with two digits in a row by using the digits character class twice.

In [None]:
pattern = '[0-9][0-9]'
title[title.str.contains(pattern)].head()

## Combining Special Characters
You are allowed to combine any number of literal and special characters together with your regex. For instance, matching movies with two or more digits in a row could have been done by using the curly braces for repeats like this:

In [None]:
pattern = '[0-9]{2,}'
title[title.str.contains(pattern)].head()

#### Find all movies that begin with exactly 4 digits in a row
We can use the caret to anchor the digits to the start and the curly braces to match exactly 4 digits.

In [None]:
pattern = '^[0-9]{4}'
title[title.str.contains(pattern)].head()

#### Find all movies that begin with 'The' and end with 'Movie'
We anchor 'The' to the beginning with the caret and 'Movie' to the end with the dollar symbol. We use **`.*`** in the middle to represent any character repeated 0 or more times.

In [None]:
pattern = '^The .* Movie$'
title[title.str.contains(pattern)]

#### Find all movies that are exactly 10 characters long
**`.{10}`** matches exactly any 10 characters in a row. We must anchor it to the beginning and end to ensure that the string is exactly 10 characters in length.

In [None]:
pattern = '^.{10}$'
title[title.str.contains(pattern)].head()

## More Complex Character Classes
Special characters lose their special meaning within the brackets. For instance, **`[.]`** matches the literal dot and **`[()*$]`** matches any string with the literal parentheses, asterisk, or dollar sign.

Let's match movies with an asterisk in them:

In [None]:
pattern = '[*]'
title[title.str.contains(pattern)]

Match movies with either an asterisk or dollar sign.

In [None]:
pattern = '[*$]'
title[title.str.contains(pattern)]

### Exclude characters with caret
It is possible to exclude character sets by putting a caret as the first character inside the brackets. For instance, **`Z[^aeiou]`** find match strings that begin have 'Z' followed by a non-vowel.

In [None]:
pattern = 'Z[^aeiou]'
title[title.str.contains(pattern)]

Find all movies that have an uppercase 'T' followed by a non-lowercase letter.

In [None]:
pattern = 'T[^a-z]'
title[title.str.contains(pattern)]

## The backslash `\` metacharacter
The backslash metacharacter is used in conjunction with the very next character to change its meaning.

* `\d` - all digits, equivalent to `[0-9]`
* `\D` - any non-digit.
* `\s` - any amount of whitespace including normal spaces and tabs
* `\S` - any non-whitespace
* `\w` - any 'word' character, which is any upper or lowercase letter, digit or underscore. Equivalent to `[A-Za-z0-9_]`
* `\W` - any non-word character

For instance, **`^\W`** matches all strings that begin with a non-word character.

### Prefix the string with `r` to make it a raw string

The backslash is a special character in normal Python strings. **`\n`** represents a newline character, **`\t`** represents a tab. To be sure your regex is exactly what you see, its best to use **raw** Python strings. Prepend the string with an r outside of the quotation marks to make it a raw string. Python will treat the backslash as a literal backslash without any special meaning.

In [None]:
pattern = r'^\W+'
title[title.str.contains(pattern)]

### Backslash escapes special characters
As we just saw, the special characters lose their special ability within the brackets. Preceding a special character by a backslash has the same effect. For instance **`\*`** represents a literal asterisk and is the same as **`[*]`**

In [None]:
pattern = r'\*'
title[title.str.contains(pattern)]

## The parentheses metacharacters `( )`
The parentheses metacharacters are used to **group** together parts of the regular expression. For instance, let's say we want to find all movies that begin with the word 'In' or 'My'. You might think about using **`'^In|My'`**:

In [None]:
pattern = '^In|My'
title[title.str.contains(pattern)].head(10)

### The meaning of `^In|My`
There are a couple things wrong with this regex. First, we are getting movies that begin with words that begin with 'In' such as 'Indiana' or 'Inside' instead of the just the word 'In'.

Second, the movie, 'Journey 2: The Mysterious Island' has 'My' within the name and not at the beginning. This mistake is happening because of **operator precedence** within the regex. 

`^In|My` matches movies that begin with the letters 'In' or have 'My' anywhere inside it. The caret is only anchoring 'In'.

## Using parentheses to group
We can use parentheses to change the operator precedence just how we do in mathematical expressions. Let's modify our expression to `'^(In|My)'`

In [None]:
pattern = '^(In|My)'
title[title.str.contains(pattern)].head(15)

### Getting closer
We grouped `In|My` together so the movie must begin with them. We are still lacking a space after them. We can do this with three slightly different regexes
* `'^(In|My)\s'`
* `'^(In|My) '`
* `'^(In |My )'`

The `\s` matches any number of whitespaces.

In [None]:
pattern = r'^(In|My)\s'
title[title.str.contains(pattern)].head(15)

### Why are we getting `UserWarning: This pattern has match groups`?
Besides operator precedence, grouping has an alternative function and that is to extract specific text from a string. In regex terminology, we call this a **capturing group**. This warning is alerting us that we have used a capture group and if we wanted to extract this group then we should be using the **`extract`** string method.

### Specifying a non-capturing group
Our regular expression is valid the way it is. We can signal inside of our regular expression that this is a **non-capturing group** by placing a **`?:`** as the first two characters inside of the parentheses. This will eliminate the warning.

In [None]:
pattern = r'^(?:In|My)\s'
title[title.str.contains(pattern)].head(15)

## Using capture groups with the `extract` string method
We can use the exact same pattern with the **`extract`** string method to extract the group.

In [None]:
pattern = r'^(In|My)\s'
title.str.extract(pattern).head()

### Why are all the values missing?
Only a small fraction of the movie titles begin with 'In' or 'My'. Let's drop the missing values and see the extracted text:

In [None]:
pattern = '^(In|My)\s'
title.str.extract(pattern).dropna().head()

### Extracting the fourth word of movie titles that begin with 'In' or 'My'
Let's try something a bit more complex and extract the fourth word of all movies that begin with the words 'In' or 'My'. For instance, the movie, 'In the Heart of the Sea' meets our criteria. The word 'of' would be extracted from it.

To accomplish this, we need to match movies that begin with 'In' or 'My' and then match two words, before capturing the fourth word.

We already saw that `^(?:In|My)` completes the first part of this task. We can then add on `(?:\s\S+){2}` which is a non-capturing group that matches a space followed by one or more non-space characters. We use `{2}` to match two of these in a row. We then need to match one more space and then capture the next word. We do this with `\s(\S+)`.

Remember, only the matched text in the parentheses is extracted. The parentheses that begin with `?:` are also not extracted.

In [None]:
pattern = r'^(?:In|My)(?:\s\S+){2}\s(\S+)'
title.str.extract(pattern).dropna()

### `extract` must have capture groups
The regex used with the **`extract`** string method must have capture groups. If not, an error will be raised.


### Multiple capture groups for `extract`
You can capture more than one group with **`extract`**. Take a look at the following regex which captures the first word after a movie that begins with 'The' and the first word after 'of'.

In [None]:
pattern = r'^The (\S+) .*of (\S+)'
title.str.extract(pattern).dropna().head()

## Many other string methods take regexes
You can use regular expressions in several other Series string methods such as **`count`**, **`replace`** and **`split`**. For instance, the following counts the times consecutive lowercase vowels appear for each string. We then find the maximum number of times this happens within the movie titles.

In [None]:
title.str.count('[aeiou]{2}').max()

# Other Flavors of Regex
Regular expressions are not quite standardized for every single programming language, so you will need to ensure you are implementing the right 'flavor' for each language.

# More to Regex
There is a lot more to regular expressions that was no covered in this notebook. 

* [Official Python Documentation][1]
* [Thorough Online Tutorial][2]
* [Practice with explanations][3] - make sure to choose Python

[1]: https://docs.python.org/3/howto/regex.html
[2]: https://www.regular-expressions.info/
[3]: https://regex101.com/

# Regex Summary
* Literal characters represent themselves
* Special or metacharacters represent something entirely different
* Primarily usage of regex is to either match a particular string or extract a substring
* Many Pandas string methods accept regular expressions but you will primarily be using **`contains`** and **`extract`**
* Use raw Python strings when writing regex. Raw strings have 'r' prepended to them.

### Metacharacter Summary
`. ^ $ * + ? { } [ ] \ | ( )`
* `.` - Matches any character
* `^` - Anchors next characters to beginning
* `$` - Anchors previous characters to end
* `*` - Matches 0 or more occurrences of previous character
* `+` - Matches 1 or more occurrences of previous character
* `?` - Matches 0 or 1 occurrences of previous character
* `{m}`, `{m,}`, `{m,n}` Matches m, m or more, between m and n repeats of previous character
* `[]` - A character set to match one out of many characters. `[aeiou]` matches a single vowel
* `[a-z]`, `[A-Z]`, `[0-9]` - Character sets for lowercase, uppercase, and digits
* `[^abc]` - Use caret at beginning of bracket to match anything but these characters
* `\` - backslash changes meaning of next character
* `\s` - whitespace
* `\S` - non-whitespace
* `\w` - lower/uppercase and underscore
* `\W` - everything but `\w`
* `\d` - digits
* `\D` - non-digits
* `\.` - Escapes all special characters such as literal dot here. 
* `|` - Or clause. Matches when either left or right set of characters match. `cat|dog` matches either 'cat' or 'dog'
* `()` - Groups together parts of regex like mathematical parentheses to achieve different operator precedence
* `()` - Also represents capture groups for extracting text. Use `(?:)` to signal non-capturing group.

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' followed by the next word that begins with digits.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">For all movies that begin with 'The' and are followed by the next word that begins with a digit, extract just the digits part of this word.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Find all movies that have three consecutive capital letters in them.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Find all movies that have two separate numbers in them. An example would be, '7 days and 7 nights'.</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Find all movies that have begin and end with a capital letter.</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Find all the movies that have 6 or more non-vowel and non-space characters in a row.</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Find all the movies that have a digit followed by a comma followed by a digit.</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Find all the movies that have either an ampersand or a question mark in them.</span>

In [None]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Which movie has the most ampersands, question marks, and periods in it?</span>

In [None]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Extract the very next character after 't' or 'T' for each movie.</span>

In [None]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">What is the most common character after 't' or 'T'?</span>

In [None]:
# your code here

### Problem 13
<span style="color:green; font-size:16px">Extract all the words that begin with 'T' or 't' and end in 'e' then find their frequency. Research the word boundaray special character.</span>

In [None]:
# your code here