> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, type the following in the console:


> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`.

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Regular Expressions

_Authors: Alex Combs (NYC)_

---

<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*
- Define regular expressions.
- Use regular expressions to match text.
- Demonstrate how to use capturing and non-capturing groups.
- Use advanced methods such as lookaheads.

### Lesson Guide
- [Exploring `regex`](#exploring-regex)
- [The Most Famous Quote in `regex-dom`](#most-famous-quote-in-regex-dom)
- [So, What Does a Regular Expression Look Like?](#so-what-does-a-regular-expression-look-like)
- [The History of Regular Expressions](#the-history-of-regular-expressions)
- [Where are `regex` Implemented?](#where-is-regex-implemented)
- [Basic Regular Expression Syntax](#basic-regular-expression-syntax)
	- [Literals](#literals)
	- [Character Classes](#character-classes)
	- [Character Classes Can Also Accept Certain Ranges](#character-classes-can-also-accept-certain-ranges)
	- [Character Class Negation](#character-class-negation)
- [Exercise #1](#exercise-)
	- [What Happens if We Put Two Character Class Brackets Back to Back?](#what-happens-if-we-put-two-character-class-brackets-back-to-back)
- [Shorthand for Character Classes](#shorthand-for-character-classes)
- [Special Characters](#special-characters)
- [Exercise #2](#exercise-2)
- [The Dot](#the-dot)
- [Anchors](#anchors)
- [Exercise #3](#exercise-3)
- [Modifiers](#modifiers)
- [Quantifiers](#quantifiers)
- [Greedy and Lazy Matching](#greedy-and-lazy-matching)
- [Exercise #4](#exercise-4)
- [Groups and Capturing](#groups-and-capturing)
- [Exercise #5](#exercise-)
- [Alternation](#alternation)
- [Word Border](#word-border)
- [Lookahead](#lookahead)
- [Exercise #6](#exercise-5)
- [`regex` in Python and `pandas`](#regex-in-python-and-pandas)
	- [`regex`' `.search()` Method](#regex-search-method)
	- [`regex`' `.findall()` method](#regex-findall-method)
	- [Using `pandas`](#using-pandas)
	- [`str.contains`](#strcontains)
	- [`str.extract`](#strextract)
- [Independent Practice](#independent-practice)
- [Extra Practice](#extra-practice)


<img src="../assets/regex1.png">
<br>
<center>**as in**</center>
<br>
<img src="../assets/regex2.png">
<br>
<center>**not as in**</center>
<img src="../assets/regex3.png">

In [15]:
from __future__ import division

from IPython.core.display import Image

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

<a id="exploring-regex"></a>
## Exploring `regex`

---

[RegEx101](https://regex101.com/) lets you explore `regex`.

- Copy the text in the cell below into the body of the website linked above.

- Make sure to click on flags to the right of the main text field, and check that **g** and **m** are clicked.

```
1. This is a string
2. That is also a string
3. This is an illusion
4. THIS IS LOUD
that isn't thus
bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab
6. tHiS	iS	CoFu SEd
777. THIS IS 100%-THE-BEST!!!
8888. this_is_a_fiiile.py
hidden bob
```

<a id="most-famous-quote-in-regex-dom"></a>
## The Most Famous Quote in `regex-dom`

>"Some people, when confronted with a problem, think 
'I know, I'll use regular expressions.'  Now they have two problems." — Jamie Zawinski (Netscape engineer)

<a id="so-what-does-a-regular-expression-look-like"></a>
## So, What Does a Regular Expression Look Like?

## ```/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/```



<img src="../assets/regex4.png">

<a id="the-history-of-regular-expressions"></a>
## The History of Regular Expressions

---

Regular expressions and neural nets have a common ancestry in the work of McColloch and Pitts (1943) and their attempt to computationally represent a model of a neuron. 

This work was picked up by Steve Kleene (Mr. \*) and developed further into the idea of regular expressions. His idea was then popularized by its inclusion in Unix in the 1970s, in the form of [**grep**](http://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics-regular-expressions/). Its inclusion in PERL in the 1980s cemented its popularity.

Here's [the story of Walter Pitts](http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic).

<a id="where-is-regex-implemented"></a>
## Where are `regex` Implemented?

---

There are any number of places where `regex`s can be run — from your text editor, to the `bash` shell, to Python, and even SQL. It is typically baked into the standard libary of programming languages.

In Python, it can be imported like so:

```python
import re
```

<a id="basic-regular-expression-syntax"></a>
## Basic Regular Expression Syntax
---

<a id="literals"></a>
### Literals

Literals are essentially just what you think of as characters in a string. For example:

```
a
b
c
X
Y
Z
1
5
100
``` 

These are all considered literals.

<a id="character-classes"></a>
### Character Classes

A character class is a set of characters matched as an "or."

```
[io]
```

So, this class would run as "match either i or o."

You can include as many characters as you like in between the brackets.

Character classes match only a single character.

<a id="character-classes-can-also-accept-certain-ranges"></a>
### Character Classes Can Also Accept Certain Ranges

For example, the following will all work:
    
```
[a-f]
[a-z]
[A-Z]
[a-zA-Z]
[1-4]
[a-c1-3]
```

<a id="character-class-negation"></a>
### Character Class Negation

We can also add **negation** to character classes. For example:

```
[^a-z]
```

This means match *ANYTHING* that is *NOT* `a` through `z`.

<a id="exercise-"></a>
## Exercise #1

---

<a id="what-happens-if-we-put-two-character-class-brackets-back-to-back"></a>
### What Happens If We Put Two Character Class Brackets Back to Back?

Using RegEx101 and the text snippet from earlier, match **"That", "that"**, and **"thus"** — but not **"This"** and **"this"** — using the following:
- One literal
- Two character classes (no negation)
- One negation in a character class

In [None]:
# A:            

<a id="shorthand-for-character-classes"></a>
## Shorthand for Character Classes
---

```
\w - Matches word characters (includes digits and underscores)
\W - Matches what \w doesn't — non-word characters
\d - Matches all digit characters
\D - Matches all non-digit characters
\s - Matches whitespace (including tabs)
\S - Matches non-whitespace
\n - Matches new lines
\r - Matches carriage returns
\t - Matches tabs
```

These can also be placed into brackets like so:

```
[\d\t]
[^\d\t]
```

<a id="special-characters"></a>
## Special Characters
---

Certain characters must be escaped with a backslash: "`\`."

These include the following:

```
.
?
\
{
}
(
)
[
]
+
-
&
<
>
```

<a id="exercise-2"></a>
## Exercise #2

---

Use RegEx101 and our text snippet to match all digits. Do this three ways:

```
- First, with character classes
- Second, with the shorthand character classes
- Third, using only negation
```

In [None]:
# A:

<a id="the-dot"></a>
## The Dot

---

The dot matches any single character.

<a id="anchors"></a>
## Anchors

---

Anchors are used to denote the start and end of a line.

```
^ - Matches the start of the line
$ - Matches the end of the line
```

Example:

```bash
^Now - Matches the word "Now" when it occurs at the beginning of a line.  
country$ - Matches the word "country" when it occurs at the end of a line.
```

You can also use anchors at the beginning or end of a word with `\b`

<a id="exercise-3"></a>
## Exercise #3

---

1. Use an anchor and a character class to find the **bab** and the **bob** at the end of the line, but not elsewhere.
2. Match all numbers at the beginning of a line

In [1]:
# A:

<a id="modifiers"></a>
## Modifiers

---

Modifiers control the following:
    
```
g - Global match (matches every occurance in the text, rather than just the first)
i - Case insensitivity
m - Multiline (modifies how ^ and $ work)
```

<a id="quantifiers"></a>
## Quantifiers

---

Quantfiers adjust how many items are matched.

```
* - Zero or more
+ - One or more
? - Zero or one
{n} - Exactly 'n' number
{n,} - Matches 'n' or more occurrences
{n,m} - Between 'n' and 'm'
```

## Exercise
---
Copy the following molecular formulas into Regex101. Create a regular expression to isolate each atom (with their number, if provided).

```
CH4 (Methane)
H2O (Water)
HCl (Hydrochloric Acid)
C3H8 (Propane)
ClF3 (Chlorine trifluoride)
Cl2O7 (Dichlorine heptoxide)
```

<a id="greedy-and-lazy-matching"></a>
## Greedy and Lazy Matching

---


By nature, ```.+ and .*``` are *greedy* matchers. This means they will match for as many characters as possible (i.e., the longest match).

This can be flipped to lazy matching (the shortest match) by adding a question mark: `?`.


<a id="groups-and-capturing"></a>
## Groups and Capturing

---

In `regex`, parentheses — `()` — denote groupings. These groups can then be quantified.

Additionally, these groups can be designated as either "capture" or "non-capture."

To mark a group as a capture group, just put it in parenthesis — (match_phrase).

To mark it as a non-capture group, punctuate it like so — (?:match_phrase).

Each capture group is assigned a consecutive number that can be referenced (e.g., ```$1, $2...```).

<a id="exercise-5"></a>
## Exercise #5

---

The following is a list of facts about the Boston Celtics, Atlanta Hawks, NY Knicks, Chicago Bulls and San Antonio Spurs, as they appear on https://www.basketball-reference.com/

Celtics:
```
Record: 12-2, 1st in NBA Eastern Conference
Last Game: W 95-94 vs. TOR
Next Game: Tuesday, Nov. 14 at BRK
Coach: Brad Stevens (12-2)
Executive: Danny Ainge
PTS/G: 102.2 (24th of 30) Opp PTS/G: 94.0 (1st of 30)
SRS: 7.38 (3rd of 30) Pace: 96.4 (26th of 30)
Off Rtg: 106.0 (17th of 30) Def Rtg: 97.5 (1st of 30)
```


Hawks:
```
Record: 2-11, 15th in NBA Eastern Conference
Last Game: L 94-113 at WAS
Next Game: Monday, Nov. 13 at NOP
Coach: Mike Budenholzer (2-11)
Executive: Travis Schlenk
PTS/G: 102.5 (23rd of 30) Opp PTS/G: 110.8 (26th of 30)
SRS: -8.59 (27th of 30) Pace: 100.0 (9th of 30)
Off Rtg: 102.4 (25th of 30) Def Rtg: 110.8 (27th of 30)
```


Spurs:
```
Record: 8-5, 3rd in NBA Western Conference
Last Game: W 133-94 vs. CHI
Next Game: Tuesday, Nov. 14 at DAL
Coach: Gregg Popovich (8-5)
Executive: R.C. Buford
PTS/G: 103.0 (22nd of 30) Opp PTS/G: 99.6 (5th of 30)
SRS: 3.04 (9th of 30) Pace: 96.4 (27th of 30)
Off Rtg: 106.9 (16th of 30) Def Rtg: 103.4 (6th of 30)
```


Knicks:
```
Record: 7-5, 4th in NBA Eastern Conference
Last Game: W 118-91 vs. SAC
Next Game: Monday, Nov. 13 vs. CLE
Coach: Jeff Hornacek (7-5)
Executive: Steve Mills
PTS/G: 106.4 (16th of 30) Opp PTS/G: 105.0 (13th of 30)
SRS: 1.12 (14th of 30) Pace: 96.5 (25th of 30)
Off Rtg: 110.2 (6th of 30) Def Rtg: 108.8 (21st of 30)
```


Bulls:
```
Record: 2-9, 14th in NBA Eastern Conference
Last Game: L 94-133 at SAS
Next Game: Wednesday, Nov. 15 at OKC
Coach: Fred Hoiberg (2-9)
Executive: Gar Forman
PTS/G: 93.6 (30th of 30) Opp PTS/G: 103.9 (10th of 30)
SRS: -9.69 (29th of 30) Pace: 96.1 (29th of 30)
Off Rtg: 96.6 (30th of 30) Def Rtg: 107.2 (19th of 30)
```

Using match groups, create regular expressions to isolate the following:
1. The executive's name
2. The coach's name
3. Number of wins
4. Number of losses
5. Their conference
6. Their rank within their respective conference
7. Pace
8. The date of their next game
9. Points allowed (Opp PTS/G)
10. Rank of points allowed

<a id="alternation"></a>
## Alternation

---

The pipe character — `|` — can be used to denote an OR relation, just like in Python.

For example, `(bob|bab)` or `(b(o|a)b)`.

<a id="word-border"></a>
## Word Border

---

The word border — `\b` — limits matches to those that mark the boundaries of words.

These borders can be used on both the left and right sides of the match.

## Exercise
---
Look at the documentation for sklearn's `CountVectorizer` class. 

1. Explain the regex that the vectorizer uses to split up words
2. What might be some flaws with the splitting logic?
3. Create your own tokenizer regex and defend your rationale

<a id="lookahead"></a>
## Lookahead
---

There are two types of lookaheads: postive and negative.

```    
(?=match_text) — A postive lookahead says, "only match the current pattern if it is followed by another pattern."
(?!match_text) — A negative lookahead says the opposite.

Examples:
- that(?=guy) — Only match "that" if it is followed by "guy."
- these(?!guys) — Only match "these" if it is NOT follow by "guys."
```

<a id="exercise-6"></a>
## Exercise #6
---
Using the following text:

```
1. This is a string
2. That is also a string
3. This is an illusion
4. THIS IS LOUD
that isn't thus
bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab
6. tHiS    iS    CoFu SEd
777. THIS IS 100%-THE-BEST!!!
8888. this_is_a_fiiile.py
hidden bob
```

1. Match **bob** only if it is followed by "_".
2. Match **bob** if it is followed by "_" or a new line character (Hint: How do we specify "or" in `regex`?).
3. Match **bob** only if it isn't followed by a space or a new line character.

In [None]:
# A:

<a id="regex-in-python-and-pandas"></a>
## Regex in Python and `pandas`

---

Let's practice working with `regex` in Python and `pandas` using the string below.

In [3]:

my_string = """
I said a hip hop,
The hippie, the hippie,
To the hip, hip hop, and you don't stop, a rock it
To the bang bang boogie, say, up jump the boogie,
To the rhythm of the boogie, the beat.
"""

In [4]:
# Import the `regex` module.
import re

<a id="regex-search-method"></a>
### `regex`' `.search()` Method

In [5]:
# `.search()` returns a match object.
mo = re.search('h([io])p', my_string)

In [6]:
# Everything that matches the expression:
mo.group()

'hip'

In [7]:
# The match groups (like $1, $2):
mo.group(1)

'i'

<a id="regex-findall-method"></a>
### `regex`' `.findall()` Method

In [8]:
mo = re.findall('h[io]p', my_string)

In [9]:
mo

['hip', 'hop', 'hip', 'hip', 'hip', 'hip', 'hop']

In [10]:
# `.findall()` will return only the capture groups, if included.
mo = re.findall('h([io])p', my_string)

In [11]:
mo

['i', 'o', 'i', 'i', 'i', 'i', 'o']

<a id="using-pandas"></a>
### Using `pandas`

In [16]:
fish = pd.Series(['onefish', 'twofish','redfish', 'bluefish'])
fish

0     onefish
1     twofish
2     redfish
3    bluefish
dtype: object

<a id="strcontains"></a>
### `str.contains`

In [17]:
# Get all fish that start with "b."
fish[fish.str.contains('^b')]

3    bluefish
dtype: object

<a id="strextract"></a>
### `str.extract`

In [18]:
# `.extract()` maps capture groups to new Series.
fish.str.extract('(.*)fish', expand=False)

0     one
1     two
2     red
3    blue
dtype: object

<a id="independent-practice"></a>
## Independent Practice
---

1. Load in the Titanic dataset from Kaggle (train.csv).
2. Extract the title (Miss, Mr, Mme, etc) from the passenger's name into its own column

<a id="extra-practice"></a>
## Extra Practice

---

Pull up the [Regex Golf](http://regex.alf.nu/) website and solve as many as you can!

If you get bored, try [Regex Crossword](https://regexcrossword.com/).