> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser type in the console:


> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Regular Expressions

_Authors: Alex Combs (NYC)_

---

<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*
- Explain what regular expressions are
- Use regular expressions to match text
- Demonstrate how to use capturing and non-capturing groups
- Use advanced methods such as look-aheads

### Lesson Guide
- [Exploring regex](#exploring-regex)
- [Most famous quote in regex-dom](#most-famous-quote-in-regex-dom)
- [So what does a regular expression look like?](#so-what-does-a-regular-expression-look-like)
- [The history of regular expressions](#the-history-of-regular-expressions)
- [Where is regex implemented?](#where-is-regex-implemented)
- [Basic regular expression syntax](#basic-regular-expression-syntax)
	- [Literals](#literals)
	- [Character classes](#character-classes)
	- [Character classes can also accept certain ranges](#character-classes-can-also-accept-certain-ranges)
	- [Character class negation](#character-class-negation)
- [Exercise 1](#exercise-)
	- [What happens if we put two character class brackets back to back?](#what-happens-if-we-put-two-character-class-brackets-back-to-back)
- [Shorthand for character classes](#shorthand-for-character-classes)
- [Special Characters](#special-characters)
- [Exercise 2](#exercise-2)
- [The dot](#the-dot)
- [Anchors](#anchors)
- [Exercise 3](#exercise-3)
- [Modifiers](#modifiers)
- [Quantifiers](#quantifiers)
- [Greedy and Lazy Matching](#greedy-and-lazy-matching)
- [Exercise 4](#exercise-4)
- [Groups and capturing](#groups-and-capturing)
- [Exercise 5](#exercise-)
- [Alternation](#alternation)
- [Word border](#word-border)
- [Lookahead](#lookahead)
- [Exercise 6](#exercise-5)
- [Regex in Python and Pandas](#regex-in-python-and-pandas)
	- [Regex search method](#regex-search-method)
	- [Regex findall method](#regex-findall-method)
	- [Using pandas](#using-pandas)
	- [`str.contains`](#strcontains)
	- [`str.extract`](#strextract)
- [Independent practice](#independent-practice)
- [Extra practice](#extra-practice)


<img src="../assets/regex1.png">
<br>
<center>**as in**</center>
<br>
<img src="../assets/regex2.png">
<br>
<center>**not as in**</center>
<img src="../assets/regex3.png">

In [15]:
from __future__ import division

from IPython.core.display import Image

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

<a id="exploring-regex"></a>
## Exploring regex

---

The [RegExr](http://regexr.com/) lets you explore regex.

- Copy the text in the cell below into the body of the site above.

- Make sure to click on flags in the upper-right hand corner and make sure that **g** and **m** is clicked.

```
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob
```

<a id="most-famous-quote-in-regex-dom"></a>
## Most famous quote in regex-dom:

>Some people, when confronted with a problem, think 
“I know, I'll use regular expressions.”   Now they have two problems. -Jamie Zawinski (Netscape Engineer)

<a id="so-what-does-a-regular-expression-look-like"></a>
## What does a regular expression look like?

## ```/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/```



<img src="../assets/regex4.png">

<a id="the-history-of-regular-expressions"></a>
## The history of regular expressions

---

Regular expressions and neural nets have a common ancestry in the work of McColloch and Pitts (1943) in their attempt to computationally represent a model of a neuron. 

This work was picked up on by Steve Kleene (Mr. \*) and developed futher into the idea of regular expressions. This was then popularized by the inclusion in Unix in the the 1970s in the form of [**grep**](http://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics-regular-expressions/). It's inclusion in PERL in the 1980s cemented its popularity.

[The Story of Walter Pitts](http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic)

<a id="where-is-regex-implemented"></a>
## Where is regex implemented?

---

There are any number of places where regexs can be run. From your text editor, to the bash shell, to Python and even SQL. It is typically baked in to the standard libary of all programming languages.

In Python it can be imported like so:

```python
import re
```

<a id="basic-regular-expression-syntax"></a>
## Basic regular expression syntax
---

<a id="literals"></a>
### Literals

Literals are essentially just what you think of as characters in a string. For example:

```
a
b
c
X
Y
Z
1
5
100
``` 

are all examples of literals.

<a id="character-classes"></a>
### Character classes

Character classes are a set of characters that are matched as an "or".

```
[io]
```

So this would run as "match either i or o".

You can put as many characters as you like in between the brackets.

Character classes match only a single character.

<a id="character-classes-can-also-accept-certain-ranges"></a>
### Character classes can also accept certain ranges

For example, the following all work:
    
```
[a-f]
[a-z]
[A-Z]
[a-zA-Z]
[1-4]
[a-c1-3]
```

<a id="character-class-negation"></a>
### Character class negation

We can also add **negation** to character classes. For example:

```
[^a-z]
```

This means match *ANYTHING* thats *NOT* `a` through `z`.

<a id="exercise-"></a>
## Exercise 1

---

<a id="what-happens-if-we-put-two-character-class-brackets-back-to-back"></a>
### What happens if we put two character class brackets back to back?

Using regexr and the text snippet from earlier, match **"That", "that"**, and **"thus"** but not **"This"** and **"this"** using the following:
- one literal
- two character classes (no negation)
- one negation in a character class.

#### Solution

`[Tt]h[^i][st]`

**Solution Breakdown**  

`[Th]` = _'T' or 't'_              
`h`    = _'h'_                      
`[^i]` = *anything thats _not_ 'i'*  
`[st]` =_'s' or 't'_               

<a id="shorthand-for-character-classes"></a>
## Shorthand for character classes
---

```
\w - matches word characters (includes digits and underscores)
\W - matches what that one doesn't - non-word characters
\d - matches all digit character
\D - matches all non-digit characters
\s - matches whitespace (including tabs)
\S - matches non-whitespace
\n - matches new lines
\r - matches carriage returns
\t - matches tabs
```

These can also be placed into brackets like so:

```
[\d\t]
[^\d\t]
```

<a id="special-characters"></a>
## Special Characters
---

Certain characters must be escaped with a backslash: "`\`"

These include the following:

```
.
?
\
{
}
(
)
[
]
+
-
&
<
>
```

<a id="exercise-2"></a>
## Exercise 2

---

Use regexr and the text to match all the digits. Do it three ways:

```
1st with character classes
2nd with the shorthand character classes
3rd with only negation
```

#### Solution

1. `[0-9]`
2. `\d`
3. `[^\D]` **or** `[^a-zA-Z\s\%\'!\-\._]`  
>_The later option of solution 3 is specific to our text blob as we explicitly specify the special characters to exclude._

<a id="the-dot"></a>
## The dot

---

The dot matches any single character

<a id="anchors"></a>
## Anchors

---

Anchors are used to denote the start and end of a line.

```
^ - matches the start of the line
$ - matches the end of the line
```

Example:

```bash
^Now - matches the word "Now" when it occurs at the beginning of a line.  
country$ - matches the word "country" when it occurs at the end of a line.
```

<a id="exercise-3"></a>
## Exercise 3

---

Use an achor and character class to find the **bab** and the **bob** at the end of the line, but not elsewhere.

#### Solution

`b[oa]b$`

<a id="modifiers"></a>
## Modifiers

---

Modifiers control the following:
    
```
g - global match (matches every occurance in the text rather than the first)
i - case insensitivity
m - multiline (modifies how ^ and $ work)
```

<a id="quantifiers"></a>
## Quantifiers

---

Quantfiers adjust how many items are matched

```
* - zero or more
+ - one or more
? - zero or one
{n} - exactly n number
{n,} - matches n or more occurences
{n,m} - between n and m
```

<a id="greedy-and-lazy-matching"></a>
## Greedy and Lazy Matching

---


```.+ and .*``` are by nature *greedy* matchers. This means they will match for as many characters as possible (longest match).

This can be flipped to lazy matching (shortest match) by adding a question mark: `?`.


<a id="exercise-4"></a>
## Exercise 4
---

1. Find **bob** only if it occurs three times in a row without any spaces.
2. Find **bob** if it occurs twice in a row with or without spaces.

#### Solution

1. `(bob){3}`
2. `(bob)( )?(bob)` **or**  `(bob ?){2}`

<a id="groups-and-capturing"></a>
## Groups and capturing

---

In regex, parenthesis, `()`,  denote groupings. These groupings can then be quantified.

Additionally, these groups can be designated as either "capture" groups or "non-capture" groups.

To mark a group as a capture group, just put it in parenthesis - (match_phrase).

To mark it as a non-capturing group, mark it like the following - (?:match_phrase).

Each capture group is assigned a consecutive number that may be referenced, e.g., ```$1, $2...```

<a id="exercise-5"></a>
## Exercise 5

---

1. Run the following in regexr: ```(bob.?) (bob.?)```
2. Then click on "list" in the bottom to open the tab and try entering ```$1```
3. Now enter ```$2``` instead - what is the difference?
4. Change the code to make the first one a non-capturing group 
5. Enter ```$1``` again - what has changed?

<a id="alternation"></a>
## Alternation

---

The pipe character, `|`,  can be used to denote an OR relation just as it is done in Python.

For example, `(bob|bab)` or `(b(o|a)b)`

<a id="word-border"></a>
## Word border

---

The word border, `\b`, limits matches to those that mark the boundaries of words.

They can be used on both the left and the right side of the match.

<a id="lookahead"></a>
## Lookahead
---

There are two types of lookaheads: postive and negative.

```    
(?=match_text) - a postive lookahead says only match the current pattern if it is followed by another pattern.
(?!match_text) - a negative lookahead is the opposite.

Examples:
that(?=guy) - only match "that" if it is followed by "guy"
these(?!guys) - only match "these" if they are NOT follow by guys
```

<a id="exercise-6"></a>
## Exercise 6
---

1. Match bob only if it is followed by "_".
2. Match bob if it is followed by "_" or a new line character (hint: how do we specify "or" in regex?).
3. Match bob only if it isn't followed by a space or a new line character.

#### Solution

1. `(bob)(?=_)`
2. `(bob)(?=_|\n)`
3. `(bob)(?!( |\n))`

<a id="regex-in-python-and-pandas"></a>
## Regex in Python and Pandas

---

Let's practice using regex in python and pandas using the string below.

In [1]:

my_string = """
I said a hip hop,
The hippie, the hippie,
To the hip, hip hop, and you don't stop, a rock it
To the bang bang boogie, say, up jump the boogie,
To the rhythm of the boogie, the beat.
"""

In [2]:
# import the regex module
import re

<a id="regex-search-method"></a>
### Regex search method

In [3]:
# search returns a match object
mo = re.search('h([io])p', my_string)

In [4]:
# everything that matches the expression
mo.group()

'hip'

In [8]:
# the match groups (like $1, $2)
mo.group(1)

'i'

<a id="regex-findall-method"></a>
### Regex findall method

In [10]:
mo = re.findall('h[io]p', my_string)

In [11]:
mo

['hip', 'hop', 'hip', 'hip', 'hip', 'hip', 'hop']

In [12]:
# findall will return only the capture groups if included
mo = re.findall('h([io])p', my_string)

In [13]:
mo

['i', 'o', 'i', 'i', 'i', 'i', 'o']

<a id="using-pandas"></a>
### Using pandas

In [16]:
fish = pd.Series(['onefish', 'twofish','redfish', 'bluefish'])
fish

0     onefish
1     twofish
2     redfish
3    bluefish
dtype: object

<a id="strcontains"></a>
### `str.contains`

In [17]:
# get all fish that start with b
fish[fish.str.contains('^b')]

3    bluefish
dtype: object

<a id="strextract"></a>
### `str.extract`

In [18]:
# extract maps capture groups to new series
fish.str.extract('(.*)fish', expand=False)

0     one
1     two
2     red
3    blue
dtype: object

<a id="independent-practice"></a>
## Independent practice
---

Pull up the following tutorials for regular expressions in Python. 

[TutorialPoint](http://www.tutorialspoint.com/python/python_reg_expressions.htm)  
[Google Regex Tutorial](https://developers.google.com/edu/python/regular-expressions) (findall)

In the cells below, import Python's regex library and experiment with matching on the string in the cells below.

Try some of the following:
- Match with and without case sensitivity
- Match using word borders (try "bob")
- Use positive and negative lookaheads
- Experiment with the multi-line flag
- Try matching the 2nd or 3rd instance of a repetitive pattern (ab or bob for example)
- Try using re.sub to replace a matching string
- Note the difference between search and match - what makes them different?
- What happens to the order of groups if they are nested?

In [14]:
test = """
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob

"""

<a id="extra-practice"></a>
## Extra practice

---

Pull up the site [Regex Golf](http://regex.alf.nu/) and solve as many as you can!

If you get bored with that, try [Regex Crossword](https://regexcrossword.com/)!