> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, type the following in the console:


> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`.

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Regular Expressions

_Authors: Alex Combs (NYC)_

---

<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*
- Define regular expressions.
- Use regular expressions to match text.
- Demonstrate how to use capturing and non-capturing groups.
- Use advanced methods such as lookaheads.

## Understanding Regex
---
- Regex → Regular expressions
- Use case: extracting/subsetting pieces of information from any text/string
- Not a library or programming language
- Regex is a sequence of characters that specifies a search pattern in any given text (string)
    - Text: anything from letters/numbers/both/characters
- Examples: phone numbers, email, urls, etc.

<a id="so-what-does-a-regular-expression-look-like"></a>
## So, What Does a Regular Expression Look Like?
---

## ```/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/```



<a id="where-is-regex-implemented"></a>
## Where are `regex` Implemented?

---

There are any number of places where `regex`s can be run — from your text editor, to the `bash` shell, to Python, and even SQL. It is typically baked into the standard library of programming languages.

***In Python, it can be imported like so:***

```python
import re
```

<a id="basic-regular-expression-syntax"></a>
## Basic Regular Expression Syntax
---

<a id="literals"></a>
### Literals
---
Literals are essentially just what you think of as **characters in a string**. For example:

```
a
b
c
X
Y
Z
1
5
100
``` 

These are all considered literals.

<a id="character-classes"></a>
### Character Classes
---

A character class is a set of **characters matched as an "or."** In other words, matching *<b>only one</b> out of several characters*.

```
[io]
```

So, this class would run as "**match either i or o**"

You can include *as many characters* as you like in between the brackets.

Character classes match <b>only a single character _because of the "or" criteria_.

<a id="character-classes-can-also-accept-certain-ranges"></a>

### Character Classes + Ranges
---

The Character classes we introduced above, can also accept ranges, like below:
    
```
[a-f] --> this still matches a "single lower case character", between the range: a to f
[a-z] --> this still matches a "single lower case character", between the range: a to z
[A-Z] --> this still matches a "single upper case character", between the range: A to Z
[a-zA-Z] --> this still matches a "single character either lower case or upper case", but between the range: a to z or A to Z
[1-4] --> single number between the range 1 to 4
[a-c1-3] --> single lower case character between the range a to c or single number between the range 1 to 3
```

<a id="character-class-negation"></a>
### Character Class Negation
---

We can also add **negation** to character classes. For example:

```
[^a-z]
```

This means match *ANYTHING* that is *NOT* `a` through `z`. So it is the **inverse** of the character class without negation.

<a id="shorthand-for-character-classes"></a>
### Shorthand for Character Classes
---

```
\w - Matches word characters (includes digits and underscores)
\W - Matches what \w doesn't — non-word characters
\d - Matches all digit characters
\D - Matches all non-digit characters
\s - Matches whitespace (including tabs)
\S - Matches non-whitespace
\n - Matches new lines
\r - Matches carriage returns
\t - Matches tabs
```

These can also be placed into brackets like so:

```
[\d\t] --> matches single digits or tabs
[^\d\t] --> matches anything other than digits or tabs
```

<a id="special-characters"></a>
### Special Characters
---

Certain characters must be escaped with a backslash: "`\`." Escape means these characters should have a `\` in front of them when specifying in the regex pattern

These include the following:

```
. --> for example, if you want to write . you must actually write it as \.
?
\ --> \ should be written as \\
{
}
(
)
[
]
+
-
&
<
>
```

<a id="the-dot"></a>
### The Dot

---

The dot matches **any single character**.

<a id="anchors"></a>
### Anchors

---

Anchors are used to denote the start and end of a line.

```
^ - Matches the START of the line.
$ - Matches the END of the line
```

Example:

```bash
^Now - Matches the word "Now" when it occurs at the BEGINNING of a line.  
country$ - Matches the word "country" when it occurs at the END of a line.
```

**You can also use anchors at the beginning or end of a word with `\b`**

<a id="exploring-regex"></a>
## Exploring `regex` 
*(The following are some exercises we can try on [RegEx101](https://regex101.com/) platform that allows us to write and explore `regex` and visualize the text pattern search. We'll cover a couple of initial examples then move on to Python implementation. As a general tip, you'll need to google based on the task to find fitting regex solutions and can try them out by typing in the REGULAR EXPRESSION bar on the given platform link)* 

---

 

- Copy the text from the cell below into the body of the website linked above. *(paste within TEST STRING)*

- Make sure to click on flags to the right of the main text field, and check that **g** and **m** are clicked. *(check on extreme right of REGULAR EXPRESSION bar to confirm /gm is shown)*

```
1. This is a string
2. That is also a string
3. This is an illusion
4. THIS IS LOUD
that isn't thus
bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab
6. tHiS	iS	CoFu SEd
777. THIS IS 100%-THE-BEST!!!
8888. this_is_a_fiiile.py
hidden bob
```

<a id="exercise-"></a>
### Exercise #1

---

<a id="what-happens-if-we-put-two-character-class-brackets-back-to-back"></a>
### What Happens If We Put Two Character Class Brackets Back to Back?

Match **"That", "that"**, and **"thus"** — but not **"This"** and **"this"** 

<a id="exercise-2"></a>
### Exercise #2

---

Use RegEx101 and our text snippet to match all digits. Do this three ways:

```
- First, with character classes
- Second, with the shorthand character classes
- Third, using only negation
```

<a id="exercise-3"></a>
### Exercise #3

---

1. Use an anchor and a character class to find the **bab** and the **bob** at the end of the line, but not elsewhere.
2. Match all numbers at the beginning of a line

## More Regular Expression Syntax
---

<a id="modifiers"></a>
### Modifiers

---

Modifiers control the following:
    
```
g - Global match (matches every occurance in the text, rather than just the first)
i - Case insensitivity
m - Multiline (modifies how ^ and $ work)
```

<a id="quantifiers"></a>
### Quantifiers

---

Quantfiers adjust **how many** items are matched.

```
* - Zero or more
+ - One or more
? - Zero or one
{n} - Exactly 'n' number
{n,} - Matches 'n' or more occurrences
{n,m} - Between 'n' and 'm'
```

### Exercise 
_[optional: try during flextime]_

---
Copy the following molecular formulas into Regex101. Create a regular expression to isolate each atom (with their number, if provided).

```
CH4 (Methane)
H2O (Water)
HCl (Hydrochloric Acid)
C3H8 (Propane)
ClF3 (Chlorine trifluoride)
Cl2O7 (Dichlorine heptoxide)
```

<a id="greedy-and-lazy-matching"></a>
### Greedy and Lazy Matching

---


By nature, ```.+ and .*``` are *greedy* matchers. This means they will **match for as many characters as possible** (i.e., the longest match).

This can be flipped to *lazy* matching (the shortest match) by adding a question mark: `?`.


<a id="groups-and-capturing"></a>
### Groups and Capturing

---

In `regex`, parentheses — `()` — denote *groupings*. These groups can then be quantified.

Additionally, these groups can be designated as either "capture" or "non-capture."

To mark a group as a capture group, just put it in parenthesis — (match_phrase).

To mark it as a non-capture group, punctuate it like so — (?:match_phrase).

Each capture group is assigned a consecutive number that can be referenced (e.g., ```$1, $2...```).

<a id="exercise-5"></a>
### Exercise
_[optional: try during flextime]_

---

The following is a list of facts about the Boston Celtics, Atlanta Hawks, NY Knicks, Chicago Bulls and San Antonio Spurs, as they appear on https://www.basketball-reference.com/

### East Coast

Celtics:
```
Record: 55-27, 2nd in NBA Eastern Conference
Last Game: W 96-83 vs. CLE
Next Game: Friday, May. 25 at CLE
Coach: Brad Stevens (55-27)
Executive: Danny Ainge
PTS/G: 104.0 (20th of 30) Opp PTS/G: 100.4 (3rd of 30)
SRS: 3.23 (7th of 30) Pace: 96.0 (22nd of 30)
Off Rtg: 107.6 (18th of 30) Def Rtg: 103.9 (1st of 30)
```


Hawks:
```
Record: 24-58, 15th in NBA Eastern Conference
Last Game: L 113-121 vs. PHI
Coach: Mike Budenholzer (24-58)
Executive: Travis Schlenk
PTS/G: 103.4 (25th of 30) Opp PTS/G: 108.8 (23rd of 30)
SRS: -5.30 (26th of 30) Pace: 98.3 (10th of 30)
Off Rtg: 105.0 (26th of 30) Def Rtg: 110.6 (21st of 30)
Expected W-L: 27-55 (26th of 30)
```


Spurs:
```
Record: 47-35, 7th in NBA Western Conference
Last Game: L 91-99 at GSW
Coach: Gregg Popovich (47-35)
Executive: R.C. Buford
PTS/G: 102.7 (27th of 30) Opp PTS/G: 99.8 (1st of 30)
SRS: 2.89 (8th of 30) Pace: 95.0 (28th of 30)
Off Rtg: 107.9 (17th of 30) Def Rtg: 104.8 (3rd of 30)
Expected W-L: 49-33 (8th of 30)
```


Knicks:
```
Record: 29-53, 11th in NBA Eastern Conference
Last Game: W 110-98 at CLE
Coach: Jeff Hornacek (29-53)
Executive: Steve Mills
PTS/G: 104.5 (18th of 30) Opp PTS/G: 108.0 (20th of 30)
SRS: -3.53 (23rd of 30) Pace: 96.8 (15th of 30)
Off Rtg: 107.1 (20th of 30) Def Rtg: 110.7 (23rd of 30)
Expected W-L: 32-50 (23rd of 30)
```


Bulls:
```
Record: 27-55, 13th in NBA Eastern Conference
Last Game: L 87-119 vs. DET
Coach: Fred Hoiberg (27-55)
Executive: Gar Forman
PTS/G: 102.9 (26th of 30) Opp PTS/G: 110.0 (27th of 30)
SRS: -6.84 (29th of 30) Pace: 98.3 (9th of 30)
Off Rtg: 103.7 (28th of 30) Def Rtg: 110.8 (24th of 30)
Expected W-L: 23-59 (28th of 30)
```

### West Coast
Seattle :(
```
Record: 48-34, 4th in NBA Western Conference
Last Game: L 91-96 at UTA
Coach: Billy Donovan (48-34)
Executive: Sam Presti
PTS/G: 107.9 (12th of 30) Opp PTS/G: 104.4 (10th of 30)
SRS: 3.42 (6th of 30) Pace: 96.7 (17th of 30)
Off Rtg: 110.7 (7th of 30) Def Rtg: 107.2 (9th of 30)
Expected W-L: 50-32 (7th of 30)
```

Denver:
```
Record: 46-36, 9th in NBA Western Conference
Last Game: L 106-112 at MIN
Coach: Mike Malone (46-36)
Executive: Tim Connelly
PTS/G: 110.0 (6th of 30) Opp PTS/G: 108.5 (22nd of 30)
SRS: 1.57 (11th of 30) Pace: 96.8 (16th of 30)
Off Rtg: 112.5 (6th of 30) Def Rtg: 110.9 (25th of 30)
Expected W-L: 45-37 (11th of 30)```

Golden State:
```
Record: 58-24, 2nd in NBA Western Conference
Last Game: L 92-95 vs. HOU
Next Game: Thursday, May. 24 at HOU
Coach: Steve Kerr (58-24)
Executive: Bob Myers
PTS/G: 113.5 (1st of 30) Opp PTS/G: 107.5 (18th of 30)
SRS: 5.79 (3rd of 30) Pace: 99.6 (5th of 30)
Off Rtg: 113.6 (3rd of 30) Def Rtg: 107.7 (11th of 30)```

L.A.:
```
Record: 35-47, 11th in NBA Western Conference
Last Game: W 115-100 at LAC
Coach: Luke Walton (35-47)
Executive: Magic Johnson
PTS/G: 108.1 (11th of 30) Opp PTS/G: 109.6 (25th of 30)
SRS: -1.44 (21st of 30) Pace: 100.3 (2nd of 30)
Off Rtg: 106.5 (23rd of 30) Def Rtg: 108.0 (12th of 30)
Expected W-L: 37-45 (21st of 30)```

Using match groups, create regular expressions to isolate the following:
1. The executive's name
2. The coach's name
3. Number of wins
4. Number of losses
5. Their conference
6. Their rank within their respective conference
7. Pace
8. The date of their next game
9. Points allowed (Opp PTS/G)
10. Rank of points allowed

<a id="alternation"></a>
### Alternation

---

The pipe character — `|` — can be used to denote an OR relation, just like in Python.

For example, `(bob|bab)` or `(b(o|a)b)`

<a id="word-border"></a>
### Word Border

---

The word border — `\b` — limits matches to those that mark the boundaries of words.

These borders can be used on both the left and right sides of the match.

### Exercise

_[optional: try during flextime]_

---
Look at the documentation for sklearn's `CountVectorizer` class. 

1. Explain the regex that the vectorizer uses to split up words
2. What might be some flaws with the splitting logic?
3. Create your own tokenizer regex and defend your rationale

<a id="lookahead"></a>
### Lookahead
---

There are two types of lookaheads: postive and negative.

```    
(?=match_text) — A postive lookahead says, "only match the current pattern if it is followed by another pattern."
(?!match_text) — A negative lookahead says the opposite.

Examples:
- that(?=guy) — Only match "that" if it is followed by "guy."
- these(?!guys) — Only match "these" if it is NOT follow by "guys."
- (?=.*guys)this - Only match "this" if "guys" is anywhere in the word

```

### Exercise
_[optional: try during flextime]_

---
From Codewars: https://www.codewars.com/kata/regex-password-validation

Copy the following passwords into Regex101. 

```
Valid:
djI38D55
fjd3IR9
4fdg5Fj3
123abcABC
ABC123abc
Password123

Invalid:
ghdfj32
DSJKHD23
dsF43
DHSJdhjsU
fjd3IR9.;
fjd3  IR9
a2.d412
JHD5FJ53
!fdjn345
jfkdfj3j
123
abc
```

Create a regex to only highlight valid passwords. A valid password must:

1. At least six characters long (ONLY letters and numbers)
2. contains a lowercase letter
3. contains an uppercase letter
4. contains a number

**NOTE**: Make sure the multiline option is set in Regex101


<a id="exercise-6"></a>
### Exercise
_[optional: try during flextime]_

---
Using the following text:

```
1. This is a string
2. That is also a string
3. This is an illusion
4. THIS IS LOUD
that isn't thus
bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab
6. tHiS    iS    CoFu SEd
777. THIS IS 100%-THE-BEST!!!
8888. this_is_a_fiiile.py
hidden bob
```

1. Match **bob** only if it is followed by "_".
2. Match **bob** if it is followed by "_" or a new line character (Hint: How do we specify "or" in `regex`?).
3. Match **bob** only if it isn't followed by a space or a new line character.
4. Match **bob** only if the letter y occurs somewhere in the word

<a id="regex-in-python-and-pandas"></a>
## Regex in Python and `pandas`

---

Let's practice working with `regex` in Python and `pandas` using the string below.

In [1]:
my_string = """
I said a hip hop,
The hippie, the hippie,
To the hip, hip hop, and you don't stop, a rock it
To the bang bang boogie, say, up jump the boogie,
To the rhythm of the boogie, the beat.
"""

In [2]:
# Import the python `regex` module as shared at the beginning of this lesson
import re

#### Python popular `regex` methods:
- search()
- findall()
- sub()

#### Pandas:
- str.contains
- str.extract

<a id="regex-search-method"></a>
### `regex`' `.search()` Method

In [3]:
# `.search()` returns a match object if there is a match anywhere in the string
# () --> Capture and group
mo = re.search('h([io])p', my_string)

In [4]:
# group allows to pick out parts of the matching text
# plain match.group() as below returns the whole match text
mo.group() # executes h"i"p since [io] will match i or o and in my_string, 'hip' appears before 'hop'

'hip'

In [5]:
# The match groups: (group(0) is the same as group() --> returns entire match)
# generally, this returns the first parenthesized subgroup, can be incremented to group(2) etc. for more nested subgroup searches
mo.group(1) # only i is matched based on match condition specified ([io])

'i'

<a id="regex-findall-method"></a>
### `regex`' `.findall()` Method

In [6]:
# returns a list containing all matches
# below matches ALL occurences of 'hip' in the string 'my_string' including as part of the word 'hippie'
test = re.findall('hip', my_string)
test

['hip', 'hip', 'hip', 'hip', 'hip']

In [7]:
# matching all occurences of 'hip' + 'hop' in the sequence of their occurence in the string 'my_string'
mo = re.findall('h[io]p', my_string)
mo

['hip', 'hop', 'hip', 'hip', 'hip', 'hip', 'hop']

In [8]:
# returning capture groups with .findall specified within ()
mo = re.findall('h([io])p', my_string)
mo

['i', 'o', 'i', 'i', 'i', 'i', 'o']

<a id="regex-findall-method"></a>
### `regex`' `.sub()` Method
---

`re.sub()` works like a *find and replace*. The syntax is as follows:

`re.sub(regex, replacement, string_to_search)`

where:
- `regex` is the pattern match to find
- `replacement` can either be a string or a lambda function that accepts a match object as a param
- `string_to_search` is the source to perform the find and replacement like `my_string` above

In [9]:
# let's see below example to Replace all white-space characters with the digit '9' in the string 'txt':
# we introduced above that \s can match whitespaces
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)

print(f'string pre-replacement: {txt}')
print(f'string post-replacement: {x}')

string pre-replacement: The rain in Spain
string post-replacement: The9rain9in9Spain


#### We can expand this to write more complex regex like in the Exercise here:

Let's use `re.sub` to remove the params (everything after the ?) in the below url to transform the url:

From:
```python
'https://www.google.com/search?q=data+science+jobs'
```

To:
```python
'https://www.google.com/search'
```

In [10]:
sfrom = 'https://www.google.com/search?q=data+science+jobs'
# for our regex, we introduced above that special characters like '?' needs to be escaped with '\'
# once the '?' has been located, we are doing what we introduced previously as 'greedy match' with '.*'
# they will match for as many characters as possible

sto = re.sub("\?.*", "", sfrom)
sto

'https://www.google.com/search'

#### Another Exercise example:
---

Use the `re.sub()` method to convert this string:

```python
'The-quick-brOwn_fox_juMped_over_The-lazy-dog'
```

Into camelcase:
```python
'theQuickBrownFoxJumpedOverTheLazyDog'
```

In [11]:
txt = 'The-quick-brOwn_fox_juMped_over_The-lazy-dog'
# multiple steps are required here, as we can see the capitalization is incorrect and,
# there are special characters like dashes and underscores to be managed as well

txt = re.sub("[A-Z]", lambda x: x.group().lower(), txt) # first, match capital alphabets and convert to lower case
txt

'the-quick-brown_fox_jumped_over_the-lazy-dog'

In [12]:
txt = re.sub("(?<=[\-\_])\w", lambda x: x[0].upper(), txt) # next, selective capitalization on characters after -/_ to get camelcase output
txt

'the-Quick-Brown_Fox_Jumped_Over_The-Lazy-Dog'

In [13]:
txt = re.sub("[\-\_]", "", txt) # finally, remove any dashes and underscores
txt

'theQuickBrownFoxJumpedOverTheLazyDog'

<a id="using-pandas"></a>
### Using `pandas`

In [14]:
import pandas as pd
fish = pd.Series(['onefish', 'twofish','redfish', 'bluefish']) # defining a Series dtype
fish

0     onefish
1     twofish
2     redfish
3    bluefish
dtype: object

<a id="strcontains"></a>
#### `str.contains`

In [15]:
# Get all fish that start with "b."
fish[fish.str.contains('^b')] # '^' we introduced is an anchor to match 'start' 

3    bluefish
dtype: object

<a id="strextract"></a>
#### `str.extract`

In [16]:
# `.extract()` maps capture groups to new Series.
# below regex will do a 'greedy match' to capture whatever preceeds the word 'fish'
fish.str.extract('(.*)fish', expand=False) # expand=True will return results in a 1 col df

0     one
1     two
2     red
3    blue
dtype: object

<a id="independent-practice"></a>
## Independent Practice
_[optional: try during flextime]_

---

1. Load in the Titanic dataset from Kaggle (train.csv).
2. Extract the title (Miss, Mr, Mme, etc) from the passenger's name into its own column

In [18]:
df = pd.read_csv('train.csv')
df['title'] = df['Name'].str.extract('(\w*(?=\.))', expand=True)
df['title'].head(10)

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
5        Mr
6        Mr
7    Master
8       Mrs
9       Mrs
Name: title, dtype: object

In [19]:
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))
print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello     there      there'))

hello     there
hello     there


<a id="extra-practice"></a>
### Extra Practice
_[optional: try during flextime]_

---

Pull up the [Regex Golf](http://regex.alf.nu/) website and solve as many as you can!

If you get bored, try [Regex Crossword](https://regexcrossword.com/).

# Final Advice
- Regex is one of those tools that you will occasionally need to use while working with text data.
- Most of the time, you can directly google for the regex pattern of what you want to achieve and try out some examples from stackoverflow. 
- Like what we say in the 'The-quick-brOwn_fox_juMped_over_The-lazy-dog' example, you will get better results from google search if you can breakdown your task into smaller components and search for each component independently.