# Regular Expressions

## Announcements

## Review

## Looking Back

*You're killing it!*

- SQL, Relational Databases
   - Selecting, sorting, limiting, joins
   - Using SQLite in Jupyter with `%sql`

- Data manipulation with Pandas
   - DataFrames and Series's
   - Import/Exporting to SQL
   - Pulling tables from web
   - Selection, sorting, counting

- Split-Apply-Combine
   - Groupby in Pandas

- Visualization

- Semi-Structured data in MongoDB
    - JSON
    - selection, sorting
    - Aggregations
    - MapReduce concepts

- String pattern matching and extraction with regular expressions

- Next week
    - Advanced Pandas (rolling, dates)
    - Web scraping

## Final Project Updates

## <center>Regular Expressions</center>
### <center>aka *regex*</center>

### Overview

Regular Expressions help you work with strings

*Pattern Matching*

e.g. Find all phone numbers on a web page

*Manipulation*

e.g. Match "{Lastname}, {Firstname}" in a set of records and rewrite it as "{Firstname} {Lastname}"

## Why?

- Checking whether an input is valid (i.e. password, phone number, email, etc.)
- Cleaning data
- More complex data subsetting
- Working with user inputs or other unstructured data

### Q: Where can you use regular expressions?

### A: Many, many places!

## In Python

In [1]:
import re
comment = "It was a dark and stormy night."

Find a simple string:

In [2]:
re.findall('dark', comment)

['dark']

Find all sequences of one or more word characters:

In [3]:
re.findall('\w+', comment)

['It', 'was', 'a', 'dark', 'and', 'stormy', 'night']

## In SQL

SQLite doesn't support it, but...

**MySQL**

Select columns that match alphanumeric characters only:

```
SELECT * FROM table WHERE column REGEXP '^[A-Za-z0-9]+$';
```

**Postgresql**

Match strings that include foo, bar, or baz:

```
SELECT * FROM table WHERE value ~ 'foo|bar|baz';
```

## In Pandas

In [4]:
import pandas as pd
movies = pd.read_csv('https://raw.githubusercontent.com/organisciak/Scripting-Course/master/data/movielens_small.csv')
movies.sample()

Unnamed: 0,userId,rating,title,genres,timestamp,year
52455,150,2.5,Deep Blue Sea,Action,1114308289,1999


Find movies where there is a digit (`\d`) right before the end of the string (`$`):

In [64]:
matches = movies.title.str.contains('\d$')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
73693,654,4.0,Predator 2,Action,1145393723,1990
97753,452,1.0,Smokey and the Bandit III,Action,1016590031,1983
47841,339,3.5,Die Hard 2,Action,1446663933,1990
77626,467,3.0,54,Drama,939063634,1998
91148,468,2.5,$9.99,Animation,1296189776,2008
41665,165,0.5,Mission: Impossible II,Action,1111479113,2000
2108,282,4.5,Apollo 13,Adventure,1111493750,1995
41352,481,3.0,Harry Potter and the Deathly Hallows: Part 1,Action,1437003549,2010
20206,550,2.0,Lethal Weapon 4,Action,943373045,1998
84657,457,2.5,Friday the 13th Part 2,Horror,1471385314,1981


Find movies where the substring ' Part ' exists:

In [6]:
matches = movies.title.str.contains(' Part ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
41898,150,2.5,Father of the Bride Part II,Comedy,1114308628,1995
41356,570,4.5,Harry Potter and the Deathly Hallows: Part 1,Action,1475783785,2010
73544,654,5.0,History of the World: Part I,Comedy,1145394077,1981
41937,650,3.0,Father of the Bride Part II,Comedy,844883711,1995
25179,494,5.0,"Godfather: Part II, The",Crime,1342747453,1974
25123,215,4.0,"Godfather: Part II, The",Crime,860561181,1974
84670,260,3.0,Friday the 13th Part VI: Jason Lives,Horror,1207886252,1986
50272,232,3.0,Back to the Future Part III,Adventure,955086621,1990
50314,518,3.0,Back to the Future Part III,Adventure,945364886,1990
71101,564,5.0,Wes Craven's New Nightmare (Nightmare on Elm S...,Drama,974716031,1994


Find movies that are named "The ... of ..."

In [65]:
matches = movies.title.str.contains('^The .+ of ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
95699,299,4.5,The Lair of the White Worm,Comedy,1344178630,1988
36099,212,4.0,The Count of Monte Cristo,Action,1228789284,2002
68258,624,3.5,The Theory of Everything,Drama,1449334366,2014
36111,580,2.5,The Count of Monte Cristo,Action,1167160889,2002
36105,386,3.0,The Count of Monte Cristo,Action,1047028511,2002
68254,333,4.0,The Theory of Everything,Drama,1441197950,2014
36114,607,4.0,The Count of Monte Cristo,Action,1151425776,2002
68256,378,3.5,The Theory of Everything,Drama,1443292443,2014
68249,73,4.0,The Theory of Everything,Drama,1457597352,2014
36104,382,2.5,The Count of Monte Cristo,Action,1371825566,2002


## In MongoDB

In [50]:
from pymongo import MongoClient
client = MongoClient()
db = client.week7
collection = db.cooking

Find an recipe with an ingredient called "yellow ..."

In [55]:
collection.find_one({
    "ingredients": {"$regex": "yellow .*"}
})

{'_id': ObjectId('5af1b7634b6d022f8c977dd0'),
 'cuisine': 'southern_us',
 'id': 25693,
 'ingredients': ['plain flour',
  'ground pepper',
  'salt',
  'tomatoes',
  'ground black pepper',
  'thyme',
  'eggs',
  'green tomatoes',
  'yellow corn meal',
  'milk',
  'vegetable oil']}

After unwinding the recipes to one doc per ingredient, find ingredients with a qualified salt:

In [63]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{
        "ingredients": {"$regex": "^.+ salt" }
        }
    },
    { "$limit": 5 }
]
results = collection.aggregate(pipeline)
list(results)

[{'ingredients': 'sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'fine sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'kosher salt'}]

Count the qualified salt types:

In [73]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{ "ingredients": {"$regex": "^.+ salt$" } } },
    { "$group":{
        "_id": "$ingredients", "count": {"$sum": 1} } 
    },
    { "$sort": { "count": -1} },
    { "$limit": 20 }
]
results = collection.aggregate(pipeline)
list(results)

[{'_id': 'kosher salt', 'count': 3113},
 {'_id': 'sea salt', 'count': 940},
 {'_id': 'coarse salt', 'count': 578},
 {'_id': 'fine sea salt', 'count': 285},
 {'_id': 'garlic salt', 'count': 240},
 {'_id': 'seasoning salt', 'count': 131},
 {'_id': 'table salt', 'count': 79},
 {'_id': 'coarse sea salt', 'count': 68},
 {'_id': 'coarse kosher salt', 'count': 64},
 {'_id': 'celery salt', 'count': 52},
 {'_id': 'fine salt', 'count': 24},
 {'_id': 'onion salt', 'count': 15},
 {'_id': 'rock salt', 'count': 14},
 {'_id': 'pickling salt', 'count': 12},
 {'_id': 'black salt', 'count': 12},
 {'_id': 'Himalayan salt', 'count': 11},
 {'_id': 'celtic salt', 'count': 9},
 {'_id': 'maldon sea salt', 'count': 8},
 {'_id': 'smoked sea salt', 'count': 6},
 {'_id': 'iodized salt', 'count': 4}]

### Note on variation

- Regular Expressions are *close* to standard, but different implementations are slightly different.

## Basics of Regular Expressions

In this class: we'll cover the basics, practiced in Python and Pandas.

To follow along:

In [75]:
import re

## Wild Cards

`a` - Match the letter 'a'. Same for most other characters

In [66]:
text = "Colorado"
re.findall('o', text)

['Colorado']

In [77]:
text = "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo"
re.findall('Buffalo buffalo', text)

['Buffalo buffalo', 'Buffalo buffalo', 'Buffalo buffalo']

`.` - Match any single character

In [69]:
text = "who, what, where, why, and how"
re.findall('wh.', text)

['what']

In [83]:
text = "who, what, where, why, and how"
re.findall('wh.,', text)

['who,', 'why,']

- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters

In [73]:
text = "Who, what, where, why, and how"
re.findall('\w\w\w,', text)

['Wh0,', 'hat,', 'ere,', 'why,']

In [88]:
text = "Who, what, where, why, and how"
re.findall('\w', text)

['W',
 'h',
 'o',
 'w',
 'h',
 'a',
 't',
 'w',
 'h',
 'e',
 'r',
 'e',
 'w',
 'h',
 'y',
 'a',
 'n',
 'd',
 'h',
 'o',
 'w']

`\d` - Match any digit

In [89]:
text = "Party like it's 1999"
re.findall('\d', text)

['1', '9', '9', '9']

In [90]:
text = "Party like it's 1999"
re.findall('\d\d\d\d', text)

['1999']

`\s` - Match any whitespace character (space, tabs, line breaks sometimes)

*What will this return?*

In [76]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('\s....\s', text)

[' over ', ' lazy ']

`[ab]` - Group of multiple possible characters - in this case 'a' or 'b'

In [78]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('[Tt]he', text)

['The', 'the']

### *What if I want to match an actual backslash or period?*

This is a problem:

In [114]:
text = "Dr. Jones Drinks Too Much"
re.findall('Dr.', text)

['Dr.', 'Dri']

Precede the character with a backslash

E.g.

- `.` - Matches *any* character
- `\.` - Matches a period

In [115]:
re.findall('Dr\.', text)

['Dr.']

- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z

In [144]:
text = "text 1-800-SPAM for more information"
re.findall('[A-Z]+', text)

['SPAM']

Those square brackets are same as before, so you can group A-Z with other matches.

e.g. Match capital letters, digits, or hyphens:

In [145]:
text = "text 1-800-SPAM for more information"
re.findall('[\d\-A-Z]+', text)

['1-800-SPAM']

*Note above that a hyphen is another special character, so matching for a literal `-` is done with `\-`.*

Returning to the earlier data.

In [92]:
titles = movies.title.drop_duplicates()

"The (single word) of ..."

In [15]:
matches = titles.str.contains('^The \w+ of ')
titles[matches].sample(10)

99139                The End of the Tour
94735          The Plague of the Zombies
97198                     The Best of Me
99821               The Face of an Angel
88980       The Earrings of Madame de...
68702               The Legend of Tarzan
75357    The Importance of Being Earnest
93216            The Diary of Anne Frank
97199                   The Book of Life
68479                 The Age of Adaline
Name: title, dtype: object

In [16]:
matches = titles.str.contains(':')
titles[matches].sample(10)

67847                  Captain America: The Winter Soldier
29299    Léon: The Professional (a.k.a. The Professiona...
96755    Will Ferrell: You're Welcome America - A Final...
88694         Nightmare on Elm Street 3: Dream Warriors, A
82307                   Police Academy 6: City Under Siege
92984                             Exorcist II: The Heretic
75368                      Tabu: A Story of the South Seas
94572                       Sherlock: The Abominable Bride
93522       Librarian, The: The Curse of the Judas Chalice
76784         City Slickers II: The Legend of Curly's Gold
Name: title, dtype: object

In [93]:
matches = titles.str.contains("^\w+\-\w+$")
titles[matches]

259                  Ben-Hur
12000             Spider-Man
40269                  X-Men
55032                  U-571
58796             Scooby-Doo
61252              Fail-Safe
65729                G-Force
66092               Kick-Ass
68396                Ant-Man
69332            Re-Animator
69765    Slaughterhouse-Five
81831                  K-PAX
83228                 BURN-E
83394               Non-Stop
83908               Bio-Dome
89557            Topsy-Turvy
93256               Cry-Baby
94106              She-Devil
95079               Kon-Tiki
96155              De-Lovely
96602               Catch-22
96617                Ben-hur
96717               Semi-Pro
98638                  T-Men
99056     Shakespeare-Wallah
99971        Straight-Jacket
Name: title, dtype: object

## Exercises

Reference: 
    
- `a` - Match the letter 'a'. Same for most other characters
- `.` - Match any single character
- `\.` - Match a period. Same for other 'special' characters
- `\w` - Match any word character
- `\d` - Match any digit
- `\s` - Match any whitespace character
- `[ab]` - Group of multiple possible characters
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z

## Repetition

`?` - One or zero of the preceding match

In [81]:
text = "color colour"
re.findall('colour', text)

['colour']

- `+` - One or more of the preceding match
- `*` - Zero or more of the preceding match

In [135]:
text = "GOAL GOOOOOOOOOAAAAAAL"
re.findall('GO+A+L', text)

['GOAL', 'GOOOOOOOOOAAAAAAL']

In [136]:
text = "GOAL"
re.findall('GO+A+L', text)

['GOAL']

`*` and `+` are *greedy* in Python. They will grab as much as possible. 

In [82]:
text = "<p>Something or other</p><p>Yet more junk.</p>" 
re.findall('<p>.*</p>', text)

['<p>Something or other</p><p>Yet more junk.</p>']

In [124]:
text = "foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com" 
re.findall('\w.*@gmail.com', text)

['foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com']

`*?` is the *lazy* alternative, it will grab as little as possible.

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

In [123]:
re.findall('\w.*?@gmail.com', text)

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

## Start and End of Line

`^` - Start of line

In [100]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('^quick', text)

[]

In [105]:
re.findall('^The', text)

['The']

In [109]:
re.findall('^.*fox', text)

['The quick brown fox']

`$` - End of line

In [104]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('.......$', text)

['low dog']

In [94]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall("^.*$", text)

['The quick brown fox jumped over the lazy yellow dog']

## Reference

- `a` - Match the letter 'a'. Same for most other characters
- `.` - Match any single character
- `\.` - Match a period. Same for other 'special' characters
- `\w` - Match any word character
- `\d` - Match any digit
- `\s` - Match any whitespace character
- `[ab]` - Group of multiple possible characters
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `+?` - One or more of the preceding match (lazy)
- `*?` - Zero or more of the preceding match (lazy)
- `^` - Start of line
- `$` - End of line

# Additional tips

Choose a range for repetition with `{min,max}`. e.g.

In [49]:
text = "YOLO"
re.search('YOLO{1,3}$', text) 

<_sre.SRE_Match object; span=(0, 4), match='YOLO'>

In [50]:
text = "YOLOOO"
re.search('YOLO{1,3}$', text)

<_sre.SRE_Match object; span=(0, 6), match='YOLOOO'>

In [51]:
text = "YOLOOOOOO"
re.search('YOLO{1,3}$', text)

*Negation*
    
Use the caret in square brackets: `[^aeiou]` means *not* a, e, t, o, or u

*Groups*
    
Use parentheses. e.g:

In [53]:
text = "banana"
re.search('^ba(na)+$', text)

<_sre.SRE_Match object; span=(0, 6), match='banana'>

In [54]:
text = "lololololololololololol"
re.search('^l(ol)+$', text)

<_sre.SRE_Match object; span=(0, 23), match='lololololololololololol'>

Capturing groups:

In [61]:
text = "Ketchup Catsup"
re.findall('(Ketch|Cats)up', text)

['Ketch', 'Cats']