# Regex

- What is a regular expression?
- When are regular expressions useful?

In [1]:
import pandas as pd
import re # part of the python stdlib

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [None]:
regex = r'(?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"'
regex = re.compile(regex,re.VERBOSE)

In [4]:
lines = pd.Series(log_file_lines.strip().split('\n'))

In [5]:
lines

0    76.185.131.226 - - [11/May/2020:14:25:53 +0000...
1    76.185.131.226 - - [11/May/2020:16:25:46 +0000...
2    76.185.131.226 - - [11/May/2020:16:25:58 +0000...
3    76.185.131.226 - - [11/May/2020:16:25:58 +0000...
4    104.5.217.57 - - [11/May/2020:16:26:27 +0000] ...
5    76.185.131.226 - - [11/May/2020:16:26:46 +0000...
6    76.185.131.226 - - [11/May/2020:16:26:54 +0000...
7    104.5.217.57 - - [11/May/2020:16:27:04 +0000] ...
8    76.185.131.226 - - [11/May/2020:16:27:05 +0000...
9    76.185.131.226 - - [11/May/2020:16:27:10 +0000...
dtype: object

In [3]:
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [15]:
# a single character the
# this regular expression lookf for the pattern of the letter b
regexp = r'b'

In [16]:
subject = 'abc'
subject2 = 'Bounce that booger on the bowl back behind the bookshelf'
re.search(regexp, subject2)

<re.Match object; span=(4, 5), match='c'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "c". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
    </ol>
</div>

In [35]:
regexp = r'book'
subject = 'abc'
re.search(regexp, subject2)

<re.Match object; span=(47, 51), match='book'>

In [36]:
# the span is where the characters found in the searh start and end
len(subject2)

56

### Metacharacters

- `.`: anything
- `\w`: any letter or number; `\W` anything that is *not* a letter or number
- `\s` : any type of whitespace
- `\d`: any numeral, `\D` anythong that is *not* a number
- Captial variants

In [49]:
regexp = r'\w\D\s'
re.findall(regexp, subject2)

['ce ', 'at ', 'er ', 'on ', 'he ', 'wl ', 'ck ', 'nd ', 'he ']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

In [53]:
# 1
regexp = r'.'
subject = 'abc'
re.findall(regexp, subject)

['a', 'b', 'c']

In [55]:
# 1
regexp = r'\w'
subject = 'abc'
re.findall(regexp, subject)

['a', 'b', 'c']

In [56]:
# 2
regexp = r'\w\w'
subject = 'abc'
re.findall(regexp, subject)

['ab']

In [57]:
# 3
regexp = r'\w\s\d'
subject = 'c 1'
re.findall(regexp, subject)

['c 1']

In [61]:
# 4
regexp = r'\w\d\S'
subject = '123'
re.findall(regexp, subject)

['123']

### Repeating

- `{}`: custom number of repititions
    - `{x}`: exactly x repititions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repititions
- `*`: zero or more
- `+`: one or more
- `?`: optional
- `?`: greedy + non-greedy

In [64]:
subject = 'paddingc 1 padding 777 66 hello'
regexp = r'\d{2,3}'
re.findall(regexp, subject)

['777', '66']

In [70]:
subject = 'paddingc 1 padding !!.. 777 66 hello'
regexp = r'\w+'
re.findall(regexp, subject)

['paddingc', '1', 'padding', '777', '66', 'hello']

In [71]:
subject = 'paddingc 1 padding !!.. 777 66 hello'
regexp = r'.*'
re.findall(regexp, subject)

['paddingc 1 padding !!.. 777 66 hello', '']

In [72]:
subject = 'paddingc 1 padding !!.. 777 66 hello'
regexp = r'.+'
re.findall(regexp, subject)

['paddingc 1 padding !!.. 777 66 hello']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches `http://` or `https://`.</li>
        <li>Write a regular expression that matches all of the words.</li>
    </ol>
</div>

In [137]:

subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com. '
    "It's a great school!"
)

In [153]:
regexp = r"\d+\.?\d"
re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [141]:
regexp = r"\d{5}"
re.findall(regexp, subject)

['78230']

In [160]:
regexp = r"\w{4,5}\W{3}"
re.findall(regexp, subject)

['http://', 'https://']

In [152]:
regexp = r"https?://"
re.findall(regexp, subject)

['http://', 'https://']

In [161]:
regexp = r"\w+://"
re.findall(regexp, subject)

['http://', 'https://']

In [169]:
regexp = r"\w+"
re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 '2014',
 'is',
 'located',
 'at',
 '600',
 'Navarro',
 'St',
 'Suite',
 '350',
 'San',
 'Antonio',
 'TX',
 '78230',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com',
 'It',
 's',
 'a',
 'great',
 'school']

### Any/None Of

We can use `[]` to denote potential optional characters
- \- betwene characters can parse the range of characters
- ^ can aact as "not" inside brackets

In [178]:
regexp = r'[A-Za-z\']+'
re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 'is',
 'located',
 'at',
 'Navarro',
 'St',
 'Suite',
 'San',
 'Antonio',
 'TX',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com',
 "It's",
 'a',
 'great',
 'school']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [220]:
# 1
regexp = r'[1-9]*[02468]'
re.findall(regexp, subject)

['20', '14', '60', '0', '350', '78230']

In [212]:
# 1
regexp = r'\d*[02468]'
re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [184]:
# 2
regexp = r'[13579]{2,}'
re.findall(regexp, subject)

['35']

In [226]:
# 2
regexp = r'\d*[13579]\W\d*[13579]' 
subject = 'hey there are 77 85 and 5.35 of all of that'
re.findall(regexp, subject)

['77 85', '5.35']

In [211]:
# 3
regexp = r'\b[A-Za-z]*?[aeiou]*[a-zA-Z].?\b'
re.findall(regexp, subject)

['Codeup',
 'founded ',
 'in ',
 'is ',
 'located ',
 'at ',
 'Navarro',
 'St',
 'Suite',
 'San ',
 'Antonio',
 'TX',
 'You ',
 'can ',
 'find',
 'us ',
 'online',
 'at ',
 'http',
 'codeup.',
 'com ',
 'and',
 'our ',
 'alumni',
 'portal ',
 'is ',
 'located ',
 'at ',
 'https',
 'alumni',
 'codeup.',
 'com',
 'It',
 's ',
 'a ',
 'great ',
 'school']

### Anchors

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

In [239]:
some_strings = ['aardvark', 'racoon', 'possum', 'snake', 'armadillo']

In [241]:
regexp = r'^a[a-z]+[ok]$'

for animal in some_strings:
    print(re.search(regexp, animal))

<re.Match object; span=(0, 8), match='aardvark'>
None
None
None
<re.Match object; span=(0, 9), match='armadillo'>


<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [251]:
regexp = r'^[aeiouAEIOU][a-zA-Z]*\b'

In [252]:
regexp = r'^[A-Z][a-zA-Z]*\b'

In [254]:
regexp = r'^[A-Z][a-zA-Z]*[A-Z]$|^[A-Z]$'

### Capture Groups
Denoted by parentheses

In [243]:
regexp = r'([abc]+) ([abc]+)'
subject = 'abbbc cbbaa bbbbaa ham sandwhich bbba abbb'
re.findall(regexp, subject)

[('abbbc', 'cbbaa'), ('bbba', 'abbb')]

## `re.sub`

- removing
- substitution

In [244]:
subject = 'Hello I would like a ham sandwich please'
regexp = r'sandwich'
re.sub(regexp, 'slamwich', subject)

'Hello I would like a ham slamwich please'

In [250]:
subject = 'Hello I would like a ham sandwich please'
regexp = r'(ham)\s+(\w+)\s'
re.sub(regexp, r'\2 of \1 ', subject)

'Hello I would like a sandwich of ham please'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [255]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])


In [258]:
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [261]:
url_parts = df.text.str.extract(r'(?P<protocol>https?)://(\w+)\.(\w+)')
url_parts

Unnamed: 0,protocol,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


In [264]:
url_parts.columns = ['protocol', 'base_domain', 'tld' ]
pd.concat([df, url_parts], axis=1)

Unnamed: 0,text,protocol,base_domain,tld
0,"You should go check out https://regex101.com, ...",https,regex101,com
1,My favorite search engine is https://duckduckg...,https,duckduckgo,com
2,"If you have a question, you can get it answere...",http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

In [None]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [265]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}