# Regex

- What is a regular expression?
    - mini language for describing text
    - bigger than python, but python flavored
- When are regular expressions useful?
    - extraction (parsing) regular text, i.g. log files
    - wrangle
    - scope can be small or large
- When should you not use regex
    - data more complex than lines a log file
    - HTML

In [3]:
import pandas as pd
import re

In [4]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [5]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject (text that you are applying the regex to)

### Literals

In [24]:
regexp = r'ab' # r means raw string allows you to use special characters in string
subject = 'abc'

re.search(regexp, subject)

# re.findall(regexp, subject) # Shows list of strings

<re.Match object; span=(0, 2), match='ab'>

**Span** Index of where letter is and up to  
1. Change your regular expression to match the literal character "b". What do you notice? 
- the span changes (0,1) for a, (1,2) for b and (2,3) for c.
2. Change your regular expression to match the literal string "ab". What do you notice?
- span changes to (0,2)
3. Change your regular expression to match the literal "d". What do you notice?
4. Use re.findall instead of re.search. How do the results differ?
5. Change your regular expression to just the "." character. What are the results?

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

### Metacharacters

- `.` :any character
- `\w` :any alpha numeric character
- `\s` :any white space
- `\d` :any digit
- Captial variants match the opposite of the lower case version

In [45]:
regexp = r'\d\d\d'
subject = 'abc 123'

re.findall(regexp, subject)

['123']

- \w\w shows ['ab','12']
- c\s give you c1
- \d\d\d gives you 123

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

### Repeating

- `{}` : specific number of repetitions
- `*` : 0 or more
- `+` :1 or more
- `?` : like * but more specific. optional 
- `?` greedy + non-greedy at the end of the regex
- `,` or more. ex {5,} or a range {5,8}

In [171]:
regexp = r'(https?://.+?com)'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(115, 132), match='http://codeup.com'>
['http://codeup.com', 'https://alumni.codeup.com']


1. regexp = r'\d{3,}'
2. regexp = r'\d{5,}?'
3. regexp = r'(h\w*...\w*.\w*\w.com)'
3. regexp = r'(https?://.+?com) so .+ means everything up to com and the ? allows it to stop after the first .com

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78205. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

### Any/None Of

In [176]:
regexp = r'[a1][b2][c3]' # matches if any of these characters are
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [175]:
subject = '123abc'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that matches any word with a vowel in it.</li>
    </ol>
</div>

In [191]:
subject = '123 bed 456 13 abc'

In [196]:
#1.
re.search(r' \d+[02468]', subject)

<re.Match object; span=(7, 11), match=' 456'>

In [198]:
#2
re.findall(r'[13579]{2,}', subject)

['13']

In [203]:
#3
re.search(r'[a-z-A-Z]+[aeiou][a-zA-Z]+', 'matchstick')

<re.Match object; span=(0, 10), match='matchstick'>

In [204]:
#3
re.search(r'[a-z-A-Z]*[aeiou][a-zA-Z]+', 'and')

<re.Match object; span=(0, 3), match='and'>

### Anchors

- `^` Used for starts with. like ^b is starts with b
- inverts ex. when in brackets [^a-z] example anything that is not a lower cased letter
- `$` :Ends with

In [209]:
regexp = r'\d$'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(6, 7), match='3'>

In [223]:
regexp = r'^[aeiou]'
subject = 'how does this seem to work with 343553 and 232'

re.search(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

### Capture Groups

In [226]:
regexp = '.*?(\d+)'
df = pd.DataFrame({'text': ['abc', 'abc123', '123']})
df['match'] = df.text.str.extract(regexp)

In [227]:
df

Unnamed: 0,text,match
0,abc,
1,abc123,123.0
2,123,123.0


## `re.sub`

- removing
- substitution

In [11]:
# removing one or more digits
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, '', subject)

'abc'

In [230]:
# Capture everything in the capture group 2
regexp = r'([a-z]+)(\d+)'
subject = 'abc123'

re.sub(regexp, r'\2', subject)

'123'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [231]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

In [232]:
dates

0    2020-11-12
1    2020-07-13
2    2021-01-12
dtype: object

In [237]:
dates.str.replace(r'(\d+)-(\d+)-(\d+)', r'\2/\3/\1') # regex first then replacement

0    11/12/2020
1    07/13/2020
2    01/12/2021
dtype: object

In [239]:
subject = '2021-01-12'
regexp = r'2021'
replacement = r''

re.sub(regexp, replacement, subject)

'-01-12'

In [240]:
replacement  = r'On the \2 day of the \3 month in the year \1'

In [241]:
regexp = r'(a)(b)(c)'
subject = 'abc'

re.sub(regexp, r'\3 and then \2 and then \1', subject)

'c and then b and then a'

In [245]:
regexp = r'(.)(.)(.)'
subject = 'def'

re.search(regexp, subject).groups()

('d', 'e', 'f')

In [246]:
re.sub(regexp, r'\3 and then \2 and then \1', subject)

'f and then e and then d'

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [12]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [251]:
matches = df.text.str.extract(r'(https?)://(\w+)\.(\w+)')
matches.columns = ['protocol', 'base_domain', 'tld']
matches

Unnamed: 0,protocol,base_domain,tld
0,,,
1,,,
2,,,


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [14]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [15]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [16]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}