# Regex

- What is a regular expression?
- When are regular expressions useful?

In [34]:
import pandas as pd
import re # part of the python stdlib

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
# the 'r' means that the string is a raw string
# r''
# r""
# r''''''

regex = r'(?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"'
regex = re.compile(regex,re.VERBOSE)

lines = pd.Series(log_file_lines.strip().split('\n'))
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [13]:
# single characters work they way you'd expect 
# this regular expression looks for the pattern of the letter b

regexp = r'b'
subject = 'abc'
subject2= 'Hello this is time for class blob yes class time is now'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

In [14]:
re.search(regexp, subject2)

<re.Match object; span=(29, 30), match='b'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "c". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
    </ol>
</div>

In [15]:
# 1
regexp = r'c'

re.search(regexp, subject)

<re.Match object; span=(2, 3), match='c'>

In [16]:
# 2
regexp = r'ab'

re.search(regexp, subject)

# not sure what the span is referring to..?

<re.Match object; span=(0, 2), match='ab'>

In [36]:
# 3
regexp = r'd'

re.search(regexp, subject)

In [21]:
# 4

regexp = r'a'

re.findall(regexp, subject2)

# returns a list

regexp2 = r'cl'

re.findall(regexp2, subject2)

['cl', 'cl']

In [30]:
print('\\n\n\n')
print('\\n')

\n


\n


In [37]:
pattern = re.compile('d')

pattern.findall('dog')

['d']

In [38]:
pattern.search('dog', 0)

<re.Match object; span=(0, 1), match='d'>

### Metacharacters

- `.` - anything really
- `\w`: any letter or number; `\W` anything that is *not* a letter or number
- `\s`: any type whitespace
- `\d`: any numeral, \D being anything that is NOT a numeral
- Captial variants

In [39]:
regexp = r'\w'

re.findall(regexp, subject2)

# returns every character that is a letter or number

['H',
 'e',
 'l',
 'l',
 'o',
 't',
 'h',
 'i',
 's',
 'i',
 's',
 't',
 'i',
 'm',
 'e',
 'f',
 'o',
 'r',
 'c',
 'l',
 'a',
 's',
 's',
 'b',
 'l',
 'o',
 'b',
 'y',
 'e',
 's',
 'c',
 'l',
 'a',
 's',
 's',
 't',
 'i',
 'm',
 'e',
 'i',
 's',
 'n',
 'o',
 'w']

In [40]:
regexp = r'\s'

re.findall(regexp, subject2)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [42]:
regexp = r'\d'

re.findall(regexp, subject2)

[]

In [41]:
regexp = r'\W'

re.findall(regexp, subject2)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

In [43]:
subject3 = r"I'm a little teapot, short and shout_. Here is my handle; here is my spout!"

regexp = r'.'

re.findall(regexp, subject3)

['I',
 "'",
 'm',
 ' ',
 'a',
 ' ',
 'l',
 'i',
 't',
 't',
 'l',
 'e',
 ' ',
 't',
 'e',
 'a',
 'p',
 'o',
 't',
 ',',
 ' ',
 's',
 'h',
 'o',
 'r',
 't',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 's',
 'h',
 'o',
 'u',
 't',
 '_',
 '.',
 ' ',
 'H',
 'e',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 'm',
 'y',
 ' ',
 'h',
 'a',
 'n',
 'd',
 'l',
 'e',
 ';',
 ' ',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 'm',
 'y',
 ' ',
 's',
 'p',
 'o',
 'u',
 't',
 '!']

In [44]:
regexp = r'\w'

re.findall(regexp, subject3)

['I',
 'm',
 'a',
 'l',
 'i',
 't',
 't',
 'l',
 'e',
 't',
 'e',
 'a',
 'p',
 'o',
 't',
 's',
 'h',
 'o',
 'r',
 't',
 'a',
 'n',
 'd',
 's',
 'h',
 'o',
 'u',
 't',
 '_',
 'H',
 'e',
 'r',
 'e',
 'i',
 's',
 'm',
 'y',
 'h',
 'a',
 'n',
 'd',
 'l',
 'e',
 'h',
 'e',
 'r',
 'e',
 'i',
 's',
 'm',
 'y',
 's',
 'p',
 'o',
 'u',
 't']

In [45]:
regexp = r'\W'

re.findall(regexp, subject3)

["'",
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 ' ',
 ' ',
 ' ',
 ';',
 ' ',
 ' ',
 ' ',
 ' ',
 '!']

In [46]:
regexp = r'\s'

re.findall(regexp, subject3)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [47]:
regexp = r'\d'

re.findall(regexp, subject3)

[]

In [48]:
regexp = r'\D'

re.findall(regexp, subject3)

['I',
 "'",
 'm',
 ' ',
 'a',
 ' ',
 'l',
 'i',
 't',
 't',
 'l',
 'e',
 ' ',
 't',
 'e',
 'a',
 'p',
 'o',
 't',
 ',',
 ' ',
 's',
 'h',
 'o',
 'r',
 't',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 's',
 'h',
 'o',
 'u',
 't',
 '_',
 '.',
 ' ',
 'H',
 'e',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 'm',
 'y',
 ' ',
 'h',
 'a',
 'n',
 'd',
 'l',
 'e',
 ';',
 ' ',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 'm',
 'y',
 ' ',
 's',
 'p',
 'o',
 'u',
 't',
 '!']

In [49]:
regexp = r'\w\w'

re.findall(regexp, subject3)

['li',
 'tt',
 'le',
 'te',
 'ap',
 'ot',
 'sh',
 'or',
 'an',
 'sh',
 'ou',
 't_',
 'He',
 're',
 'is',
 'my',
 'ha',
 'nd',
 'le',
 'he',
 're',
 'is',
 'my',
 'sp',
 'ou']

In [50]:
regexp = r'\S'

re.findall(regexp, subject3)

['I',
 "'",
 'm',
 'a',
 'l',
 'i',
 't',
 't',
 'l',
 'e',
 't',
 'e',
 'a',
 'p',
 'o',
 't',
 ',',
 's',
 'h',
 'o',
 'r',
 't',
 'a',
 'n',
 'd',
 's',
 'h',
 'o',
 'u',
 't',
 '_',
 '.',
 'H',
 'e',
 'r',
 'e',
 'i',
 's',
 'm',
 'y',
 'h',
 'a',
 'n',
 'd',
 'l',
 'e',
 ';',
 'h',
 'e',
 'r',
 'e',
 'i',
 's',
 'm',
 'y',
 's',
 'p',
 'o',
 'u',
 't',
 '!']

In [52]:
regexp = r'\w\s\d'
subject4 = 'c 1'
re.findall(regexp, subject4)

['c 1']

In [53]:
subject5 = r"1, 2, 3 strikes you're out at the old ballgame"

regexp = r'\d'

re.findall(regexp, subject5)

['1', '2', '3']

### Repeating

- `{}`: custom number of repititions
    - `{x}`: exactly x repititions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repititions
- `*`: zero or more
- `+`: one or more
- `?`: optional
- `?`: greedy + non-greedy

In [56]:
regexp = r'\d{1,3}'

re.findall(regexp, subject5)

['1', '2', '3']

In [57]:
regexp = r'\d{1}'

re.findall(regexp, subject5)

['1', '2', '3']

In [58]:
regexp = r'\d{2}'

re.findall(regexp, subject5)

[]

In [59]:
# so remember that \w will return all charatcers that are letters and numbers, so \w+ will return everything concatted
# til it reaches a character which is not a number or letter

regexp = r'\w+'

re.findall(regexp, subject5)

['1', '2', '3', 'strikes', 'you', 're', 'out', 'at', 'the', 'old', 'ballgame']

In [60]:
regexp = r'\w*'

re.findall(regexp, subject5)

['1',
 '',
 '',
 '2',
 '',
 '',
 '3',
 '',
 'strikes',
 '',
 'you',
 '',
 're',
 '',
 'out',
 '',
 'at',
 '',
 'the',
 '',
 'old',
 '',
 'ballgame',
 '']

In [61]:
regexp = r'\?'

re.findall(regexp, subject5)

[]

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches `http://` or `https://`.</li>
        <li>Write a regular expression that matches all of the words.</li>
    </ol>
</div>

In [89]:
regexp = r"\d+"
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com. '
    "It's a great school!"
)

re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [90]:
# what if we wanted to account for float numbers?

subject2 = 'Codeup is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. I have a 3.5 GPA.'

regexp = r'\d+\.\d+'

re.findall(regexp, subject2)

# this only gives us the float.... not what we want

['3.5']

In [92]:
regexp = r'\d+\.?\d*'

re.findall(regexp, subject2)

# got it! nice

['600', '350', '78230.', '3.5']

In [65]:
regexp = r'\d{5}'

re.findall(regexp, subject)

['78230']

In [96]:
regexp = r'https?://'

re.findall(regexp, subject)

['http://', 'https://']

In [86]:
regexp = r'\w+'

re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 '2014',
 'is',
 'located',
 'at',
 '600',
 'Navarro',
 'St',
 'Suite',
 '350',
 'San',
 'Antonio',
 'TX',
 '78230',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com',
 'It',
 's',
 'a',
 'great',
 'school']

### Any/None Of

In [None]:
# we cannot use brackets '[]' to denote potential optional characters

# '-' between characters can parse the range of characters
# '^' can act as a NOT inside brackets

In [97]:
regexp = r'[a-z]'

re.findall(regexp, subject)

# returns any lower case letter

['o',
 'd',
 'e',
 'u',
 'p',
 'f',
 'o',
 'u',
 'n',
 'd',
 'e',
 'd',
 'i',
 'n',
 'i',
 's',
 'l',
 'o',
 'c',
 'a',
 't',
 'e',
 'd',
 'a',
 't',
 'a',
 'v',
 'a',
 'r',
 'r',
 'o',
 't',
 'u',
 'i',
 't',
 'e',
 'a',
 'n',
 'n',
 't',
 'o',
 'n',
 'i',
 'o',
 'o',
 'u',
 'c',
 'a',
 'n',
 'f',
 'i',
 'n',
 'd',
 'u',
 's',
 'o',
 'n',
 'l',
 'i',
 'n',
 'e',
 'a',
 't',
 'h',
 't',
 't',
 'p',
 'c',
 'o',
 'd',
 'e',
 'u',
 'p',
 'c',
 'o',
 'm',
 'a',
 'n',
 'd',
 'o',
 'u',
 'r',
 'a',
 'l',
 'u',
 'm',
 'n',
 'i',
 'p',
 'o',
 'r',
 't',
 'a',
 'l',
 'i',
 's',
 'l',
 'o',
 'c',
 'a',
 't',
 'e',
 'd',
 'a',
 't',
 'h',
 't',
 't',
 'p',
 's',
 'a',
 'l',
 'u',
 'm',
 'n',
 'i',
 'c',
 'o',
 'd',
 'e',
 'u',
 'p',
 'c',
 'o',
 'm',
 't',
 's',
 'a',
 'g',
 'r',
 'e',
 'a',
 't',
 's',
 'c',
 'h',
 'o',
 'o',
 'l']

In [98]:
regexp = r'[a-zA-Z]'

re.findall(regexp, subject)

# returns all the letters

['C',
 'o',
 'd',
 'e',
 'u',
 'p',
 'f',
 'o',
 'u',
 'n',
 'd',
 'e',
 'd',
 'i',
 'n',
 'i',
 's',
 'l',
 'o',
 'c',
 'a',
 't',
 'e',
 'd',
 'a',
 't',
 'N',
 'a',
 'v',
 'a',
 'r',
 'r',
 'o',
 'S',
 't',
 'S',
 'u',
 'i',
 't',
 'e',
 'S',
 'a',
 'n',
 'A',
 'n',
 't',
 'o',
 'n',
 'i',
 'o',
 'T',
 'X',
 'Y',
 'o',
 'u',
 'c',
 'a',
 'n',
 'f',
 'i',
 'n',
 'd',
 'u',
 's',
 'o',
 'n',
 'l',
 'i',
 'n',
 'e',
 'a',
 't',
 'h',
 't',
 't',
 'p',
 'c',
 'o',
 'd',
 'e',
 'u',
 'p',
 'c',
 'o',
 'm',
 'a',
 'n',
 'd',
 'o',
 'u',
 'r',
 'a',
 'l',
 'u',
 'm',
 'n',
 'i',
 'p',
 'o',
 'r',
 't',
 'a',
 'l',
 'i',
 's',
 'l',
 'o',
 'c',
 'a',
 't',
 'e',
 'd',
 'a',
 't',
 'h',
 't',
 't',
 'p',
 's',
 'a',
 'l',
 'u',
 'm',
 'n',
 'i',
 'c',
 'o',
 'd',
 'e',
 'u',
 'p',
 'c',
 'o',
 'm',
 'I',
 't',
 's',
 'a',
 'g',
 'r',
 'e',
 'a',
 't',
 's',
 'c',
 'h',
 'o',
 'o',
 'l']

In [103]:
regexp = r'\w+[a-zA-Z]'

re.findall(regexp, subject)

# returns all letters and sparses when it comes across a character which is not a letter nor a number

['Codeup',
 'founded',
 'in',
 'is',
 'located',
 'at',
 'Navarro',
 'St',
 'Suite',
 'San',
 'Antonio',
 'TX',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com',
 'It',
 'great',
 'school']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [111]:
subject = "I graduated highschool in 2013. My GPA was 3.36. I joined the Army at age 17. I can't remember when I started basic Training, but I know I graduated on October 31, 2013."
subject2 = 'I started playing softball when I was 10 I think. My number was 6. I usually batted 1st in the lineup.'

In [121]:
regexp = r'\d*[02468]'

re.findall(regexp, subject)

['20', '36', '20']

In [113]:
re.findall(regexp, subject2)

['10 ', '6. ']

In [123]:
regexp = r'\d*[13579]{2}'

re.findall(regexp, subject2)

[]

In [124]:
re.findall(regexp, subject)

['2013', '17', '31', '2013']

In [128]:
regexp = '\w*[aeiouAEIOU]'
re.findall(regexp, subject)

['I',
 'graduate',
 'highschoo',
 'i',
 'GPA',
 'wa',
 'I',
 'joine',
 'the',
 'A',
 'a',
 'age',
 'I',
 'ca',
 'remembe',
 'whe',
 'I',
 'starte',
 'basi',
 'Traini',
 'bu',
 'I',
 'kno',
 'I',
 'graduate',
 'o',
 'Octobe']

### Anchors

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [None]:
regexp = r'^[aeiouAEIOU]'

### Capture Groups

## `re.sub`

- removing
- substitution

In [129]:
# capture groups are denoted by paratheses

regexp = r'([abc]+) ([abc]+)'
subject = 'abbbbbc cbbaa bbbba sandwich baba ab zz'

re.findall(regexp, subject)

[('abbbbbc', 'cbbaa'), ('baba', 'ab')]

In [130]:
regexp = 'sandwich'

re.sub(regexp, 'slamwich', subject)

'abbbbbc cbbaa bbbba slamwich baba ab zz'

In [139]:
# capture groups have a initial value of 1

subject = 'Hello, I would lile a ham sandwich please'

regexp = r'(ham)\s+(\w)\s '
re.sub(regexp, r'\2 of \1       ', subject)

'Hello, I would lile a ham sandwich please'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [140]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])


In [141]:
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

In [None]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()