# Regex

- What is a regular expression?
    - a language for describing regular text.
    - a way to describe things like a series of digits followed by whitespace, followed by the letters a-z
    - focusing on python-flavored regex here, but reg ex as a concept are larger in scope
        - can be used inside of SQL (LIKE operator w/ wildcards)
- When are regular expressions useful?
    - when parsing regular text 
        - (parsing = extracting meaning from something)
        - break text down to its parts (component pieces)
    - structured (some sort) text
    - data acquisition, data preparation, maybe even a bit in exploration. 
    - any time we deal w/ text data and anytime we manipulate text data
    - can be simple + complex operations (can help quickly identify things, replacing whitespace) 
    - normalizing whitespace

In [1]:
import pandas as pd
import re

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
#turn the log_file_lines into a df
regex = r'(?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"'
regex = re.compile(regex,re.VERBOSE)

lines = pd.Series(log_file_lines.strip().split('\n'))
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


In [4]:
#regular expression module that is part of the python standard library
import re 

### Functions from re module that allow us to apply regular expressions to text
- **re.search**: shows a *single* match for a regex (like the first instance of the match)
- **re.findall**: shows *ALL* the matches for a regex in a subject (shows all matches)

***
Side note:
- r' - means raw. by convention we use raw strings as our regular expressions.
- u - is for unicode

In [5]:
#raw string is different from a python string
r'a'

'a'

In [6]:
#string is read as 2 characters: the backslash and the 'n'
r"\n"

'\\n'

In [7]:
#2 characters
len(r"\n")

2

In [8]:
#this is read as a new line character in python
"\n"

'\n'

In [9]:
#1 character
len("\n")

1

***

## Literals
- any character that matches itself (a-z, A-Z, 0-9); literals
- any letter or number by itself will match literally that same thing in a subject that we're applying a regular expression to.

In [10]:
#generally, we have 2 variables: a regular expression and a subject (piece of text that we will apply the regexp to)
regexp = r'a'
subject = 'abc'

#get back a match object that represents the result of applying regex to the text
re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

- `match` is the 'a' character
- `span` is from 0 to 1
- meaning that match is from index [0] to index [1] in the subject

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the cell above to start experimenting with regular expressions.</p>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
            - span=(1, 2), match='b'
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        - span=(0, 2), match='ab'
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        - Nothing comes out since there is no 'd' in the subject.
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        - [] Empty brackets since there is no 'd' in the subject.
        <li>Change your regular expression to just the "." character. What are the results?</li>
        - ['a', 'b', 'c'] with .findall
  <p>- span=(0, 1), match='a' with .search</p>
        <p>the '.' is a metacharacter that matches anything</p>
    </ol>
</div>

In [11]:
#1. Change your regular expression to match the literal character "b". 
regexp = r'b'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

In [12]:
#2. Change your regular expression to match the literal string "ab". 
regexp = r'ab'
subject = 'abc'

re.search(regexp, subject)


<re.Match object; span=(0, 2), match='ab'>

In [13]:
#3. Change your regular expression to match the literal "d". 
regexp = r'd'
subject = 'abc'

re.search(regexp, subject)

In [14]:
#4. Use re.findall instead of re.search. 
regexp = r'd'
subject = 'abc'

re.findall(regexp, subject)

[]

In [15]:
#switch to findall and it gives a list w/ a single string b/c it will produce all the matches in the string
regexp = r'a'
subject = 'abc'

re.findall(regexp, subject)

['a']

In [16]:
#example how it produces all the matches in the string
regexp = r'a'
subject = 'abc abc'

re.findall(regexp, subject)

['a', 'a']

In [17]:
#5. Change your regular expression to just the "." character. 
regexp = r'.'
subject = 'abc'

re.findall(regexp, subject)

['a', 'b', 'c']

In [18]:
#regular expression to just the "." character w/ search func 
regexp = r'.'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

### Metacharacters: have a special meaning and match a whole class of character

- `.`: anything (matches any character in the subject)
- `\w`: any word characters (alphanumeric)
- `\s`: whitespace
- `\d`: numbers
- Captial variants: matches anything that is **not** described by the lower case variant

In [19]:
#matches any word characters 
#looks for anything a-z, 0-9, or an underscore (_)
regexp = r'\w'
subject = 'abc 123'

re.search(regexp, subject)
#.search matches the first instance of a word character ('a')

<re.Match object; span=(0, 1), match='a'>

In [20]:
#matches whitespace
regexp = r'\s'
subject = 'abc 123'

re.search(regexp, subject)
#spans from 3 to 4 and matches the empty whitespace

<re.Match object; span=(3, 4), match=' '>

In [21]:
#matches any digit 0-9
regexp = r'\d'
subject = 'abc 123'

re.search(regexp, subject)
#.search matches the first instance of a digit

<re.Match object; span=(4, 5), match='1'>

In [22]:
#capital D matches anything that is NOT a digit
regexp = r'\D'
subject = 'abc 123'

re.search(regexp, subject)
#.search matches the 'a' b/c that is the first character that is not a digit

<re.Match object; span=(0, 1), match='a'>

In [23]:
#capital W matches anything that is NOT a word character
regexp = r'\W'
subject = 'abc 123'

re.search(regexp, subject)
#.search matches the whitespace since its the first non-alphanumeric character

<re.Match object; span=(3, 4), match=' '>

In [24]:
#capital S matches anything that is NOT whitespace
regexp = r'\S'
subject = 'abc 123'

re.search(regexp, subject)
#.search matches the 'a' b/c it is the first instance of NON whitespace

<re.Match object; span=(0, 1), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        - gives back a list of strings w/ each string representing one match of the regular expression.
        <li>What does the regular expression <code>\w\w</code> match?</li>
        - 2 word/alphanumeric characters next to each other.
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

#### 1. Use all of the above metacharacters with re.findall

In [25]:
#everything that is a word character is returned
regexp = r'\w'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [26]:
#everything that is whitespace is returned back
regexp = r'\s'
subject = 'abc 123'

re.findall(regexp, subject)

[' ']

In [27]:
#another instance to show more empty strings w/ .findall r'\s'
regexp = r'\s'
subject = 'abc easy as 123'

re.findall(regexp, subject)

[' ', ' ', ' ']

In [28]:
#returns anything that is a digit
regexp = r'\d'
subject = 'abc 123'

re.findall(regexp, subject)

['1', '2', '3']

In [29]:
#returns anything that is not a word character
regexp = r'\W'
subject = 'abc 123'

re.findall(regexp, subject)

[' ']

In [30]:
#everything that is not whitespace is returned back 
regexp = r'\S'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [31]:
#returns everything that is not a digit (notice how it also includes the whitespace)
regexp = r'\D'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', ' ']

#### 2. What does the regular expression \w\w match?

In [32]:
#gives back 'ab' b/c its the first time 2 word characters are next to each other, 
#then the sequence 'c' and whitespace does not match (b/c the whitespace character is not a word character)
#'12' then matches as well b/c there's another 2 word chars next to each other
#'3' doesn't show b/c it's just a single digit and nothing else at the end of the string
regexp = r'\w\w'
subject = 'abc 123'

re.findall(regexp, subject)

['ab', '12']

In [33]:
#another example of returning 2 word characters next to each other
regexp = r'\w\w'
subject = 'abcdefg 1234 567 890'

re.findall(regexp, subject)

['ab', 'cd', 'ef', '12', '34', '56', '89']

In [34]:
#returns '12' b/c its a sequence of one digit and then another digit
regexp = r'\d\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 6), match='12'>

In [35]:
#another example of 2 digits next to each other
regexp = r'\d\d'
subject = 'abc 12345 6 789 0'

re.findall(regexp, subject)

['12', '34', '78']

In [36]:
#1 more example
regexp = r'\d\d'
subject = 'abc 123456 789 0'

re.findall(regexp, subject)

['12', '34', '56', '78']

#### 3. Use only metacharacters to write a regular expression to match "c 1"

In [37]:
#word character, s for whitespace and 1 for digit
regexp = r'\w\s\w' 
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(2, 5), match='c 1'>

In [38]:
#another way
regexp = r'\w\s\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(2, 5), match='c 1'>

In [39]:
#many other ways
regexp = r'..\d'
subject = 'abc 123'

re.findall(regexp, subject)

['c 1']

In [40]:
#another way
regexp = r'\S\s\S'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(2, 5), match='c 1'>

In [41]:
#another way
regexp = r'c..'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(2, 5), match='c 1'>

#### Use a combination of metacharacters to match 3 digits in a row.

In [42]:
#returns 3 digits next to each other
regexp = r'\d\w\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 7), match='123'>

In [43]:
#another way
regexp = r'\d\d\d'
subject = 'abc 123'

re.findall(regexp, subject)

['123']

In [44]:
#another way
regexp = r'\d\w\d'
subject = 'abc 123'

re.findall(regexp, subject)

['123']

### Repeating: modify whatever comes before them

- `{}`: specific number of repetitions
- `*`: zero or more
- `+`: one or more
- `?`: optional
- greedy + non-greedy

In [45]:
#'\w' matches any word char 
#'+' modifies anything that comes before it (the word char)
#returns one or more('+') word characters('\w')
regexp = r'\w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [46]:
#common use of '*' (zero or more) is when matching text that might have a lot of whitespace in it
#zero or more space characters, and then the word character
regexp = r'\s*\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [47]:
#('*') matches all the space characters (zero or more), and then the word character
regexp = r'\s*\w'
subject = '       abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 8), match='       a'>

In [48]:
#refers to one or more digits
regexp = r'\d+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 7), match='123'>

In [49]:
#can specify repetitions w/ the {}
#returns 2 digits in a row (2 repetitions of a digit)
regexp = r'\d{2}' 
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 6), match='12'>

In [50]:
#this reg expression does not match the subject b/c there is no 2 whitespaces in a row
regexp = r'\s{2}'
subject = 'abc 123'

re.search(regexp, subject)

In [51]:
#inside {} we can have a comma to say 2 or more and give an upper bound
#this matches 2 or 3 word characters, one after the other
regexp = r'\w{2,3}'
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

In [52]:
#example with .search
regexp = r'\w{2,3}'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [53]:
#another example of 2 or 5 word characters
regexp = r'\w{2,5}'
subject = 'abc 123456 you and me supercalifragilisticexpialidocious'

re.findall(regexp, subject)

['abc',
 '12345',
 'you',
 'and',
 'me',
 'super',
 'calif',
 'ragil',
 'istic',
 'expia',
 'lidoc',
 'ious']

In [54]:
#example w/ .search
regexp = r'\w{2,5}'
subject = 'abc 123456 you and me supercalifragilisticexpialidocious'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [55]:
#whatever comes before the '?' is optional (can be matched or not)
regexp = r'abcd?'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [56]:
#if you have 'd' or not in subject, it'll still match cause of the '?'
regexp = r'abcd?'
subject = 'abcd 123'

re.search(regexp, subject)

<re.Match object; span=(0, 4), match='abcd'>

In [57]:
#see the difference w/out '?' (doesn't match anything)
regexp = r'abcd'
subject = 'abc 123'

re.search(regexp, subject)

In [58]:
#`+` is greedy by default b/c it matches as much as it can
#greedily matches as much as it can
regexp = r'\w+' #one or more any alphanumeric
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

In [59]:
#it will much one or more word characters until it hits something that is not a word character
regexp = r'\w+'
subject = 'abcde 123 ^'

re.findall(regexp, subject)

['abcde', '123']

In [60]:
#'?' says: make your match, nongreedily
#nongreedy matches as little as possible
regexp = r'\w+?'
subject = 'abc 123'

re.findall(regexp, subject)
#least possible match for one or more word characters is every char by itself

['a', 'b', 'c', '1', '2', '3']

**A `?` after a literal character or metacharacter will make it optional.**


**A `?` after a repetition operator, makes that repetition *nongreedy***

In [61]:
#matches the whole thing: as much as you possibly can ('+') of anything ('.')
regexp = r'.+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [62]:
#same as adding \d
#match as much as you possibly can ('+') 
# of anything ('.') up until a digit 
regexp = r'.+\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [63]:
#add an 'a' to subject and it outputs the same as above
regexp = r'.+\d'
subject = 'abc 123a'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [64]:
#'?' after the '+' makes it nongreedy so it's as little as you can match, and then a digit
regexp = r'.+?\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 5), match='abc 1'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

In [65]:
#1. Write a regular expression that matches all the numbers.
regexp = r'\d+' #any sequence of 1 or more digits
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)


['2014', '600', '350', '78230']

In [66]:
#2. Write a regular expression that matches a 5 digit number, but not a number with fewer digits.
regexp = r'\d{5}' #specifies 5 digits
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['78230']

In [67]:
#3. Write a regular expression that matches any urls in the subject.
regexp = r'https?://.+?com' #'?' makes it nongreedy (matching as little as possible)
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['http://codeup.com', 'https://alumni.codeup.com']

In [68]:
#https '?' makes 's' optional, 
#'://' are the literal characters from the url
#'.+' is one ore more of anything, up until a whitespace char ('\s')
#and matching as much as it possibly can (output has text 'and our alumni portal is located at ')
regexp = r'https?://.+\s'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['http://codeup.com and our alumni portal is located at ']

In [69]:
#only returns 1 (codeup.com)
#'.+?\s' means one or more of anything nongreedily up until whitespace char
#so the '?' after '+' is matching as little as possible
regexp = r'https?://.+?\s'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['http://codeup.com ']

In [70]:
#pulls both urls but includes the text b/c there is no '?' after the '+', so it is matching greedily
regexp = r'https?://.+com'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['http://codeup.com and our alumni portal is located at https://alumni.codeup.com']

### Any/None Of

In [71]:
#matches a single char that is anything that is inside of the brackets
regexp = r'[ab]'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b']

In [72]:
#matches a single char that is anything that is inside of the brackets
regexp = r'[ac]'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'c']

In [73]:
#matches a single char that is anything that is inside of the brackets
regexp = r'[c2]'
subject = 'abc 123'

re.findall(regexp, subject)

['c', '2']

In [74]:
#specify a range of chars
regexp = r'[2-6]'
subject = 'abc 123'

re.findall(regexp, subject)

['2', '3']

In [75]:
#can invert the match

regexp = r'[^2-6]'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', ' ', '1']

In [76]:
#can use repetition operators
regexp = r'[^c3]+'
subject = 'abc 123'

re.findall(regexp, subject)

['ab', ' 12']

In [77]:

regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [78]:
#change subject
subject = '123abc'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [79]:
#1. Write a regular expression that matches even numbers.
regexp = r'[02468]'
subject = '123456 789'

re.findall(regexp, subject)

['2', '4', '6', '8']

In [80]:
#2. Write a regular expression that matches 2 or more odd numbers in a row.
regexp = r'[13579]{2,}'#'2,' to say 2 or more of these
subject = '123456 789 110 133 333'

re.findall(regexp, subject)

['11', '133', '333']

In [81]:
#3. Write a regular expression that matches any word with a vowel in it.
regexp = r'[aeiou]'
subject = 'do re mi'

re.findall(regexp, subject)

['o', 'e', 'i']

In [82]:
#.match
regexp = r'[aeiou]'
subject = 'do re mi'

re.match(regexp, subject)

In [83]:
#.search
regexp = r'[aeiou]'
subject = 'do re mi'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='o'>

Sidenote: we can use a regexp match object as a boolean value

In [84]:
regexp = r'[aeiou]'
subject = 'bbbb'

if re.search(regexp, subject):
    print('Found a vowel')
else:
    print('No vowels')

No vowels


In [85]:
regexp = r'[aeiou]'
subject = 'abc easy as 123'

if re.search(regexp, subject):
    print('Found a vowel')
else:
    print('No vowels')

Found a vowel


### Anchors

- `^`: starts with
- `$`: ends with

In [86]:
regexp = r'b'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

In [87]:
#gives back nothing (does the subject start w/ a b)
regexp = r'^b'
subject = 'abc 123'

re.search(regexp, subject)

In [88]:
#start w/ word char
regexp = r'^\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [89]:
regexp = r'^[a-z]'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [90]:
regexp = r'[a-z]$'
subject = 'abc 123'

re.search(regexp, subject)

In [91]:
regexp = r'\d$'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(6, 7), match='3'>

In [92]:
regexp = r'^$'
subject = 'abc 123'

re.search(regexp, subject)

In [93]:
regexp = r'.*\d$'
subject = ''

re.search(regexp, subject)

In [94]:
regexp = r'.*\d$'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [95]:
#1. Write a regular expression that matches if a word starts with a vowel.
regexp = r'^[aeiou]'
subject = 'easy as 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='e'>

In [96]:
regexp = r'^[aeiou]'
subject = 'banana'

re.search(regexp, subject)

In [97]:
regexp = r'^[aeiouAEIOU]'
subject = 'easy as 123'

re.findall(regexp, subject)

['e']

In [98]:
#2. Write a regular expression that matches if a word starts with a capital letter.
regexp = r'^[A-Z]'
subject = 'EASY AS 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='E'>

In [99]:
#3. Write a regular expression that matches if a word ends with a capital letter.
regexp = r'[A-Z]$'
subject = 'EASY AS 123 YEAH'

re.search(regexp, subject)

<re.Match object; span=(15, 16), match='H'>

In [100]:
#4. Write a regular expression that matches if a word starts and ends with a capital letter.
regexp = r'^[A-Z].*[A-Z]$'
subject = 'EASY AS 123 YEAH'

re.search(regexp, subject)

<re.Match object; span=(0, 16), match='EASY AS 123 YEAH'>

In [101]:
regexp = r'^[A-Z].*[A-Z]$'
subject = 'BananA'

re.search(regexp, subject)

<re.Match object; span=(0, 6), match='BananA'>

In [102]:
regexp = r'^[A-Z].*[A-Z$]'
subject = 'BananA$'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='BananA$'>

### Capture Groups
Wrapping any part of 

In [103]:
#simple example
regexp = r'abc'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [104]:
regexp = r'[a-z]+\s\d+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [105]:
#capture groups
regexp = r'([a-z]+)\s(\d+)'
subject = 'abc 123'

match = re.search(regexp, subject)
match.groups()

('abc', '123')

In [106]:
regexp = '.*?(\d+)'
s = pd.Series(['abc', 'abc123', '123'])
s.str.extract(regexp)

Unnamed: 0,0
0,
1,123.0
2,123.0


In [107]:
s = pd.Series(['abc', 'abc123', '123'])
s

0       abc
1    abc123
2       123
dtype: object

In [108]:
s.str.extract('.*?(\d+)')

Unnamed: 0,0
0,
1,123.0
2,123.0


## `re.sub`

- removing
- substitution

In [109]:
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, r'', subject)

'abc'

In [110]:
regexp = r'[a-z]+\d+'
subject = 'abc123'

re.sub(regexp, r'', subject)

''

In [111]:
regexp = r'([a-z]+)(\d+)'
subject = 'abc123'

re.sub(regexp, r'', subject)

''

In [112]:
regexp = r'([a-z]+)(\d+)'
subject = 'abc123'

re.sub(regexp, r'\2', subject)

'123'

In [113]:
regexp = r'([a-z]+)(\d+)'
subject = 'abc123'

re.sub(regexp, r'\1', subject)

'abc'

In [114]:
regexp = r'([a-z]+)(\d+)'
subject = 'abc123'

re.sub(regexp, r'\2\1', subject)

'123abc'

In [115]:
#example w/ pandas
s.str.replace(r'\d+', '')

0    abc
1    abc
2       
dtype: object

In [116]:
s.str.replace(r'.*(\d).*', r'1')

0    abc
1      1
2      1
dtype: object

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [117]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])
dates

0    2020-11-12
1    2020-07-13
2    2021-01-12
dtype: object

In [118]:
#Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.
dates.str.replace(r'\d{4}-\d{2}-\d{2}', '')

0    
1    
2    
dtype: object

In [119]:
dates.str.replace(r'\d{4}-\d{2}-\d{2}', 'X')

0    X
1    X
2    X
dtype: object

In [120]:
#add capture groups
dates.str.replace(r'(\d{4})-(\d{2})-(\d{2})', r'\2')

0    11
1    07
2    01
dtype: object

In [121]:
dates.str.replace(r'(\d{4})-(\d{2})-(\d{2})', r'\2/\1')

0    11/2020
1    07/2020
2    01/2021
dtype: object

In [122]:
dates.str.replace(r'(\d{4})-(\d{2})-(\d{2})', r'\2/\3')

0    11/12
1    07/13
2    01/12
dtype: object

In [123]:
dates.str.replace(r'(\d{4})-(\d{2})-(\d{2})', r'\2/\3/\1')

0    11/12/2020
1    07/13/2020
2    01/12/2021
dtype: object

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [124]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [125]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

Unnamed: 0,0,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [126]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [127]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [128]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}