# Regex

- What is a regular expression?
- When are regular expressions useful?

In [2]:
import pandas as pd


In [3]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [4]:
import re # part of the python stdlib

In [5]:
regex = r'(?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"'
regex = re.compile(regex,re.VERBOSE)

lines = pd.Series(log_file_lines.strip().split('\n'))
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [9]:
regexp =  r'a'
subject = 'abcaaaa'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [10]:
re.search(regexp, subject)[0]

'a'

In [11]:
re.findall(regexp, subject)

['a', 'a', 'a', 'a', 'a']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the cell above to start experimenting with regular expressions.</p>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

**1.Change your regular expression to match the literal character "b". What do you notice?**

In [12]:
regexp =  r'b'


In [13]:
re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

**note: the span change. span=(1,2)**

**2. Change your regular expression to match the literal string "ab". What do you notice?**

In [14]:
regexp =  r'ab'
re.search(regexp, subject)

<re.Match object; span=(0, 2), match='ab'>

**note: the span change. span=(0,2)**

**3.Change your regular expression to match the literal "d". What do you notice?**

In [15]:
regexp =  r'd'
re.search(regexp, subject)

**nothing happened because there is no d in my expression**

**4.Use re.findall instead of re.search. How do the results differ?**

In [16]:
regexp =  r'b'
re.findall(regexp, subject)

['b']

In [17]:
regexp =  r'd'
re.findall(regexp, subject)

[]

**5.Change your regular expression to just the "." character. What are the results?**


In [18]:
regexp =  r'.'
re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

**it gives you any character, int his case the first one**

### Metacharacters

- `.`
- `\w`
- `\s`
- `\d`
- Captial variants

In [19]:
regexp =  r'\w'
subject = ';abc 123'
re.search(regexp, subject)

<re.Match object; span=(1, 2), match='a'>

In [20]:
regexp =  r'\w'
subject = '1abc 123'
re.search(regexp, subject)

<re.Match object; span=(0, 1), match='1'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

**1.Use all of the above metacharacters with re.findall. What do you notice?**

In [21]:
regexp =  r'\w'
subject = ';abc 123'
re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [23]:
regexp =  r'\s'
subject = ';abc 123'
re.findall(regexp, subject)

[' ']

In [23]:
regexp =  r'\d'
subject = ';abc 123'
re.findall(regexp, subject)

['1', '2', '3']

**2.What does the regular expression \w\w match?**

In [24]:
regexp =  r'\w\w'
subject = ';abc 123'
re.findall(regexp, subject)

['ab', '12']

In [22]:
#\w\w
regexp = r'\w\w'
subject = 'a bc 123'

re.search(regexp, subject)

<re.Match object; span=(2, 4), match='bc'>

**3.Use only metacharacters to write a regular expression to match "c 1".**

In [56]:
# getting c 1 with just metacharacters:
# what we want to find:
# some letter
# followed by some space
# followed by a number
# some letter (or number): \w
# a space: \s
# a number: \d
# re.search is going to give us the first match  that it observes that meets 
# those three things in the order that we asked of it
regexp =  r'\w\s\d'
subject = 'abc 123'
re.findall(regexp, subject)

['c 1']

In [57]:
regexp =  r'''..\d'''
subject = ';abc 123'
re.findall(regexp, subject)

['c 1']

In [58]:
regexp =  r'''\D \d'''
subject = ';abc 123'
re.findall(regexp, subject)

['c 1']

**4.Use a combination of metacharacters to match 3 digits in a row.**

In [59]:
regexp =  r'\d\d\d'
subject = ';abc 123'
re.findall(regexp, subject)

['123']

### Repeating

- `{}`
- `*`
- `+`
- `?`
- greedy + non-greedy

In [62]:
# three digits in a row:
# from zero to n: {,n}
# from n to however many: {n,}
# range: {n,i}
regexp =  r'\d{2,}'
subject = 'abc 123'
re.findall(regexp, subject)

['123']

In [60]:
regexp =  r'\d{3}'
subject = ';abc 123'
re.findall(regexp, subject)

['123']

In [65]:
regexp =  r'\w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [66]:
regexp =  r'\w+ \w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [67]:
regexp =  r'\w+\s\w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [71]:
regexp =  r'\w+\s\w+?'
subject = 'xabc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 6), match='xabc 1'>

In [24]:
regexp = r'ham sandwix?ch'
subject = 'hi yes i would like some ham sandwich'

re.search(regexp, subject)

<re.Match object; span=(25, 37), match='ham sandwich'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

In [25]:
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com'

**1.Write a regular expression that matches all the numbers.**

In [27]:
regexp =  r'\d+'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

**2.Write a regular expression that matches a 5 digit number, but not a number with fewer digits.**

In [97]:
regexp =  r'\d{5,}'

re.findall(regexp, subject)

['78230']

**3.Write a regular expression that matches any urls in the subject.**

In [None]:
# parts of our url:
# starts with: http
# one has s, one doesnt
# followed by //
# followed by some amount of letter characters
# then a dot
# then maybe another set of letters and a dot
# then com

In [28]:
regexp = 'https?://\w+.\w+\w*.?com'

re.findall(regexp, subject)

['http://codeup.com', 'https://alumni.codeup.com']

### Any/None Of

In [None]:
# brackets: denote any character inside of it

In [29]:
regexp = '[a-z]+' 
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [30]:
regexp = '[a1][b2][c3]' 
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [31]:
subject = '123abc'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even digits.</li>
        <li>Write a regular expression that matches 2 or more odd digits in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

1.Write a regular expression that matches even digits.

In [32]:
subj = '579480984  55554'
regexp = r'\d*[24680]$'
re.findall(regexp, subj)

['55554']

2. Write a regular expression that matches 2 or more odd digits in a row.

In [34]:
subj = '5794809555845'
regexp = r'[13579]{2,}'
re.findall(regexp, subj)

['579', '9555']

3. Write a regular expression that any word with a vowel in it.

In [35]:
subj = 'whch of these wrds hs a vowel?'
regexp = r'[A-Za-z]*[aeiouAEIOU]+[A-Za-z]*'
re.findall(regexp, subj)

['of', 'these', 'a', 'vowel']

### Anchors

- `^`
- `$`

In [36]:
# even numbers:
subj = '579480984  55554'
regexp = r'\d*[24680]$'
re.findall(regexp, subj)

['55554']

In [None]:
# looking for the first character : '^thing_youre_looking_for'
#at the end: 'thing_youre_looking_for$'

In [37]:
subj = '579480984  55554'
regexp = r'^5\d*'
re.findall(regexp, subj)

['579480984']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

1. Write a regular expression that matches if a word starts with a vowel.

In [43]:

subjects = ['apple', 'Ham', 'grouper_', 'AnchorS', 'Z', 'pokemon', 'christmas', 'bash']

In [39]:
regexp =  r'^[aeiouAEIOU][a-zA-Z]*$'
[re.findall(regexp, word) for word in subjects]

[['apple'], [], [], ['AnchorS'], [], [], [], []]

2. Write a regular expression that matches if a word starts with a capital letter.

In [40]:
regexp =  r'^[A-Z][a-zA-Z]*$'


[re.findall(regexp, word) for word in subjects]

[[], ['Ham'], [], ['AnchorS'], ['Z'], [], [], []]

3. Write a regular expression that matches if a word ends with a capital letter.

In [166]:
regexp =  r'^[a-zA-Z]*[A-Z]$'


[re.findall(regexp, word) for word in subjects]

[[], [], [], ['AnchorS'], ['Z']]

In [None]:
4. Write a regular expression that matches if a word starts and ends with a capital letter

In [42]:

regexp = r'^[A-Z][a-zA-Z]*[A-Z]$'
[re.findall(regexp, word) for word in subjects]

[[], [], [], ['AnchorS'], [], [], [], []]

### Capture Groups

In [45]:
regexp = '.*?(\d+)'
s = pd.Series(['abc','abc123','123'])
s.str.extract(regexp)


Unnamed: 0,0
0,
1,123.0
2,123.0


In [46]:
regexp = '([A-Z]+) ([a-z]+)'
subject = 'KJHDSFLKJSH ksdjflsdkjf'
re.sub(regexp, r'\2 \1', subject)

'ksdjflsdkjf KJHDSFLKJSH'

## `re.sub`

- removing
- substitution

In [47]:
regexp = r'\d+'
subject = 'abc123'
re.sub(regexp,'nope',subject)

'abcnope'

In [48]:
re.sub(regexp,'',subject)

'abc'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [50]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

In [51]:
date_reg = r'(\d+)-(\d+)-(\d+)'
[re.sub(date_reg,r'\2/\3/\1' ,date)for date in dates]

['11/12/2020', '07/13/2020', '01/12/2021']

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [52]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [53]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

Unnamed: 0,0,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [54]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [55]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [56]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}