# Regex

## Context

1. What is a regex? Language for describing *regular* text
    - bigger than python, but python-flavored
    - **regex** *matches* a piece of text, the **subject**
    - when to use regex?
1. Demo
1. `re.search`, `re.findall`, (`re.sub` later)
1. `hlre`
1. Regex Parts (see below)
1. `re.sub` + capture groups
1. pandas methods, `pd.Series.str` -- `.contains`, `.count`, `.extract`, `.replace`

## Parts of Regular Expressions

1. literals
1. metachars, char classes: `.`, `\w`, `\s`, `\d` (+ caps variants)
1. repeating `*`, `+`, `{m[,[n]]}`, `?`
1. any/none of `[]`
1. anchors, `^` + `$`, `\b`
1. capture groups, group referencing

In [1]:
import pandas as pd
import re

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
regex = r'(?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"'
regex = re.compile(regex,re.VERBOSE)

In [4]:
lines = pd.Series(log_file_lines.strip().split('\n'))
lines

0    76.185.131.226 - - [11/May/2020:14:25:53 +0000...
1    76.185.131.226 - - [11/May/2020:16:25:46 +0000...
2    76.185.131.226 - - [11/May/2020:16:25:58 +0000...
3    76.185.131.226 - - [11/May/2020:16:25:58 +0000...
4    104.5.217.57 - - [11/May/2020:16:26:27 +0000] ...
5    76.185.131.226 - - [11/May/2020:16:26:46 +0000...
6    76.185.131.226 - - [11/May/2020:16:26:54 +0000...
7    104.5.217.57 - - [11/May/2020:16:27:04 +0000] ...
8    76.185.131.226 - - [11/May/2020:16:27:05 +0000...
9    76.185.131.226 - - [11/May/2020:16:27:10 +0000...
dtype: object

In [5]:
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


In [6]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

In [7]:
text = '''
At the end of today, we will have completed 504.5 / 670 hours.
That's 75 days done and 23 days left.
Yall are 76.5% finished with the course.
'''

In [8]:
re.search(r'\d+', text)

<re.Match object; span=(45, 48), match='504'>

In [9]:
re.findall(r'\d+', text)

['504', '5', '670', '75', '23', '76', '5']

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

In [10]:
re.findall(r'\d+', text)

['504', '5', '670', '75', '23', '76', '5']

In [13]:
re.search(r'^.', text)

In [14]:
text

"\nAt the end of today, we will have completed 504.5 / 670 hours.\nThat's 75 days done and 23 days left.\nYall are 76.5% finished with the course.\n"

## `re.sub`

For re-arranging / slicing up / extracting bits of a string:

In [16]:
date = '2020-05-22'
date

'2020-05-22'

Express the date in freedom format: m/d/y

In [20]:
freedom_formatted_date = re.sub(r'(\d{4})-(\d{2})-(\d{2})', r'\2/\3/\1', date)
freedom_formatted_date

'05/22/2020'

Removing parts of a string we don't want

In [22]:
text

"\nAt the end of today, we will have completed 504.5 / 670 hours.\nThat's 75 days done and 23 days left.\nYall are 76.5% finished with the course.\n"

In [24]:
re.sub(r'[^\w\s]', '', text)

'\nAt the end of today we will have completed 5045  670 hours\nThats 75 days done and 23 days left\nYall are 765 finished with the course\n'

In [29]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(https?)://(\w+)\.(\w+)', text)
match.groups()

('https', 'regex101', 'com')

In [31]:
emails = [
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
]

df = pd.DataFrame()
df['original_text'] = pd.Series(emails)
df

Unnamed: 0,original_text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [33]:
df.original_text.str.extract(r'(https?)://(\w+)\.(\w+)')

Unnamed: 0,0,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


Named capture groups

In [37]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [39]:
df.original_text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


- `re.VERBOSE` ignores whitespace in the regexp
- `(?# this is a comment)`: allows you to document your regexes inline

In [43]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [52]:
lines[0]

'76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"'

Non-greedy regex: `?` after a repitition operator.

In [56]:
regexp = r'''
^
(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
\s-\s-\s
\[
(?P<timestamp>.+)
\]
\s"
(?P<request_info>.+?)
"
.+
"
(?P<user_agent>.+)
"
$
'''
pd.Series(lines).str.extract(regexp, re.VERBOSE)

Unnamed: 0,ip,timestamp,request_info,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET / HTTP/1.1,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET / HTTP/1.1,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET / HTTP/1.1,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET /favicon.ico HTTP/1.1,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET / HTTP/1.1,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET /documentation HTTP/1.1,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET /documentation HTTP/1.1,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET /documentation HTTP/1.1,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET /documentation HTTP/1.1,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET /documentation HTTP/1.1,python-requests/2.23.0


Be careful using regex to find:

- urls
- html
- email addresses

In [59]:
text = "That's what word boudaries are."

re.findall(r'.\b', text)

['t', "'", 's', ' ', 't', ' ', 'd', ' ', 's', ' ', 'e']

---

get rid of the 0 at the start of the month