## Regular Expressions
- A regular expression is a sort of **meta-language that can be used to describe patterns in text.**
    - mini language for describing text
    - bigger than python, but python-falvored

- Regexes are most commonly used in one of two ways:
    - To find/extract text that matches a pattern.
    - To replace/substitute text that matches a pattern.

### The re module
- To demonstrate regular expressions, we'll be using the `re` module from the python standard library, and its `findall` function. Many other libraries also work with regular expressions.

- This function will accept a string that is a regular expression, the `pattern`, and another string that is the string to be searched. findall will return a list of all of the times the given regular expression matches the string.

**Raw Strings**
- Any string in python prefixed with a r is a raw string. This means that **backslashes will be included in the string verbatim, and don't carry special meaning.** It is very common to use raw strings when creating a regular expression.

### Basic Regexes
At it's most basic, any alpha numeric character is a valid regular expression.

In [1]:
import re

re.findall(r'b', 'abcd')

['b']

We'll define a function here **to simplify the process of showing many results from regular expressions.**

In [2]:
def show_all_matches(regexes, subject, re_length=6):
    print('Sentence:')
    print()
    print('    {}'.format(subject))
    print()
    print(' regexp{} | matches'.format(' ' * (re_length - 6)))
    print(' ------{} | -------'.format(' ' * (re_length - 6)))
    for regexp in regexes:
        fmt = ' {:<%d} | {!r}' % re_length
        matches = re.findall(regexp, subject)
        if len(matches) > 8:
            matches = matches[:8] + ['...']
        print(fmt.format(regexp, matches))

In [7]:
sentence = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.'

show_all_matches([
    r'a',
    r'm',
    r'M',
    r'Mary',
    r'little',
    r'1',
    r'10',
    r'22'
], sentence) # Pass a list of regexes into a list

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.

 regexp | matches
 ------ | -------
 a      | ['a', 'a', 'a', 'a', 'a']
 m      | ['m', 'm']
 M      | ['M']
 Mary   | ['Mary']
 little | ['little', 'little']
 1      | ['1', '1', '1']
 10     | ['10']
 22     | ['22']


### Metacharacters and Character Classes

- In addition to letters and numbers, there are special **metacharacters** in regular expressions. These are characters that match several different kinds of characters, but don't match the character itself literally like others. Metacharacters must be **escaped** to match the character itself.

Here are several metacharacters that represent various **character classes**.

- `.`: anything
- `\w`: any letter or number
- `\W`: anything that's not a letter or number
- `\d`: any digit
- `\D`: anything that's not a digit
- `s`: anywhitesapce character

In [8]:
res = [r'\w', 
       r'\d', 
       r'\s', 
       r'.',  # matches every character
       r'\.'] # a literal period

show_all_matches(res, sentence)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.

 regexp | matches
 ------ | -------
 \w     | ['M', 'a', 'r', 'y', 'h', 'a', 'd', 'a', '...']
 \d     | ['1', '1', '0', '1', '2', '2', '2', '3']
 \s     | [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '...']
 .      | ['M', 'a', 'r', 'y', ' ', 'h', 'a', 'd', '...']
 \.     | ['.', '.', '.']


These can be comined together.

In [11]:
show_all_matches([r'l\w\w\w\W', r'\d\d'], sentence, re_length=9)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.

 regexp    | matches
 ------    | -------
 l\w\w\w\W | ['lamb.', 'lamb.']
 \d\d      | ['10', '12', '22']


### Repeating
All of the metacharacters in the table below will match the previous character a repeated number of times.
- `*`: zero or more
- `+`: one or more
- `{n}`: exactly n repetitions
- `{n, }`: n or more repititions
- `{n, m}`: between n and m repitions
- `?`: an optional character / non-greedy

In [12]:
show_all_matches([
    r'\d+'
], sentence)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.

 regexp | matches
 ------ | -------
 \d+    | ['1', '10', '12', '223']


In [13]:
show_all_matches([
    r'a{2,}',
    r'a{2}',
    r'a{3,4}'
], 'aabbaaaa')

Sentence:

    aabbaaaa

 regexp | matches
 ------ | -------
 a{2,}  | ['aa', 'aaaa']
 a{2}   | ['aa', 'aa', 'aa']
 a{3,4} | ['aaaa']


### Any of or None of
- The square brackets in a regular expression represent **a single character that will match any of the values within the square brackets.** For example, [ab] will match either an 'a' or a 'b'.

- If the first character inside of the square brackets is a caret, ^, then **anything that is not inside of the square brackets will be matched.** For example, [^ab] will match any character that is neither 'a' nor 'b'.

- Inside of square brackets, ranges of letters and numbers can be abbreviated with a hypen.

In [14]:
show_all_matches([
    r'[lt]',
    r'[lt]+',
    r'[^aeiou\s\.]', # any letter that's not a vowel
    r'[a-d]'
], sentence, re_length=12)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.

 regexp       | matches
 ------       | -------
 [lt]         | ['l', 't', 't', 'l', 'l', 'l', 't', 't', '...']
 [lt]+        | ['l', 'ttl', 'l', 'l', 'ttl', 'l', 't', 't', '...']
 [^aeiou\s\.] | ['M', 'r', 'y', 'h', 'd', 'l', 't', 't', '...']
 [a-d]        | ['a', 'a', 'd', 'a', 'a', 'b', 'a', 'b']


### Anchors
There are several special metacharacters that don't match any individual characters, but serve as an "anchor" for the rest of the regular expression.
- `^`: the start of the string/line
- `$`: the end of the string/line
- `\b`: a word boundary

In [15]:
show_all_matches([
    r'\bo\w+', # any word that starts with an 'o'
    r'^\s', # starts with a space
    r'^M', # starts with 'M'
    r'\.$', # ends with a period
], sentence)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 223, just one.

 regexp | matches
 ------ | -------
 \bo\w+ | ['one']
 ^\s    | []
 ^M     | ['M']
 \.$    | ['.']


### Other Common Functions
- `match`: Matches from the start of the string.
- `search`: Find the first instance of the regular expression.
- `sub`: Make substitutions with a regular expression.
- `compile`: Prepare a regular expression for use ahead of time.

Now we'll take a look at using search and sub in more detail.

### Capture Groups

- We can define groups in our regular expressions called **capture groups.** This allows us to **reference the groups later on in the regular expression, or apply repitition to the group as a whole.**

- Note that when we include capture groups in our regular expressions, findall will return only the matched groups, not the entire text that was matched.

In [16]:
sentence = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''.strip()

sentence

'You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).'

In [17]:
ip_re = r'\d+(\.\d+){3}'

match = re.search(ip_re, sentence)
match[0]

'123.123.123.123'

In [18]:
# simplified for demonstration, a real url to parse urls would be much more complex
url_re = r'(https?)://(\w+)\.(\w+)'

protocol, domain, tld = re.search(url_re, sentence).groups()

print(f'''
protocol: {protocol}
domain:   {domain}
tld:      {tld}
''')


protocol: https
domain:   codeup
tld:      com



You can create non-capturing (aka shy) groups by adding `?:` to the beginning of the group, and groups can be named by adding `?P<name>`.

In [19]:
url_re = r'(?P<protocol>https?)://(?:\w+)\.(?P<tld>\w+)'

match = re.search(url_re, sentence)

print(f'''
groups: {match.groups()}
referencing a group by name: {match.group('tld')}
group dictionary: {match.groupdict()}
''')


groups: ('https', 'com')
referencing a group by name: com
group dictionary: {'protocol': 'https', 'tld': 'com'}



### Substitution
We can use a regular expression **to replace or remove parts of a string.** In addition, if the supplied regular expression has capture groups in it, the text captured can be referenced when making the substitution.

In [20]:
# remove anything that's not a digit
re.sub(r'\D', '', 'abc 123')

'123'

In [21]:
# remove anything that's not a letter
re.sub(r'[^a-z]', '', 'abc 123')

'abc'

In [24]:
re.sub(r'.(.).', r'\1', 'abcde')

'bde'

In [29]:
re.sub(r'(.)(.)(.)', r'\3\2\1', 'abcde')

'cbade'

In [31]:
# Replace the 2nd last letter to X and remove the last letter
re.sub(r'.{2}$', 'X', 'abcdefg')

'abcdeX'

### Regex Flags
- Include any of the flags below as the last argument to any of the regular expressions method mentioned in this lesson, and that behavior will be enabled for that use of the regular expression.

    - `re.MULTILINE`: The `^` and `$` anchors will apply line by line, instead of applying to start and end of the string.

    - `re.IGNORECASE`: Ignore character casing when matching.
    - `re.VERBOSE`: Ignore any whitespace in the regular expression. This can be useful to make more readable regular expressions, especially when combined with non-capturing comment groups.

In [32]:
regexp = r'''
[aeiou] (?# any vowel)
[^aeiou] (?# followed by a non-vowel)
'''

The above is equivalent to the following.

In [33]:
regexp = r'[aeiou][^aeiou]'

When the `VERBOSE` flag is set

# Regex

- What is a regular expression?
    
- When are regular expressions useful?
    - parsing regular text, e.g. log files
    - wrangle
    - scope
- When should you not use Regex?
    - too complicated?
    - when there is a tool already built to parse the data

In [37]:
import pandas as pd
import re

In [38]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [7]:
regexp = r'a' # r means raw string
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

In [8]:
regexp = r'b'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

In [9]:
regexp = r'ab'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 2), match='ab'>

In [15]:
regexp = r'd'
subject = 'abc'

type(re.search(regexp, subject))

NoneType

In [12]:
regexp = r'b'
subject = 'abc'

re.findall(regexp, subject)

['b']

In [13]:
regexp = r'.'
subject = 'abc'

re.findall(regexp, subject)

['a', 'b', 'c']

In [14]:
regexp = r'.'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

### Metacharacters

- `.` -- any character
- `\w` any alphanumeric
- `\s` whitespace
- `\d` digits
- Captial variants invert: the opposite of that

In [16]:
regexp = r'\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

### Exercise 1

In [17]:
regexp = r'.'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(0, 1), match='a'>
['a', 'b', 'c', ' ', '1', '2', '3']


In [18]:
regexp = r'\w'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(0, 1), match='a'>
['a', 'b', 'c', '1', '2', '3']


In [19]:
regexp = r'\s'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(3, 4), match=' '>
[' ']


In [20]:
regexp = r'\d'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(4, 5), match='1'>
['1', '2', '3']


In [21]:
regexp = r'\W'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(3, 4), match=' '>
[' ']


In [22]:
regexp = r'\S'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(0, 1), match='a'>
['a', 'b', 'c', '1', '2', '3']


In [23]:
regexp = r'\D'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(0, 1), match='a'>
['a', 'b', 'c', ' ']


### Exercise 2

In [24]:
regexp = r'\w\w'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(0, 2), match='ab'>
['ab', '12']


### Exercise 3

In [25]:
regexp = r'\w\s\d'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(2, 5), match='c 1'>
['c 1']


### Exercise 4

In [26]:
regexp = r'\d\d\d'
subject = 'abc 123'

print(re.search(regexp, subject))
print(re.findall(regexp, subject))

<re.Match object; span=(4, 7), match='123'>
['123']


### Repeating

- `{}` a specific number of repetition, {5}, {5, 8}, {3, }
- `*`  0 or more
- `+`  one or more
- `?`: optional or non-greedy
- greedy + non-greedy


- r'.+' returns everything
- r'.{3,}?' minimum match

In [27]:
regexp = r'\w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [None]:
regexp = r'\w+'
subject = 'abc 123'

re.search(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78205. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

In [39]:
# Create a subject
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78205. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

In [31]:
# 1. Write a regular expression that matches all the numbers.

regexp = r'\d+'

re.findall(regexp, subject)

['78205']

In [32]:
# 2. Write a regular expression that matches a 5 digit number, but not a number with fewer digits.

regexp = r'\d{5,}'

re.findall(regexp, subject)

['78205']

In [41]:
# 3. Write a regular expression that matches any urls in the subject.

regexp = r'https?://.+?com'

re.findall(regexp, subject)

['http://codeup.com', 'https://alumni.codeup.com']

### Any/None Of
[ ] a or 1, b or 2, c or 3

In [7]:
regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [8]:
subject = '123abc'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [53]:
regexp = r'\d+[13579]'

re.search(regexp, '34343')

<re.Match object; span=(0, 5), match='34343'>

In [54]:
regexp = r'\d+[13579]'

re.findall(regexp, '343, 43')

['343', '43']

In [49]:
regexp = r'^\d*[02468]'

re.search(regexp, '123 bcd 456 13 abc')

<re.Match object; span=(0, 2), match='12'>

### Anchors

- `^`: starts with
- `$`: ends with

In [9]:
regexp = r'b'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [57]:
regexp = r'^[aeiouAEIOU]'

re.search(regexp, 'Egg')

<re.Match object; span=(0, 1), match='E'>

In [60]:
regexp = r'^[A-Z]'

re.search(regexp, 'Abc')

<re.Match object; span=(0, 1), match='A'>

In [62]:
regexp = r'[A-Z]$'

re.search(regexp, 'acB')

<re.Match object; span=(2, 3), match='B'>

In [67]:
regexp = r'^[A-Z]\w+[A-Z]$'

re.search(regexp, 'AcC')

<re.Match object; span=(0, 3), match='AcC'>

In [69]:
regexp = r'^[A-Z][a-zA-Z]+[A-Z]$'

re.search(regexp, 'A1C')

### Capture Groups

In [70]:
regexp = '.*?(\d+)'
s = pd.Series(['abc', 'abc123', '123'])
s.str.extract(regexp)

Unnamed: 0,0
0,
1,123.0
2,123.0


## `re.sub`

- removing
- substitution

In [11]:
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, '', subject)

'abc'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [71]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [72]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [13]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

Unnamed: 0,0,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [14]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [15]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [16]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}