# Regex

### https://regex101.com/

\\d ---------- Any numeric digit from 0 to 9.

\\D ---------- Any character that is not a numeric digit from 0 to 9.

\\w ---------- Any letter, numeric digit, or the underscore character.

\\W ---------- Any character that is not a letter, numeric digit, or the underscore character.

\\s ---------- Any space, tab, or newline character.

\\S ---------- Any character that is not a space, tab, or newline.


Regex Symbols

* The `?` matches zero or one of the preceding group.
* The `*` matches zero or more of the preceding group.
* The `+` matches one or more of the preceding group.
* The `{n}` matches exactly n of the preceding group.
* The `{n,}` matches n or more of the preceding group.
* The `{,m}` matches 0 to m of the preceding group.
* The `{n,m}` matches at least n and at most m of the preceding group.
* `{n,m}?` or `*?` or `+?` performs a non-greedy match of the preceding group.
* `^`spam means the string must begin with spam.
* spam`$` means the string must end with spam.
* The `.` matches any character, except newline characters.
* `[abc]` matches any character between the brackets (such as a, b, or c).
* `[^abc]` matches any character that isn’t between the brackets.

### to detect these characters as part of your text pattern

In [2]:
print(r"\.  \^  \$  \*  \+  \?  \{  \}  \[  \]  \\  \|  \(  \)")

\.  \^  \$  \*  \+  \?  \{  \}  \[  \]  \\  \|  \(  \)


***********
### Creating Regex Objects
* Passing a string value representing your regular expression to `re.compile()` returns a Regex pattern object

In [3]:
import re

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
# the phoneNumRegex variable contains a Regex object.

*************
### Matching Regex Objects

In [4]:
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


**********
### Grouping with Parentheses

* Passing `0` or `nothing` to the group() method will return the entire matched text.
* The below gives you the first find but not all

In [14]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242. 848-404-0480')
print(mo.group(1))

print(mo.group(2))

print(mo.group(0))

print(mo.group())

415
555-4242
415-555-4242
415-555-4242


Finding all

In [19]:
import re

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.findall('My number is 415-555-4242. 848-404-0480')

for match in mo:
    print("Group 1:", match[0])  # Area code
    print("Group 2:", match[1])  # Rest of the number

print('\n')

group1_list = [match[0] for match in mo]
print(group1_list)

print('\n')

group2_list = [match[1] for match in mo]
print(group2_list)


Group 1: 415
Group 2: 555-4242
Group 1: 848
Group 2: 404-0480


['415', '848']


['555-4242', '404-0480']


### retrieve all the groups at once, use the `groups()` method

In [6]:
print(mo.groups())

areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

('415', '555-4242')
415
555-4242


************
### The `Pipe` = `|` 

* The regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

The Below gives first find

In [7]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey')
mo1.group()

'Batman'

using `findall()` gives all the matches

In [11]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.findall('Batman and Tina Fey')
mo1


['Batman', 'Tina Fey']

*********
### Optional Matching with the Question Mark `?`

In [23]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman and Batwoman')

print(mo1.group())

Batman


In [24]:
batRegex = re.compile(r'Bat(?:wo)?man')  # non-capturing group
mo2 = batRegex.findall('The Adventures of Batman and Batwoman')
print(mo2)

['Batman', 'Batwoman']


*********
### Matching Zero or More with the `*` Star

In [25]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())


mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())


mo3 = batRegex.search('The Adventures of Batwowowowoman')
print(mo3.group())

Batman
Batwoman
Batwowowowoman


********
### Matching One or More with the `+`Plus

In [30]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())


mo2 = batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())


mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

Batwoman
Batwowowowoman


True

**********
### Matching Specific Repetitions with Braces

In [33]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())

mo2 = haRegex.search('Ha')
mo2 == None

HaHaHa


True

********
### Greedy and Non-greedy Matching

In [36]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())

HaHaHaHaHa


In [37]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
print(mo2.group())

HaHaHa


***********
### The `findall()` Method

In [39]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
mo = phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
mo

['415-555-9999', '212-555-0000']

In [46]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') # has groups
mo1 = phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
print(mo1)
print(list(mo1[0]))
print(list(mo1[1]))

[('415', '555-9999'), ('212', '555-0000')]
['415', '555-9999']
['212', '555-0000']


***********
### Making Your Own Character Classes using `[]` and `^`

In [55]:
vowelRegex = re.compile(r'[0-9]+')
vowelRegex.findall('My cell number is 848-404-0480 and my extension is 09')

['848', '404', '0480', '09']

In [53]:
vowelRegex = re.compile(r'[^0-9]')
print(vowelRegex.findall('My cell number is 848-404-0480 and my extension is 09'))

['M', 'y', ' ', 'c', 'e', 'l', 'l', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ', '-', '-', ' ', 'a', 'n', 'd', ' ', 'm', 'y', ' ', 'e', 'x', 't', 'e', 'n', 's', 'i', 'o', 'n', ' ', 'i', 's', ' ']


***********
### The Caret and Dollar Sign Characters
* use the caret symbol `^` at the start of a regex to indicate that a match must occur at the beginning of the searched text.
* use a dollar sign `$` at the end of the regex to indicate the string must end with this regex pattern.
* use the `^` and `$` together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

In [56]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello, world!')

<re.Match object; span=(0, 5), match='Hello'>

In [66]:
endsWithNumber = re.compile(r'\d$')
x = endsWithNumber.search('Your number is 42')
print(x)

<re.Match object; span=(16, 17), match='2'>


In [61]:
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

**********
### The Wildcard Character `.`

In [67]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo.group(1))
print(mo.group(2))

Al
Sweigart


In [69]:
# Non-Greedy way

nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
print(mo.group())

# Greedy Way

greedyRegex = re.compile(r'<.*>')
mo = greedyRegex.search('<To serve man> for dinner.>')
print(mo.group())


<To serve man>
<To serve man> for dinner.>


*********
### Matching Newlines with the Dot Character `.` using `re.DOTALL`

In [70]:
noNewlineRegex = re.compile('.*')
noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [71]:
newlineRegex = re.compile('.*', re.DOTALL)
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

In [72]:
newlineRegex = re.compile('.*', re.DOTALL)
newlineRegex.search("""Serve the public trust.
                    
                    \nProtect the innocent.
                    
                    \nUphold the law.""").group()

'Serve the public trust.\n                    \n                    \nProtect the innocent.\n                    \n                    \nUphold the law.'

**********
### Case-Insensitive Matching `re.IGNORECASE` or `re.I`

In [None]:
robocop = re.compile(r'robocop', re.I)
robocop.search('RoboCop is part man, part machine, all cop.').group()
'RoboCop'

**********
### Substituting or replacing Strings with the `sub()` Method

In [75]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

To use the matched text itself as part of the substitution.

In [78]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

In the first argument to sub(), you can type \1, \2, \3, and so on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”

In [80]:
agentNamesRegex = re.compile(r'Agent (\w{2})\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'Al**** told Ca**** that Ev**** knew Bo**** was a double agent.'

*************
### Managing Complex Regexes using `re.VERBOSE`

In [None]:
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)