## Regular expressions inside Python are made available through the "re" module:

In [1]:
import re

### Note: It is recommended to use raw strings for the search:

In [44]:
## Use raw strings for the search pattern
a = '\tHello'
b = r'\tHello'
print(a)
print(b)

	Hello
\tHello


## Performing matches with compiled objects
**Once we have our pattern, we can search for this pattern in the text / string that we want to look up.**

**match():** Determine if the RE matches at the beginning of the string.

**search():** Scan through a string, looking for any location where this RE matches.

**findall():** Find all substrings where the RE matches, and returns them as a list.

**finditer():** Find all substrings where the RE matches, and returns them as an iterator.

In [45]:
my_string = 'abc123ABC123abc'

## finditer() - Find all substrings where the RE matches, and returns them as an iterator.

In [46]:
pattern = re.compile(r'123')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(3, 6), match='123'>
<re.Match object; span=(9, 12), match='123'>


## findall() - Find all substrings where the RE matches, and returns them as a list.

In [47]:
pattern = re.compile(r'123')
matches = pattern.findall(my_string)

print(matches)

for match in matches:
    print(match)

['123', '123']
123
123


## match() - Determine if the RE matches at the beginning of the string.

In [48]:
pattern = re.compile(r'123') 
# '123' is not in beginning, 'abc' is in beginning
match = pattern.match(my_string)
print(match)

None


## search() - Scan through a string, looking for any location where this RE matches.

In [49]:
match = pattern.search(my_string)
print(match)

<re.Match object; span=(3, 6), match='123'>


### Note: Methods can also be used directly on the re module. It does not make that much of a difference, but some people prefer explicitly precompiling and binding the pattern to a reusable variable.

In [53]:
matches = re.finditer(r'abc', my_string)

for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


## Methods on a Match object
**group():** Return the string matched by the RE

**start():** Return the starting position of the match

**end():** Return the ending position of the match

**span():** Return a tuple containing the (start, end) positions of the match

In [54]:
test_string = '123abc456789abc123ABC'

In [55]:
pattern = re.compile(r'abc')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the string

<re.Match object; span=(3, 6), match='abc'>
(3, 6) 3 6
abc
<re.Match object; span=(12, 15), match='abc'>
(12, 15) 12 15
abc


# Meta characters
**Metacharacters are characters with a special meaning:**
**All meta characters:** . ^ $ * + ? { } [ ] \ | ( )
### Meta characters need need to be escaped (with ) if we actually want to search for the char.

. Any character (except newline character) "he..o"

^ Starts with "^hello"

$(dollar sign) Ends with "world$(dollar sign)"

*(asterix) Zero or more occurrences "aix*(asterix)"

+(plus) One or more occurrences "aix+"

{ } Exactly the specified number of occurrences "al{2}"

[] A set of characters "[a-m]"

\ Signals a special sequence (can also be used to escape special characters) "\d"

| Either or "falls|stays"

( ) Capture and group**

In [56]:
test_string = 'python-engineer.com'
pattern = re.compile(r'\-')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 7), match='-'>


## More Metacharacters / Special Sequences
#### A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

**\d :Matches any decimal digit; this is equivalent to the class [0-9].**

**\D : Matches any non-digit character; this is equivalent to the class [^0-9].**

**\s : Matches any whitespace character;**

**\S : Matches any non-whitespace character;**

**\w : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].**

**\W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].**

**\b Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" r"ain\b"**

**\B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"**

**\A Returns a match if the specified characters are at the beginning of the string "\AThe"**

**\Z Returns a match if the specified characters are at the end of the string "Spain\Z"**

In [7]:
test_string = 'hello 123_ heyho hohey'

In [8]:
# \d : Matches any decimal digit; this is equivalent to the class [0-9].
pattern = re.compile(r'\d')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>


In [9]:
# \D : Matches any non-digit character; this is equivalent to the class [^0-9].
pattern = re.compile(r'\D')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match=' '>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [10]:
# \s : Matches any whitespace character
pattern = re.compile(r'\s')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(16, 17), match=' '>


In [11]:
# \S : Matches any non-whitespace character
pattern = re.compile(r'\S')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [12]:
# \w : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_]
pattern = re.compile(r'\w')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [14]:
# \W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
pattern = re.compile(r'\W')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(16, 17), match=' '>


In [18]:
# \b Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" r"ain\b"
pattern = re.compile(r'\bhey')
matches = pattern.finditer('heyho hohey')
for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='hey'>


In [19]:
pattern = re.compile(r'hey\b')
matches = pattern.finditer('heyho hohey')
for match in matches:
    print(match)

<re.Match object; span=(8, 11), match='hey'>


In [25]:
# \B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"
pattern = re.compile(r'\Bhey')
matches = pattern.finditer('hohey heyhey heyho') # ho-hey, ho\nhey are matches!
for match in matches:
    print(match)

<re.Match object; span=(2, 5), match='hey'>
<re.Match object; span=(9, 12), match='hey'>


In [24]:
pattern = re.compile(r'hey\B')
matches = pattern.finditer('hohey heyhey heyho') # ho-hey, ho\nhey are matches!
for match in matches:
    print(match)

<re.Match object; span=(6, 9), match='hey'>
<re.Match object; span=(13, 16), match='hey'>


In [26]:
# \A Returns a match if the specified characters are at the beginning of the string "\AThe"
pattern = re.compile(r'\Ahello')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='hello'>


In [31]:
# \Z Returns a match if the specified characters are at the end of the string "Spain\Z"
pattern = re.compile(r'hey\Z')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(19, 22), match='hey'>


## Regex: Sets

### A set is a set of characters inside a pair of square brackets [] with a special meaning. Append multiple conditions back-to back, 
e.g. [aA-Z].
A ^ (caret) inside a set negates the expression.
A - (dash) in a set specifies a range if it is in between, otherwise the dash itself.

Examples:
- [arn] Returns a match where one of the specified characters (a, r, or n) are present
- [a-n] Returns a match for any lower case character, alphabetically between a and n
- [^arn] Returns a match for any character EXCEPT a, r, and n
- [0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present
- [0-9] Returns a match for any digit between 0 and 9
- 0-5 Returns a match for any two-digit numbers from 00 and 59
- [a-zA-Z] Returns a match for any character alphabetically between a and z, lower case OR upper case

In [2]:
test_string = 'hello 123_'

In [3]:
pattern = re.compile(r'[a-z]')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>


In [17]:
dates = '''
01.04.2020

2020.04.01

2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11

2020/04/02

2020_04_04
2020_04_04
'''

In [18]:
print('All dates with a character in between')
pattern = re.compile(r'\d\d\d\d.\d\d.\d\d')
matches = pattern.finditer(dates)

for match in matches:
    print(match)

All dates with a character in between
<re.Match object; span=(13, 23), match='2020.04.01'>
<re.Match object; span=(25, 35), match='2020-04-01'>
<re.Match object; span=(36, 46), match='2020-05-23'>
<re.Match object; span=(47, 57), match='2020-06-11'>
<re.Match object; span=(58, 68), match='2020-07-11'>
<re.Match object; span=(69, 79), match='2020-08-11'>
<re.Match object; span=(81, 91), match='2020/04/02'>
<re.Match object; span=(93, 103), match='2020_04_04'>
<re.Match object; span=(104, 114), match='2020_04_04'>


In [19]:
print('Only dates with - or . in between')
pattern = re.compile(r'\d\d\d\d[-.]\d\d[-.]\d\d')
matches = pattern.finditer(dates)

for match in matches:
    print(match)

Only dates with - or . in between
<re.Match object; span=(13, 23), match='2020.04.01'>
<re.Match object; span=(25, 35), match='2020-04-01'>
<re.Match object; span=(36, 46), match='2020-05-23'>
<re.Match object; span=(47, 57), match='2020-06-11'>
<re.Match object; span=(58, 68), match='2020-07-11'>
<re.Match object; span=(69, 79), match='2020-08-11'>


In [20]:
print('Only dates with - or . in between in May or June')
pattern = re.compile(r'\d\d\d\d[-.]0[56][-.]\d\d')
matches = pattern.finditer(dates)

for match in matches:
    print(match)

Only dates with - or . in between in May or June
<re.Match object; span=(36, 46), match='2020-05-23'>
<re.Match object; span=(47, 57), match='2020-06-11'>


In [23]:
print('Only dates with - or . in between in May, June, July')
patternt = re.compile(r'\d\d\d\d[-.]0[5-7][-.]\d\d')
matches = pattern.finditer(dates)

for match in matches:
    print(match)

Only dates with - or . in between in May, June, July
<re.Match object; span=(36, 46), match='2020-05-23'>
<re.Match object; span=(47, 57), match='2020-06-11'>


## Quantifier

- asterix(sign) : 0 or more
- plus(sign) : 1 or more
- ? : 0 or 1, used when a character can be optional

- {4} : exact number

- {4,6} : range numbers (min, max)

In [24]:
my_string = 'hello_123'

In [25]:
pattern = re.compile(r'\d*')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 5), match=''>
<re.Match object; span=(6, 9), match='123'>
<re.Match object; span=(9, 9), match=''>


In [26]:
pattern = re.compile(r'\d+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 9), match='123'>


In [35]:
my_string = 'hello_1_2-3'
pattern = re.compile(r'_?\d')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(5, 7), match='_1'>
<re.Match object; span=(7, 9), match='_2'>
<re.Match object; span=(10, 11), match='3'>


In [31]:
my_string = '2020-04-01'
pattern = re.compile(r'\d{4}') # or if you need a range r'\d{3,5}'
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 4), match='2020'>


In [32]:
pattern = re.compile(r'\d{4}.\d{2}.\d{2}')
matches = pattern.finditer(dates)
for match in matches:
    print(match)

<re.Match object; span=(13, 23), match='2020.04.01'>
<re.Match object; span=(25, 35), match='2020-04-01'>
<re.Match object; span=(36, 46), match='2020-05-23'>
<re.Match object; span=(47, 57), match='2020-06-11'>
<re.Match object; span=(58, 68), match='2020-07-11'>
<re.Match object; span=(69, 79), match='2020-08-11'>
<re.Match object; span=(81, 91), match='2020/04/02'>
<re.Match object; span=(93, 103), match='2020_04_04'>
<re.Match object; span=(104, 114), match='2020_04_04'>


In [33]:
pattern = re.compile(r'\d+.\d+.\d+')
matches = pattern.finditer(dates)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='01.04.2020'>
<re.Match object; span=(13, 23), match='2020.04.01'>
<re.Match object; span=(25, 35), match='2020-04-01'>
<re.Match object; span=(36, 46), match='2020-05-23'>
<re.Match object; span=(47, 57), match='2020-06-11'>
<re.Match object; span=(58, 68), match='2020-07-11'>
<re.Match object; span=(69, 79), match='2020-08-11'>
<re.Match object; span=(81, 91), match='2020/04/02'>
<re.Match object; span=(93, 103), match='2020_04_04'>
<re.Match object; span=(104, 114), match='2020_04_04'>


## Conditions
### Use the | for either or condition.

In [5]:
my_string = """
Mr Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr. T
"""

In [11]:
pattern = re.compile(r'Mr\.?\s\w+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mr Simpson'>
<re.Match object; span=(24, 33), match='Mr. Brown'>
<re.Match object; span=(43, 48), match='Mr. T'>


In [12]:
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mr Simpson'>
<re.Match object; span=(12, 23), match='Mrs Simpson'>
<re.Match object; span=(24, 33), match='Mr. Brown'>
<re.Match object; span=(34, 42), match='Ms Smith'>
<re.Match object; span=(43, 48), match='Mr. T'>


## Grouping
### ( ) is used to group substrings in the matches.

In [17]:
emails = """
Mr Simpson
Mrs Simpson
pythonengineer@gmail.com
Python-engineer@gmx.de
python-engineer123@my-domain.org
"""

In [18]:
pattern = re.compile(r'[a-zA-Z0-9-]+@[a-zA-Z-]+\.[a-zA-Z]+')
matches = pattern.finditer(emails)
for match in matches:
    print(match)

<re.Match object; span=(24, 48), match='pythonengineer@gmail.com'>
<re.Match object; span=(49, 71), match='Python-engineer@gmx.de'>
<re.Match object; span=(72, 104), match='python-engineer123@my-domain.org'>


In [27]:
# using by grouping ()
pattern = re.compile(r'([a-zA-Z0-9-]+)@([a-zA-Z-]+)\.([a-zA-Z]+)')
matches = pattern.finditer(emails)
for match in matches:
    print(match)
    # only show differents group among above 3 groups
    print(match.group(1))

<re.Match object; span=(24, 48), match='pythonengineer@gmail.com'>
pythonengineer
<re.Match object; span=(49, 71), match='Python-engineer@gmx.de'>
Python-engineer
<re.Match object; span=(72, 104), match='python-engineer123@my-domain.org'>
python-engineer123


## Modifying strings
split(): Split the string into a list, splitting it wherever the RE matches

sub(): Find all substrings where the RE matches, and replace them with a different string

In [2]:
my_string = 'abc123ABCDEF123abc'

In [6]:
#split() method
pattern = re.compile(r'123')
matches = pattern.split(my_string)
print(matches)

['abc', 'ABCDEF', 'abc']


In [7]:
#sub() method
my_string = "hello world, you are the best world"

In [8]:
pattern = re.compile(r'world')
subbed_string = pattern.sub(r'planet', my_string)
print(subbed_string)

hello planet, you are the best planet


In [17]:
urls = """
http://python-engineer.com
https://www.python-engineer.org
http://www.pyeng.net
"""

In [31]:
pattern = re.compile(r"https?://(www\.)?([a-zA-Z-]+)(\.[a-zA-Z]+)")
matches = pattern.finditer(urls)
for match in matches:
    print("match:", match)
    print("match.group:", match.group(1))

match: <re.Match object; span=(1, 27), match='http://python-engineer.com'>
match.group: None
match: <re.Match object; span=(28, 59), match='https://www.python-engineer.org'>
match.group: www.
match: <re.Match object; span=(60, 80), match='http://www.pyeng.net'>
match.group: www.


In [33]:
subbed_urls = pattern.sub(r"\3", urls) # \3 is 3rd group 
print(subbed_urls)


.com
.org
.net



## Compilation Flags
**ASCII, A :** Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.

**DOTALL, S :** Make . match any character, including newlines.

**IGNORECASE, I :** Do case-insensitive matches.

**LOCALE, L :** Do a locale-aware match.

**MULTILINE, M :** Multi-line matching, affecting ^ and $.

**VERBOSE, X (for ‘extended’) :** Enable verbose REs, which can be organized more cleanly and understandably.

In [34]:
my_string = "Hello World"

In [39]:
pattern = re.compile(r"world", re.I)
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 11), match='World'>


In [47]:
my_string = '''
hello
cool
Hello.
'''

In [50]:
# line starts with ...
pattern = re.compile(r'^[a-zA-Z]', re.MULTILINE) # No match without M flag
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='h'>
<re.Match object; span=(7, 8), match='c'>
<re.Match object; span=(12, 13), match='H'>
