Title: Regular Expressions in Python - FULL COURSE (1 HOUR) - Programming Tutorial

Source: Patrick Loeber YouTube Channel

Author (Original Tutorial): Patrick Loeber

URL: https://www.youtube.com/watch?v=AEE9ecgLgdQ&list=PLJjO7syMbvVMYdDYNLvtjS3BQkRoBnifP&index=8

Date of Implementation: 2025-01-05

Description: 

In this Python Tutorial, we will be learning about Regular Expressions (Regex) in Python. Regular expressions are a powerful language for matching text patterns. Possible pattern examples for searches are e-mail addresses or domain names. This video covers all you need to know to understand any regex expression! I go over all important concepts and mix examples in between.

Written tutorial: https://www.python-engineer.com/posts/regular-expressions/

0) re Module 
1) Methods to search for matches 
2) Methods on a match object 
3) Meta Characters 
4) Special Sequences 
5) Sets
6) Quantifier 
7) Conditions 
8) Grouping 
9) Modification 
10) Compilation Flags 
    

# 0) re module

In [1]:
import re

# 1) Methods to search for matches
- finditer()
- match()
- search()
- findall()

In [2]:
test_string1 = '123abc456789abc1011ABC'
# two style possible with or without pattern variable

#pattern = re.compile(r"abc") # case sensitive
#matches = pattern.finditer(test_string1) # object one can iter

# explicity compile the pattern and create object #
# r to get raw string
matches = re.finditer(r"abc", test_string1)

for match in matches:
    print(match)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


In [3]:
matches = re.findall(r"abc", test_string1)
for match in matches:
    print(match)

abc
abc


In [4]:
# Only first matching pattern in the beginning of the string
match = re.match(r"abc", test_string1)
print(match)
match = re.match(r"123", test_string1)
print(match)

None
<re.Match object; span=(0, 3), match='123'>


In [5]:
# Search looks for any location where the re matches, only first match
match = re.search(r"abc", test_string1)

print(match)

<re.Match object; span=(3, 6), match='abc'>


# 2) Methods on a match object
- group
- start
- end
- span

In [6]:
# Continue with finditer
matches = re.finditer(r"abc", test_string1)

# group, start, end, span
for match in matches:
    print("Just Object:", match)
    print("Span:", match.span())
    print("Start:", match.start())
    print("End:", match.end())
    print("Group:", match.group(0)) # group can take arguments
    break


Just Object: <re.Match object; span=(3, 6), match='abc'>
Span: (3, 6)
Start: 3
End: 6
Group: abc


# 3) Meta characters

All meta characters:

## . ^ $ + ? { } [ ] \ | ( )

. Any character (except newline character)

^ Starts with "^hello"

\$ Ends with "world\\$"

----

### Quantifiers:

\* Zero or more occurrences "aix*"

\+ One or more occurrences "aix+"

{ } Exactly the specified number of occurrences "al{2}"

---

[ ] A set of characters "[a-m]"

\ Special sequence (or escape special characters) "\d"

| Either or "falls|stays"

( ) Capture and group

In [7]:
matches = re.finditer(r".", test_string1)

for match in matches:
    print(match.group())

1
2
3
a
b
c
4
5
6
7
8
9
a
b
c
1
0
1
1
A
B
C


In [8]:
test_string1 = '123abc456789abc1011ABC.'
matches = re.finditer(r"\.", test_string1) # escape dot

for match in matches:
    print(match.group())

.


In [9]:
matches = re.finditer(r"^123", test_string1) 

for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='123'>


In [10]:
matches = re.finditer(r"^abc", test_string1) 

for match in matches:
    print(match)

In [11]:
test_string1 = '123abc456789abc1011ABC'
matches = re.finditer(r"ABC$", test_string1) 

for match in matches:
    print(match)

<re.Match object; span=(19, 22), match='ABC'>


# 4) More special sequences

\d: Matches any decimal digit; \[0-9]

\D: Matches any non-digit character;

\s: Matches any whitespace character; (space " ", tab "\t", newline "\n")

\S: Matches any non-whitespace character;

\w: Matches any alphanumeric (word) character; \[a-zA-Z0-9_]

\W: Matches any non-alphanumerical character;

\b: Matches where specified characters are at the beginning or the end of a word; r"\bain" r"\ain\b"

\B: Matches where specified characters are present, but NOT at the beginning (or the end) of a word; r"\Bain" r"ain\B"

In [12]:
test_string2 = 'hello 123_ heyho\nhohey'

matches = re.finditer(r"\d", test_string2)
for match in matches:
    print(match)

<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>


In [13]:
matches = re.finditer(r"\s", test_string2)
for match in matches:
    print(match)

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(16, 17), match='\n'>


In [14]:
matches = re.finditer(r"\w", test_string2)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [15]:
# a block is any character sequence followed by a whitespace
matches = re.finditer(r"\bhey", test_string2)
for match in matches:
    print(match)

<re.Match object; span=(11, 14), match='hey'>


In [16]:
matches = re.finditer(r"hey\b", test_string2)
for match in matches:
    print(match)

<re.Match object; span=(19, 22), match='hey'>


In [17]:
matches = re.finditer(r"el\B", test_string2)
for match in matches:
    print(match)

<re.Match object; span=(1, 3), match='el'>


# 5) Sets

In [18]:
test_string3 = 'hello 123-_'

# a set is a pattern inbetween squared brackets [ ]
# in a set we can define multiple characters to search for
# [lo] looks for any single character in this set
matches = re.finditer(r"[lo]", test_string3)
for match in matches:
    print(match)

<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>


In [19]:
# specify ranges (very common)
matches = re.finditer(r"[a-z]", test_string3)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>


In [20]:
# look for digits in range
matches = re.finditer(r"[2-3]", test_string3)
for match in matches:
    print(match)

<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>


In [21]:
# search also for dash (after specified range)
matches = re.finditer(r"[0-9-]", test_string3)
for match in matches:
    print(match)

<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='-'>


In [22]:
test_string4 = 'helloHELLO 123-_'
# only lower case, then also upper case, then also numeric
matches = re.finditer(r"[a-zA-Z0-9]", test_string4)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='H'>
<re.Match object; span=(6, 7), match='E'>
<re.Match object; span=(7, 8), match='L'>
<re.Match object; span=(8, 9), match='L'>
<re.Match object; span=(9, 10), match='O'>
<re.Match object; span=(11, 12), match='1'>
<re.Match object; span=(12, 13), match='2'>
<re.Match object; span=(13, 14), match='3'>


# 6) Quantifier
- \* : 0 or more
- \+ : 1 or more
- ? : 0 or 1, -> optional character
- {4} : exact number
- {4,6} : range numbers (min, max)

In [23]:
test_string4 = 'hello_123'
# on or more matches of numerical characters, combined match of 123
matches = re.finditer(r"\d+", test_string4)
for match in matches:
    print(match)

<re.Match object; span=(6, 9), match='123'>


In [24]:
# optional underscore (we dont now if _ in string or not)
matches = re.finditer(r"_?\d+", test_string4)
for match in matches:
    print(match)

<re.Match object; span=(5, 9), match='_123'>


In [25]:
# has to be exact 3 digits long numerical sequence
matches = re.finditer(r"\d{3}", test_string4)
for match in matches:
    print(match)

<re.Match object; span=(6, 9), match='123'>


In [26]:
# can be 1 to 3 digits long numerical sequence
matches = re.finditer(r"\d{1,3}", test_string4)
for match in matches:
    print(match)

<re.Match object; span=(6, 9), match='123'>


## Example

In [27]:
dates = """
hello
01.04.2020

2021.04.02

2021-05-03
2020-04-23
2020-06-11
2020-07-11
2020-08-11

2020/04/02

2020_04_04
2020_04_04
"""
# YYYY-MM-DD this should be the desired format

In [28]:
matches = re.finditer(r"\d\d\d\d.\d\d.\d\d", dates)
for match in matches:
    print(match)

<re.Match object; span=(19, 29), match='2021.04.02'>
<re.Match object; span=(31, 41), match='2021-05-03'>
<re.Match object; span=(42, 52), match='2020-04-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>
<re.Match object; span=(75, 85), match='2020-08-11'>
<re.Match object; span=(87, 97), match='2020/04/02'>
<re.Match object; span=(99, 109), match='2020_04_04'>
<re.Match object; span=(110, 120), match='2020_04_04'>


In [29]:
# using set to include / as valid date
matches = re.finditer(r"\d\d\d\d[-/]\d\d[-/]\d\d", dates)
for match in matches:
    print(match)

<re.Match object; span=(31, 41), match='2021-05-03'>
<re.Match object; span=(42, 52), match='2020-04-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>
<re.Match object; span=(75, 85), match='2020-08-11'>
<re.Match object; span=(87, 97), match='2020/04/02'>


In [30]:
# only dates in may and june 
matches = re.finditer(r"\d\d\d\d[-/]0[56][-/]\d\d", dates)
for match in matches:
    print(match)

<re.Match object; span=(31, 41), match='2021-05-03'>
<re.Match object; span=(53, 63), match='2020-06-11'>


In [31]:
# only dates in may, june and july
matches = re.finditer(r"\d{4}[-/]0[5-7][-/]\d{2}", dates)
for match in matches:
    print(match)

<re.Match object; span=(31, 41), match='2021-05-03'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>


# 7) Conditions

In [32]:
test_string5 = """
hello world
123 23
2022-04-02
Mr Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr. T
"""
# extract only the whole name
# may also have Mrs, add by group condition () meta character
matches = re.finditer(r"(Mr|Ms|Mrs)\.?\s\w+", test_string5)
for match in matches:
    print(match)

<re.Match object; span=(31, 41), match='Mr Simpson'>
<re.Match object; span=(42, 53), match='Mrs Simpson'>
<re.Match object; span=(54, 63), match='Mr. Brown'>
<re.Match object; span=(64, 72), match='Ms Smith'>
<re.Match object; span=(73, 78), match='Mr. T'>


In [33]:
test_string6 = """
hello world
123 23
2022-04-02
Mr Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr. T
pythonengineer@gmail.com
Python-engineer@gmx.de
python-engineer123@my-domain.org
"""
# use multiple sets to search extract emails
matches = re.finditer(r"[a-zA-Z0-9-]+@[a-zA-Z-]+\.[a-zA-Z]+", test_string6)
for match in matches:
    print(match.group(0)) # 0 group by default but others not accessable

pythonengineer@gmail.com
Python-engineer@gmx.de
python-engineer123@my-domain.org


# 8) Grouping
- use parenthesis to create groups in order to access them seperatly

In [34]:
# explicitly group the match object into substrings
# create three groups by ( ) to access them seperately
matches = re.finditer(r"([a-zA-Z0-9-]+)@([a-zA-Z-]+)\.([a-zA-Z]+)", test_string6)
for match in matches:
    print(match.group(1))

pythonengineer
Python-engineer
python-engineer123


# 9) Modification
- Two methods to modify the string:
1. split(): split into list (whereever regex matches)
2. sub(): the same but replaces with different string


In [35]:
test_string7 = '123abc456789abc123ABC'
splitted = re.split(r"abc", test_string7)
print(splitted)

['123', '456789', '123ABC']


In [36]:
test_string8 = "hello world, whats up?"
subbed_string = re.sub("world", "planet", test_string8)
print(subbed_string)

hello planet, whats up?


## Final example

In [37]:
urls = """
hello
2022-12-23
https://python-engineer.com
https://www.youtube.com
http://www.pyeng.net
"""

pattern = re.compile(r"https?://(www\.)?([a-zA-Z-]+)\.([a-zA-Z]+)")
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(0))

# use groups by back reference "\2\3"
subbed_urls = pattern.sub(r"\2\3", urls)
print(subbed_urls)

# or by one-liner again
subbed_urls = re.sub(r"https?://(www\.)?([a-zA-Z-]+)\.([a-zA-Z]+)", r"\2\3", urls)
print(subbed_urls)

https://python-engineer.com
https://www.youtube.com
http://www.pyeng.net

hello
2022-12-23
python-engineercom
youtubecom
pyengnet


hello
2022-12-23
python-engineercom
youtubecom
pyengnet



# 10) Compilation flags
- when we compile the pattern we have the option to choose different compilation flags:
- ASCII, A: Makes several escapes like \w, \b, \s and \d match only ASCII characters
- DOTALL, S: Make . match any character, including newlines.
- IGNORECASE, I: Do case-insensitive matches.
- LOCALE, L: Do a locale-aware match.
- MULTILINE, M: Multi-line matching, affecting ^ and $.
- VERBOSE, X (for 'extended'): Enable verbose REs, which can be organized more cleanly and understandable.

In [38]:
my_string = "Hello World"
# ignore case sensitivity flag 
pattern = re.compile(r"world", re.I)
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 11), match='World'>
