# Regex

Regular expressions, often abbreviated as regex or regexp, are sequences of characters that define a search pattern. They are widely used in computer science and programming for string manipulation, pattern matching, and text processing. In Python, the re module provides support for regular expressions.

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. Python on the other hand uses the same character as escape character. Hence Python uses the raw string notation.

<br>

A string become a raw string if it is prefixed with r or R before the quotation symbols. Hence 'Hello World' is a normal string and r'Hello World' is a raw string.

In normal circumstances, there is no difference between the two. However, when the escape character is embedded in the string, the normal string actually interprets the escape sequence, where as the raw string doesn't process the escape character.

In [4]:
normal="Hello\nWorld"

print(normal) 

# Output: 
# Hello
# World

raw=r"Hello\nWorld"

print(raw) # Output: Hello\nWorld




Hello
World
Hello\nWorld


**Metacharacters**

Most letters and characters will simply match themselves. However, some characters are special metacharacters, and don't match themselves. Meta characters are characters having a special meaning, similar to * in wild card.

Here's a complete list of the metacharacters:

. ^ $ * + ? { } [ ] \ | ( )

The square bracket symbols[] indicate a set of characters that you wish to match. Characters can be listed individually, or as a range of characters separating them by a '-'.

[abc]: match any of the characters a, b, or c

[a-c]: match any of the characters, which uses a range to express the same set of characters.

[a-z]: match only lowercase letters.

[0-9]: match only digits.

'^': complements the character set in []. [^5] will match any character except '5'.


'\'is an escaping metacharacter. When followed by various characters it forms various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \\[ or \\\\.

\d : Matches any decimal digit; this is equivalent to the class [0-9]

\D : Matches any non-digit character; this is equivalent to the class [^0-9]

\s : Matches any whitespace character; this is equivalent to the class [\t\n\r\f\v]

\S : Matches any non-whitespace character; this is equivalent to the class [^\t\n\r\f\v]

\w: Matches any alphanumeric character; this is equivalent to the class [a-zAZ0-9_]

\W: Matches any non-alphanumeric character. equivalent to the class [^a-zAZ0-9_]

. : Matches with any single character except newline '\n'

? : Matches 0 or 1 occurrence of the pattern

+ : Matches 1 or more occurrences of the pattern. If we have a digit, then it will list out other digits as well, which are
    connected to it.

* : Matches 0 or more occurrences of the pattern.

[..] : Matches any single character in a square bracket and [^..] matches any single character not in square bracket

\ : It is used for special meaning characters like \. to match a period or \+ for plus sign

\b : Boundary between word and non-word

{n} : Matches exactly n occurences

{n,} : At least n occurrences

{,m} : At most m occurrences

{n,m} : Between n and m occurrences (inclusive)

a|b : Matches either a or b

^ : Matches the start of the string, and in MULTILINE mode also matches immediately after each newline

$ : Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline


The re.match function in Python is used to determine if the regular expression pattern matches at the beginning of a string. It checks for a match only at the beginning of the string, unlike re.search, which scans the entire string for a match.

In [6]:
#import re

#re.match(pattern, string, flags=0)

# pattern: This is the regular expression to be matched.

# string: This is the string, which would be searched to match the pattern at the beginning of string.

# Optional flags to control the behavior of the regular expression. E.g. you can use re.IGNORECASE to perform a case-insensitive match.

In [12]:
import re

line = "Cats are smarter than dogs"

matchObj = re.match(r'Cats', line)

print(matchObj.start(), matchObj.end()) # Output: 0 4

print(matchObj.group()) # Output: Cats

print(matchObj) # Output: <re.Match object; span=(0, 4), match='Cats'>



0 4
Cats
<re.Match object; span=(0, 4), match='Cats'>


Example 1: Matching Digits at the Beginning of a String

In [13]:
# Using re.match to check if the pattern matches at the beginning of the string
result = re.match(r'\d+', "123abc")

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: 123


In Python's re module, re.compile() is a function that compiles a regular expression pattern into a regular expression object. This compilation step allows you to reuse the compiled pattern, which can improve the performance of repeated pattern matching operations.

In [15]:
# Creating a regular expression pattern to match digits
pattern = re.compile(r'\d+')

# Using the compiled pattern for matching
result = pattern.match("123abc")

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: 123


Example 2: Case-Insensitive Match

In [16]:
# Using re.match for a case-insensitive match
result = re.match(r'hello', "Hello World", flags=re.IGNORECASE)

if result:
    print("Match found:", result.group())
else:
    print("No match")


#OR

pattern = re.compile(r'hello', flags=re.IGNORECASE)

# Using re.match for a case-insensitive match
result = pattern.match("Hello World")

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: Hello
Match found: Hello


Example 3: Matching a Specific Word at the Beginning

In [24]:
# Using re.match to check if the pattern matches at the beginning of the string
result = re.match(r'apple', "apple pie")

if result:
    print("Match found:", result.group())
else:
    print("No match")


# OR


# Using re.match to check if the pattern matches at the beginning of the string
result = re.match(r'apple', "apple pie")

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: apple
Match found: apple


The re.search() function in Python's re module is used to search for a pattern within a string. Unlike re.match(), which only checks for a match at the beginning of the string, re.search() scans the entire string for a match.

re.search(pattern, string, flags=0)

pattern: The regular expression pattern to search for.

string: The input string in which the pattern is searched.

flags: Optional flags to control the behavior of the regular expression.

In [52]:
pattern = re.compile(r'\d+')

# Using re.search to search for the pattern in the string
result = pattern.search("The price is $123.45")

if result:
    print("Match found:", result.group())
else:
    print("No match")


pattern = re.compile(r'\d*')

# Using re.search to search for the pattern in the string
result = pattern.search("The price is $123.45")

if result:
    print("Match found:", result.group())
else:
    print("No match")

# Output: ''

# The pattern r'\d*' is looking for zero or more occurrences of digits (\d). The * quantifier allows for the possibility of zero digits.

# \d: Matches a digit.
# *: Matches zero or more occurrences of the preceding character (in this case, a digit).

# Now, let's look at the provided string "The price is $123.45". 
# The pattern is able to match at the start of the string where there are zero digits (empty string). 
# The search function finds this match, and the condition if result: evaluates to True.

# As a result, the code prints "Match found:" along with the empty string that was matched at the beginning of the string. 
# If you want to ensure that there is at least one digit in the match, you can modify the pattern to use the + quantifier

Match found: 123
Match found: 


Example 1: Matching a Specific Word

In [53]:
text = "The quick brown fox jumps over the lazy dog."

pattern = re.compile(r'\bfox\b')

result = pattern.search(text)

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: fox


Example 2: Finding the First Email Address

In [54]:
text = "Contact us at support@example.com or sales@company.com for assistance."

pattern = re.compile(r'\b\w+@\w+\.\w*\b')

result = pattern.search(text)

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: support@example.com


In [70]:
# Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
    print(email)


match = re.search(r'([\w.-]+)@([\w.-]+)', str)

if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice@google.com
bob@abc.com
alice@google.com
alice
google.com


Example 3: Searching for a Date

In [55]:
text = "The meeting is scheduled for 2023-12-25."

pattern = re.compile(r'\b\d{4}-\d{2}-\d{2}\b')

result = pattern.search(text)

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: 2023-12-25


Multiline mode (re.M):

In [57]:
text = "Line 1\nLine 2\nLine 3"

pattern = re.compile(r'^Line', flags=re.M)

matches = pattern.findall(text)

print(matches)



['Line', 'Line', 'Line']


Case Insensitive Mode (re.I):

In [58]:
text = "The Quick Brown Fox"

pattern = re.compile(r'quick', flags=re.I)

result = pattern.search(text)

if result:
    print("Match found:", result.group())
else:
    print("No match")

Match found: Quick


The re.findall() function in Python's re module is used to find all non-overlapping occurrences of a pattern in a string, returning a list of all matches. It searches the entire input string and returns all matches as a list of strings.

re.findall(pattern, string, flags=0)

pattern: The regular expression pattern to search for.

string: The input string in which the pattern is searched.

flags: Optional flags to control the behavior of the regular expression.

In [185]:
text = "The cat and the dog chased a cat."

pattern = re.compile(r'\b\w+at\b')

matches = pattern.findall(text)

print(matches) # Output: ['cat', 'cat']


string="Simple is better than complex."

obj=re.findall(r"ple", string) 

print(obj) # Output: ['ple', 'ple']


string="Simple is better than complex."


obj2=re.findall(r"\w+", string)

print(obj2) # Output: ['Simple', 'is', 'better', 'than', 'complex']


obj3=re.findall(r"\w*", string)

print(obj3) # Output: ['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']


string2 = "It is just a simple test, which I want to try."

obj4 = re.findall(r"\b\w*t\w*t\w*\b", string2)

print(obj4)



['cat', 'cat']
['ple', 'ple']
['Simple', 'is', 'better', 'than', 'complex']
['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']
['test']


In [235]:
regexp = re.compile(r"\w*(.)\1\w*")

# The expression in the brackets count as a group. Since we have only 1 group, we can reference it as \1

data = ["parrot","follia","carrot","mattia","rettoo","melone"]

for str in data:
    match = re.search(regexp, str)
    if match:
        print(match.group())
    else:
        print ("")

parrot
follia
carrot
mattia
rettoo



In [273]:
pattern = r'(\w*(\w)\2\w*)'

text = "hello, book, balloon, bee, success, llama, rabbit, cheese, pumpkin, door"

matches = re.findall(pattern, text)

for match in matches:
    print(match[0])
    print(match[1])

# Mi történik itt? A python mindig a group-olt karaktert/szót fogja nekünk visszaadni, ha pedig csak 1 groupot csinálunk, 
# akkor értelemszerűen cska csak azt fogja:


pattern = r'\w*(\w)\1\w*'

text = "hello, book, balloon, bee, success, llama, rabbit, cheese, pumpkin, door"

matches = re.findall(pattern, text)

print(matches)


# Hogy ezt elkerüljük 2 groupot csináltunk felül. Van a külső zárójel (1-es group) és van a belső zárójel (2-es group). Így most
# mindkét group-olt karakterhalmazt eltárólja, pl. a hello-t és az l-t is. Így megmondhatom neki, hogy melyiket szeretném látni.



hello
l
book
o
balloon
o
bee
e
success
s
llama
l
rabbit
b
cheese
e
door
o
['l', 'o', 'o', 'e', 's', 'l', 'b', 'e', 'o']


In [265]:
pattern = r'\d+'
text = "There are 42 apples and 123 oranges."

matches = re.findall(pattern, text)

print(matches)

['42', '123']


In [266]:
pattern = r'(\w+)-(\d+)'
text = "apple-42, orange-123, banana-7"

matches = re.findall(pattern, text)

print(matches)
# Output: [('apple', '42'), ('orange', '123'), ('banana', '7')]

[('apple', '42'), ('orange', '123'), ('banana', '7')]


In [267]:
pattern = r'apple'
text = "There is an Apple in the basket."

matches = re.findall(pattern, text, flags=re.IGNORECASE)

print(matches)
# Output: ['Apple']

['Apple']


(?: ... ): This is a non-capturing group. It groups the patterns inside the parentheses without creating a separate capturing group. The ?: at the beginning of the group makes it non-capturing.

In [279]:
pattern = r'(?:Mr|Ms)\. (\w+)'
text = "Mr. Smith, Ms. Johnson, Mr. Brown"

matches = re.findall(pattern, text)

print(matches)
# Output: ['Smith', 'Johnson', 'Brown']



pattern = r'(?:Mr|Ms)\. \w+'
text = "Mr. Smith, Ms. Johnson, Mr. Brown"

matches = re.findall(pattern, text)

print(matches)
# Output: ['Smith', 'Johnson', 'Brown']

['Smith', 'Johnson', 'Brown']
['Mr. Smith', 'Ms. Johnson', 'Mr. Brown']


In [282]:
pattern = r'\d+%'

text = "42% discount, 20% off, 75% sale"

matches = re.findall(pattern, text)

print(matches)
# Output: ['42%', '20%', '75%']


pattern = r'\d+(?=%)'
text = "42% discount, 20% off, 75% sale"

matches = re.findall(pattern, text)

print(matches)
# Output: ['42', '20', '75']


['42%', '20%', '75%']
['42', '20', '75']


The re.sub() function in Python is used for replacing substrings that match a specified pattern with a replacement string. Its basic syntax is:

re.sub(pattern, replacement, string, count=0, flags=0)

pattern: The regular expression pattern to search for.

replacement: The string to replace the matched occurrences with.

string: The input string where replacements will be made.

count: Optional. The maximum number of occurrences to replace. If not specified or 0, all occurrences will be replaced.

flags: Optional flags that modify the behavior of the regex search.

Example 1: Basic Substitution

In [283]:
text = "The cat and the hat are sitting on the mat."

# Replace "cat" with "dog"
new_text = re.sub(r'cat', 'dog', text)

print(new_text)

The dog and the hat are sitting on the mat.


Example 2: Case-Insensitive Substitution

In [284]:
text = "The cat and the hat are sitting on the mat."

# Replace "Cat" with "dog" (case-insensitive)
new_text = re.sub(r'Cat', 'dog', text, flags=re.IGNORECASE)

print(new_text)
# Output: "The dog and the hat are sitting on the mat."

The dog and the hat are sitting on the mat.


Example 3: Using Capturing Groups in Replacement

In [285]:
text = "Name: John, Age: 25, Name: Jane, Age: 30"

# Swap the names and ages
new_text = re.sub(r'Name: (\w+), Age: (\d+)', r'Age: \2, Name: \1', text)

print(new_text)
# Output: "Age: 25, Name: John, Age: 30, Name: Jane"

Age: 25, Name: John, Age: 30, Name: Jane


Example 4: Removing HTML Tags

In [298]:
html_text = "<p>This is <b>bold</b> and <i>italic</i> text.</p>"

# Remove HTML tags
plain_text = re.sub(r'<.*?>', '', html_text)

print(plain_text)
# Output: "This is bold and italic text."

This is bold and italic text.


On greedy vs non-greedy

Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.

The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.

In [301]:
# Greedy

text = "eeeAiiZuuuuAoooZeeee"

text_1 = re.sub(r"A.*Z", "", text)

print(text_1)


# Not greedy

text = "eeeAiiZuuuuAoooZeeee"

text_2 = re.sub(r"A.*?Z", "", text)

print(text_2)

eeeeeee
eeeuuuueeee
