## Regex

- Regular expressions (regex) are a powerful tool for pattern matching and string manipulation in Python.
- In regex:

1. characters are used to match literal characters in a string
2. metacharacters and special sequences are used to match more complex patterns


### Regex methods

| Method       | Description                                                                                                                          |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------ |
| re.search()  | Searches the string for a match to the regex pattern and returns the first match found.                                              |
| re.findall() | Returns a list of all non-overlapping matches of the regex pattern in the string.                                                    |
| re.match()   | Returns a match object if the pattern is found at the beginning of the string. If the pattern is not found, the method returns None. |
| re.sub()     | Substitutes all occurrences of the regex pattern in the string with a replacement string.                                            |
| re.split()   | Splits the string at each occurrence of the regex pattern and returns a list of the resulting substrings.                            |


**Characters**: These are literal characters that match themselves.
For example, the regex **a** matches the character "a" in a string.


In [7]:
import re
matches = re.search('a', 'Ham')
print(matches)


<re.Match object; span=(1, 2), match='a'>


**Metacharacters**: These are special characters that have a special meaning in regex.
Some common metacharacters include:


| Character          | Description                                                     |
| ------------------ | --------------------------------------------------------------- |
| . (dot):           | Matches any character except newline.                           |
| ^ (caret):         | Matches the beginning of a string.                              |
| $ (dollar sign):   | Matches the end of a string.                                    |
| \* (asterisk):     | Matches zero or more occurrences of the preceding character.    |
| + (plus sign):     | Matches one or more occurrences of the preceding character.     |
| ? (question mark): | Matches zero or one occurrence of the preceding character.      |
| {m}:               | Matches exactly m occurrences of the preceding character.       |
| {m,n}:             | Matches between m and n occurrences of the preceding character. |


In [9]:
# Dot (.)
match_obj = re.match('a.', 'abc')
print(match_obj)  # <re.Match object; span=(0, 2), match='ab'>
# The regex 'a.' matches the first 'a' character and any character after it, so the match object contains 'ab'.

match_obj = re.match('a.', 'a')
print(match_obj)  # None
# The regex 'a.' requires at least two characters in the string to match, so None is returned.

# Caret (^)
match_obj = re.match('^a', 'abc')
print(match_obj)  # <re.Match object; span=(0, 1), match='a'>
# The regex '^a' matches the first character of the string 'abc', so a match object is returned.

match_obj = re.match('^b', 'abc')
print(match_obj)  # None
# The regex '^b' does not match the beginning of the string 'abc', so None is returned.

# Dollar Sign ($)
match_obj = re.match('c$', 'abc')
print(match_obj)  # <re.Match object; span=(2, 3), match='c'>
# The regex 'c$' matches the last character of the string 'abc', so a match object is returned.

match_obj = re.match('a$', 'abc')
print(match_obj)  # None
# The regex 'a$' does not match the end of the string 'abc', so None is returned.

# Asterisk (*)
match_obj = re.match('a*', 'aaa')
print(match_obj)  # <re.Match object; span=(0, 3), match='aaa'>
# The regex 'a*' matches zero or more 'a' characters at the beginning of the string 'aaa', so a match object is returned.

match_obj = re.match('a*', 'bbb')
print(match_obj)  # <re.Match object; span=(0, 0), match=''>
# The regex 'a*' matches zero 'a' characters in the string 'bbb', so an empty match object is returned.

# Plus Sign (+)
match_obj = re.match('a+', 'aaa')
print(match_obj)  # <re.Match object; span=(0, 3), match='aaa'>
# The regex 'a+' matches one or more 'a' characters at the beginning of the string 'aaa', so a match object is returned.

match_obj = re.match('a+', 'bbb')
print(match_obj)  # None
# The regex 'a+' does not match any character in the string 'bbb', so None is returned.

# Question Mark (?)
match_obj = re.match('a?', 'aaa')
print(match_obj)  # <re.Match object; span=(0, 1), match='a'>
# The regex 'a?' matches zero or one 'a' character at the beginning of the string 'aaa', so a match object is returned.

match_obj = re.match('a?', 'bbb')
print(match_obj)  # <re.Match object; span=(0, 0), match=''>
# The regex 'a?' matches zero 'a' characters in the string 'bbb', so an empty match object is returned.

# {m}
match_obj = re.match('a{2}', 'aaa')
print(match_obj)  # <re.Match object; span=(0, 2), match='aa'>
# The regex 'a{2}' matches exactly two 'a' characters at the beginning of the string 'aaa', so a match object is returned.

match_obj = re.match('a{2}', 'a')
print(match_obj)  # None
# The regex 'a{2}' requires at least two 'a' characters to match, so None is returned.

# {m,n}
match_obj = re.match('a{2,4}', 'aaaaa')
print(match_obj)  # <re.Match object; span=(0, 4), match='aaaa'>
# The regex 'a{2,4}' matches between two and four 'a' characters at the beginning of the string 'aaaaa', so a match object is returned.

match_obj = re.match('a{2,4}', 'a')
print(match_obj)  # None
# The regex 'a{2,4}' requires at least two 'a' characters to match, so None is returned.

match_obj = re.match('a{2,4}', 'aaaaaaa')
print(match_obj)  # <re.Match object; span=(0, 4), match='aaaa'>
# The regex 'a{2,4}' matches only the first four 'a' characters at the beginning of the string 'aaaaaaa', so a match object is returned.


<re.Match object; span=(0, 2), match='ab'>
None
<re.Match object; span=(0, 1), match='a'>
None
None
None
<re.Match object; span=(0, 3), match='aaa'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 3), match='aaa'>
None
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 2), match='aa'>
None
<re.Match object; span=(0, 4), match='aaaa'>
None
<re.Match object; span=(0, 4), match='aaaa'>


**Special sequences**: These are shorthand codes for commonly used patterns. Some common special sequences include:


![image.png](character_classes.png)


In [18]:
import re

# \d
match_obj = re.match('\d', '123')
print(match_obj)  # <re.Match object; span=(0, 1), match='1'>
# The regex '\d' matches any digit character, so the first digit '1' at the beginning of the string '123' is matched
# and returned as a match object.

match_obj = re.match('\d', 'abc')
print(match_obj)  # None
# The regex '\d' matches only digit characters, so None is returned because there are no digit characters at the beginning of the string 'abc'.

# \s
match_obj = re.match('\s', ' hello')
print(match_obj)  # <re.Match object; span=(0, 1), match=' '>
# The regex '\s' matches any whitespace character, so the first whitespace character at the beginning of the string ' hello' is matched and returned as a match object.

match_obj = re.match('\s', 'hello')
print(match_obj)  # None
# The regex '\s' matches only whitespace characters, so None is returned because there are no whitespace characters at the beginning of the string 'hello'.

# \w
match_obj = re.match('\w', 'hello')
print(match_obj)  # <re.Match object; span=(0, 1), match='h'>
# The regex '\w' matches any alphanumeric character, so the first alphanumeric character 'h' at the beginning of the string 'hello' is matched and returned as a match object.

match_obj = re.match('\w', '$hello')
print(match_obj)  # None
# The regex '\w' matches only alphanumeric characters, so None is returned because there are no alphanumeric characters at the beginning of the string '$hello'.



<re.Match object; span=(0, 1), match='1'>
None
<re.Match object; span=(0, 1), match=' '>
None
<re.Match object; span=(0, 1), match='h'>
None


In [19]:
# \[...]
# Match any single character that is 'x', 'y', or 'z'
pattern1 = '[xyz]'
text1 = 'hello xyz world'

# Find all matches of pattern1 in text1
matches1 = re.findall(pattern1, text1)
print(matches1)  # Output: ['x', 'y', 'z']

# Match any single digit that is either 0, 1, 2, or 3
pattern2 = '[0123]'
text2 = '12345'

# Find all matches of pattern2 in text2
matches2 = re.findall(pattern2, text2)
print(matches2)  # Output: ['1', '2', '3']

# Match any single alphabetic character that is between 'a' and 'e', inclusive
pattern3 = '[a-e]'
text3 = 'abcdefg12345'

# Find all matches of pattern3 in text3
matches3 = re.findall(pattern3, text3)
print(matches3)  # Output: ['a', 'b', 'c', 'd', 'e']

# Match any single digit that is between 0 and 5, inclusive
pattern4 = '[0-5]'
text4 = '12345'

# Find all matches of pattern4 in text4
matches4 = re.findall(pattern4, text4)
print(matches4)  # Output: ['1', '2', '3', '4', '5']

# Match any single character that is not 'x', 'y', or 'z'
pattern5 = '[^xyz]'
text5 = 'hello xyz world'

# Find all matches of pattern5 in text5
matches5 = re.findall(pattern5, text5)
print(matches5)  # Output: ['h', 'e', 'l', 'o', ' ', 'w', 'r', 'l', 'd']

# Match any single alphabetic character that is between 'a' and 'e', inclusive, or between 'A' and 'E', inclusive
pattern6 = '[a-eA-E]'
text6 = 'abcDEFG12345'

# Find all matches of pattern6 in text6
matches6 = re.findall(pattern6, text6)
print(matches6)  # Output: ['a', 'b', 'c', 'D', 'E']

# Match any two-digit number that is between 00 and 59, inclusive
pattern7 = '[0-5][0-9]'
text7 = '12 34 56 78 90'

# Find all matches of pattern7 in text7
matches7 = re.findall(pattern7, text7)
print(matches7)  # Output: ['12', '34', '56']

# Match the plus sign (+) character
pattern8 = '[+]'
text8 = '1+2=3'

# Find all matches of pattern8 in text8
matches8 = re.findall(pattern8, text8)
print(matches8)  # Output: ['+']


['x', 'y', 'z']
['1', '2', '3']
['a', 'b', 'c', 'd', 'e']
['1', '2', '3', '4', '5']
['h', 'e', 'l', 'l', 'o', ' ', ' ', 'w', 'o', 'r', 'l', 'd']
['a', 'b', 'c', 'D', 'E']
['12', '34', '56']
['+']


![image.png](Repetition_char.png)


![image.png](groups.png)


To replace a regex group in Python, you can use the re.sub() method from the re module. Here's an example that demonstrates how to use this method to replace a regex group:

In [42]:
import re

# Define the input string
input_str = "John Smith, 25 years old"

# Define the regex pattern with a group
regex_pattern = r"(\w+)\s(\w+),\s(\d+)\syears\sold"

# Replace the second group with a new value
output_str = re.sub(regex_pattern, r"\1, \3 years old (\2)", input_str)

# Print the output string
print(output_str)


John, 25 years old (Smith)


In [43]:
import re

# Define the input string
input_str = "John Smith, 25 years old"

# Define the regex pattern with a group
regex_pattern = r"(\w+)\s(\w+),\s(\d+)\syears\sold"

# Use lambda and re.sub() to replace the second group with a new value
output_str = re.sub(regex_pattern, lambda match: f"{match.group(1).upper()}, {match.group(3)} years old ({match.group(2)})", input_str)

# Print the output string
print(output_str)


JOHN, 25 years old (Smith)


![image.png](anchor_char.png)


In [17]:
# \b
match_obj = re.search(r'\bworld', 'hello world')
print(match_obj)  # <re.Match object; span=(6, 11), match='world'>
# The regex '\b' matches a word boundary,
# which is the empty string between a word character (\w) and a non-word character.
# In this case, it matches the empty string between the space character and the 'w' character in the string 'hello world',
# and returns the match object for the word 'world'.

match_obj = re.search(r'\bworld', 'helloworld')
print(match_obj)  # None
# The regex '\b' matches only at a word boundary, 
# so None is returned because there is no word boundary before the 'w' character in the string 'helloworld'.


<re.Match object; span=(6, 11), match='world'>
None


## Lookahead & Lookbehind

In [52]:
import re

text = "I love eating sushi, but I hate eating wasabi with it."

# Positive Lookahead
#pattern = r'(\w+)(?:\s+)(?=sushi)'
pattern = r'\w+\s+(?=sushi)' # A(?=B) Find expression A where expression B follows
matches = re.findall(pattern, text)
print("Positive Lookahead:", matches)  # Output: Positive Lookahead: ['eating']
# Positive Lookahead (?=sushi): This regex pattern matches any word character 
# that is followed by the string 'sushi'. In the given text, the only word character that is followed by the string 'sushi' is 'eating', which is why it is the only match.

# Positive Lookbehind
pattern = r'(?<=eating )\w+' # (?<=B)A Find expression A where expression B precedes
matches = re.findall(pattern, text)
print("Positive Lookbehind:", matches)  # Output: Positive Lookbehind: ['sushi', 'wasabi']
# Positive Lookbehind (?<=eating): This regex pattern matches any word character that is preceded by the string 'eating'. 
# In the given text, 'sushi' and 'wasabi' both follow the word 'eating', which is why they are both matches.


Positive Lookahead: ['eating ']
Positive Lookbehind: ['sushi', 'wasabi']


In [63]:
text = "I love eating sushi, but I hate eating wasabi with it."
pattern = r'\w+\s+(?!sushi)' # A(?!B) Find expression A where expression B does not follow
matches = re.findall(pattern, text)
print("negative Lookahead:", matches)

negative Lookahead: ['I ', 'love ', 'but ', 'I ', 'hate ', 'eating ', 'wasabi ', 'with ']


In [60]:
# possitive lookbehind
address = 'DE 33333'
pattern = r'(?<=[A-Z]{2} )\d{5}' # (?<=B)A Find expression A where expression B precedes:
re.findall(pattern, address)

[]

#raw string

In [66]:
# Using a regular string
# str1 = "C:\Users\John"

# Using a raw string
str2 = r"C:\Users\John"

# print(str1)  # Output: C:\Users\John
print(str2)  # Output: C:\Users\John


C:\Users\John
