<div style="display: flex; align-items: center;">
    <img src="../img/es_logo.png" alt="title" style="margin-right: 20px;">
    <h1>Regex in Python</h1>
</div>

Regex is a powerful tool for working with strings. It allows you to search for patterns in text and extract the information you need. In Python, the `re` module provides support for working with regular expressions.

### Why use regex?

- **Pattern matching**: You can search for specific patterns in text, such as phone numbers, email addresses, or URLs.
- **Data extraction**: You can extract specific information from text, such as dates, numbers, or names.
- **Data validation**: You can validate user input to ensure it matches a specific format or pattern.
- **Replacing text**: You can replace specific patterns in text with other patterns.

#### Basic syntax

1. **literal characters**: `a`, `b`, `c`, etc. match themselves.
2. **metacharacters**: `.` (dot), `^`, `$`, `*`, `+`, `?`, `{}`, `[]`, `()`, etc. have special meanings.
3. **character classes**: `\d`, `\w`, `\s`, etc. match digits, word characters, whitespace, etc.
4. **quantifiers**: `*`, `+`, `?`, `{}`, etc. match zero or more, one or more, zero or one, or a specific number of times.
5. **anchors**: `^`, `$`, `\b`, etc. match the start or end of a string, or a word boundary.
6. **groups**: `()`, groups multiple tokens together to create a subpattern.


In [14]:
# regex literal
import re

s = r"abc 123 def 456"

# match: this will match the first occurence of the pattern
m = re.search(r"abc", s)
print(m.group(0))


abc


#### Regex character classes

- `\d`: Matches any digit (0-9).
- `\D`: Matches any non-digit.
- `\w`: Matches any word character (alphanumeric + underscore).
- `\W`: Matches any non-word character.
- `\s`: Matches any whitespace character (space, tab, newline).
- `\S`: Matches any non-whitespace character.
- `.`: Matches any character except newline.
- `[abc]`: Matches any character `a`, `b`, or `c`.
- `[^abc]`: Matches any character except `a`, `b`, or `c`.
- `[a-z]`: Matches any lowercase letter.
- `[A-Z]`: Matches any uppercase letter.
- `[0-9]`: Matches any digit.
- `[a-zA-Z0-9]`: Matches any alphanumeric character.

In [28]:
s = """some text
abc 123 @ def 456
more text"""

r = r"\d+"
m = re.findall(r, s)
for match in m:
    print(match)

print("------------------")

r = r"\D+"
m = re.findall(r, s)
for match in m:
    print(match)

print("------------------")

r = r"[a-zA-Z]+"
m = re.findall(r, s)
for match in m:
    print(match)

123
456
------------------
some text
abc 
 @ def 

more text
------------------
some
text
abc
def
more
text


### Meta characters

- `.`: Matches any character except newline.
- `|`: Matches either the expression before or after it.


#### Quantifiers

- `*`: Matches zero or more occurrences of the preceding character.
- `+`: Matches one or more occurrences of the preceding character.
- `?`: Matches zero or one occurrence of the preceding character.
- `{n}`: Matches exactly `n` occurrences of the preceding character.
- `{n,}`: Matches `n` or more occurrences of the preceding character.
- `{n,m}`: Matches between `n` and `m` occurrences of the preceding character.

#### Anchors

- `^`: Matches the start of a string.
- `$`: Matches the end of a string.
- `\b`: Matches a word boundary.
- `\B`: Matches a non-word boundary.

In [52]:
s = """A long string with some numbers like 1234567890 and some special characters like !@#$%^&*()
and some spaces                       
and some new lines


and some tabs       

and some more text"""

r = r".+"
m = re.findall(r, s)
for match in m:
    print(match)

print("------------------")

r = r"long string|\d+|spec..\B"
m = re.findall(r, s)
for match in m:
    print(match)

print("------------------")

r = r"\s+[a-zA-Z0-9]+$"
m = re.findall(r, s)
for match in m:
    print(match)

print("------------------")

r = r"\s+[a-zA-Z0-9]+\s+"
m = re.findall(r, s)
for match in m:
    print(match)

A long string with some numbers like 1234567890 and some special characters like !@#$%^&*()
and some spaces                       
and some new lines
and some tabs       
and some more text
------------------
long string
1234567890
specia
------------------
 text
------------------
 long 
 with 
 numbers 
 1234567890 
 some 
 characters 

and 
 spaces                       

 some 
 lines



 some 
       

and 
 more 


### Regex functions in Python

- `re.match()`: Matches the pattern at the beginning of the string.
- `re.search()`: Searches for the pattern anywhere in the string.
- `re.findall()`: Finds all occurrences of the pattern in the string.
- `re.split()`: Splits the string based on the pattern.
- `re.sub()`: Replaces the pattern with a new string.

In [62]:
# match: this will match the pattern from the beginning of the string
m = re.match(r"A", s)
print(m)
m = re.match(r"long", s)
print(m)

# search: this will match the first occurence of the pattern
m = re.search(r"long", s)
print(m.group(0))

# findall: this will return all the occurences of the pattern
m = re.findall(r"long", s)
print(m)

# split: this will split the string based on the pattern
m = re.split(r"\s+", s)
print(m)

# sub: this will substitute the pattern with the given string
m = re.sub(r"\s+", " ", s)
print(m)

<re.Match object; span=(0, 1), match='A'>
None
long
['long']
['A', 'long', 'string', 'with', 'some', 'numbers', 'like', '1234567890', 'and', 'some', 'special', 'characters', 'like', '!@#$%^&*()', 'and', 'some', 'spaces', 'and', 'some', 'new', 'lines', 'and', 'some', 'tabs', 'and', 'some', 'more', 'text']
A long string with some numbers like 1234567890 and some special characters like !@#$%^&*() and some spaces and some new lines and some tabs and some more text


### Validation

regex can be used to validate user input, such as email addresses, phone numbers, or URLs. For example, to validate an email address etc.

In [71]:
# validation example

def validate_phone_number(phone_number):
    r = r"^(\+9627|009627|07)[7-9](\d{7}|-\d{3}-\d{4})$"

    m = re.match(r, phone_number)
    if m:
        return True
    else:
        return False
    
phone_number = "0791234567"
print(validate_phone_number(phone_number))

phone_number = "079-123-4567"
print(validate_phone_number(phone_number))

phone_number = "+962791234567"
print(validate_phone_number(phone_number))
    

True
True
True


A very useful tool to create and test regex patterns interactively is [regexer](https://regexr.com/).