# Introduction to Regular Expressions in Python

In this notebook, we will learn the basics of **Regular Expressions (regex)**, which are sequences of characters that define a search pattern. We will use Python's built-in `re` module to perform regex operations. This notebook will guide you through various regex concepts and demonstrate how to use them in practice.


In this notebook, we will cover the basics of Regular Expressions (regex) including:
- Simple pattern matching
- Special characters in regex
- Using grouping and capturing
- Advanced concepts like lookahead and lookbehind
- Practical use cases like email validation

In [12]:
import re

### 1. Basic Patterns and Matching

A basic regex pattern can be a simple string that matches exactly that sequence of characters. Let's start with a simple example.


In [13]:
# Simple match: Searching for an exact string
pattern = r'hello'  # Raw string to avoid backslashes being treated as escape characters
text = "hello world"

# re.search() looks for the pattern in the given string and returns a match object if found
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())  # match.group() returns the matched text
else:
    print("No match found")


Match found: hello


### 2. Special Characters in Regex

Regular expressions use special characters to define complex patterns. Below are some key special characters:

- `.`: Matches any single character except a newline
- `^`: Matches the start of a string
- `$`: Matches the end of a string
- `*`: Matches 0 or more repetitions of the preceding character or group
- `+`: Matches 1 or more repetitions of the preceding character or group
- `?`: Matches 0 or 1 repetition of the preceding character or group
- `\d`: Matches any digit (0-9)
- `\w`: Matches any word character (letters, digits, or underscore)

Let's see these special characters in action.


In [14]:
# Example of '.' (dot) which matches any character except newline
pattern = r'h.llo'
text = "hello"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: hello


In [15]:
# Example of '^' to match the beginning of a string
pattern = r'^hello'  # '^hello' matches if the string starts with 'hello'
text = "hello world"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: hello


In [16]:
# Example of '$' to match the end of a string
pattern = r'world$'  # 'world$' matches if the string ends with 'world'
text = "hello world"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: world


In [17]:
# Example of '*' to match 0 or more repetitions of a character
pattern = r'lo*'  # Matches 'l' followed by 0 or more 'o's
text = "loooo"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: loooo


### 3. Grouping and Capturing

Below are some key special characters:You can group parts of a pattern using parentheses to capture sub-patterns. Let's look at an example:

In [19]:
# Grouping with parentheses
pattern = r'(\d+)\s+(\w+)'  # Matches a number followed by a space and then a word
text = "123 abc"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())  # Full match
    print("Group 1 (Number):", match.group(1))  # First captured group (the number)
    print("Group 2 (Word):", match.group(2))  # Second captured group (the word)
else:
    print("No match found")


Match found: 123 abc
Group 1 (Number): 123
Group 2 (Word): abc


### 4. Lookahead and Lookbehind

Lookahead and lookbehind are advanced regex features that allow you to assert whether a pattern is (or is not) followed or preceded by another pattern. These are called **lookaround assertions**.

- Positive Lookahead `(?=...)`: Asserts that what follows the current position matches the pattern inside the lookahead.
- Negative Lookahead `(?!...)`: Asserts that what follows does **not** match the pattern inside the lookahead.
- Positive Lookbehind `(?<=...)`: Asserts that what precedes the current position matches the pattern inside the lookbehind.
- Negative Lookbehind `(?<!...)`: Asserts that what precedes does **not** match the pattern inside the lookbehind.

Let's look at an example.


In [21]:
# Positive Lookahead Example
pattern = r'\d+(?=\s+abc)'  # Matches digits followed by 'abc'
text = "123 abc"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: 123


In [22]:
# Negative Lookahead Example
pattern = r'\d+(?!\s+abc)'  # Matches digits not followed by 'abc'
text = "123 def"
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: 123


### 5. Practical Use Case: Email Validation

Let's use regular expressions to validate email addresses. A common pattern for validating an email address is:

- Start with a word character (letters, digits, or underscore)
- Followed by an `@` symbol
- Then another word character (domain name)
- Finally, a period (`.`) followed by a domain extension

Here’s an example of using regex for email validation.


In [23]:
# Email validation pattern
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'  # Basic email pattern
email = "example@domain.com"

if re.match(pattern, email):
    print(f"'{email}' is a valid email address.")
else:
    print(f"'{email}' is not a valid email address.")


'example@domain.com' is a valid email address.


### 6. Using Regex to Find All Matches

Sometimes you may want to find **all** occurrences of a pattern in a string. You can use `re.findall()` for this purpose, which returns a list of all matches.

Let’s look at an example of finding all digits in a string.


In [24]:
# Find all digits in a string
pattern = r'\d+'  # Matches one or more digits
text = "The price is 100 dollars and the discount is 20%"

matches = re.findall(pattern, text)
print("All digit matches:", matches)


All digit matches: ['100', '20']
