<a href="https://colab.research.google.com/github/bhagyabinoy/Python-notes-and-projects/blob/main/Python_regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regular expressions

- Regular expressions (regex) in Python are a powerful tool for pattern matching and manipulation of strings.

- They provide a way to search, match, and manipulate text based on specific patterns, making them invaluable for tasks such as data validation, parsing, and transformation.

### Key Concepts of Regex

1. **Pattern**: A regex pattern is a sequence of characters that defines a search criteria. Patterns can include literal characters, special characters, and various syntax constructs.

2. **Special Characters**:
   - `.`: Matches any character except a newline.
   - `^`: Matches the start of a string.
   - `$`: Matches the end of a string.
   - `*`: Matches 0 or more repetitions of the preceding element.
   - `+`: Matches 1 or more repetitions of the preceding element.
   - `?`: Matches 0 or 1 repetition of the preceding element (makes it optional).
   - `{n}`: Matches exactly n repetitions of the preceding element.
   - `{n,}`: Matches n or more repetitions.
   - `{n,m}`: Matches between n and m repetitions.
   - `[]`: Matches any one of the characters inside the brackets.
   - `|`: Acts as a logical OR between patterns.
   - `\`: Escapes special characters.

3. **Character Classes**:
   - `\d`: Matches any digit (equivalent to `[0-9]`).
   - `\D`: Matches any non-digit.
   - `\w`: Matches any word character (alphanumeric plus underscore).
   - `\W`: Matches any non-word character.
   - `\s`: Matches any whitespace character (spaces, tabs).
   - `\S`: Matches any non-whitespace character.

### built-in `re` module functions:

1. **`re.match()`**: Checks for a match only at the beginning of the string.
2. **`re.search()`**: Searches the entire string for a match.
3. **`re.findall()`**: Returns a list of all matches in the string.
4. **`re.finditer()`**: Returns an iterator yielding match objects for all matches.
5. **`re.sub()`**: Replaces occurrences of the pattern with a specified string.
6. **`re.split()`**: Splits the string by occurrences of the pattern.

In [1]:
#### 1. Basic Matching
import re

text = "The rain in Spain stays mainly in the plain."

# Match a pattern at the start of the string
match = re.match(r"The", text)
if match:
    print("Match found:", match.group())

# Search for a pattern anywhere in the string
search = re.search(r"Spain", text)
if search:
    print("Search found:", search.group())

Match found: The
Search found: Spain


In [4]:
#### 2. Finding All Matches

text = "Cats are great pets. Dogs are great pets too."

# Find all occurrences of the word "great"
matches = re.findall(r"great", text)
print("All matches:", matches)

All matches: ['great', 'great']


In [5]:
#### 3. Substitution

text = "Hello, John Doe. Welcome, John!"

# Replace "John" with "Jane"
new_text = re.sub(r"John", "Jane", text)
print("New text:", new_text)

New text: Hello, Jane Doe. Welcome, Jane!


In [None]:
#### 4. Splitting Strings

text = "apple, banana; cherry orange: grape"

# Split the string by various delimiters
fruits = re.split(r"[;,: ]+", text)
print("Fruits list:", fruits)

### Grouping and Capturing

You can use parentheses to create groups within patterns, allowing you to capture specific parts of a match.

In [2]:
text = "John's phone number is 123-456-7890."

# Match and capture the phone number format
match = re.search(r"(\d{3})-(\d{3})-(\d{4})", text)
if match:
    print("Area Code:", match.group(1))
    print("Central Office Code:", match.group(2))
    print("Line Number:", match.group(3))

Area Code: 123
Central Office Code: 456
Line Number: 7890


In [3]:
#validating email pattern
import re

email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
email = "example@example.com"

if re.match(email_pattern, email):
    print("Valid email address.")
else:
    print("Invalid email address.")


Valid email address.


In [4]:
text = "Important dates: 2023-11-01, 11/02/2023, and 01-Nov-2023."

# Pattern to match dates in YYYY-MM-DD, DD/MM/YYYY, and DD-MMM-YYYY formats
date_pattern = r"(\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}|\d{2}-[A-Za-z]{3}-\d{4})"
dates = re.findall(date_pattern, text)
print("Extracted dates:", dates)


Extracted dates: ['2023-11-01', '11/02/2023', '01-Nov-2023']


In [5]:
text = "This is a test of regex functionality."

# Find all words with exactly 4 letters
four_letter_words = re.findall(r"\b\w{4}\b", text)
print("Four-letter words:", four_letter_words)


Four-letter words: ['This', 'test']


In [6]:
url_pattern = r"https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?"

text = "Visit our website at https://www.example.com for more info."

if re.search(url_pattern, text):
    print("URL found in text.")
else:
    print("No URL found.")


URL found in text.


In [7]:
text = "Call me at 123-456-7890 or (987) 654-3210. My office number is 555.123.4567."

# Pattern to match different phone number formats
phone_pattern = r"(\(\d{3}\)\s?\d{3}[-.\s]?\d{4}|\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\d{3}.\d{3}.\d{4})"
phone_numbers = re.findall(phone_pattern, text)
print("Extracted phone numbers:", phone_numbers)


Extracted phone numbers: ['123-456-7890', '(987) 654-3210', '555.123.4567']


In [8]:
text = "Madam, in Eden I'm Adam. Racecar is also a palindrome."

# Find palindromes (case-insensitive)
palindrome_pattern = r"\b([A-Za-z]+)\b"
words = re.findall(palindrome_pattern, text)

palindromes = [word for word in words if word.lower() == word[::-1].lower()]
print("Palindromic words:", palindromes)


Palindromic words: ['Madam', 'I', 'm', 'Racecar', 'a']
