## 1: Introduction to Regular Expressions

### What are Regular Expressions?
- Regular expressions, often abbreviated as regex, are powerful tools for **pattern matching** and **text manipulation**.
- They provide a concise and flexible way to search, extract, and manipulate strings based on specific patterns.

- Regular expressions are a sequence of characters that define a search pattern.
- This pattern can be used to match and manipulate text in various contexts, such as searching for specific words or patterns in a document, validating input strings, or extracting specific information from a larger dataset.

### Benefits of Using Regular Expressions
Regular expressions offer several benefits when working with text data:

- **Pattern matching**: Regular expressions allow you to define complex patterns to match specific strings or patterns within larger texts. This makes it easier to identify and extract relevant information from text data.

- **Flexibility**: Regular expressions provide a flexible syntax that allows you to define intricate patterns and handle a wide range of text manipulation tasks. They can be used to search, replace, or validate strings based on complex rules and conditions.

- **Efficiency**: Regular expressions are designed to be efficient and optimized for performance. They can handle large amounts of text data quickly, making them suitable for processing tasks that involve searching or manipulating large datasets.

### Basic Syntax of Regular Expressions


- `.` (dot): Matches any single character except a newline.
- `*` (asterisk): Matches zero or more occurrences of the preceding character or group.
- `+` (plus): Matches one or more occurrences of the preceding character or group.
- `?` (question mark): Matches zero or one occurrence of the preceding character or group.
- `[ ]` (square brackets): Matches any single character within the specified range or set.
- `|` (pipe): Acts as an OR operator, matching either the pattern before or after the pipe.
- `^` (caret): Matches the beginning of a line or string.
- `$` (dollar sign): Matches the end of a line or string.

Regular expressions also support various special sequences, such as `\d` for matching digits, `\w` for matching word characters, and `\s` for matching whitespace characters.

In [4]:
import re

text = "Hello, 123 World!456,7"
pattern = r"\d+"
matches = re.findall(pattern, text)
print(matches)

#The regular expression pattern r"\d+" matches one or more digits.
#The findall() function returns a list of all matches found in the text.

['123', '456', '7']


## 2: Matching Patterns

### Matching Text Literals
- Regular expressions can be used to match specific text literals in a string. Text literals are simply the exact characters you want to match.

In [5]:
import re

# Example: Matching text literals
text = "Hello, World!"
pattern = r"Hello"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Hello']

['Hello']


- The regular expression pattern `r"Hello"` is used to match the text literal "Hello" in the `text` string.
- The `findall()` function returns a list containing all occurrences of the text literal found in the text.

### Matching Character Classes
- Character classes allow you to match a specific set of characters. They are defined using square brackets `[ ]` and can include a range of characters or individual characters.

- Here's an example that demonstrates matching character classes using regular expressions:

In [7]:
import re

# Example: Matching character classes
text = "Hello, World!"
pattern = r"[aeiou]"
matches = re.findall(pattern, text)
print(matches)  # Output: ['e', 'o', 'o']

['e', 'o', 'o']


- In this example, the regular expression pattern `r"[aeiou]"` matches any lowercase vowel character (a, e, i, o, u) in the `text` string. The `findall()` function returns a list of all vowel characters found in the text.

### Metacharacters and Escaping
- Metacharacters have special meanings in regular expressions.
- If you want to match a metacharacter as a literal character, you need to escape it using a backslash `\`.

- Here's an example that demonstrates escaping metacharacters using regular expressions:

In [18]:
import re

# Example: Escaping metacharacters
text = "Hello, World* ."
pattern = r"[\*\.]" #any * or .
matches = re.findall(pattern, text)
print(matches)  

['*', '.']


In this example, the regular expression pattern `r"\."` is used to match a literal period (dot) character in the `text` string. The backslash `\` before the dot indicates that it should be treated as a literal character.

### Quantifiers and Repetition
- Quantifiers allow you to specify the number of occurrences of a character or group in a pattern.
- They control the repetition of the preceding element.

- Here's an example that demonstrates using quantifiers for repetition using regular expressions:

In [28]:
import re

# Example: Using quantifiers for repetition
text = "Hellooo, World!"
matches = re.findall(r"l{2}", text) # 2 consecutive l's
print(matches)  
matches = re.findall(r"o{2,}", text) # 2 or more consecutive o's
print(matches)  


['ll']
['ooo']


In this example, the regular expression pattern `r"o{2,}"` matches two or more consecutive "o" characters in the `text` string. The `{2,}` quantifier specifies that the preceding "o" should occur at least twice.

### Anchors and Boundary Matchers
- Anchors and boundary matchers allow you to specify the position of a pattern in a string.
- They are used to match patterns at the beginning, end, or specific positions within a string.
- Here's an example that demonstrates using anchors and boundary matchers using regular expressions:

In [38]:
import re

# Example: Using anchors and boundary matchers
text = "Hello, World!"
pattern = r"^Hello"
matches = re.findall(r"^Hello", text) # beginning with Hello
print(matches)  
matches = re.findall(r"World!$", text) # ending with World!
print(matches)  

['Hello']
['World!']


In this example, the regular expression pattern `r"^Hello"` matches the pattern "Hello" only at the beginning of the `text` string. The `findall()` function returns a list containing the

## 3: Character Sets and Groups

### Defining Character Sets
- Character sets allow you to specify a set of characters that you want to match. 
- They are defined using square brackets `[ ]` and can include individual characters, ranges of characters, or a combination of both.

- Here's an example that demonstrates defining character sets using regular expressions:

In [39]:
import re

# Example: Defining character sets
text = "Hello, World!"
pattern = r"[aeiou]"
matches = re.findall(pattern, text)
print(matches) 

['e', 'o', 'o']


In this example, the regular expression pattern `r"[aeiou]"` matches any lowercase vowel character (a, e, i, o, u) in the `text` string. The `findall()` function returns a list of all vowel characters found in the text.

### Negating Character Sets
- You can also negate a character set by including a caret `^` at the beginning of the set.
- This will match any character that is not in the set.

- Here's an example that demonstrates negating character sets using regular expressions:

In [40]:
import re

# Example: Negating character sets
text = "Hello, World!"
pattern = r"[^aeiou]"
matches = re.findall(pattern, text)
print(matches)  # Output: ['H', 'l', 'l', ',', ' ', 'W', 'r', 'l', 'd', '!']

['H', 'l', 'l', ',', ' ', 'W', 'r', 'l', 'd', '!']


In this example, the regular expression pattern `r"[^aeiou]"` matches any character in the `text` string that is not a lowercase vowel. The `findall()` function returns a list of all non-vowel characters found in the text.

### Special Character Sets
- Regular expressions provide special character sets that match common character groups. Some commonly used special character sets include:

> - `\d`: Matches any digit character (0-9).
> 
> - `\D`: Matches any non-digit character.
> 
> - `\w`: Matches any word character (alphanumeric and underscore).
> 
> - `\W`: Matches any non-word character.
> 
> - `\s`: Matches any whitespace character (space, tab, newline).
> 
> - `\S`: Matches any non-whitespace character.

- Here's an example that demonstrates using special character sets using regular expressions:

In [48]:
import re

# Example: Using special character sets
text = "H3ll0, 123 W0r1d!"
pattern = r"\d+"
matches = re.findall(pattern, text)
print(matches)  # Output: ['123']

['3', '0', '123', '0', '1']


In this example, the regular expression pattern `r"\d+"` matches one or more digit characters in the `text` string. The `findall()` function returns a list of all digit sequences found in the text.

### Grouping and Capturing
- You can use parentheses `( )` to group patterns together.
- This allows you to apply quantifiers or other operations to the group as a whole.
- Additionally, groups can be used to capture specific portions of the matched text.

- Here's an example that demonstrates grouping and capturing using regular expressions:

In [57]:
import re

# Example: Grouping and capturing
text = "John Doe, Jane Smith"
pattern = r"(\w+) (\w+)"
matches = re.findall(pattern, text)
print(matches)  # Output: [('John', 'Doe'), ('Jane', 'Smith')]

[('John', 'Doe'), ('Jane', 'Smith')]


In this example, the regular expression pattern `r"(\w+) (\w+)"` matches two groups of word characters separated by a space in the `text` string. The `findall()` function returns a list of tuples, where each tuple represents a match and contains the captured groups.

Grouping and capturing are useful for extracting specific parts of the matched text or applying operations to grouped patterns within a regular expression.

## 4: Advanced Pattern Matching

### Alternation and Branching
- Alternation allows you to match one pattern out of several possible patterns.
- It is represented by the pipe symbol `|` and behaves like a logical OR operator.

- Here's an example that demonstrates alternation and branching using regular expressions:

In [58]:
import re

# Example: Alternation and branching
text = "apple banana cherry"
pattern = r"apple|banana|cherry"
matches = re.findall(pattern, text)
print(matches) 

['apple', 'banana', 'cherry']


In this example, the regular expression pattern `r"apple|banana|cherry"` matches any occurrence of "apple", "banana", or "cherry" in the `text` string. The `findall()` function returns a list of all matched alternatives.

### Greedy vs. Non-Greedy Matching
- By default, regular expressions perform greedy matching, where they try to match as much as possible.
- However, you can use the `?` symbol to perform non-greedy matching, where they try to match as little as possible.

- Here's an example that demonstrates greedy and non-greedy matching using regular expressions:

In [59]:
import re

# Example: Greedy and non-greedy matching
text = "ababab"
pattern_greedy = r"aba.*b"
pattern_non_greedy = r"aba.*?b"

matches_greedy = re.findall(pattern_greedy, text)
matches_non_greedy = re.findall(pattern_non_greedy, text)

print(matches_greedy)  # Output: ['ababab']
print(matches_non_greedy)  # Output: ['abab']

['ababab']
['abab']


In this example, the regular expression pattern `r"aba.*b"` matches the longest possible sequence starting with "aba" and ending with "b". This results in a greedy match, matching the entire `text` string. On the other hand, the pattern `r"aba.*?b"` performs non-greedy matching and matches the shortest possible sequence, resulting in a partial match.

### Lookahead and Lookbehind Assertions
- Lookahead and lookbehind assertions allow you to specify conditions that must be satisfied for a match to occur.
- They are used to look ahead or behind the current position in the string without including the matched text in the result.

- Here's an example that demonstrates lookahead and lookbehind assertions using regular expressions:

In [61]:
import re

# Example: Lookahead and lookbehind assertions
text = "apple orange apple banana"
pattern_lookahead = r"\w+(?= orange)"
pattern_lookbehind = r"(?<=apple )\w+"

matches_lookahead = re.findall(pattern_lookahead, text)
matches_lookbehind = re.findall(pattern_lookbehind, text)

print(matches_lookahead)  # Output: ['apple']
print(matches_lookbehind)  # Output: ['orange']

['apple']
['orange', 'banana']


In this example, the regular expression pattern `r"\w+(?= orange)"` uses a positive lookahead assertion to match any word that is followed by the word "orange" in the `text` string. The pattern `r"(?<=apple )\w+"` uses a positive lookbehind assertion to match any word that is preceded by the word "apple".

### Backreferences and Substitutions
- Backreferences allow you to refer back to captured groups within a regular expression. 
- They are represented by the backslash `\` followed by the group number or name.

- Here's an example that demonstrates backreferences and substitutions using regular expressions:

In [63]:
import re

# Example: Backreferences and substitutions
text = "Hello, John Doe!"
pattern = r"(Hello, )(\w+) (\w+)"
replacement = r"\1Mr. \3"
new_text = re.sub(pattern, replacement, text)
print(new_text)  # Output: "Hello, Mr. Doe!"

Hello, Mr. Doe!


In this example, the regular expression pattern ```r"(Hello, )(\w+) (\w+)"``` matches the greeting "Hello, " followed by two words (captured as groups) separated by a space. The replacement pattern r"\1Mr. \3" replaces the matched pattern with "Hello, " (the contents of group 1), followed by "Mr. " and the second captured word (the contents of group 3).

The re.sub() function is used to perform the substitution, replacing the matched pattern in the text string with the specified replacement pattern. The resulting new text is then printed, which in this case would be "Hello, Mr. Doe!".

## 5: Modifiers and Flags

### Case Insensitivity

- The case insensitivity modifier allows you to perform case-insensitive matching in regular expressions.
- By default, regular expressions are case sensitive, meaning that uppercase and lowercase letters are treated differently.
- However, with the case insensitivity modifier, you can ignore the case of letters when making matches.
- Here's an example that demonstrates case insensitivity using regular expressions:

In [64]:
import re

# Example: Case insensitivity
text = "Hello, World!"
pattern = r"hello"
matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)  # Output: ['Hello']

['Hello']


In this example, the regular expression pattern `r"hello"` is matched against the `text` string using the `findall()` function with the `re.IGNORECASE` flag. The flag enables case insensitivity, allowing "Hello" to be matched even though the pattern uses lowercase letters.

### Multiline Mode
- The multiline mode modifies how the `^` and `$` anchors behave in regular expressions.
- By default, these anchors match the start and end of the entire string.
- However, in multiline mode, they also match the start and end of each line within a multiline string.
- Here's an example that demonstrates multiline mode using regular expressions:

In [66]:
import re

# Example: Multiline mode
text = "Line 1\nLine 2\nLine 3"
pattern = r"^Line \d+"

matches = re.findall(pattern, text) #Whole String
print(matches)

matches = re.findall(pattern, text, re.MULTILINE) #Multiline
print(matches)

['Line 1']
['Line 1', 'Line 2', 'Line 3']


In this example, the regular expression pattern `r"^Line \d+"` is matched against the `text` string using the `findall()` function with the `re.MULTILINE` flag. The flag enables multiline mode, causing the pattern to match the start of each line that begins with "Line" followed by one or more digits.

### Dotall Mode
- The dotall mode modifies how the dot `.` metacharacter behaves in regular expressions.
- By default, the dot matches any character except a newline. 
- However, in dotall mode, the dot matches any character, including a newline.
- Here's an example that demonstrates dotall mode using regular expressions:

In [67]:
import re

# Example: Dotall mode
text = "Hello\nWorld"
pattern = r"Hello.World"


matches = re.findall(pattern, text)
print(matches)  # Output: ['Hello\nWorld']

matches = re.findall(pattern, text, re.DOTALL)
print(matches)

[]
['Hello\nWorld']


In this example, the regular expression pattern `r"Hello.World"` is matched against the `text` string using the `findall()` function with the `re.DOTALL` flag. The flag enables dotall mode, allowing the dot to match the newline character between "Hello" and "World".

### Verbose Mode

- The verbose mode allows you to write more readable and organized regular expressions by ignoring whitespace and comments within the pattern.
- It enables you to split the pattern into multiple lines and add comments to explain the different parts of the expression.
- Here's an example that demonstrates verbose mode using regular expressions:

In [70]:
import re

# Example: Verbose mode
text = "Hello,  World!"
pattern = r"""
    Hello  # Match "Hello"
    ,      # Match comma
    \s{2,}   # Match two or more whitespace characters
    World  # Match "World"
"""
matches = re.findall(pattern, text, re.VERBOSE)
print(matches)  # Output: ['Hello, World']

['Hello,  World']


In this example, the regular expression pattern is written using the `re.VERBOSE` flag, enabling verbose mode. The pattern is split into multiple lines, and comments are added using the `#` symbol

## 6: Using Regular Expressions in Python

### The re Module
- In Python, regular expressions are handled using the `re` module, which provides various functions and methods for working with regular expressions.

- Here's an example that demonstrates the basic usage of the `re` module:

In [71]:
import re

# Example: Using the re module
text = "Hello, World!"
pattern = r"World"
matches = re.findall(pattern, text)
print(matches)

['World']


In this example, the `re` module is imported, and the `findall()` function is used to find all occurrences of the pattern `"World"` in the `text` string. The matches are stored in the `matches` list and then printed.

### Compiling and Matching Patterns
The `re` module provides a `compile()` function that allows you to compile a regular expression pattern into a regular expression object. This object can then be used for matching operations.

Here's an example that demonstrates compiling and matching patterns using the `re` module:

In [72]:
import re

# Example: Compiling and matching patterns
text = "Hello, World!"
pattern = r"World"
regex = re.compile(pattern)
matches = regex.findall(text)
print(matches) 

['World']


In this example, the pattern `"World"` is compiled into a regular expression object using the `re.compile()` function. The resulting object `regex` is then used to perform the matching operation using the `findall()` method, similar to the previous example.

### Searching and Extracting Text
- The `re` module provides several functions for searching and extracting text based on regular expressions. 
- Some commonly used functions include `search()`, `match()`, and `findall()`.
- Here's an example that demonstrates searching and extracting text using regular expressions:

In [76]:
import re

# Example: Searching and extracting text
text = "Hello, World!"
pattern = r"\w+"
match = re.search(pattern, text)
if match:
    print("Match found:", match.group())
else:
    print("No match found")

Match found: Hello


In this example, the `search()` function is used to find the first occurrence of the pattern `"World"` in the `text` string. If a match is found, the `group()` method is used to retrieve the matched text. If no match is found, an appropriate message is printed.

### Substitution and Replacing Text
- The `re` module provides the `sub()` function for substituting and replacing text based on regular expressions.
- This function allows you to replace matched patterns with new text.
- Here's an example that demonstrates substitution and replacing text using regular expressions:

In [77]:
import re

# Example: Substitution and replacing text
text = "Hello, World!"
pattern = r"World"
replacement = "Python"
new_text = re.sub(pattern, replacement, text)
print(new_text)  # Output: "Hello, Python!"

Hello, Python!


In this example, the `sub()` function is used to substitute the pattern `"World"` with the replacement text `"Python"` in the `text` string. The resulting new text is stored in the `new_text` variable and then printed.

These are some of the common operations you can perform using regular expressions in Python using the `re` module.

## 7: Practical Examples and Use Cases

### Validating Input Data
- Regular expressions are commonly used for validating input data, such as validating email addresses, phone numbers, or passwords.
- By defining specific patterns and constraints, regular expressions can help ensure that the input data meets the required format.
- Here's an example of validating an email address using a regular expression:

In [78]:
import re

# Example: Validating an email address
email = "example@example.com"
pattern = r"^[\w\.-]+@[\w\.-]+\.\w+$"
is_valid = re.match(pattern, email)
if is_valid:
    print("Email is valid")
else:
    print("Email is not valid")

Email is valid


In this example, the regular expression pattern `r"^[\w\.-]+@[\w\.-]+\.\w+$"` is used to validate the email address. The pattern checks for the presence of alphanumeric characters, dots, and hyphens before and after the `@` symbol, followed by a valid domain name. If the email matches the pattern, it is considered valid.

### Parsing and Extracting Information
- Regular expressions can be used to parse and extract specific information from text data. 
- This is particularly useful when dealing with structured data formats or log files.
- Here's an example of extracting phone numbers from a text using regular expressions:

In [80]:
import re

# Example: Extracting phone numbers
text = "Contact us at +1-123-456-7890 or +44-987-654-3210"
pattern = r"\+\d{1,3}-\d{3}-\d{3}-\d{4}"
phone_numbers = re.findall(pattern, text)
print(phone_numbers) 

['+1-123-456-7890', '+44-987-654-3210']


In this example, the regular expression pattern `r"\+\d{1,3}-\d{3}-\d{3}-\d{4}"` is used to extract phone numbers in the format "+x-xxx-xxx-xxxx". The `findall()` function is used to find all occurrences of the pattern in the `text` string, and the extracted phone numbers are stored in the `phone_numbers` list.

### Searching and Filtering Text
- Regular expressions are powerful tools for searching and filtering text data based on specific patterns.
- They can be used to identify specific words, patterns, or structures within a larger body of text.
- Here's an example of searching for specific words in a text using regular expressions:

In [81]:
import re

# Example: Searching for specific words
text = "The quick brown fox jumps over the lazy dog"
pattern = r"\b\w{5}\b"
matches = re.findall(pattern, text)
print(matches) 

['quick', 'brown', 'jumps']


In this example, the regular expression pattern `r"\b\w{5}\b"` is used to search for words that are exactly 5 characters long. The `\b` metacharacters represent word boundaries, and `\w{5}` matches exactly 5 alphanumeric characters. The `findall()` function returns all the matching words found in the `text` string.

### Text Manipulation and Formatting

- Regular expressions can be used to manipulate and format text data by replacing or rearranging specific patterns.
- This is useful for tasks such as data cleaning, formatting, or generating new text based on specific rules.
- Here's an example of replacing specific words in a text using regular expressions:

In [82]:
import re

# Example: Replacing words
text = "The cat chased the mouse not the notcat"
pattern = r"\bcat\b"
replacement = "dog"
new_text = re.sub(pattern, replacement, text)
print(new_text) # Output: "The dog chased the mouse"

The dog chased the mouse not the notcat


## 8: Regular Expressions Best Practices

### Keeping Regular Expressions Simple

Here are some best practices for keeping regular expressions simple:

- Use **clear and descriptive** pattern names or comments to explain the purpose of the regular expression.
- **Break down complex patterns** into smaller, more manageable parts. This improves readability and makes it easier to debug.
- **Avoid unnecessary complexity** by using simpler alternatives when they achieve the same result.
- **Don't overuse metacharacters** or complex features if they are not required.

### Testing and Debugging Regular Expressions

Here are some tips for testing and debugging regular expressions:

- Test your regular expressions with various input samples, including both expected matches and non-matches.
- Use online regex testers or regex debugging tools to visualize and validate your regular expressions.
- Make use of the debugging capabilities provided by regex libraries or programming environments.
- Pay attention to edge cases and special scenarios to ensure your regular expression handles them correctly.

### Handling Edge Cases and Special Characters

Consider the following best practices for handling edge cases and special characters:

- Escape special characters with a backslash (`\`) when you want to match them literally in the text.
- Be aware of characters with special meanings, such as `.`, `*`, `+`, `?`, `{}`, `()`, `[]`, `\`, `^`, `$`, `|`, and `\b`. Use appropriate escaping or character classes when needed.
- Consider different scenarios, such as empty strings, whitespace, or input containing line breaks, and ensure your regular expression handles them correctly.
- Use character classes or predefined character sets (e.g., `\w`, `\d`, `\s`) to match specific types of characters, rather than listing individual characters.

### Optimizing Regular Expressions for Performance


- Use the most efficient quantifiers for repetition. Prefer non-greedy quantifiers (`*?`, `+?`, `??`, `{n,m}?`) when possible.
- Be mindful of the order and arrangement of alternations (`|`) within your patterns, as it can impact performance.
- Take advantage of character classes and predefined character sets to match specific types of characters efficiently.
- Utilize compiled regular expressions when performing repeated matching operations, as they can offer better performance by caching the compiled pattern.


### Excercises

- `Easy`: Match a valid email address pattern.

- `Easy`: Match a string containing only alphabetic characters (both uppercase and lowercase).

- `Easy`: Match a string consisting of digits only.

In [107]:
import re

pattern = "^\d+$"
text = "99129"
result = re.findall(pattern,text)

print(result)

['99129']


- `Moderate`: Match a string that starts with a capital letter followed by lowercase letters.
- `Moderate`: Match a string containing a specific word surrounded by word boundaries.

- `Moderate`: Match a string that represents a valid URL (starting with "http://" or "https://").

In [106]:
import re

pattern = "http://"
text = "^(http|https)://\S+$"
result = re.findall(pattern,text)

print(result)

[]


- `Hard`: Match a string representing a valid IPv4 address (e.g., "192.168.0.1").
- `Hard`: Match a string that represents a valid date in the format "YYYY-MM-DD".
- `Hard`: Match a string that represents a valid phone number in a specific format (e.g., "(123) 456-7890").
- `Hard`: Match a string that represents a valid credit card number (Visa, Mastercard, or American Express).