<a href="https://colab.research.google.com/github/epythonlab/PythonLab/blob/master/Regex_Tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learn Regular Expressions from Beginner to Advanced
----
<h2 style="margin-left:230px;"><a href="https://telegram.me/epythonlab/">Join Epythonlab</a></h2>

----

# Description:

In this tutorial, you will learn everything you need to know about regular expressions, from beginner to advanced. I will cover the basics of regular expressions, as well as more advanced topics with practical examples. By the end of this tutorial, you will be able to use regular expressions to solve a variety of problems.

Keywords

regular expressions, regex, beginner, advanced, Python, programming, tutorial, regex tutorial, regular expressions tutorial, regex examples, regex exercises, beginner regex, advanced regex

 # I. Introduction to Regular Expressions


A. What are regular expressions?
- Regular expressions `(regex)` are sequences of characters that define a search pattern.
- They are widely used for pattern matching and text manipulation tasks.

B. Why are regular expressions useful?
- Regular expressions offer a powerful and flexible way to search, validate, and manipulate text data.
- They can be applied in various programming languages, text editors, and command-line tools.

C. How are regular expressions applied in various fields?
- Regular expressions find applications in fields like data validation, web scraping, text mining, log analysis, search and replace operations, and more.

# II. Regex Basics


Importing the `re` Module

In Python, regular expressions are handled using the built-in `re` module. Before using regular expressions, you need to import this module

In [None]:
import re

## `re` module provides a set of functions that allow you to work with regex

- `re.match(pattern, string, flags=0)`: This function attempts to match the pattern only at the beginning of the string. If the pattern matches, it returns a match object; otherwise, it returns None. Use this function if you want to check whether a pattern occurs at the beginning of a string.

- `re.search(pattern, string, flags=0)`: This function searches the entire string for a match to the pattern. It returns the first match found as a match object. If no match is found, it returns None. Use this function when you want to find the first occurrence of a pattern anywhere in the string.

- `re.findall(pattern, string, flags=0)`: This function finds all non-overlapping occurrences of the pattern in the string and returns them as a list of substrings. If no matches are found, it returns an empty list. Use this function when you want to find all instances of a pattern in the string.

- `re.finditer(pattern, string, flags=0)`: This function is similar to re.findall(), but instead of returning a list of substrings, it returns an iterator of match objects. Each match object contains information about the matched substring, such as the start and end indices.

- `re.sub(pattern, replacement, string, count=0, flags=0)`: This function replaces all occurrences of the pattern in the string with the specified replacement. It returns a new string with the substitutions. The count parameter specifies the maximum number of substitutions to make (default is 0, which means all occurrences are replaced).

- `re.split(pattern, string, maxsplit=0, flags=0)`: This function splits the input string by occurrences of the pattern and returns a list of substrings. The maxsplit parameter specifies the maximum number of splits to perform (default is 0, which means all occurrences are split).

These functions are the core tools provided by the `re` module to work with regular expressions in Python. They allow you to perform various operations, such as searching, matching, replacing, and splitting strings based on complex patterns. Regular expressions are very powerful, and mastering their use can greatly enhance your string manipulation capabilities.

## A. Literal Matches
- Matching literal characters directly.

### Example:
- The regex pattern `cat` matches the word "cat" in the input text.

In [None]:
text = "The cat is on the mat."
pattern = "cat"

result = re.findall(pattern, text)
print(result)  # Output: ['cat']

- `re.findall(pattern, string)`: Returns all non-overlapping matches of the pattern in the string as a list.

## B. Metacharacters and Escaping
- Special characters in regex that have a predefined meaning. Escaping them allows treating them as literal characters.
### Example:
- To match a dot `(.)`, you need to escape it like `\.`

In [None]:
text = "I have a dot in my sentence."
pattern = r"\."

result = re.findall(pattern, text)
print(result)  # Output: ['.']

- `. (dot)`: Matches any single character except a newline.

## C. Character Classes and Ranges
- Matching a specific set of characters or ranges.
### Example:
- The regex pattern `[aeiou]` matches any vowel in the input text.

In [None]:
text = "I love apples and oranges."
pattern = "[aeiou]"

result = re.findall(pattern, text)
print(result)  # Output: ['o', 'e', 'a', 'e', 'o', 'a', 'o', 'e']

- `[ ] (square brackets)`: Matches any character within the brackets.

## D. Quantifiers
- Specifying the number of occurrences of a character or group.
### Example:
- The regex pattern `a{2,4}` matches "aa," "aaa," or "aaaa."


In [None]:
text = "aaa ab aab aaab"
pattern = "a{2,4}b"

result = re.findall(pattern, text)
print(result)  # Output: ['aab', 'aaab']

- `a{2,4}`: Matches the character "a" repeated 2 to 4 times consecutively.
- `b`: Matches the character "b" immediately following the "a" sequence.
Here's a breakdown of the pattern:

- `a{2,4}`: The curly braces `{2,4}` indicate a quantifier.
 In this case, it specifies that the preceding character "a" should occur between 2 and 4 times. So, it will match patterns like "aa", "aaa", or "aaaa", but not "a" or "aaaaa".
- `b`: Matches the character "b" immediately following the "a" sequence. It ensures that the pattern ends with the character "b".

## E. Anchors
- Matching specific positions in the input text.
### Example:
- The regex pattern `^start` matches "start" at the beginning of a line.

In [None]:
text = "Start with a newline\nstart with a word"
pattern = "^start"

result = re.findall(pattern, text, re.MULTILINE)
print(result)  # Output: ['start']

# III. Intermediate Regex Concepts


## A. Alternation
- Matching multiple alternatives using the OR operator `(|)`.
### Example:
- The regex pattern `apple|orange` matches either "apple" or "orange."


In [None]:
text = "I have an apple and an orange."
pattern = "apple|orange"

result = re.findall(pattern, text)
print(result)  # Output: ['apple', 'orange']

- `|(pipe)`: Matches either the expression before or after the pipe.

## B. Grouping and Capturing
-  Grouping parts of a regex pattern and capturing matched substrings.
###  Example:
- The regex pattern `(ab)+` matches "ab," "abab," or "ababab."


In [None]:
import re
text = "abab abc ab"
pattern = "(ab)+"

result = re.findall(pattern, text)
print(result)  # Output: ['ab', 'ab', 'ab']

- `* (asterisk)`: Matches zero or more occurrences of the preceding character or group.
- `+ (plus)`: Matches one or more occurrences of the preceding character or group.

## C. Backreferences
-  Referring to previously captured groups within the same regex pattern.
### Example:
- The regex pattern `(\d{2})-\1` matches "22-22" but not "22-33."


In [None]:
text = "22-22 22-33"
pattern = r"(\d{2})-\1"

result = re.findall(pattern, text)
print(result)  # Output: ['22']

- `\d` matches any digit
- `{2}` specifies that the preceding expression should occur exactly two times.
- `-` matches a hyphen.
- `\1` ensures that the second occurrence of the same two-digit sequence is required, which is not present in the text.

The regular expression pattern `r"(\d{2})-\1"` matches two consecutive digits followed by a hyphen, and then matches the same two-digit sequence again. It can be used to find repeated digit patterns separated by a hyphen.

## D. Lookaheads and Lookbehinds
- Matching based on the presence or absence of certain patterns ahead or behind the current position.
### Example:
- The regex pattern `(?<=prefix)\w+` matches a word preceded by the word "prefix."


In [None]:
text = "prefixword suffix"
pattern = r"\w+(?<=prefix)"

result = re.findall(pattern, text)
print(result)  # Output: ['word']

- `\w` matches any word character (alphanumeric and underscore).
- `? (question mark)`: Matches zero or one occurrence of the preceding character or group.


## E. Greedy vs. Lazy Matching
- Controlling the matching behavior to be either greedy `(`matches as much as possible`)` or lazy `(`matches as little as possible`)`.
### Example:
- The regex pattern `a.+b` matches "aabbbb" in "aaabbbbbaabbbb" greedily, but "aab" lazily.

In [None]:
text = "aaabbbbbaabbbb"
pattern = r"a.+b"

result = re.findall(pattern, text)
print(result)  # Output: ['aabbbbbaabbbb']



In [None]:
pattern = r"a.+?b"

result = re.findall(pattern, text)
print(result)  # Output: ['aab']

- `a`: Matches the character "a" literally.
- `.+?`: Matches one or more occurrences of any character (except a newline) in a non-greedy fashion.
- The `+` symbol means one or more occurrences, and the `?` symbol makes it non-greedy, meaning it will match as few characters as possible.
- `b`: Matches the character "b" literally.

# IV. Advanced Regex Techniques


## A. Advanced Character Classes
- Utilizing advanced character class constructs for matching specific character types.
### Example:
-The regex pattern `\p{L}` matches any Unicode letter.


In [None]:
text = "Hello, こんにちは, مرحبًا"
pattern = r"\p{L}"

result = re.findall(pattern, text, re.UNICODE)
print(result)  # Output: ['H', 'e', 'l', 'l', 'o', 'こ', 'ん', 'に', 'ち', 'は', 'م', 'ر', 'ح', 'ب', 'ا']


- `\p{L}`: This is a Unicode property escape that matches any Unicode letter character. It includes letters from all scripts, such as Latin, Cyrillic, Arabic, Chinese, etc.
- the `re` module doesn't support Unicode properties using the `\p{}` syntax directly. However, you can achieve the same functionality using the `re.UNICODE` flag (or `re.U` for short) in combination with the `\w` shorthand character class.

## B. Modifiers and Flags
- Adding modifiers or flags to change regex behavior (e.g., case-insensitive matching).
### Example:
- The regex pattern `pattern(?i)` matches "pattern" case-insensitively.


In [None]:
text = "Pattern matching using the pattern flag (?i)"
pattern = r"pattern(?i)"

matches = re.findall(pattern, text)
print(matches)  # Output: ['Pattern', 'pattern']

- `pattern`: This part of the pattern matches the word "pattern" literally. It is case-sensitive by default.

- `(?i)`: This is a flag that modifies the behavior of the regular expression. The `i` flag stands for "ignore case," and when included in the pattern, it makes the pattern case-insensitive. It means that the pattern will match "pattern," "Pattern," "PATTERN," and any other variation of the word regardless of its letter casing.



## C. Conditional Matching
-  Matching patterns conditionally based on certain criteria.
### Example:
- The regex pattern `(?(expression)true|false)` matches "true" if the expression is satisfied; otherwise, it matches "false."


In [None]:
text = "There are 10 items."
pattern = r"(?(10)true|false)"

result = re.findall(pattern, text)
print(result)  # Output: ['true']

- `(?(10) ... )`: This is the conditional construct in the regular expression. It checks if the capture group numbered 10 exists.

- `true`: If the capture group numbered 10 exists in the text, this part of the pattern matches and returns "true."

- `false`: If the capture group numbered 10 does not exist in the text, this part of the pattern matches and returns "false."

## D. Recursive Patterns
-  Creating patterns that can match nested or repetitive structures.
### Example:
- The regex pattern `(?R)` matches the entire pattern recursively.


In [None]:
text = "Nested brackets: (((text)))"
pattern = r"(\(((?>[^()]+|(?1))*)\))"

result = re.findall(pattern, text)
print(result)  # Output: [('(((text)))', 'text')]

## E. Assertions
-  Making assertions about the surrounding text without including it in the actual match.
### Example:
- The regex pattern `word(?=\W|$)` matches "word" only if it's followed by a non-word character or end of line.

In [None]:
text = "word, word1, word2"
pattern = r"word(?=\W|$)"

result = re.findall(pattern, text)
print(result)  # Output: ['word']

- `word`: Matches the characters "word" literally.

- `(?=\W|$)`: This is a positive lookahead assertion that ensures the presence of a non-word character `(\W)` or the end of the string `($)` immediately after the word "word".
- The `(?=...)` construct is used for lookahead assertions, and `\W` represents any non-word character.

# V. Practical Examples and Demo Exercises


## A. Validating Email Addresses
- Using regex to validate the format of an email address.
### Exercise:
- Write a regex pattern to validate email addresses with the format "username@epythonlab.com."


In [None]:
emails = [
    "john_doe@example.com",
    "alice123@mydomain.org",
    "support@website",
    "invalid-email",
    "admin@com",
]
pattern = r"^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$"

for email in emails:
    if re.match(pattern, email):
        print(f"{email} is a valid email address.")
    else:
        print(f"{email} is not a valid email address.")

- `^`: The caret symbol indicates the start of the string.

- `\w+`: This part of the pattern matches one or more word characters (letters, digits, or underscores). It matches the username part of the email address.

- `@`: Matches the "@" symbol.

- `[a-zA-Z_]+?`: This part matches one or more letters (both uppercase and lowercase) and underscores. The `+?` makes it non-greedy, meaning it matches as few characters as possible. It represents the domain name part of the email address.

- `\.`: Escapes the period (dot) character so that it matches a literal dot in the email address.

- `[a-zA-Z]{2,3}`: This part matches two or three letters (both uppercase and lowercase). It represents the top-level domain (TLD) part of the email address, like ".com" or ".org".

- `$`: The dollar sign indicates the end of the string.

## B. Extracting URLs from Text
- Extracting URLs from a text document using regex.
### Exercise:
- Write a regex pattern to extract URLs from a given text input.


In [None]:
text = "Visit my website at https://www.example.com"
pattern = r"https?://[\w./-]+"

result = re.findall(pattern, text)
print(result)  # Output: ['https://www.example.com']

- `https?://`: This part matches "http://" or "https://". The `s?` makes the "s" optional, so it will match both "http://" and "https://".

- `[\w./-]+`: This part matches one or more occurrences of word characters, dots, slashes, and hyphens.
  - The `\w` is a shorthand character class for word characters (letters, digits, and underscores).
  - The square brackets `[]` define a character class, which allows any character inside it.
  - The character class includes `\w` (word characters), `.` (dot), `/` (forward slash), and `-` (hyphen).

## C. Parsing HTML/XML Tags
- Parsing and extracting information from HTML/XML tags using regex.
### Exercise:
- Write a regex pattern to extract the content within HTML `<title>` tags.


In [None]:
html = "<html><head><title>Regex tutorial</title></head><body><h1>This is regex tutorial home page</h1></body></html>"
pattern = r"<title>(.*?)<\/title>"

match = re.search(pattern, html)
if match:
    title_content = match.group(1)
    print(f"The title is: {title_content}")
else:
    print("Title not found.")


- `<title>`: this part mathces the opening title tag literally. It looks for the substring `"</title>"` in the text.

- `(.*?)`: This is a non-greedy capture group enclosed in parentheses. The `.*?` matches any character (except a newline) zero or more times in a non-greedy manner. The non-greedy behavior means it will match as few characters as possible. This group is used to capture the content within the title tags.

- `<\/title>`: This part matches the closing title tag literally. The backslash `\` is used to escape the forward slash, making sure it matches the actual `"</title>"` in the text.

- `re.search()` is a function provided by the Python re module, which is used for searching a string for a match to a regular expression pattern. It scans the entire input string and tries to find the first location where the pattern matches.
- `pattern`: The regular expression pattern that you want to search for in the string.
- `string`: The input string in which you want to search for the pattern.
- `flags`: (Optional) Additional flags that modify the behavior of the regular expression. For example, `re.IGNORECASE` makes the pattern case-insensitive.

## D. Formatting and Manipulating Text
- Using regex to format and manipulate text strings.
### Exercise:
- Write a regex pattern to remove all non-alphanumeric characters from a given text.


In [None]:
text = "Remove!@#$non-alphanumeric%^characters"
pattern = r"\W+"

result = re.sub(pattern, "", text)
print(result)  # Output: 'Removenonalphanumericcharacters'

## E. Data Extraction and Transformation
-  Applying regex to extract and transform data from structured or semi-structured text.
### Exercise:
- Write a regex pattern to extract phone numbers from a text document.

In [None]:
text = "Contact us at: Phone: 123-456-7890 Email: info@example.com"
pattern = r"Phone: (\d{3}-\d{3}-\d{4}) Email: (\w+@\w+\.\w+)"

result = re.findall(pattern, text)
print(result)  # Output: [('123-456-7890', 'info@example.com')]

- `Phone`:: This part matches the word "Phone:" literally. It will look for the substring "Phone:" in the text.

- `(\d{3}-\d{3}-\d{4})`: This is a capture group enclosed in parentheses. It matches a phone number in the format "xxx-xxx-xxxx," where "x" represents any digit. The `\d` is a shorthand character class for digits `(0-9)`. The curly braces `{3}` specify that the preceding digit pattern `\d` should occur exactly three times.

- `Email`:: This part matches the word "Email:" literally. It will look for the substring "Email:" in the text.

- `(\w+@\w+\.\w+)`: This is another capture group enclosed in parentheses. It matches an email address in the format "username@domain.com," where "username" can contain one or more word characters (letters, digits, or underscores), and "domain.com" can contain one or more word characters separated by a dot. The `\w `is a shorthand character class for word characters, and `\.` matches a literal dot.

# Conclusion:


In conclusion, this comprehensive regex tutorial covered various topics from basics to advanced techniques. You learned how regex empowers you to tackle pattern matching and text manipulation tasks. With practical examples and exercises, you gained hands-on experience. Keep practicing and exploring the vast world of regex for endless possibilities in text-related problem-solving.

## Join [Epythonlab](https://telegram.me/epythonlab/)