# Regular Expression in Python

Regular expressions (also known as regex or regexp) are a powerful tool for searching and manipulating text. They allow you to define a pattern or set of rules that describe a particular string of characters, and then search for or manipulate any text that matches that pattern.

Regular expressions are commonly used in programming, particularly for tasks like data validation, searching and replacing text, and parsing strings. They are also useful in text editors, command-line tools, and other applications that involve working with text.

Some of the benefits of using regular expressions include:

- **Flexibility:** Regular expressions are incredibly flexible and can match a wide range of patterns, from simple strings to complex sequences of characters.
- **Efficiency:** Regular expressions are often faster than alternative methods for text processing, particularly for large amounts of data.
- **Accuracy:** Regular expressions are very precise and can be used to match specific patterns, ensuring that you only work with the data that you need.
- **Standardization:** Regular expressions are a widely accepted standard for working with text, making it easier to share and collaborate on code that involves text processing.

```
https://regexr.com/
```

In [1]:
import re

## Python RegEx Methods

Python provides a powerful module called `re` for working with regular expressions. This module provides various methods for working with regular expressions in Python, including:

### 1. re.search(pattern, string) 

The `re.search()` function is used to search for a pattern in a string and return the first occurrence of the pattern. It returns `None` if the match is not found. This is equivalent to `in` operator used with python string. Since the result is either some value or None, depending on whether a match was found or not, the result can be used with conditional expressions as well.

- `pattern`: The regular expression pattern to search for
- `string`: The string to search in

It's a good idea to use raw strings (represented as `r'...'`) to define regular expression patterns. This will make more sense later on.

The match object contains information about the match. Some of the useful methods and attributes of the match object are:

- `group()`: Returns the matched string
- `start()`: Returns the starting index of the match
- `end()`: Returns the ending index of the match
- `span()`: Returns a tuple containing the starting and ending indices of the match

In [3]:
string = "In the documents, the FBI found dozens of high-profile names, including Trump's. A unit of FOIA officers, citing exemptions in FOIA law, redacted Trump's name because, although he was then a sitting president, he had been a private citizen when the 2006 federal investigation into Epstein began.[85] On May 18, Patel and Deputy FBI Director Dan Bongino told Fox News that Epstein had died by suicide."
print(string)

In the documents, the FBI found dozens of high-profile names, including Trump's. A unit of FOIA officers, citing exemptions in FOIA law, redacted Trump's name because, although he was then a sitting president, he had been a private citizen when the 2006 federal investigation into Epstein began.[85] On May 18, Patel and Deputy FBI Director Dan Bongino told Fox News that Epstein had died by suicide.


In [8]:
pattern = r"the"

match = re.search(pattern, string)

In [11]:
if match:
    print("Match Start:", match.start())
    print("Match End:", match.end())

Match Start: 3
Match End: 6


In [12]:
string[3:6]

'the'

### 2. re.findall(pattern, string)

The `re.findall()` function is used to find all occurrences of a regular expression pattern in a string. All the parameters are same as that used with `re.search()`. The result of `re.findall()` is a list of all the matches found. The result in the example below is quite simple, we will discuss the pattern design later to draw more insights on the upcoming topic. 

In [15]:
pattern = r"[A-Z]{2,}"

match = re.findall(pattern, string)

In [16]:
match

['FBI', 'FOIA', 'FOIA', 'FBI']

### 3. re.sub(pattern, repl, string, count=0)

`re.sub()` is a method that is used to replace occurrences of a pattern in a string with a replacement string. It returns a new string with the replacements made. Here are the parameters used:

- `pattern`: The regular expression pattern to search for
- `repl`: replacement string that you want to use in place of matched pattern
- `string`: The string to search in
- `count`: Maximum number of replacements to make


In [20]:
pattern = r"[A-Z]{2,}"

repl = "---"

new_string = re.sub(pattern, repl, string, count = 0 )

In [21]:
new_string

"In the documents, the --- found dozens of high-profile names, including Trump's. A unit of --- officers, citing exemptions in --- law, redacted Trump's name because, although he was then a sitting president, he had been a private citizen when the 2006 federal investigation into Epstein began.[85] On May 18, Patel and Deputy --- Director Dan Bongino told Fox News that Epstein had died by suicide."

### 4. re.split(pattern, string, maxsplit=0)

`re.split()` is a method that is used to split a string into a list of substrings based on a regular expression pattern. It returns a list of the substrings. It is similar to Python's `split()` method use with Python `str` objects. Let's see how each parameter works:

- `pattern`: The regular expression pattern to search for
- `string`: The string to search in
- `maxsplit`: Maximum number of splits to make


In [23]:
pattern = r"\[\d+\]"

In [24]:
segements = re.split(pattern, string)

In [25]:
segements

["In the documents, the FBI found dozens of high-profile names, including Trump's. A unit of FOIA officers, citing exemptions in FOIA law, redacted Trump's name because, although he was then a sitting president, he had been a private citizen when the 2006 federal investigation into Epstein began.",
 ' On May 18, Patel and Deputy FBI Director Dan Bongino told Fox News that Epstein had died by suicide.']

In [26]:
file_path = "amazonaws.com/clip_test_videos/5min_test.mp4"

In [27]:
file_path.split("/")

['amazonaws.com', 'clip_test_videos', '5min_test.mp4']

In [29]:
re.split(r"\/", file_path)

['amazonaws.com', 'clip_test_videos', '5min_test.mp4']