# RegEx (Regular Expressions) in Python

- What is RegEx?
- Metacharacters used in RegEx
- The `re` module
- Usage of the `re` module and available functions

## What is RegEx?

- is a sequence of characters that defines a search pattern.
- it is a mix of letters, numbers and symbols in the pattern defining the search term/criteria.
- **RegEx is case-sensitive**

    **Example**

    ```regex
    ^a...s$
    ```

    - The code above defines a pattern. The pattern is **any five letter string starting with `a` and ending with `s`.**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `^a...s$` | abs | No match |
    | `^a...s$` | alias | Match | 
    | `^a...s$` | abyss | Match |
    | `^a...s$` | Alias | No match |
    | `^a...s$` | An abacus | No match |

## Specifying patterns using RegEx

**Metacharacters:**

- Are interpreted in a special by the RegEx engine giving a search pattern

`[] . ^ $ * + ? {} () \ |`

1. `[]` - Square brackets

    - Specify a set of characters that you wish to match
    - The do not have to be in sequence
    - Each occurrence of a  specified character counts as one match

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `[abc]` | a | 1 match |
    | `[abc]` | ac | 2 matches |
    | `[abc]` | Hey Jude | No match |
    | `[abc]` | abc de ca | 5 matches |

    - You can also specify a range of characters using `-` inside the square brackets:

        - `[a-e]` is the same as `[abcde]`
        - `[1-4]` is the sames `[1234]`
        - `[0-39]` is the same as `[0123...39]

    <hr>

2. `.` - Period

    - Matches any single character (except newline `\n`)
    - The match must have at least the number of specified periods(characters)

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `..` | a | No match |
    | `..` | ac | 1 match |
    | `..` | acd | 1 match |
    | `..` | acde | 2 matches |

    <hr>

3. `^` -  Caret

    - is used to check if a string starts with a certain character(s)
    - The order of characters specified is important and must be recognized in order to return a match

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `^a` | a | 1 match |
    | `^a` | abc | 1 match |
    | `^ab` | abc | 1 match |
    | `^a` | bac | No match |
    | `^ab` | abab | 1 match |
    | `^ab` | acb | No match |

    <hr>

4. `$` - Dollar

    - Used to check if a string ends with a certain character
    - The order of characters specified is important and must be recognized in order to return a match

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `a$` | a | 1 match |
    | `a$` | formula | 1 match |
    | `a$` | formula one | No match | 
    | `re$` | fire | 1 match |

    <hr>

5. `*` - Asterisk (Star)

    - It matches zero or more occurrences of the pattern to the left of it.

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `ma*n` | an | 1 match |
    | `ma*n` | man | 1 match |
    | `ma*n` | maaaan | 1 match |
    | `ma*n` | main | No match (a is not followed by n) | 
    | `ma*n` | woman | 1 match |

    <hr>

6. `+` - Plus

    - Matches one or more occurrences of the pattern to the left of it

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `ma+n` | mn | No match |
    | `ma+n` | man | 1 match |
    | `ma+n` | mailed | No match |
    | `ma+n` | many | 1 match |
    | `ma+n` | mason | No match |

    <hr>

7. `?` - Question Mark

    - Matches zero or one occurrences of the pattern to the left of it

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `ma?n` | mn | 1 match |
    | `ma?n` | man | 1 match |
    | `ma?n` | maaaaan | No match (a appears more than once) |
    | `ma?n` | main | No match | 
    | `ma?n` | woman | 1 match |

    <hr>

8. `{}` - Braces 

    - The syntax is `{n, m}` : at least `n` and at most `m` repetitions of the pattern to the left of it.
    
    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `a{2, 4}` | abc dat | No match |
    | `a{2, 4}` | abc daat | 1 match (at d**aa**t) | 
    | `a{2, 4}` | aabc daaat | 2 matches | 
    | `a{2, 4}` | aabc daaaat | 2 matches |

    <hr>

9. `|` - Alternation / Vertical bar / Pipe symbol

    - Used as an `or` operator

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `a\|b` | cde | No match |
    | `a\|b` | ade | 1 match |
    | `a\|b` | acdbea | 3 matches |

    <hr>

10. `()` - Group

    - Used to group sub-patterns.

    **Usage:**

    | Expression | String | Matched? |
    |------------|--------|----------|
    | `(a\|b\|c)xz` | ab xz | No match | 
    | `(a\|b\|c)xz` | abxz | 1 match |
    | `(a\|b\|c)xz` | axz cabxz | 2 matches |
    | `(a\|b\|c)xz` | adxz | No match |

    <hr>

11. `\` - Backslash 

    - is used to escape various including the metacharacters.
    - If you are unsure if a character has a special meaning or not you can put a backslash in front of it.

    **Usage:**

    | Expression | Resulting pattern |
    |------------|-------------------|
    | `\$` | Includes `$` as part of the string |
    | `\[\]` | Include `[]` as part of the string/pattern |

#### Special Sequences

- Make commonly used patterns easier to write

`\A` - Matches if a specified character(s) are at the start of the string.

**Usage:**

| Expression | String | Matched? |
|------------|--------|----------|
| `\Athe `| the sun | Match |
| `\Athe`| The sun | No match |

**Tools/Resources:**

- RegEx checker - [https://regex101.com/](https://regex101.com/)

### The `re` module in Python

- What the `re` module is?
- Utilizing the `re` module
- `re` functions

#### What is the `re` module?

- It is a built-in Python module that allows us to deal/handle regular expressions
- It defines several functions and constants to work with RegEx.
- To use it simply `import re`

#### Utilizing the `re` module

- Use the import statement to get the `re` module into the file

    ```python
    import re
    ```

1. `re.findall()`

    - Return a list of strings containing all matches

        **Syntax:**

        ```python
        re.findall(<pattern>, <string>)
        ```

        ```python
        # `findall()`
        string_1 = "hello 12 hi. 89 how a876e you? 56"
        pattern_1 = "[0-9]+"  # Search for 1 or more occurrences of a digit

        string_2 = "johnny76@mail.com"
        pattern_2 = "^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+.[a-z.]$"

        result = re.findall(pattern=pattern_1, string=string_1)
        result_2 = re.findall(pattern=pattern_2, string=string_2)

        print(result)
        print(result_2)
        ```

2. `re.split()`

    - Splits the string where there is a match.
    - returns a list of strings where the splits have occurred.

        **Syntax:**

        ```python
        re.split(<pattern>, <string>)
        ```

        ```python
        # `split()`

        pattern_3 = "[0-9]+"

        string_3 = "hello 12 hi. 89 how a876e you? 56"
        string_4 = "Twelve:12 Eighty-nine:89 seventy-six:76"

        result_3 = re.split(pattern=pattern_3, string=string_3)
        result_4 = re.split(pattern=pattern_3, string=string_4)

        print(result_3)
        print(result_4)
        ```

3. `re.sub()`

    - Returns a string where matched occurrences are replaced with the content of the `replace` variable.
    - If you pass `count` as the fourth parameter it replaces the number of specified occurrences. It is 0 by default.

        **Syntax:**

        ```python
        re.sub(<pattern>, <replacement-string>, <string>, <count:optional>)
        ```

        ```python
        # Program to remove all whitespaces
        string_5 = "abc 12\n def      15 \n 56 \nghi"

        pattern_4 = "\s+"
        replace = ""

        result_5 = re.sub(pattern_4, replace, string_5)

        print(result_5)
        ```

4. `re.search()`

    - The method looks for the first location where the RegEx pattern produces a match with the string.
    - If the search is successful it returns a match object, if not it returns `None`

        **Syntax:**

        ```python
        re.search(<pattern>, <string>)
        ```

        ```python
        # `search()`
        # Check if a phone number starts with the country code
        string_6 = "+26 5 899654937"
        string_7 = "08754456323"

        pattern_5 = "^[+][0-9]{1,3}" # Alternatives: "^[+]{1}[0-9]{1,3}", "^(\+|(00))[0-9]{1,3}"

        result_6 = re.search(pattern_5, string_6)
        result_7 = re.search(pattern_5, string_7)

        print(result_6)  # Output: <re.Match object; span=(0, 4), match='+265'>
        print(result_7)  # Output: None
        ```

5. `re.search().group()`

    - The group function only works on a match object produced from the search function

    - The `group()` function returns the part of the string where the match is

        **Syntax:**
        
        ```python
        group()
        ```
        
        ```python
        # `search().group()`
        string_8 = "39801 356, 2102 1111"
        pattern_6 = "([0-9]{3}) ([0-9]{2})"  # The brackets determine separate grouping for the pattern

        result_8 = re.search(pattern_6, string_8).group(1, 2)

        print(result_8)
        ```

#### Raw string processing using `r` prefix

- When the `r` or `R` prefix is used on a regular expression, it means raw string.
- For example `\n` is new line whereas `r'\n'` it means we have two characters `\` and `n`

```python
# Raw string prefixing

string_9 = "\n and \r are escape + sequences()[]"
pattern_7 = r"[\n+()[]]"

result_9 = re.findall(pattern_7, string_9)
print(result_9)
```


