<a href="https://colab.research.google.com/github/docindata/regex/blob/main/Regex_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Regular Expressions (Regex)

### 1.1 What is Regex?

Regex is a great tool available in similar syntax in quite a few programming languages, it can be used on string to find patterns of text, numeber and other characters. Strings may be found, removed, manipulated, extracted etc

### 1.2 How to start Regex in Python?

First of all, load the library __re__
```python
import re
```
Then use some of its functions on a pre-defined string. 

### 1.3 The Most Important Functions in re


1. `re.findall` is the most basic, it runs through all the provided string and looks for the pattern in all of it. Not always optimal if the string is very long and we're looking for one match only.
2. `re.match` looks for only one match only in the beginning of the string. Stops after finding one match.
3. `re.search` looks for only one match whereever in the string. Stops after finding one match.
4. `re.finditer` finds one match then stops, until it is iterated on again. The iteration can be conditional, as stop after 4 finding for example.

1. `re.group` is used to access sub-groups within the found pattern. We'll go into groups further on.
2. `re.sub` finds and substitutes the found pattern.
3. `re.split` splits the string at the place of a specific pattern.

### 1.4 The Regex Syntax

First of all, Regex has to be written in __raw__ string format.
```python
r"regex_body"
```
and not as a normal string
```python
"regex_body"
```
becuase inverted slash `\` is used alot in regex and non-raw strings are programmed to interact with `\` in a way that will alter the pattern.

Second of all, to leverage the full power of regex, __metacharacters__ are used to look for patterns, see below
```python
r"\w+\s\d+"
```
which will look for a word made of one or more characters, followed by a singel space and one or more digits.

Here is a table on some metacharacter's character classes which are used to find any character or characters of ceratin type.  

#### Character classes

| Character class | Use |
| --- | --- |
|. | str contains any, except \n. This is called wild card |
|\w | str contains 0-9, a-Z and _ |
|\d | str contains digits 0-9 |
|\s | str contains whitespace, even new line |
|\W | str does NOT contain 0-9, a-Z or _ |
|\D | str does NOT contain digits 0-9 |
|\S | str does NOT contain space |  

Anchors can be used to make the search more deterministic. ie in the begging of the string.  

#### Anchors

| Anchors | Use |
| --- | --- |
|^ | matches begging of str or line |
|$ | matches begging of str or line | 
|\b | matches begging or end of str. on boundary |
|\A | matches beginning of str |
|\B | matches NOT begging or end of str |
|\Z | matches end of str |

Note that searching with character classes returns each match once. To overcome that we can use quantifiers.

#### Quantifiers

| Quantifier | Use |
| --- | --- |
| {n} | occurs exactly n times |
| {n,} | occurs n times or more |
| {m, n} | occurs between m to n times |
| + | occurs once or more |
| * | occurs zero times or more |
| ? | occurs zero times or once |

#### Looking for multiple patterns with sets

| Set | Use |
| --- | --- |
| [abc] | looks for a or b or c|   
| [^abc] | looks for all but a, b or c (negation if ^ is first) |
| [a^bc] | look for a, ^, b or c 
| [a-b] | looks for all lowercase letters |
| [a-zA-Z] | looks for all letters |
| [a-zA-Z0-9] | looks for all letters and digits |
| [a-z][0-9] | looks for a letter followed by a digit |
| [a-z][0-9]+ | looks for a letter followed by one or more digits |
| [a-z]{3,}[0-9]+ | looks for at least 3 letters followed by one or more digits |
| [a-z][0-9]{2} | looks for i.e. a1b2 |

#### Looking for characters that are regex special characters

| Syntax | Use |
| --- | --- |
| [.] | looks for . as a literal |
| \\. | looks for . as a literal |
| note | applies for ?, *, \ and all special characters |

#### Optional matches

car and carpet
car(pet)?

### 1.5 The Regular Expression Engine

There are __5 key conecpts__ to understand how the engine behind the scene works. 

1. _One character at a time_: The engine evaluates and decects one matching character at a time. Quantifiers are used to modify this behaviour. 
2. _Left to right_: Specify boundary using \bword\b or query specific and THEN general.
3. _Greedy, lazy and backtracking_: 
4. _Groups_
5. _Look ahead and look behind_

6. __Extra!__ _What is matched once is not queried again_


In [None]:
import re 
string = "aaaaa"
regex = r"aa"
re.findall(regex, string)

['aa', 'aa']

In [None]:
string = "carpet and car"
regex = r"car|carpet"

re.findall(regex, string)

In [None]:
string = "carpet and car"
regex = r"\b(car|carpet)\b"

re.findall(regex, string)

['carpet', 'car']

In [None]:
string = "carpet and car"
regex = r"carpet|car"

re.findall(regex, string)

['carpet', 'car']