## 1. Regular Expressions (Regex)

The following document is my personal notes while learning Regex, it's by no means a comprehensive guide.

### 1.1 What is Regex?

Regex is a great tool available in similar syntax in quite a few programming languages, it can be used on string to find patterns of text, numeber and other characters. Strings may be found, removed, manipulated, extracted etc

### 1.2 How to start Regex in Python?

First of all, load the library __re__
```python
import re
```
Then use some of its functions on a pre-defined string. 

### 1.3 The Most Important Functions in the re module


1. `re.findall` is the most basic, it runs through all the provided string and looks for the pattern in all of it. Not always optimal if the string is very long and we're looking for one match only.
2. `re.match` looks for only one match only in the beginning of the string. Stops after finding one match.
3. `re.search` looks for only one match whereever in the string. Stops after finding one match.
4. `re.finditer` finds one match then stops, until it is iterated on again. The iteration can be conditional, as stop after 4 finding for example.

1. `re.group` is used to access sub-groups within the found pattern. We'll go into groups further on.
2. `re.sub` finds and substitutes the found pattern.
3. `re.split` splits the string at the place of a specific pattern.

### 1.4 The Regex Syntax

First of all, Regex has to be written in __raw__ string format.
```python
r"regex_body"
```
and not as a normal string
```python
"regex_body"
```
becuase inverted slash `\` is used alot in regex and non-raw strings are programmed to interact with `\` in a way that will alter the pattern.

Second of all, to leverage the full power of regex, __metacharacters__ are used to look for patterns, see below
```python
r"\w+\s\d+"
```
which will look for a word made of one or more characters, followed by a singel space and one or more digits.

Here is a table on some metacharacter's character classes which are used to find any character or characters of ceratin type.  

#### Character classes

| Character class | Use |
| --- | --- |
|. | str contains any, except \n. This is called wild card |
|\w | str contains 0-9, a-Z and _ |
|\d | str contains digits 0-9 |
|\s | str contains whitespace, even new line |
|\W | str does NOT contain 0-9, a-Z or _ |
|\D | str does NOT contain digits 0-9 |
|\S | str does NOT contain space |  

Anchors can be used to make the search more deterministic. ie in the begging of the string.  

#### Anchors

| Anchors | Use |
| --- | --- |
|^ | matches begging of str or line |
|$ | matches begging of str or line | 
|\b | matches begging or end of str. on boundary |
|\A | matches beginning of str |
|\B | matches NOT begging or end of str |
|\Z | matches end of str |

Note that searching with character classes returns each match once. To overcome that we can use quantifiers.

#### Quantifiers

| Quantifier | Use |
| --- | --- |
| {n} | occurs exactly n times |
| {n,} | occurs n times or more |
| {m, n} | occurs between m to n times |
| + | occurs once or more |
| * | occurs zero times or more |
| ? | occurs zero times or once |

#### Looking for multiple patterns with sets

| Set | Use |
| --- | --- |
| [abc] | looks for a or b or c|   
| [^abc] | looks for all but a, b or c (negation if ^ is first) |
| [a^bc] | look for a, ^, b or c 
| [a-b] | looks for all lowercase letters |
| [a-zA-Z] | looks for all letters |
| [a-zA-Z0-9] | looks for all letters and digits |
| [a-z][0-9] | looks for a letter followed by a digit |
| [a-z][0-9]+ | looks for a letter followed by one or more digits |
| [a-z]{3,}[0-9]+ | looks for at least 3 letters followed by one or more digits |
| [a-z][0-9]{2} | looks for i.e. a1b2 |

#### Looking for characters that are regex special characters

| Syntax | Use |
| --- | --- |
| [.] | looks for . as a literal |
| \\. | looks for . as a literal |
| note | applies for ?, *, \ and all special characters |

#### Optional matches

car and carpet
car(pet)?

### 1.5 The Regular Expression Engine

There are __5 key conecpts__ to understand how the engine behind the scene works. 

#### 1 _One character at a time_

The engine evaluates and decects one matching character at a time. Quantifiers are used to modify this behaviour when not desired.

#### 2 _Left to right_

The regex engine looks for pattern from left to right, which implies:
* When searching for a sun|sunset, the word sun will be looked for throughout the whole string before moving to sunset.
* Pattern sun can be found as a delmängd of word sunset in the string, and the string which is matched to the first argument, sun, will not be matched to the second argument, sunset. So it's possible that sunset won't be found even though it is present in the string.

This can be solved by looking for the more specific pattern first before moving to the general pattern. 

Example below

In [None]:
import re
string = "sunset is when the sun is gone"
regex = r"sun|sunset"

print(re.findall(regex, string), "\n\nnote that sunset wasn't found even thought it was looked for\n\nSo let's try again")

['sun', 'sun'] 

note that sunset wasn't found even thought it was looked for

So let's try again


In [None]:
regex = r"sunset|sun"
print(re.findall(regex, string), "\n\nNow we found both sun and sunset\n\nLet's try a different method")

['sunset', 'sun'] 

Now we found both sun and sunset

Let's try a different method


In [None]:
regex = r"\b(sun|sunset)\b"
print(re.findall(regex, string), "\n\nThis worked because we used the anchor boundary \\b")

['sunset', 'sun'] 

This worked because we used the anchor boundary \b


#### 3 _Greedy, lazy and backtracking_

When a search is greedy it exhausts the string before stopping. A good analogy is that if you're greedy and driving you'd go all the way until the highway is over and then drive back to look for you exit. If you're lazy you'd just check for your exit each time you see an exit, when you find it you'll drive out and stop looking.

In [None]:
string = "Will the algo stop at this dot.? Maybe not."
regex = r"algo.+\."
print(re.findall(regex, string), "\n\nThe greedy quantifier went too far")

['algo stop at this dot.? Maybe not.'] 

The greedy quantifier went too far


In [None]:
# Below we'll add ? to the + quantifier to make it lazy
regex = r"algo.+?\."
print(re.findall(regex, string),  "\n\nThe lazy quantifier stopped at the first dot, like we wanted")

['algo stop at this dot.'] 

The lazy quantifier stopped at the first dot, like we wanted


Let's do another more extensive example where metacharacters are used

In [None]:
string = "start with 123. end with 789."
regex = r".+(\d+)[.]"

print(re.findall(regex, string), "\n\nQuite unexpected, I thought the output would be 789")

['9'] 

Quite unexpected, I thought the output would be 789


Mush be because the greedy wild card caught the digits 78 as part of the wild card call. It went all the way to the end of the string and backtracked one step, release the dot then backtracked one more step. Found 1 digit which satisified the call. Then continued forward again and matched to dot.

In [None]:
regex = r".+?(\d+)[.]"

print(re.findall(regex, string), "\n\nWorked just like we wished")

['123', '789'] 

Worked just like we wished


Another way rather than adjusting the wild card is to look for all non-digits followed by digits by using the negation mark ^ within a set.

In [None]:
regex = r"[^\d]*(\d+)[.]"

print(re.findall(regex, string))

['123', '789']


What if we just want to catch the last digits?

In [None]:
regex = r"[^\d]*(\d+)[.]$"

print(re.findall(regex, string))

['789']


In [None]:
string = "sunshining at sunshine, and shunny at sundown"

regex = r"\bsun(shin(ing|e)|down)\b"

re.findall(regex, string)

[('shining', 'ing'), ('shine', 'e'), ('down', '')]

In [14]:
string = "twenty popular recipes to make applepaste at the applepastry that sells applepie"

regex = r"\b(apple(past(ry|e)|pie))\b"

re.findall(regex, string)

[('applepaste', 'paste', 'e'),
 ('applepastry', 'pastry', 'ry'),
 ('applepie', 'pie', '')]

#### 4 _Groups_

A group is any part of the pattern specified inside a paranthesis `()`. You can break up text into subpatterns, selectively extract/capture what you want withing a bigger pattern. Groups can be accessed with backreference through an automatically designated index or with names.
Below is a table with some group-related code.
Naming the groups is safer as indexes will change if you add something to the pattern later.

|Regex|Use|
|---|---|
|(exp)|Creates an capturing indexed group|
|(?:exp)|Non-capturing group|
|?P< name >exp|Creates a named group|
|(?P=name)|Refer to a group by name|
|\n|Refer to a group by number|
|\g< name >|Replaces the named group|


In [7]:
string = "2022-06-20"
regex = r"(?P<year>\b\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
date = re.findall(regex, string)

In [13]:
string = "Are there any duplicated words words?"
# regex that finds duplicated word by index number
regex = r"(\b\w+\b)\s+\1"

print(re.findall(regex, string))

['words']


Now the same just using named groups instead of numbered.

In [17]:
# regex that finds the duplicated word and names it word
regex = r"(?P<word>\w+)\s+(?P=word)"

print(re.findall(regex, string))

['words']


The one below deletes the duplicated word.

In [18]:
re.sub(regex, r"\g<word>", string)

'Are there any duplicated words'

### 1.6 Final Words

I hope this has been helpful to any reader.