# Regular expressions, interactive examples

In [16]:
import sys
sys.path.append("../src/")

In [17]:
# Use the autoreload extension such that you are always up-to-date without notebook restarts
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
from regex_intro import show_regex_search_result

# Part 0: The math

Invented by the mathematician Spehen Kleene, he formulated regular expressions (and called it regular language), and is defined as such:

Given a finite alphated $\Sigma$, the following constants are defined as regular expressions:

- $\emptyset$ denoting the empty set and is a valid regular expression
- All $a_i \in \Sigma$ is **$a_i$** a regular expression
- are $x$ and $y$ regular expressions, then $(x|y)$, $(xy)$, $(x^*)$ are regular expressions

## Part 1: Searching

### Simply searching for a string

Similarly to a normal search bar, you can of course just search for a given string.

In [19]:
regex = r"test"
test_string = "This will match test in this string"

show_regex_search_result(regex,test_string) ## this basically calls re.compile, and then searches for all occurences of a given text

This will match [92m[4mtest[0m in this string


## Special character classes

You can search for any given character using the `[]` brackets. So searching for characters would be

```
[xy] - Searches for x or y
[a-e] - Searches for all word characters between a and e
```

The same is possible with numbers of course

In [20]:
regex = r"[es]"

test_string = "This searches for all occurences of es in this string, in combination and on its own"

show_regex_search_result(regex, test_string)

Thi[92m[4ms[0m [92m[4ms[0m[92m[4me[0march[92m[4me[0m[92m[4ms[0m for all occur[92m[4me[0mnc[92m[4me[0m[92m[4ms[0m of [92m[4me[0m[92m[4ms[0m in thi[92m[4ms[0m [92m[4ms[0mtring, in combination and on it[92m[4ms[0m own


In [21]:
regex = r"[a-k]"

test_string = "This searches for all occurences between a and k in a given string"

show_regex_search_result(regex,test_string)

T[92m[4mh[0m[92m[4mi[0ms s[92m[4me[0m[92m[4ma[0mr[92m[4mc[0m[92m[4mh[0m[92m[4me[0ms [92m[4mf[0mor [92m[4ma[0mll o[92m[4mc[0m[92m[4mc[0mur[92m[4me[0mn[92m[4mc[0m[92m[4me[0ms [92m[4mb[0m[92m[4me[0mtw[92m[4me[0m[92m[4me[0mn [92m[4ma[0m [92m[4ma[0mn[92m[4md[0m [92m[4mk[0m [92m[4mi[0mn [92m[4ma[0m [92m[4mg[0m[92m[4mi[0mv[92m[4me[0mn str[92m[4mi[0mn[92m[4mg[0m


So to search for all word characters, we would need to write `[a-zA-Z0-9_]`. This sucks to always write, so lets use some abbreviations:

```
\w - Matches any word characters
\d - Matches a digit
\s - Matches a whitespace
.  - Matches any character (but not line return)
```

In [22]:
regex = r"\w"

test_string = "This searches for all Word characters, so basically everything"

show_regex_search_result(regex,test_string)

[92m[4mT[0m[92m[4mh[0m[92m[4mi[0m[92m[4ms[0m [92m[4ms[0m[92m[4me[0m[92m[4ma[0m[92m[4mr[0m[92m[4mc[0m[92m[4mh[0m[92m[4me[0m[92m[4ms[0m [92m[4mf[0m[92m[4mo[0m[92m[4mr[0m [92m[4ma[0m[92m[4ml[0m[92m[4ml[0m [92m[4mW[0m[92m[4mo[0m[92m[4mr[0m[92m[4md[0m [92m[4mc[0m[92m[4mh[0m[92m[4ma[0m[92m[4mr[0m[92m[4ma[0m[92m[4mc[0m[92m[4mt[0m[92m[4me[0m[92m[4mr[0m[92m[4ms[0m, [92m[4ms[0m[92m[4mo[0m [92m[4mb[0m[92m[4ma[0m[92m[4ms[0m[92m[4mi[0m[92m[4mc[0m[92m[4ma[0m[92m[4ml[0m[92m[4ml[0m[92m[4my[0m [92m[4me[0m[92m[4mv[0m[92m[4me[0m[92m[4mr[0m[92m[4my[0m[92m[4mt[0m[92m[4mh[0m[92m[4mi[0m[92m[4mn[0m[92m[4mg[0m


In [23]:
regex = r"\d"

test_string = "This searches for digits, for example 5 or 8"

show_regex_search_result(regex,test_string)

This searches for digits, for example [92m[4m5[0m or [92m[4m8[0m


In [24]:
regex = r"\s"

test_string = "This text also has whitespaces, which can be matched like this"

show_regex_search_result(regex,test_string)

This[92m[4m [0mtext[92m[4m [0malso[92m[4m [0mhas[92m[4m [0mwhitespaces,[92m[4m [0mwhich[92m[4m [0mcan[92m[4m [0mbe[92m[4m [0mmatched[92m[4m [0mlike[92m[4m [0mthis


In [25]:
regex = r"."

test_string = "This matches everything, also numbers like 9, but not the \n line return"

show_regex_search_result(regex,test_string)

[92m[4mT[0m[92m[4mh[0m[92m[4mi[0m[92m[4ms[0m[92m[4m [0m[92m[4mm[0m[92m[4ma[0m[92m[4mt[0m[92m[4mc[0m[92m[4mh[0m[92m[4me[0m[92m[4ms[0m[92m[4m [0m[92m[4me[0m[92m[4mv[0m[92m[4me[0m[92m[4mr[0m[92m[4my[0m[92m[4mt[0m[92m[4mh[0m[92m[4mi[0m[92m[4mn[0m[92m[4mg[0m[92m[4m,[0m[92m[4m [0m[92m[4ma[0m[92m[4ml[0m[92m[4ms[0m[92m[4mo[0m[92m[4m [0m[92m[4mn[0m[92m[4mu[0m[92m[4mm[0m[92m[4mb[0m[92m[4me[0m[92m[4mr[0m[92m[4ms[0m[92m[4m [0m[92m[4ml[0m[92m[4mi[0m[92m[4mk[0m[92m[4me[0m[92m[4m [0m[92m[4m9[0m[92m[4m,[0m[92m[4m [0m[92m[4mb[0m[92m[4mu[0m[92m[4mt[0m[92m[4m [0m[92m[4mn[0m[92m[4mo[0m[92m[4mt[0m[92m[4m [0m[92m[4mt[0m[92m[4mh[0m[92m[4me[0m[92m[4m [0m
[92m[4m [0m[92m[4ml[0m[92m[4mi[0m[92m[4mn[0m[92m[4me[0m[92m[4m [0m[92m[4mr[0m[92m[4me[0m[92m[4mt[0m[92m[4mu[0m[92m[4mr[0m[92m[4mn[0m


### Quantifiers

You can quantify a given search pattern using special characters:

```
+   - Match 1 or more
*   - Match 0 or more
?   - Match 0 or 1
{x}  - Match x times
{x,}  - Match x times or more
{x,y} - Match between x and y times

```

In [26]:
# Match 1 or more
regex = r"\w+t"

test_string = "This will match all all t's that have characters before them"

show_regex_search_result(regex, test_string)

This will [92m[4mmat[0mch all all t's [92m[4mthat[0m have [92m[4mcharact[0mers before them


In [27]:
# Match 0 or more
regex = r"\w*t"

test_string = "This will match all ts as well as the characters before t"

show_regex_search_result(regex, test_string) #

This will [92m[4mmat[0mch all [92m[4mt[0ms as well as [92m[4mt[0mhe [92m[4mcharact[0mers before [92m[4mt[0m


In [28]:
# Match 0 or more
regex = r"\w?t"

test_string = "This will match all ts as well as the character before it, if it exists"

show_regex_search_result(regex, test_string) #

This will m[92m[4mat[0mch all [92m[4mt[0ms as well as [92m[4mt[0mhe chara[92m[4mct[0mer before [92m[4mit[0m, if [92m[4mit[0m exi[92m[4mst[0ms


In [29]:
# Match exactly 2 or more
regex_2 = r"\w{2}t"
regex_2_ = r"\w{2,}t"
regex_2_4= r"\w{2,4}?t"

test_string = "This will match all ts if they fulfill the specific boundary condition of items before it"

show_regex_search_result(regex_2, test_string)
show_regex_search_result(regex_2_, test_string)
show_regex_search_result(regex_2_4, test_string)

This will [92m[4mmat[0mch all ts if they fulfill the specific boundary con[92m[4mdit[0mion of items before it
This will [92m[4mmat[0mch all ts if they fulfill the specific boundary [92m[4mcondit[0mion of items before it
This will [92m[4mmat[0mch all ts if they fulfill the specific boundary c[92m[4mondit[0mion of items before it


## Some special search patterns

You can also match various other things, or modify search strings with other characters (incomplete list)

```
^           - Negates a search pattern in [] brackets, or start of a string when used on its own
\Capital    - Searches everything but a given pattern, e.g. \D searches for everything that is not a number
\<          - Start of word
\>          - End of word
\t          - Tabs
\n          - Newline
\r          - Carriage Return
```


In [30]:
regex = r"\D"

test_string = "This searches for all non digit characters so 9 wouldn't be matched"

show_regex_search_result(regex,test_string)

regex = r"[^\D]"

test_string = "You can write dirty dirty regex like this, so that 9 is matched."

show_regex_search_result(regex,test_string)

[92m[4mT[0m[92m[4mh[0m[92m[4mi[0m[92m[4ms[0m[92m[4m [0m[92m[4ms[0m[92m[4me[0m[92m[4ma[0m[92m[4mr[0m[92m[4mc[0m[92m[4mh[0m[92m[4me[0m[92m[4ms[0m[92m[4m [0m[92m[4mf[0m[92m[4mo[0m[92m[4mr[0m[92m[4m [0m[92m[4ma[0m[92m[4ml[0m[92m[4ml[0m[92m[4m [0m[92m[4mn[0m[92m[4mo[0m[92m[4mn[0m[92m[4m [0m[92m[4md[0m[92m[4mi[0m[92m[4mg[0m[92m[4mi[0m[92m[4mt[0m[92m[4m [0m[92m[4mc[0m[92m[4mh[0m[92m[4ma[0m[92m[4mr[0m[92m[4ma[0m[92m[4mc[0m[92m[4mt[0m[92m[4me[0m[92m[4mr[0m[92m[4ms[0m[92m[4m [0m[92m[4ms[0m[92m[4mo[0m[92m[4m [0m9[92m[4m [0m[92m[4mw[0m[92m[4mo[0m[92m[4mu[0m[92m[4ml[0m[92m[4md[0m[92m[4mn[0m[92m[4m'[0m[92m[4mt[0m[92m[4m [0m[92m[4mb[0m[92m[4me[0m[92m[4m [0m[92m[4mm[0m[92m[4ma[0m[92m[4mt[0m[92m[4mc[0m[92m[4mh[0m[92m[4me[0m[92m[4md[0m
You can write dirty dirty regex like this, so that [92m[4m9[0m is match