# Regular expressions, interactive examples

In [1]:
import sys
sys.path.append("../src/")

In [2]:
# Use the autoreload extension such that you are always up-to-date without notebook restarts
%load_ext autoreload
%autoreload 2

In [3]:
from regex_intro import show_regex_search_result_group_highlighting, show_regex_substitution_highlighting,show_regex_search_result

## Part 2: Advanced usage

### Grouping of search patterns

You can of course simply use the replacement feature of your regex library, to replace a given search string. This may not be what you want however, you may only want to replace certain patterns in things you search for. This is what grouping is for. You can group your search result using the `()` parentheses

In [4]:
# Group search results
regex = r"\w+(\d)(\w+)"

test_string = "This will match all all words like 485, and will group 8 and 5 in the previous number, or 2 and 4 in a24."

show_regex_search_result_group_highlighting(regex, test_string)

This will match all all words like 4[38;5;255m8[0m[38;5;227m5[0m, and will group 8 and 5 in the previous number, or 2 and 4 in a[38;5;255m2[0m[38;5;227m4[0m.
[0m[38;5;255mGroup 1: (\d)[0m
[0m[38;5;227mGroup 2: (\w+)[0m


Now you can go ahead and back reference a group in a substitution string. Say we want to add `dummy` before the first group, and `text` between the first and second group. We can backreference a grouping using `\1` (first group), `\2` (second group), ... . The `$` is used as well for this in other settings.

In [5]:
show_regex_substitution_highlighting(regex,r"dummy\1text\2",test_string)

This will match all all words like 485, and will group 8 and 5 in the previous number, or 2 and 4 in a24.
[91m                                   ^^^^                                                              ^^^^[0m
This will match all all words like dummy8text5, and will group 8 and 5 in the previous number, or 2 and 4 in dummy2text4.
[91m                                   ^^^^^^^^^^^^                                                              ^^^^^^^^^^^^[0m


### Backrefrences

Backreferences can also be used in regular expressions themselves. Say we want to match a string, that starts with a given number of word characters and then has two numbers. We want to only match the ones where the two numbers are the same:

In [6]:
regex = r"\w+(\d)\1" #\1 is used in python syntax, can also be $1 in other places

test_string = "This will match for example ab44, but not ab43"

show_regex_search_result(regex, test_string)

This will match for example [92m[4mab44[0m, but not ab43


### Alternation in groupings

You can also have an or operator in these groupings. Say we want to do the same above, we want to either match a given number of word characters and two numbers, or a given number of word characters and two `a`. You can use the `|` operator in the groupings for that

In [7]:
regex = r"\w+(\d{2}|a{2})"

test_string = "This will match for example ab44 and abaa, but not ab4a or abab"

show_regex_search_result(regex, test_string)

This will match for example [92m[4mab44[0m and [92m[4mabaa[0m, but not ab4a or abab


### Lookahead and lookbehind

Lookaheads allow us to look ahead in a given regex, and accept or discard it if it fullfills a given criteria. This seems fairly simple and obvious, and can be solved using the concepts already learned, however it is necessary for some more advanced topics. To perform a look ahead, we can use `(?!x)` (negative lookahead) or `(?=x)` (positive lookahead).

An example: Say we want to match all numbers between 4000 and 5000.

In [8]:
failed_regex = r"4\d{3}"

test_string = "This looks promising with 4021 but unfortunately also matches 4000."

show_regex_search_result(failed_regex, test_string)

This looks promising with [92m[4m4021[0m but unfortunately also matches [92m[4m4000[0m.


In [9]:
correct_regex = r"4([1-9]\d\d|\d[1-9]\d|\d\d[1-9])\b"

test_string = "Now we will match 4010 but not 4000."

show_regex_search_result(correct_regex, test_string)

Now we will match [92m[4m4010[0m but not 4000.


In [10]:
regex_with_lookahead =r"4(?!000)\d{3}"

show_regex_search_result(correct_regex, test_string)

Now we will match [92m[4m4010[0m but not 4000.


Lookbehinds work similarly but by using the `<` character. So `(?<!x)` is the negative lookbehind, `(?<=x)` the positive lookbehind.

**lookebhinds will not be machted in the regex, but just checked if they exist**

So say we want to find all instances of Florian, where Florian is a surname.

In [11]:
regex = "(?<=[A-Z]\w* )Florian"

test_string = "The name Hans Florian would be matched. Florian Bauer wouldn't be"

show_regex_search_result(regex, test_string)

The name Hans [92m[4mFlorian[0m would be matched. Florian Bauer wouldn't be
