# Regex

Regex is a *language describing patterns of text.* 

A regex typically consists of a search pattern, which is used by string-searching algorithms to "find" or "find and replace" sequences in a string.
 
Regex is introduced in more detail by Simon in the video lectures
1. [Regex 1](https://www.youtube.com/watch?v=ma93hpNFXZM)
2. [Regex 2](https://www.youtube.com/watch?v=B6XoKtQA2Fc)
3. [Regex 3](https://www.youtube.com/watch?v=jxNLY0L_N78)
4. [Regex 4](https://www.youtube.com/watch?v=j1jW5EF5jfs)

Regex is difficult to learn, however, by just watching videos. **It is one of those things you learn by doing!**

In the live lecture, we will practice regex together. 

I encourage you to skim the video lectures, then practice basic regex, and finally rewatch the lecture videos.

***
***

## Basic regex syntax

Regex is used across most programming languages. In this lecture, we will be working with regex in [https://regex101.com/](regex101) and python. First, let us learn basic regex syntax by playing around in regex101.


***
**Character classes**
- `\d` : matches a single character that is a digit
- `\w` : matches a word character
- `\s` : matches a whitespace character
- `.`  : matches any character

Negations: `\D`, `\W`, `\S` are the *negations* of `\d`, `\w` and `\s`.


***
**Anchors: `^`, `\$` and `\b`**
- `^Hat` matches any string that starts with "Hat".
- `Pay\$` matches any string that ends with "Pay"
- `\b` matches a word boundary
***

Try it out! Go to [regex 101](regex101.com) and play around in the field where it says "insert your regular expression here".

![alt text](regex101.png)

Questions: 
- How do you match the "4"?
- How do you match everything except the "4"?
- How do you match the *first* "hat"?

***
**Special characters**

Special characters are reserved as keywords. They are:
   `^`   `[`  `.` `$`     `{`    `*`   `(`  `\`  `+`    `)`   `|`   `?` 

Literal versions of these characters have to be *escaped* by adding a \\ (backslash) in front of them. 

For example, ^ is escaped by writing \\^.

Question:
- In our regex101 example, can you match the $?

***
**Flags**

In our example, try matching "store.\$".

From the material we have covered so far, you would not expect any matches... Yet we get two!
![alt text](regex101-store.png)

That is because regex101 is using the multiline flag, where ^ and \$ are used to match the begin and end of each line. 

Question:
- Can you turn off multi line? (Tip: Hover your mouse over "mg")
- What happens when you match "store." with the multi line flag on and the global flag off?

***
**Quantifiers**

The characters `*`, `+`, `?` and `{}` are reserved as quantifiers.

For example, 
- `abc*` matches a string that has ab followed by zero or more c.
- `abc+` matches a string that has ab followed by one or more c
- `abc?` matches a string that has ab followed by zero or one c
- `abc{2}` matches a string that has ab followed by 2 c
- `abc{2,}` matches a string that has ab followed by 2 or more c
- `abc{2, 5}` matches a string that has ab followed by 2 to 5 c

Question:
- How many regexes can you make that matches "ccc" in "abccc"?

**Greedy and lazy operators**

The quantifiers `*`, `+`, `?` and `{}` are all *greedy*, meaning they will match with the longest string they can find. 

To make them *lazy*, we add a question mark: `*?`, `+?`, `??` and `{}?`

Question:
- Take as your test string: `Norwegian: "God dag". Italian: "Buongiorno"`
Can you match the greetings?


***
**OR operator - `|` or `[]`**
- `a(b|c)` matches a string that has a followed by b or c
- `a[bc] ` matches a string that has a followed by bor c 

Example: `gr[ae]y` matches gray and grey.

We can also negate using `^`. For example, `[^aeiou]` will *not* match any the letters a, e, i, o and u.
***


***


**Grouping and capturing**

Placing parts of the regex inside `()` will group that part of the regex together. 

Example: `a(bc)+` matches strings that have a followed by one or more bc

The paranthesis also *captures* the corresponding match and allows us to *backreference* it using `\1`, `\2` etc. 

Example: `a(bc)\1` will match abcbc.

Question: 
- Can you make a regex that matches any character repeated twice?
For example, your regex should match "ee" in "Rita Skeeter".
- Can you make a regex that matches any repeated character?
For example, your regex should match "sss" in "headmistressship".
- Bonus: If you work on a text for a long time, you eventually become completely blind to your own mistakes. For example, my master thesis contains double "the the" 3 times. Can you make a regex that finds repeated words?

That concludes basic regex syntax! Now onto python and regex. 


*** 
***
***
## Python and regex
We start by importing regex in python:

In [26]:
import re

The `re` module has several functions that allow us to search through a string:
- `findall` Returns a list containing all matches
- `search` Returns a Match object if there is a match anywhere in the string
- `split` Returns a list where the string has been split at each match
- `sub` Replaces one or many matches with a string

We will test out `findall`. Let's read the docstring:

In [32]:
re.findall?

[0;31mSignature:[0m [0mre[0m[0;34m.[0m[0mfindall[0m[0;34m([0m[0mpattern[0m[0;34m,[0m [0mstring[0m[0;34m,[0m [0mflags[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.
[0;31mFile:[0m      ~/anaconda3/envs/py38/lib/python3.8/re.py
[0;31mType:[0m      function


We can try it out on a list of files:

In [43]:
files = r'''transcript.pdf
           flower_picture.png
           thesis.tex'''

regex = '(\w+).(\w+)'
matches = re.findall( regex, files, re.M)
print(matches)

[('transcript', 'pdf'), ('flower_picture', 'png'), ('thesis', 'tex')]


Question: Can you loop through the matches to make a nice formatted list of files and extensions?

Note: We put a prefix `r` in front of the files string. This tells python we want a raw string, and keeps it from formatting it.

In [44]:
str_ = 'Hello, \n I am a \n normal string'
raw_str = r'While \n I am a \n raw string'
print(str_)
print(raw_str)

Hello, 
 I am a 
 normal string
While \n I am a \n raw string


***

**Lookahead and lookbehind**

The regex engine marches along the string looking for matches to the regex. With the lookahead `(?=foo)` and look behind `(?<=foo)` we can ask it to check for matches *without* moving along the string. 

Example: Say we want to match all the consonants in a string. Then we can use `[a-z](?<=[^aeiou])`.


***

For the assignments, you will be working with more complicated input strings and regexs. You can keep using regex101 as you are working out the correct regex, just be sure to switch the flavor to python! There are small syntax differences between different regex engines. For example, python makes you write `\"` if you want to match `"`.

<img src="regex101-python.png" alt="Drawing" style="width: 400px;"/>

That concludes the live lecture! Python and regex are beautifully covered by Simon in the video lectures. 

By now you should be comfortable with basic regex and ready to start the Assignment 5 (as soon as it is published). 






