## Importing the interpreter in `python`

Python's module for implementing regex is called `re`: <https://docs.python.org/3/library/re.html#module-re>

This is a standard library, so no installation is needed!

Another popular module is `regex` (<https://pypi.org/project/regex/>), which offers some additional functionality, more specifically it provides full unicode support.
This has to be explicitly installed (e.g., through `pip`).

In [2]:
import re

We will use the `findall` function:
<https://docs.python.org/3/library/re.html#re.findall>
```
re.findall(pattern, string, flags=0)
 	Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
	The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
```

## A string is itself a regular expression

The simplest case of regular expression is a string itself

e.g., `r"contro"` which represents the regular language `L = {"contro"}`

NB: the `r` placed right before the definition of the python string ensures that the string is interpreted as a regular expression

In [None]:
text = "alla fine è uscito il sole, contro ogni aspettativa"
re.findall(r"contro", text)

['contro']

In [3]:
text = "per favore, controlla se è arrivata la posta!"
re.findall(r"contro", text)

['contro']

## Alternation

We can extend our language by introducing more options...

`|` indicates the union of two matches (i.e., `or`)

In [4]:
text = "Filastrocca di primavera più lungo è il giorno, più dolce la sera."

re.findall(r"giorno|sera", text)

['giorno', 'sera']

In [5]:
text = "per favore, controlla se è arrivata la posta stasera!"
re.findall(r"giorno|sera", text)

['sera']

## Quantifiers

The easiest way to extend the language is to repeat characters:

- `*` $\to$ 0 to multiple occurrences
- `+` $\to$ 1 to multiple occurrences 
- `{x, y}` $\to$ x to y matches (we can also have `{x}` for exactly `x`, `{, y}` for at most `y` or `{x,}` for at least `x`)
- `?` is a shortcut for `{0,1}`

e.g. `r"mu?ore"` which represents the regular language `L = {"muore", "more"}`

In [7]:
text = "a aa aaa aaaa aaaaa aaaaaa aaaaaaa"

re.findall(r"a+", text)

['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa']

In [8]:
re.findall(r"a*", text)

['a',
 '',
 'aa',
 '',
 'aaa',
 '',
 'aaaa',
 '',
 'aaaaa',
 '',
 'aaaaaa',
 '',
 'aaaaaaa',
 '']

In [9]:
re.findall(r"a{2,}", text)

['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa']

In [10]:
re.findall(r"a{2,5}", text)

['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaa', 'aaaaa', 'aa']

## Special characters in regular expressions

Now we know how to match more or less any character on our keyboard.

There's something more we need to know:

- in principle it's possible to match `tabulations` by typing the actual `[TAB]` character, but not all editors allow doing so and it's pretty incovenient to have such long expressions, so the `[TAB]` character is expressed by the regex `r"\t"`
- matching usually works line by line, so the `newline` needs a special sequence in order to be matched, and that is `r"\n"`
- Finally, can you match a full stop? 

In [9]:
text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""

re.findall(r",", text)

[',', ',', ',', ',', ',']

In [10]:
re.findall(r";", text)


[';']

In [11]:
re.findall(r".", text)

['I',
 ' ',
 'w',
 'a',
 'n',
 'd',
 'e',
 'r',
 'e',
 'd',
 ' ',
 'l',
 'o',
 'n',
 'e',
 'l',
 'y',
 ' ',
 'a',
 's',
 ' ',
 'a',
 ' ',
 'c',
 'l',
 'o',
 'u',
 'd',
 'T',
 'h',
 'a',
 't',
 ' ',
 'f',
 'l',
 'o',
 'a',
 't',
 's',
 ' ',
 'o',
 'n',
 ' ',
 'h',
 'i',
 'g',
 'h',
 ' ',
 'o',
 "'",
 'e',
 'r',
 ' ',
 'v',
 'a',
 'l',
 'e',
 's',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'h',
 'i',
 'l',
 'l',
 's',
 ',',
 'W',
 'h',
 'e',
 'n',
 ' ',
 'a',
 'l',
 'l',
 ' ',
 'a',
 't',
 ' ',
 'o',
 'n',
 'c',
 'e',
 ' ',
 'I',
 ' ',
 's',
 'a',
 'w',
 ' ',
 'a',
 ' ',
 'c',
 'r',
 'o',
 'w',
 'd',
 ',',
 'A',
 ' ',
 'h',
 'o',
 's',
 't',
 ',',
 ' ',
 'o',
 'f',
 ' ',
 'g',
 'o',
 'l',
 'd',
 'e',
 'n',
 ' ',
 'd',
 'a',
 'f',
 'f',
 'o',
 'd',
 'i',
 'l',
 's',
 ';',
 'B',
 'e',
 's',
 'i',
 'd',
 'e',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'l',
 'a',
 'k',
 'e',
 ',',
 ' ',
 'b',
 'e',
 'n',
 'e',
 'a',
 't',
 'h',
 ' ',
 't',
 'h',
 'e',
 ' ',
 't',
 'r',
 'e',
 'e',
 's',
 ',',
 'F',
 'l',
 'u',
 't'

The character `.` has a special meaning when it's expressed within a regex, and it means `any character but newline`.

So what if we wanted to match `breeze.`, for instance?

We should add an escape character (`\`) before the full stop!

Think about it, we already used the escape character before.
`\t` tells the interpreter: "do not match `t` as you usually would, but interpret it in some other way"

This is also true for some other characters, like `(`, `)`, `[`, `]`, `{`, `}`, `+`, `*` that bear a special meaning within the regex language. We already saw when to use `{`, `}`, `+`, `*`, we'll see shortly about the others.

In [12]:
re.findall(r"\.", text)

['.']

### Questions:

- what happens if you match `r"breeze."` without escaping the `.`. 
- Can you demonstrate your hypothesis, by building some ad-hoc text?

## Classes of characters

Matching explicitly is not very useful if you want to express a language like `L = {all words of length 3}`.

At the moment, if we want to express `any letter in the alphabet` we'd need to write an expression like `r"a|b|c|d|e|f..."`. Quite inconvenient.

Just like `.` means any character, regex interpreters are also able to interpret some special sequences associated to classes of characters that are often matched by users:

- `r"\w"` matches any alphanumeric character
- `r"\d"` matches any digit
- `r"\s"` matches any space-line character (i.e., tabulations, standard spaces, newlines)


Always check [syntax specifications](https://docs.python.org/3/library/re.html#regular-expression-syntax) of the interpreter you're using!

In [15]:
text = "My name is John. I was born on Jan 1st 1990."

re.findall(r"\d", text)


['1', '1', '9', '9', '0']

### Questions:

Consider this text:
```
text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""
```


- try and match any sequence of alphanumeric characters of length at most 3
- any sequence of characters of length exactly 3, preceded and followed by spaces
- any sequence composed by a space, a variable number of alphanumeric characters and a comma

We can also define a personalized class of characters, by using square brackets.

For instance, there's no pre-defined meta sequence for vowels but we can define:

`r"[aeiou]"` for `r"a|e|i|o|u"`


Square brackets also support ranges:

- `r"[0-9]"` is the same as saying `r"\d"`, but we can also define `r"[0-3]"` for instance
- `[a-z]` means all alphabetical letters from `a` to `z` (lowercased, pay attention to encoding!)
- and so on...

### Questions

