## Importing the interpreter in `python`

Python's module for implementing regex is called `re`: <https://docs.python.org/3/library/re.html#module-re>

This is a standard library, so no installation is needed!

Another popular module is `regex` (<https://pypi.org/project/regex/>), which offers some additional functionality, more specifically it provides full unicode support.
This has to be explicitly installed (e.g., through `pip`).

In [3]:
import re

We will use the `findall` function:
<https://docs.python.org/3/library/re.html#re.findall>
```
re.findall(pattern, string, flags=0)
 	Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
	The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
```

## A string is itself a regular expression

The simplest case of regular expression is a string itself

e.g., `r"contro"` which represents the regular language `L = {"contro"}`

NB: the `r` placed right before the definition of the python string ensures that the string is interpreted as a regular expression

In [4]:
text = "alla fine è uscito il sole, contro ogni aspettativa"
re.findall(r"contro", text)

['contro']

In [5]:
text = "per favore, controlla se è arrivata la posta!"
re.findall(r"contro", text)

['contro']

## Alternation

We can extend our language by introducing more options...

`|` indicates the union of two matches (i.e., `or`)

In [6]:
text = "Filastrocca di primavera più lungo è il giorno, più dolce la sera."

re.findall(r"giorno|sera", text)

['giorno', 'sera']

In [7]:
text = "per favore, controlla se è arrivata la posta stasera!"
re.findall(r"giorno|sera", text)

['sera']

## Quantifiers

The easiest way to extend the language is to repeat characters:

- `*` $\to$ 0 to multiple occurrences
- `+` $\to$ 1 to multiple occurrences 
- `{x, y}` $\to$ x to y matches (we can also have `{x}` for exactly `x`, `{, y}` for at most `y` or `{x,}` for at least `x`)
- `?` is a shortcut for `{0,1}`

e.g. `r"mu?ore"` which represents the regular language `L = {"muore", "more"}`

In [8]:
text = "a aa aaa aaaa aaaaa aaaaaa aaaaaaa"

re.findall(r"a+", text)

['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa']

In [9]:
re.findall(r"a*", text)

['a',
 '',
 'aa',
 '',
 'aaa',
 '',
 'aaaa',
 '',
 'aaaaa',
 '',
 'aaaaaa',
 '',
 'aaaaaaa',
 '']

In [10]:
re.findall(r"a{2,}", text)

['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa']

In [11]:
re.findall(r"a{2,5}", text)

['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaa', 'aaaaa', 'aa']

## Special characters in regular expressions

Now we know how to match more or less any character on our keyboard.

There's something more we need to know:

- in principle it's possible to match `tabulations` by typing the actual `[TAB]` character, but not all editors allow doing so and it's pretty incovenient to have such long expressions, so the `[TAB]` character is expressed by the regex `r"\t"`
- matching usually works line by line, so the `newline` needs a special sequence in order to be matched, and that is `r"\n"`
- Finally, can you match a full stop? 

In [12]:
text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""

re.findall(r",", text)

[',', ',', ',', ',', ',']

In [13]:
re.findall(r";", text)


[';']

In [14]:
re.findall(r".", text)

['I',
 ' ',
 'w',
 'a',
 'n',
 'd',
 'e',
 'r',
 'e',
 'd',
 ' ',
 'l',
 'o',
 'n',
 'e',
 'l',
 'y',
 ' ',
 'a',
 's',
 ' ',
 'a',
 ' ',
 'c',
 'l',
 'o',
 'u',
 'd',
 'T',
 'h',
 'a',
 't',
 ' ',
 'f',
 'l',
 'o',
 'a',
 't',
 's',
 ' ',
 'o',
 'n',
 ' ',
 'h',
 'i',
 'g',
 'h',
 ' ',
 'o',
 "'",
 'e',
 'r',
 ' ',
 'v',
 'a',
 'l',
 'e',
 's',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'h',
 'i',
 'l',
 'l',
 's',
 ',',
 'W',
 'h',
 'e',
 'n',
 ' ',
 'a',
 'l',
 'l',
 ' ',
 'a',
 't',
 ' ',
 'o',
 'n',
 'c',
 'e',
 ' ',
 'I',
 ' ',
 's',
 'a',
 'w',
 ' ',
 'a',
 ' ',
 'c',
 'r',
 'o',
 'w',
 'd',
 ',',
 'A',
 ' ',
 'h',
 'o',
 's',
 't',
 ',',
 ' ',
 'o',
 'f',
 ' ',
 'g',
 'o',
 'l',
 'd',
 'e',
 'n',
 ' ',
 'd',
 'a',
 'f',
 'f',
 'o',
 'd',
 'i',
 'l',
 's',
 ';',
 'B',
 'e',
 's',
 'i',
 'd',
 'e',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'l',
 'a',
 'k',
 'e',
 ',',
 ' ',
 'b',
 'e',
 'n',
 'e',
 'a',
 't',
 'h',
 ' ',
 't',
 'h',
 'e',
 ' ',
 't',
 'r',
 'e',
 'e',
 's',
 ',',
 'F',
 'l',
 'u',
 't'

The character `.` has a special meaning when it's expressed within a regex, and it means `any character but newline`.

So what if we wanted to match `breeze.`, for instance?

We should add an escape character (`\`) before the full stop!

Think about it, we already used the escape character before.
`\t` tells the interpreter: "do not match `t` as you usually would, but interpret it in some other way"

This is also true for some other characters, like `(`, `)`, `[`, `]`, `{`, `}`, `+`, `*` that bear a special meaning within the regex language. We already saw when to use `{`, `}`, `+`, `*`, we'll see shortly about the others.

In [15]:
re.findall(r"\.", text)

['.']

### Questions:

- what happens if you match `r"breeze."` without escaping the `.`. 
- Can you demonstrate your hypothesis, by building some ad-hoc text?

## Classes of characters

Matching explicitly is not very useful if you want to express a language like `L = {all words of length 3}`.

At the moment, if we want to express `any letter in the alphabet` we'd need to write an expression like `r"a|b|c|d|e|f..."`. Quite inconvenient.

Just like `.` means any character, regex interpreters are also able to interpret some special sequences associated to classes of characters that are often matched by users:

- `r"\w"` matches any alphanumeric character
- `r"\d"` matches any digit
- `r"\s"` matches any space-line character (i.e., tabulations, standard spaces, newlines)


Always check [syntax specifications](https://docs.python.org/3/library/re.html#regular-expression-syntax) of the interpreter you're using!

In [16]:
text = "My name is John. I was born on Jan 1st 1990."

re.findall(r"\d", text)


['1', '1', '9', '9', '0']

### Questions:

Consider this text:
```
text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""
```


- try and match any sequence of alphanumeric characters of length at most 3
- any sequence of characters of length exactly 3, preceded and followed by spaces
- any sequence composed by a space, a variable number of alphanumeric characters and a comma

We can also define a personalized class of characters, by using square brackets.

For instance, there's no pre-defined meta sequence for vowels but we can define:

`r"[aeiou]"` for `r"a|e|i|o|u"`


Square brackets also support ranges:

- `r"[0-9]"` is the same as saying `r"\d"`, but we can also define `r"[0-3]"` for instance
- `[a-z]` means all alphabetical letters from `a` to `z` (lowercased, pay attention to encoding!)
- and so on...

### Questions

- a date between Jan 1st 1937 and Jan 29th 1937, expressed as DD/MM/YYYY
- A sequence of alphanumeric characters, starting and ending with spaces, and containing two or more contiguous `n`s (e.g., `anno`, `cannone`...)
- A sequence of alphanumeric characters, starting and ending with spaces, and containing two or more contiguous vowels (e.g., `aiuola`, `meteorologo`...)
- A sequence of alphanumeric characters containing at least two `n`s, not necessarily contiguous (e.g., `nano`, `panettone`...)
- `L = {"care", "mare", "fare", "rare", "pare", "gare"}`
- sequences of characters that resemble italian past participles
- Even numbers in a text

We can also negate the content of a group in order to match `any character, but those expressed in square brackets`. 

This is achieved by adding a `^` as the first character inside a square bracket.

So, `r"[^ab]"` means "any character, except `a` or `b`"

In [17]:
text = "Filastrocca di primavera più lungo è il giorno, più dolce la sera."

re.findall(r"[^aeiou]", text)

['F',
 'l',
 's',
 't',
 'r',
 'c',
 'c',
 ' ',
 'd',
 ' ',
 'p',
 'r',
 'm',
 'v',
 'r',
 ' ',
 'p',
 'ù',
 ' ',
 'l',
 'n',
 'g',
 ' ',
 'è',
 ' ',
 'l',
 ' ',
 'g',
 'r',
 'n',
 ',',
 ' ',
 'p',
 'ù',
 ' ',
 'd',
 'l',
 'c',
 ' ',
 'l',
 ' ',
 's',
 'r',
 '.']

### Questions

- match sequences of alphanumeric characters that do not end with a punctuation mark
- in Italian some consonants like `r` and `l` can be followed both by a vowel and a consonant. Write a regular expression to match all instances of `r` or `l` when followed by a consonant.

## Groups (1)

We saw how multipliers and quantification can work on single characters:

`r"a+b" = {"ab", "aab", "aaab", "aaaab", ...}`

`r"[ae]+b" = {"ab", "eb", "aab", "aeb", "eeb", "aaab", "aaeb", "aeab", "eaab", "eeab", "eaeb", "eeeb", ...}`


Quantifiers can actually be applied to groups of characters rather than single ones.

Think about the language `L = {ab, abab, ababab, abababab}`.
Here, it's the entire `ab` sequence that can be repeated from 1 to many times. In a regular expression, we would signal it as `r"(?:ab)+"` meaning that the whole subsequence `ab` is repeated 1 to n times.

Groups can be also used to isolate substrings when using the `or` (or pipe `|`) function.

`r"(?:a|uo)\w+"` will match words that start either with `a` or with the sequence `uo`

IMPORTANT: 

- when using `re.findall` you need to specify that you want to return the whole matching string and not just what's inside the group. In order to do so you need to prefix the group with `?:`. 

In [32]:
text = "aria uovo uso"

re.findall(r"(a|uo)\w+", text)

['a', 'uo']

In [33]:
re.findall(r"(?:a|uo)\w+", text)

['aria', 'uovo']

## Questions

- can you match any valid number, expressed in roman fromat, between 1 and 10? (i.e., I, II, III, IV, V, VI, VII, VIII, IX, X). Try and be as much compact as possible!
- can you match any date in the format DD/MM/YYYY
- can you match only words (i.e., sequences surrounded by spaces) that are made up by an odd number of characters?

## Introducing `re.finditer`

The function `re.finditer` ([docs](https://docs.python.org/3/library/re.html#re.finditer)) is slightly more complex than `re.findall` but behaves in a more flexible way when it comes to handling groups, and gives us also a more informative output.

```
re.finditer(pattern, string, flags=0)

    Return an iterator yielding Match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
```

In [36]:
text = "aria uovo uso"

re.finditer(r"(a|uo)\w+", text)

<callable_iterator at 0x767595552ec0>

The function returned an `iterator`, which is a callable object that will load our result piece by piece.

Each result is a `Match` object ([docs](https://docs.python.org/3/library/re.html#match-objects)) which has many callable properties.
For instance:

- `Match.group`: group `0` corresponds to the entire match, while group `1` to `99` match the corresponding substring enclosed by round parentheses.
- If a group matches multiple times, only the last match is accessible
- `Match.start` and `Match.end` return the indices of the start and end of the substring matched by group. This way we can know how long our match is for instance!

In [42]:
for match in re.finditer(r"(a|uo)\w+", text):
	print(match.group(0))

aria
uovo


In [44]:
for match in re.finditer(r"(a|uo)\w+", text):
	print(match.group(1), match.start(1), match.end(1))

a 0 1
uo 5 7
