## Anchors

We're still missing some tools!
At the moment we're using spaces to isolate "words", but this introduces some issues like the fact that not always words that we're interested in are surrounded by spaces.

In [3]:
import re

text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""

re.findall(r" \w+ ", text)

[' wandered ',
 ' as ',
 ' floats ',
 ' high ',
 ' vales ',
 ' all ',
 ' once ',
 ' saw ',
 ' of ',
 ' the ',
 ' beneath ',
 ' and ',
 ' in ']

This generates two issues:

1. not all words can be matched as not all words are preceded or followed by a whitespace (i.e., `That` or `hills` on line 2)
2. the whitespace "consumes" a character, so for instance `wandered` is matched but `lonely` isn't because, when expressed as instances or `r" \w+ "`, the trailing space of ` wandered ` and the leading one of ` lonely ` would overlap.

In order to solve this issue, we need to introduce the concept of **anchor**: a special symbol that does not correspond to a character but match a **position** in a string.

The most popular anchors are:

- `^` indicates the beginning of a line
- `$` indicates the end of a line
- `\b` indicates a token boundary (transition between an alphanumeric character and a non-alphanumeric character)


IMPORTANT: archors do not match any character! They just determine whether a match is valid depending on the position where the match happens.

In [None]:
re.findall(r"\b\w+\b", text)

['I',
 'wandered',
 'lonely',
 'as',
 'a',
 'cloud',
 'That',
 'floats',
 'on',
 'high',
 'o',
 'er',
 'vales',
 'and',
 'hills',
 'When',
 'all',
 'at',
 'once',
 'I',
 'saw',
 'a',
 'crowd',
 'A',
 'host',
 'of',
 'golden',
 'daffodils',
 'Beside',
 'the',
 'lake',
 'beneath',
 'the',
 'trees',
 'Fluttering',
 'and',
 'dancing',
 'in',
 'the',
 'breeze']

In [11]:
# re.MULTILINE is needed so that ^ corresponds to the start of each line
# (i.e., immediately after each \n, in addition to the start of the string).

re.findall(r"^\w+\b", text, flags=re.MULTILINE)

['I', 'That', 'When', 'A', 'Beside', 'Fluttering']

### Questions

- A word of at least two alphabetical tokens followed by colon
- Words starting with capital `P` at the beginning of the line
- Lines containing at most 3 words and ending with a question mark
- Sequences of words enclosed by parentheses: `(lorem ipsum dolor)`
- Lists of words separated by comma $\to$ `apple, kiwi, pear, orange, ...`
- acronyms (`U.S.A`, `O.N.U`, etc.)

## Groups (2) and Backreferences

Once defined, a group can also be referenced later in the regular expression.

For example we might want to look for words starting and ending with the same letter $\to$ `arpa, oggetto, pulp, snakes`

    (a letter) (the rest of the string) (the initial letter)

    (a) (rp) (a)


Groups get numbered from 1 to n, so that their match instance can be later used

`r"\b(\w)\w+\1\b"` matches all words that start and end by the same letter.

### Questions

- match words containing a double consonant
- match palindrome words of at most 6 characters
- match instances of reduplications or quasi-reduplication (i.e., `caffè-caffè`, `tessuto-non-tessuto`...)
- lines where the same words (long at least 4 characters) appear twice, eventually with a different word ending (e.g., `Chi portava il bambino o anche certamente la bambina poteva così ...` should match `bambino o anche certamente la bambina`)