In [2]:
import re

- what happens if you match `r"breeze."` without escaping the `.`. 
- Can you demonstrate your hypothesis, by building some ad-hoc text?

In [None]:
# if you match r"breeze." the instance of `breeze.` will be matched because `.`
# itself is matched by r"."
# any other sequence prefixed by `breeze` will be matched as well

text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.

Beside the lake, beneath the trees,
Fluttering and dancing in the breezes
"""

print("Matching 'breeze.' returns:", re.findall(r"breeze.", text))

print("Matching 'breeze\.' returns:", re.findall(r"breeze\.", text))

Matching 'breeze.' returns: ['breeze.', 'breezes']
Matching 'breeze\.' returns: ['breeze.']


- try and match any sequence of alphanumeric characters of length at most 3
- any sequence of characters of length exactly 3, preceded and followed by spaces
- any sequence composed by a space, a variable number of alphanumeric characters and a comma

In [5]:
text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""

print(re.findall(r"\w{,3}", text))

print(re.findall(r" \w{3} ", text))

print(re.findall(r" \w+,", text))

['', 'I', '', 'wan', 'der', 'ed', '', 'lon', 'ely', '', 'as', '', 'a', '', 'clo', 'ud', '', 'Tha', 't', '', 'flo', 'ats', '', 'on', '', 'hig', 'h', '', 'o', '', 'er', '', 'val', 'es', '', 'and', '', 'hil', 'ls', '', '', 'Whe', 'n', '', 'all', '', 'at', '', 'onc', 'e', '', 'I', '', 'saw', '', 'a', '', 'cro', 'wd', '', '', 'A', '', 'hos', 't', '', '', 'of', '', 'gol', 'den', '', 'daf', 'fod', 'ils', '', '', 'Bes', 'ide', '', 'the', '', 'lak', 'e', '', '', 'ben', 'eat', 'h', '', 'the', '', 'tre', 'es', '', '', 'Flu', 'tte', 'rin', 'g', '', 'and', '', 'dan', 'cin', 'g', '', 'in', '', 'the', '', 'bre', 'eze', '', '', '']
[' and ', ' all ', ' saw ', ' the ', ' the ', ' and ', ' the ']
[' hills,', ' crowd,', ' host,', ' lake,', ' trees,']


- a date between Jan 1st 1937 and Jan 29th 1937, expressed as DD/MM/YYYY
- A sequence of alphanumeric characters, starting and ending with spaces, and containing two or more contiguous `n`s (e.g., `anno`, `cannone`...)
- A sequence of alphanumeric characters, starting and ending with spaces, and containing two or more contiguous vowels (e.g., `aiuola`, `meteorologo`...)
- A sequence of alphanumeric characters containing at least two `n`s, not necessarily contiguous (e.g., `nano`, `panettone`...)
- `L = {"care", "mare", "fare", "rare", "pare", "gare"}`
- sequences of characters that resemble italian past participles
- Even numbers in a text

In [8]:
text = """
C'era una volta un nano, nato il 12/01/1937, che si chiamava Nino.
Nino aveva un'amica, una fata, che si chiamava Nina.
Con l'avvicinarsi della Pasqua, Nino e nina si misero a fare un panettone.
Dopo averlo cucinato, Nino guardò il calendario e disse: "Nina, mi pare che il nostro panettone
sia fuori tempo massimo! È passato quasi un anno intero dal Natale ormai!
"""

print(re.findall(r"[12][0-9]/0?1/1937", text))

print(re.findall(r" \w*nn\w* ", text))

print(re.findall(r" \w*[aeiou]{2}\w* ", text))

print(re.findall(r"\w*n\w*n\w*", text))

print(re.findall(r"[cmfrpg]are", text))

print(re.findall(r"\w+[iua]t[oaei]", text))

print(re.findall(r"\d*[02468]", text))

['12/01/1937']
[' anno ']
[' chiamava ', ' chiamava ', ' guardò ', ' calendario ', ' fuori ', ' quasi ']
['nano', 'nina', 'panettone', 'panettone', 'anno']
['fare', 'pare']
['nato', 'fata', 'cucinato', 'passato', 'Nata']
['12', '0']


- match sequences of alphanumeric characters that do not end with a punctuation mark
- in Italian some consonants like `r` and `l` can be followed both by a vowel and a consonant. Write a regular expression to match all instances of `r` or `l` when followed by a consonant.

In [9]:
text = """
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of golden daffodils;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.
"""

print(re.findall(r"\w+[^.,!?;:]", text))

print(re.findall(r"[rl][^aeiou ]", text))

['I ', 'wandered ', 'lonely ', 'as ', 'a ', 'cloud\n', 'That ', 'floats ', 'on ', 'high ', "o'", 'er ', 'vales ', 'and ', 'hills', 'When ', 'all ', 'at ', 'once ', 'I ', 'saw ', 'a ', 'crowd', 'A ', 'host', 'of ', 'golden ', 'daffodils', 'Beside ', 'the ', 'lake', 'beneath ', 'the ', 'trees', 'Fluttering ', 'and ', 'dancing ', 'in ', 'the ', 'breeze']
['ly', 'll', 'll', 'ld', 'ls']


- can you match any valid number, expressed in roman fromat, between 1 and 10? (i.e., I, II, III, IV, V, VI, VII, VIII, IX, X). Try and be as much compact as possible!
- can you match any date in the format DD/MM/YYYY, only up to 2025! (You can ignore the fact that different months have different numer of days, but you should consider a date like 52/3/2027 impossible)
- can you match only words (i.e., sequences surrounded by spaces) that are made up by an odd number of characters?

In [24]:
text = """
Nel regno di Numeralia, tutto si contava in numeri romani.
Le settimane duravano X giorni, le ore erano solo VI, e ogni mese aveva XXV lune.

Il giovane scudiero Elio, nato il 03/04/2003, sognava di partecipare al Torneo dei Numeri, che si teneva ogni VII anni nel giorno 08/03/2024.

«Per diventare cavaliere,» gli disse il Mago Quadratus, «devi superare III prove:
una di forza, una di mente e una di coraggio.»

La prima prova fu una corsa contro V draghi. La seconda, un enigma con IX chiavi dorate. La terza, una sfida contro il silenzio lungo IV notti.

Elio completò tutto entro il 10/03/2024, e fu nominato Cavaliere del Numero Perfetto il giorno 11/03/2024, al suono di II campane d’argento.

Da allora, ogni anno, tra il I e il X marzo, i bambini del regno si esercitano a contare in numeri romani, sognando anche loro di diventare eroi…
proprio come succederà a Elio il Decimo il 34/12/2025.
"""

print(re.findall(r"(?:[012][0-9]|3[0-1])/(?:0?[0-9]|1[0-2])/(?:19[0-9][0-9]|20[0-2][0-5])", text))

print(re.findall(r" V?I{1,3} | I?[XV] ", text))

print(re.findall(r" (?:\w\w)+ ", text))

['03/04/2003', '08/03/2024', '10/03/2024', '11/03/2024']
[' X ', ' VII ', ' III ', ' V ', ' IX ', ' IV ', ' II ', ' I ', ' X ']
[' di ', ' si ', ' in ', ' duravano ', ' le ', ' solo ', ' ogni ', ' scudiero ', ' nato ', ' di ', ' al ', ' si ', ' ogni ', ' anni ', ' giorno ', ' il ', ' superare ', ' di ', ' di ', ' di ', ' fu ', ' contro ', ' La ', ' un ', ' IX ', ' La ', ' contro ', ' silenzio ', ' IV ', ' completò ', ' il ', ' fu ', ' Numero ', ' il ', ' al ', ' di ', ' ogni ', ' il ', ' il ', ' si ', ' in ', ' sognando ', ' loro ', ' come ', ' Elio ', ' Decimo ']


- A word of at least two alphabetical tokens followed by colon
- Words starting with capital `P` at the beginning of the line
- Lines containing at most 3 words and ending with a question mark
- Sequences of words enclosed by parentheses: `(lorem ipsum dolor)`
- Lists of words separated by comma $\to$ `apple, kiwi, pear, orange, ...`
- acronyms (`U.S.A`, `O.N.U`, etc.)

In [None]:
print(re.findall(r"\b[a-z]{2,}\b", text))

print(re.findall(r"^P\w+\b", text))

print(re.findall(r"^\w+(?: \w+){,2}\?$", text))

print(re.findall(r"\([a-z ]+\)"), text)

print(re.findall(r"\b\w+(?:,\w+)*\b", text))

print(re.findall(r"(?:[A-Z]\.)+", text))