# Regular expresions (RegEx)

- Online regex tester: https://regex101.com
- w3schools: https://www.w3schools.com/python/python_regex.asp
- re module documentation: https://docs.python.org/3/library/re.html

### Sintaxis

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`
- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`

### Methods

- **re.findall()**
- **re.sub()**
- **re.search()**
- **re.match()**
- **re.split()**

In [1]:
import re

In [2]:
text = "Pepe, Pepa and Luis are 22, 34, and 56 years old, respectively?"

In [3]:
# literals
# re.sub
text1 = re.sub("Luis", "Lola", text)
print(text1)

Pepe, Pepa and Lola are 22, 34, and 56 years old, respectively?


In [4]:
# alternation

text2 = re.sub("22|34", "40" , text)
print(text2)

Pepe, Pepa and Luis are 40, 40, and 56 years old, respectively?


In [9]:
# character sets
text3 = re.sub("[234]", "5", text)
print(text3)

Pepe, Pepa and Luis are 55, 55, and 56 years old, respectively?


In [10]:
# wildcards
text4 = re.sub("Pep.", "Felipe", text)
print(text4)

Felipe, Felipe and Luis are 22, 34, and 56 years old, respectively?


In [12]:
# escape special characters
text5 = re.sub("\?", "!", text)
print(text5)

Pepe, Pepa and Luis are 22, 34, and 56 years old, respectively!


In [18]:
# ranges
# re.findall
print(re.findall("[a-df-z]", text))
print(re.findall("[A-Z]", text))
print(re.findall("[0-9]", text))

['p', 'p', 'a', 'a', 'n', 'd', 'u', 'i', 's', 'a', 'r', 'a', 'n', 'd', 'y', 'a', 'r', 's', 'o', 'l', 'd', 'r', 's', 'p', 'c', 't', 'i', 'v', 'l', 'y']
['P', 'P', 'L']
['2', '2', '3', '4', '5', '6']


In [28]:
# character classes
print(re.findall("\w+", text))

['Pepe', 'Pepa', 'and', 'Luis', 'are', '22', '34', 'and', '56', 'years', 'old', 'respectively']


In [38]:
# quantifiers
text = "baa ba b a aa aaa aaaa aaaaa"
print(re.findall("a", text))
print(re.findall("a{2}", text))
print(re.findall("a{2,}", text))
print(re.findall("a*", text))
print(re.findall("a+", text))
print(re.findall("ba+", text))
print(re.findall("a?", text))

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']
['aa', 'aa', 'aa', 'aa', 'aa', 'aa', 'aa']
['aa', 'aa', 'aaa', 'aaaa', 'aaaaa']
['', 'aa', '', '', 'a', '', '', '', 'a', '', 'aa', '', 'aaa', '', 'aaaa', '', 'aaaaa', '']
['aa', 'a', 'a', 'aa', 'aaa', 'aaaa', 'aaaaa']
['baa', 'ba']
['', 'a', 'a', '', '', 'a', '', '', '', 'a', '', 'a', 'a', '', 'a', 'a', 'a', '', 'a', 'a', 'a', 'a', '', 'a', 'a', 'a', 'a', 'a', '']


In [60]:
# grouping
# re.search
text = "abctrc abc"

print(re.search("([a-z]{2}c){2}\sabc", text))


<re.Match object; span=(0, 10), match='abctrc abc'>


In [64]:
# anchors
text = "Ironhack is the best school"
inverse = "The best school is Ironhack"

print(re.search("^Ironhack", text))
print(re.search("^Ironhack", inverse))
print(re.search("Ironhack$", text))
print(re.search("Ironhack$", inverse))

<re.Match object; span=(0, 8), match='Ironhack'>
None
None
<re.Match object; span=(19, 27), match='Ironhack'>


In [48]:
# re.search
# re.match
if re.search("a", "hola"):
    print("encontrado!")

print(re.search("a", "hola"))
print(re.match("a", "a"))

encontrado!
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(0, 1), match='a'>


In [69]:
# re.split
text = "Pepe, Pepa and Luis are 22, 34, and 56 years old, respectively?"
print(re.split("\d\d", text))

['Pepe, Pepa and Luis are ', ', ', ', and ', ' years old, respectively?']


### Let's practice

You work for a very big company and you are assigned the task of verifying information from the 200 most important clients in Europe for a meeting with the board of directors in an hour. Execute the code below to see your dataframe.

In [None]:
# loading dataframe
import pandas as pd
a = pd.read_csv('db.csv')
display(a)

Oh no! It seems that one of the interns has messed up the `.csv` file, there is no backup, and you are to blame. In order to keep your job, you must find a way to restore the original data. But wait! There is no time to go through all the data manually. Good thing you know how to use `regular expressions`.

First of all, lets import the `regex` library and load our text file to a variable.
This is the data we must retrieve back:
- name
- phone
- email
- date
- contract_value
- creditcard
- country
- postalcode
- address

In [None]:
import re

In [None]:
with open('db.csv') as file:
    db = file.read()
print(db)

In [None]:
#name


In [None]:
#phone


In [None]:
#email


In [None]:
#date


In [None]:
#contract_value


In [None]:
#creditcard


In [None]:
#country


In [None]:
#postal code
