# Regular Expressions (RegEx)

Las regex son cadenas de caracteres basadas en reglas sintácticas que permiten describir secuencias de caracteres. Así, forman parte de los lenguajes regulares, los cuales son un subgrupo de los lenguajes formales, de gran importancia para la tecnología de la información y, especialmente, para el desarrollo de software.

Una expresión regular puede estar formada, o bien exclusivamente por caracteres normales (como abc), o bien por una combinación de caracteres normales y metacaracteres (como ab*c). Los metacaracteres describen ciertas construcciones o disposiciones de caracteres: por ejemplo, si un carácter debe estar en el inicio de la línea o si un carácter solo debe o puede aparecer exactamente una vez, más veces o menos.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3 install regex** should install it.

In [1]:
import re
import numpy as np
import pandas as pd

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- NOTA: re.M -> modo multilinea

https://docs.python.org/3/library/re.html#re-syntax



### Special Sequences:

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`

**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. 

**\d** - Matches any digit. Equivalent to `[0-9]` 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



### Methods

### re.sub(pattern, repl, string, count=0)
Replaces one or many matches with a string

In [12]:
txt = "gabriel, Mónicaç, Borja and Clara are TA's??"

In [7]:
#re.sub
#Literals
re.sub(r"g",'G',txt)

"Gabriel, Mónica, Borja and Clara are TA's??"

In [6]:
#Ranges
re.sub(r'[A-Z]','',txt)

"gabriel, ónica, orja and lara are 's??"

In [13]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"gabriel, Mónicaç, Borja and Clara are TA's."

### re.search(pattern, string, flags=0)
Scan through a string, looking for any location where this RE matches. If the search is succesful, `re.search()` returns a match object. Otherwise, it returns `None`.

In [14]:
#re.search
txt = "The rain in Spain"
x = re.search(r"^The.*Spain$", txt) 
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [15]:
y = re.search("^Hola.*", txt) 
print(y)

None


In [18]:
#\b whole words only
x = re.search(r"\bS\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [19]:
print(re.search(r'r\w*', txt))
print(re.search(r'R\w*', txt))
print(re.search(r'^T\w*', txt))
print(re.search(r'^t\w*', txt))

<re.Match object; span=(4, 8), match='rain'>
None
<re.Match object; span=(0, 3), match='The'>
None


### re.match(pattern, string)
Determine if the RE matches at the **beginning** of the string.

In [22]:
#re.match
pattern = r"Cookie"

sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"

if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Not a match!


In [23]:
txt = "The rain in Madrid"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))
print(re.match(r'^T\w*', txt))
print(re.match(r'T\w*', txt))

None
None
<re.Match object; span=(0, 3), match='The'>
<re.Match object; span=(0, 3), match='The'>


In [24]:
email_address = 'Please contact us at: support@thebridge.com'

match = re.search(r'([\.\w-]+)@([\w\.]+)', email_address)

if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@thebridge.com
support
thebridge.com


### re.fullmatch(pattern, string)

In [25]:
class_names = ["Karina", "Marina", "Isabel", "Gina", "Xinru", "Sonia", "Marycruz"]

for name in class_names:
    if re.fullmatch("Isabel", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Karina is not desired name
Marina is not desired name
Isabel is desired name
Gina is not desired name
Xinru is not desired name
Sonia is not desired name
Marycruz is not desired name


### re.findall (pattern, string)
Find all substrings where the RE matches, and returns them as a list.

In [86]:
#re.findall
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'(([\w\.]+)@([\w\.-]+))', email_address)
addresses

[('support.data@data-science.com', 'support.data', 'data-science.com'),
 ('xyz@thebridge.com', 'xyz', 'thebridge.com')]

In [44]:
addresses2 = re.findall(r'[\w\.]+@[\w\.-]+', email_address)
addresses2

['support.data@data-science.com', 'xyz@thebridge.com']

In [46]:
lista = []
for a in addresses2:
    lista.append(re.search(r"[\w\.-]+", a).group())
lista

['support.data', 'xyz']

In [30]:
print(re.findall('[^aeiou\s]+',email_address))

['Pl', 's', 'c', 'nt', 'ct', 's', 't:', 's', 'pp', 'rt.d', 't', '@d', 't', '-sc', 'nc', '.c', 'm,', 'xyz@th', 'br', 'dg', '.c', 'm']


**¿Qué me van a devolver las siguientes dos líneas de código?**



In [49]:
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

In [52]:
print(re.findall('c\w*\s', email_address))

['contact ']


In [53]:
print(re.findall('^P\w*', email_address))

['Please']


------------------------------------------------------------------------------

In [81]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [82]:
client_info[:150]

'Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941 Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nost'

In [83]:
emails_clients=re.findall(r"[\w.]+@[\w.]+", client_info)
print(emails_clients[:5])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk']


In [87]:
client_numbers = re.findall(r"((00)?\+?\d{2}-\d{3}-\d{3}-\d{3})", client_info)

#findall si añadimos un subgrupo va a buscar solo esa ocurrencia, entonces tendremos
#que meter todo el regex entre parentesis

In [91]:
print([tupla[0] for tupla in client_numbers][:5])

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682']


### re.split(pattern, string, maxsplit=0)
Returns a list where the string has been split at each match

In [92]:
#re.split
sente = "Hello,\n Please, contact me the sooner.\n Thank you,\n Me"
print(sente)

Hello,
 Please, contact me the sooner.
 Thank you,
 Me


In [93]:
reg = re.split(r"\n ", sente)
reg

['Hello,', 'Please, contact me the sooner.', 'Thank you,', 'Me']

In [95]:
" ".join(reg)

'Hello, Please, contact me the sooner. Thank you, Me'

### re.compile(pattern)
Compiles a RE into a regular expression object.

In [96]:
name_check = re.compile(r"[^A-Za-z ]")

In [100]:
name = input("Please insert your name:")
while name_check.search(name):
    # it loops while if finds a match
    print("Please enter your name correctly!")
    name = input("Please insert your name:")
print("Finally mate, I thought you'd never do it")

Please insert your name:Clara
Finally mate, I thought you'd never do it


-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


In [None]:
# your solution

In [None]:
def validate_usr(username):
    #your code here
    pass

In [104]:
check_usernames = ["unusuario", "UNUSUARIO", "UN_usuario", "un_usuario76", "8934_38875aa"]

In [None]:
# Solucion Mauro's team

In [113]:
import re

def validate_usr(username):
    patron = r'[a-z0-9_]{4,16}'
    busqueda = re.match(patron, username)
    if busqueda:
        return True
    else:
        return False

In [114]:
check_usernamesbool = [validate_usr(e) for e in check_usernames]
check_usernamesbool

[True, False, False, True, True]

In [None]:
# Solucion Miguel's team

In [119]:
def validateusr(username):
    res = re.compile(r"[^a-z\d+_+]")
    while res.search(username) or (len(username)< 4 or len(username)>16):
        return False
    else:
        return True

In [120]:
check_usernamesbool = [validateusr(e) for e in check_usernames]
check_usernamesbool

[True, False, False, True, True]

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

In [None]:
# your solution

In [137]:
pins = ["8475","fklajkfd"]

In [128]:
def validate_pin(pin):
    #return true or false
    pass

In [129]:
# Solucion adri's team

In [141]:
def validate_pin(pin):
    return bool(re.fullmatch('(\d{4}|\d{6})', pin))

In [142]:
check_pinbool = [validate_pin(e) for e in pins]
check_pinbool

[True, False]

In [None]:
# Solucion marycruz's team

In [143]:
def validatepin(pin):
    #return true or false
    regex = re.compile(r'^\d{4}$|^\d{6}$')
    if regex.fullmatch(str(pin)):
        return True
    else:
        return False

In [144]:
check_pinbool = [validatepin(e) for e in pins]
check_pinbool

[True, False]

-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

In [163]:
emails_info={}

In [164]:
fh = open("emails.txt", "r").read()

In [165]:
fh.count("From r")

3977

In [166]:
contents = re.split(r"From r", fh)

In [149]:
contents[0]

''

In [167]:
contents.pop(0)

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject

### Info Sender

In [168]:
info_sender=[]
for i,e in enumerate(contents):
    try:
        info_sender.append(re.search("From:.*", e).group())
    except: 
        info_sender.append("not found")

In [152]:
len(info_sender)

3977

In [153]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r'[\w\.]+@[\w\.-]+', line)
    if res:
        emails_info['sender_email'].append(res[0])
    else:
        emails_info['sender_email'].append(np.nan)
        
len(emails_info['sender_email'])

3977

In [154]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r':.*<', line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append(np.nan)
len(emails_info['sender_name'])

3977

### Info Dates

In [155]:
#DATES
dates=[]
for i,e in enumerate(contents):
    try:
        dates.append(re.search("Date:.*", e).group())
    except: 
        dates.append("not found")
len(dates)

3977

In [156]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d+", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append(np.nan)

len(emails_info['date_sent'])

3977

In [157]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}:\d{2}", dat)
    if res:
        emails_info['time_sent'].append(res[0])
    else:
        emails_info['time_sent'].append(np.nan)

len(emails_info['time_sent'])

3977

### Subject

In [158]:
subject=[]
for i,e in enumerate(contents):
    try:
        subject.append(re.search("Subject:.*", e).group())
    except: 
        subject.append("not found")
len(subject)

3977

In [159]:
emails_info['subject']=[]
for sub in subject:
    res=re.findall(r":.*", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append(np.nan)

len(emails_info['subject'])

3977

### Creating DataFrame

In [160]:
df=pd.DataFrame(emails_info)
df.isnull().sum()

sender_email    476
sender_name     837
date_sent       614
time_sent       618
subject          27
dtype: int64

In [170]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


### ¡Now you can start your analysis!