# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3 install regex** should install it.

In [2]:
import re
import numpy as np
import pandas as pd

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- NOTA: re.M -> modo multilinea

https://docs.python.org/3/library/re.html#re-syntax



Con el . buscamos lo que pongamos
Con ^ buscamos lo que empiece por eso: ^m que empiece por m
$ para encontrar al final del string
* busca entre 0 y muchas repeticiones, por ejemplo, buscame la b que puede estar 0 veces o estar muchas veces
+ busca algo que mínimo aparezca una vez o muchas veces
? busca algo que busca una o cero veces

### Special Sequences:

- **Literals** `a` (busca el valor)
- **Alternation** `a|b` (buscame una a o una b)
- **Character sets** `[ab]`, `[^ab]` (buscame a o una b)(el gorrito signigica que no esté ni a ni b)
- **Wildcards** `.` 
- **Escape special characters** `\` (.,?,*,+,^,&) (Para buscar estos signos hay que poner la barra delante)
- **Ranges** `[a-d]`, `[1-9]` (Estarmos dando un rango)

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+` 

    Ejemplo: [0-9]{2} (salvo que pongamos un cuantificador nos va seleccionano uno por uno, si pongo {2} me busca números del 0 al 9 que vayan seguidos)

- **Grouping** `()` (Podemos iterar por este grupo)


- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`


**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. Busca todos los caracteres alphanumericos. 

**\d** - Matches any digit. Equivalent to `[0-9]` . Busca todos los caracteres que sean dígitos. 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]` .Busca los espacios, como tabulador o salto de linea.

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



### Methods

### re.sub(pattern, repl, string, count=0)
Replaces one or many matches with a string

In [3]:
txt = "gabriel, Mónica, Borja and Clara are TA's??"

In [6]:
#re.sub
#Literals
re.sub('g','G',txt) # Estoy buscando la letra g y la estoy sustituyendo por la G, regex va en la primera posición que escribamos. La "G" mayuscula no es un regex, es un string

"Gabriel, Mónica, Borja and Clara are TA's??"

In [7]:
#Ranges
re.sub(r'[A-Z]','',txt) # la r significa regex no hace nada por si misma, es sinónimo de : re.sub('[A-Z]','',txt)

"gabriel, ónica, orja and lara are 's??"

In [5]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt) # Queremos buscar la interrogación literalmente

"gabriel, Mónica, Borja and Clara are TA's."

In [8]:
re.sub('\?','.',txt) # Queremos buscar la interrogación literalmente

"gabriel, Mónica, Borja and Clara are TA's.."

In [9]:
txt = "gabriel, Mónica, Borja and Clara are TA's?"
re.sub('\?{2}','.',txt) #Aquí no lo encuentra porque no pone dos interrogaciones seguidas

"gabriel, Mónica, Borja and Clara are TA's?"

### re.search(pattern, string, flags=0)
Scan through a string, looking for any location where this RE matches. If the search is succesful, `re.search()` returns a match object. Otherwise, it returns `None`.

In [10]:
#re.search
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) # Que en el principio ponga The, .* es todo lo que ponga en medio y que al final ponga Spain, para que haga match. 
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [7]:
y = re.search("^Hola.*", txt) # Empueza por Hola y .* Cualquier cosa que venga detrás
print(y)

None


In [8]:
txt = "The rain in Spain"
#\b whole words only
x = re.search(r"\bS\w+", txt) # la \b busca lo que venga justo seguido. Busca lo que empiece por S y luego venga seguido de caracteres alphanumericos
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [9]:
print(re.search(r'r\w*', txt))
print(re.search(r'R\w*', txt))
print(re.search(r'^T\w*', txt))
print(re.search(r'^t\w*', txt))

<re.Match object; span=(4, 8), match='rain'>
None
<re.Match object; span=(0, 3), match='The'>
None


### re.match(pattern, string)
Determine if the RE matches at the **beginning** of the string.

In [12]:
#re.match
pattern = r"Cookie"
sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"
if re.match(pattern, sequence): # tiene que buscar coincidencia y además tiene que estar al principio
    print("Match!")
else: 
    print("Not a match!")

Not a match!


In [66]:
txt = "The rain in Madrid"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))
print(re.match(r'^T\w*', txt))
print(re.match(r'T\w*', txt))

None
None
<re.Match object; span=(0, 3), match='The'>
<re.Match object; span=(0, 3), match='The'>


In [12]:
email_address = 'Please contact us at: support@thebridge.com'
match = re.search(r'(\w+)@([\w\.]+)', email_address) # Minimo tiene que haber algo delante del @ (primer +) , los corchetes son una lista de opciones, por lo que \w son todos los alphanumericos y \. cuenta el punto, el + final nos dice que tiene que haber algo detrás del @
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@thebridge.com
support
thebridge.com


### re.fullmatch(pattern, string)

In [13]:
class_names = ["Karina", "Marina", "Isabel", "Gina", "Xinru", "Sonia", "Marycruz"]
for name in class_names:
    if re.fullmatch("Isabel", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Karina is not desired name
Marina is not desired name
Isabel is desired name
Gina is not desired name
Xinru is not desired name
Sonia is not desired name
Marycruz is not desired name


### re.findall (pattern, string)
Find all substrings where the RE matches, and returns them as a list.

In [14]:
#re.findall
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.]+@[\w\.-]+', email_address) #
addresses

['support.data@data-science.com', 'xyz@thebridge.com']

In [20]:
#re.findall
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'([\w\.]+)@([\w\.-]+)', email_address) #
addresses

[('support.data', 'data-science.com'), ('xyz', 'thebridge.com')]

In [15]:
print(re.findall('[^aeiou\s]',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 's', 'c', 'n', 'c', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 't', 'h', 'b', 'r', 'd', 'g', '.', 'c', 'm']


In [17]:
lista = []
for a in addresses:
    lista.append(re.search(r"[\w\.-]+", a).group())
lista


['support.data', 'xyz']

**¿Qué me van a devolver las siguientes dos líneas de código?**



In [21]:
print(re.findall('\sc\w*', email_address))
print(re.findall('^P\w*', email_address))

[' contact']
['Please']


In [22]:
print(re.findall('c\w*', email_address))

['contact', 'cience', 'com', 'com']


------------------------------------------------------------------------------

In [23]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [25]:
print(type(client_info))
client_info

<class 'str'>


'Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941 Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242 Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292 Scarlett Ortiz Nullam.velit@non.ca 6082 Massa Road 34-345-887-949 Ocean Bell In@gravidamolestiearcu.co.uk P.O. Box 370, 440 Suspendisse Rd. 34-905-089-682 Meghan Henson non.bibendum@ipsumdolorsit.edu Ap #393-8999 Maecenas Rd. 34-773-463-479 Cole Cantrell rutrum.Fusce.dolor@purusNullamscelerisque.ca Ap #897-8561 Vitae, Rd. 34-017-915-525 Leandra Shaw quam.quis@ac.net P.O. Box 345, 1982 Ipsum Road 34-274-204-840 Michelle Rollins Nulla.eu.neque@idmollis.com 2891 Eget St. 34-575-459-881 Pamela Webster lacus.Cras@quisaccumsan.net 1418 Non Avenue 34-249-358-256 Kieran Aguilar commodo.at@sit.org 393-8798 Phasellus Rd. 34-299-478-659 Harrison Bartlett libero.et.tristique@sodaleseliterat.com Ap #803-4228 Accumsan Rd. 34-094-

In [18]:
emails_clients=re.findall(r"[\w.]+@+[\w.]+", client_info)
print(emails_clients[:5])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk']


In [26]:
client_numbers = re.findall(r"[0-9]{2}-\d{3}-\d{3}-\d{3}", client_info)

In [28]:
print(client_numbers[:5])

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682']


In [31]:
client_numbers = re.findall(r"\+?\d{2}-\d{3}-\d{3}-\d{3}", client_info) # Puede o no puede haber un + delante de los primeros dígitos
client_numbers[:5]

['34-739-941-941',
 '34-278-870-242',
 '34-999-876-292',
 '34-345-887-949',
 '34-905-089-682']

In [None]:
client_numbers = re.findall(r"\+?\d{2}-\d{3}-\d{3}-\d{3}", client_info) # Puede o no puede haber un + delante de los primeros dígitos
client_numbers[:5]

In [32]:
client_numbers = re.findall(r"((00)?\+?\d{2}-\d{3}-\d{3}-\d{3})", client_info) # Findall
client_numbers[:5]

[('34-739-941-941', ''),
 ('34-278-870-242', ''),
 ('34-999-876-292', ''),
 ('34-345-887-949', ''),
 ('34-905-089-682', '')]

### re.split(pattern, string, maxsplit=0)
Returns a list where the string has been split at each match

In [35]:
#re.split
sente = "Hello,(a)\n Please, contact me the sooner.\n Thank you,\n Me"
sente

'Hello,(a)\n Please, contact me the sooner.\n Thank you,\n Me'

In [36]:
reg = re.split("\n ", sente)
reg

['Hello,(a)', 'Please, contact me the sooner.', 'Thank you,', 'Me']

In [37]:
"".join(reg)

'Hello,(a)Please, contact me the sooner.Thank you,Me'

### re.compile(pattern
Compiles a RE into a regular expression object.

In [42]:
name_check = re.compile(r"[^A-Za-z ]")

In [43]:
name = input("Please insert your name:")
while name_check.search(name):
    # it loops while if finds a match
    print("Please enter your name correctly!")
    name = input("Please insert your name:")
print("Finally mate, I thought you'd never do it")

Please enter your name correctly!
Finally mate, I thought you'd never do it


-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


In [51]:
# your solution
username_check = re.compile(r"[^a-z\d+\_+]")

In [57]:
def validate_usr(username):
    while username_check.search(username) and (len(username)< 4 or len(username)>16):
        print("Please enter your username correctly!")
        username = input("Please insert your username:")
    print("Finally mate, I thought you'd never do it")

validate_usr(input("Please enter your user name"))

Finally mate, I thought you'd never do it


In [62]:
def function_validateUsr(username):
  res = re.compile(r"[^a-z\d\_]")
  if res.search(username) or (len(username) < 4 or len(username) > 16):
    return False
  else:
    return True

function_validateUsr("azz999z___")



True

In [None]:
def function_validateUsr(username):
  res = re.compile(r"[^a-z0-9_]{4,16}")
  

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

In [None]:
# your solution
import re
def validate_pin(pin):
    return bool(re.fullmatch("(\d{4}|\d{6})",pin))

In [None]:
def validate_pin(pin):
    #return true or false
    regex = re.compile(r'^\d{4}$|^\d{6}$')
    if regex.fullmatch(str(pin)):
        return True
    else:
        return False

-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

In [1]:
emails_info={}

In [2]:
fh = open("emails.txt", "r").read()
fh

f kin or relation that knows about the account=\n information died alongside with him at the plane crash leaving nobody behi=\nnd for the claim.</DIV>\n<DIV>&nbsp;</DIV>\n<DIV>It is therefore upon this discovery that I now decided to make this bu=\nsiness proposal to you and for the money to be release to you as the next o=\nf kin or relation to the deceased, since nobody will ever come for it, we d=\no not want this money to go into the treasury of the bank as unclaimed bill=\n or fund.</DIV>\n<DIV>&nbsp;</DIV>\n<DIV>the bank law and guideline here stipulates that if such money remained=\n unclaimed after six years, the money will be transferred into the bank tre=\nasury account as unclaimed fund. The request of foreigner as next of kin in=\n this transaction is occasioned by the fact that the customer was a foreign=\ner and a citizen cannot stand as next of kin to a foreigner.</DIV>\n<DIV>In subsequent disbursement of the money, I agree that 40% of this mone=\ny will be for you in re

In [3]:
fh.count("From r")

3977

In [5]:
import re
contents = re.split(r"From r", fh)
contents

one connecting \nVery good connecting with the famous stone companies as for growth up toge=\nther. \n\n\n\n\n\n\n\n------=_NextPart_001_0007_2C040077.840C0165\nContent-Type: text/html; charset="gb2312"\nContent-Transfer-Encoding: quoted-printable\n\n\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<HTML>\n<HEAD>\n<META http-equiv=3DContent-Type content=3D"text/html; charset=3Dgb2312">\n<META content=3D"MSHTML 6.00.2600.0" name=3DGENERATOR>\n<STYLE></STYLE>\n</HEAD>\n<BODY bgColor=3D#ffffff leftMargin=3D10 bottomMargin=3D15 topMargin=3D15 r=\nightMargin=3D10>\n<DIV align=3Dleft><FONT size=3D2 face=3DArial color=3D#000000>Kind&nbsp;At=\ntn:&nbsp;CEO/&nbsp;Manager&nbsp;Director&nbsp;<BR></FONT></DIV>\n<DIV align=3Dleft><FONT size=3D2 face=3DArial color=3D#000000><BR></FONT><=\n/DIV>\n<DIV align=3Dleft><FONT size=3D2 face=3DArial color=3D#0000ff><STRONG><U>S=\nupply&nbsp;Quality&nbsp;China\'s&nbsp;EXCLUSIVE&nbsp;dimensions&nbsp;at&nbsp=\n;Unbeatable&nbsp;Price.<BR></U></S

In [71]:
contents[0]

''

In [72]:
contents.pop(0)

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject

### Info Sender

In [73]:
info_sender=[]
for i,e in enumerate(contents):
    try:
        info_sender.append(re.search("From:.*", e).group())
    except: 
        info_sender.append("not found")

In [74]:
len(info_sender)

3977

In [75]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r'[\w\.]+@[\w\.-]+', line)
    if res:
        emails_info['sender_email'].append(res[0])
    else:
        emails_info['sender_email'].append(np.nan)
        
len(emails_info['sender_email'])

3977

In [76]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r':.*<', line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append(np.nan)
len(emails_info['sender_name'])

3977

### Info Dates

In [77]:
#DATES
dates=[]
for i,e in enumerate(contents):
    try:
        dates.append(re.search("Date:.*", e).group())
    except: 
        dates.append("not found")
len(dates)

3977

In [78]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d+", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append(np.nan)

len(emails_info['date_sent'])

3977

In [79]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}:\d{2}", dat)
    if res:
        emails_info['time_sent'].append(res[0])
    else:
        emails_info['time_sent'].append(np.nan)

len(emails_info['time_sent'])

3977

### Subject

In [80]:
subject=[]
for i,e in enumerate(contents):
    try:
        subject.append(re.search("Subject:.*", e).group())
    except: 
        subject.append("not found")
len(subject)

3977

In [81]:
emails_info['subject']=[]
for sub in subject:
    res=re.findall(r":.*", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append(np.nan)

len(emails_info['subject'])

3977

### Creating DataFrame

In [82]:
df=pd.DataFrame(emails_info)
df.isnull().sum()

sender_email    476
sender_name     837
date_sent       614
time_sent       618
subject          27
dtype: int64

In [83]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


### ¡Now you can start your analysis!