# Regular expressions (regex) : love or hate?

![commit strip](http://www.commitstrip.com/wp-content/uploads/2014/02/Strips-Le-dernier-des-vrais-codeurs-650-finalenglsih.jpg)

Regular expressions are used in almost all languages. It is a very powerful tool to check if the content of a variable has the shape you expect. 

For example, if you retrieve a phone number, you expect the variable to be composed of numbers and spaces (or dashes) but nothing more. 

Regular expressions not only warn you of an unwanted character but also delete/modify all those that are not desirable.


**There are two ways to use regular expressions:**
* The first consists in calling the function with the pattern as the first parameter, and the string to be analyzed as the second parameter.
* The second way is to compile the regex, and then use the methods of the created object to analyze a string passed as an argument. This method speeds up processing when a regex is used several times.  

In [1]:
import re

In [14]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Searches the pattern in the previous string and return a `MatchObject` if matches are found,
# otherwise returns `None`.
print(len(re.findall(pattern, string)))

10


In [9]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Cuts the string according to the occurrence of the pattern.
print(re.split(pattern, string))

['I', 'am', 'fine', '!', 'There', 'are', 'still', '6', 'months', 'left', ':()']


### A little syntax

    [xy]  A possible segment list. Example[abc] equals: a, b or c

    (x|y) Indicates a multiple choice type (ps|ump) equals "ps" OR "UMP" 

    \d    the segment is composed only of numbers, which is equivalent to [0-9].

    \D    the segment is not composed of numbers, which is equivalent to [^0-9].

    \s    A space, which is equivalent to [ \t\n\r\r\f\v].

    \S    No space, which is equivalent to [^ \t\n\r\f\v].

    \w    Alphanumeric presence, which is equivalent to [a-zA-Z0-9_].

    \W    No alphanumeric presence [^a-zA-Z0-9_].

    \     Is an escape character. It _unprotects_ reserved characters by restoring their original meaning.

### Let's try it.

If the answer is not `None`, it means the match matches. GREY is indeed a name beginning with GR followed by a character and ending with Y.

In [15]:
print(re.match("GR(.)?Y", "GREY"))
# (.)? means that we expect 0 or 1 character.
# 0 or 1 because of the `?` following the character `.`, which means any character

<re.Match object; span=(0, 4), match='GREY'>


In [16]:
pattern = "GR(.)?Y"
string = "GREY"

result = re.match(pattern, string)
print(result)

# It is equal to
compiled = re.compile(pattern)
result = compiled.match(string)
print(result)

<re.Match object; span=(0, 4), match='GREY'>
<re.Match object; span=(0, 4), match='GREY'>


In [19]:
#  So in a loop the second syntax is nicer
pattern = "GR(.)?Y"
compiled = re.compile(pattern)
l = ["GREY 'S", "GRAY", "GREYISH", "A GREY"]

for elem in l:
    result = compiled.match(elem)
    print(elem, result)

GREY 'S <re.Match object; span=(0, 4), match='GREY'>
GRAY <re.Match object; span=(0, 4), match='GRAY'>
GREYISH <re.Match object; span=(0, 4), match='GREY'>
A GREY None


In the following, we search for specific expressions in a string.

In [20]:
print(re.findall("GR(.)?Y", "GREY"))
# so here we are looking for a unique element (.)? between GR and Y

['E']


In [21]:
# Ditto for two characters to be found
re.findall("G(.)?(.)?Y", "GREY")

[('R', 'E')]

To keep only numbers. 

In [31]:
# Only numbers
print(re.findall(r"([\d]+)", "Hello I live on the 7th floor of 220 street of sims"))
# "+" Means 1 or more characters

['7', '220']


And conversely, if you only want to keep the words. 

In [39]:
# Only words
print(re.findall(r"[\w]+", "Hello I live on the 7th floor of 220 street of sims"))

['Hello', 'I', 'live', 'on', 'the', '7th', 'floor', 'of', '220', 'street', 'of', 'sims']


### Stop, we recap !

Character | Meaning   
:-------------------------:|:-------------------------:
**.** | Refers to any character.
**^** | Indicates the beginning of the string.<br />For example, _^a_ matches _ab_ but not _ba_. 
**$** | Indicates the end of the string.<br />For example, _a$_ matches _ba_ but not _ab_. 
**?**| The previous character can be repeated zero or once.<br /> For example, _ab?_ corresponds to _ab_ and _a_.
 *| The previous character can be repeated none or several times. <br />For example, _ab\*_ may correspond to: _a_, _ab_, or _a_ followed by any number of _b_.
**+**| The previous character can be repeated once or several times. <br/>For example, to _ab+_ corresponds an _a_ followed by any number of _b_.
**{n}**| Indicates that the previous character must be repeated _n_ times.
**{n, m}**|Indicates that the previous character must be repeated between _n_ and _m_ times.
**\w** | It corresponds to any alphabetical character, it is equivalent to _[a-zA-Z]_.
**\W** | It corresponds to everything that is not an alphabetical character.
**\d** | It corresponds to any numeric character, i.e. it is equivalent to _[0-9]_.
**\D** | It corresponds to everything that is not a numeric character.

<img src="https://i.redd.it/nac35ntlfg831.jpg" width="400">


### Some useful resources
[Regex quickstart](http://www.rexegg.com/regex-quickstart.html): the Regex cheat sheet

[Dreambank Regex](http://www.dreambank.net/regex.html#examples): some examples of regex behaviour

[Pythex](https://pythex.org/): a real-time regular expression editor for Python, a quick way to test your regular expressions.  

[Regex101](https://regex101.com/): online regex editor and debugger. Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, Python, Golang and JavaScript. The website also features a community where you can share useful expressions.

##### And just for fun...
[Regex Crosswords](https://regexcrossword.com/): some crossword puzzles to test your Regex knowledge


#### How to check that the entered string is that of a number ?

In [49]:
number = input("Your number : ")
if re.match("^[0-9]+$", number):
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number.")

Your number :  543


The string entered is a number.


Another way

In [50]:
compiled = re.compile("^[0-9]+$")
if compiled.search(number) is not None:
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number")

The string entered is a number.


### Drill


**1. Create a regex that finds integers without size limit.**

In [69]:
s = "sssgdds86sfsf6s"
compiled = re.compile(r"\d+")
result = compiled.findall(s)
if result is not None:
    print(result)

['86', '6']


**2. Create a regex that finds negative integers without size limit.**

In [66]:
s = "sssgdds-8sfs4f-87s"
compiled = re.compile(r"-\d+")
result = compiled.findall(s)
if result is not None:
    print(result)
    

['-8', '-87']


**3. Create a regex that finds (positive or negative) integers without size limit.**

In [76]:
s = "sssg-6dds-8s8fsf87s"
compiled = re.compile(r"-?\d+")
result = compiled.findall(s)
if result is not None:
    print(result)

['-6', '-8', '8', '87']


**4. Capture all the numbers of the following sentence :**

In [97]:
text = "21 scouts and 3 tanks fought against 4,003 protestors, so the manager was not 100.00% happy."
searchfor = re.compile(r"[+-]?\d{1,3}(?:,\d{3})*(?:\.\d+)?")
result = searchfor.findall(text)
if result is not None:
    print(result)

['21', '3', '4,003', '100.00']


**5. Find all words that end with 'ly'.**

In [109]:
text = "He had prudently disguised himself but was quickly captured by the police."
pattern = re.compile(r"\b\w+ly\b")
result = re.findall(pattern, text)
print(result)

['prudently', 'quickly']


**6. License plate number**  
A Belgian license plate consists of 1 digit (0, 1 or 2), a dash ('-'), 3 capital letters, a dash ('-') and finally 3 digits. Write a script to check that an input string is a valid license plate.  
If it's correct, print `"good"`. If it's not correct, print `"Not good"`.

In [123]:
plate = input("Enter your license plate number: ")
pattern = re.compile(r"\d+-[A-Z]{3}-\d{3}")
if pattern.search(plate) is not None: 
    print("Good")
else: 
    print("Not good")

Enter your license plate number:  1-AAA-PPP


Not good


**7. Address IPV4**  
An IPv4 address is composed of 4 numbers between 0 and 255 separated by '.'   
Write a script to verify that a string entered is a valid IPv4 address.

In [134]:
ip = input("Enter your IP address :")
pattern = re.compile(r"\b(25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?)\."
                     r"(25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?)\."
                     r"(25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?)\."
                     r"(25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?)\b")
if pattern.search(ip) is not None: 
    print ("Valid IP address")
else:
    print("Invalid IP address")

Enter your IP address : 111.111.111.111


Valid IP address


**8. Valid Mail**  
An email is composed of alphanumeric characters followed by `@` and a domain name.  
Write a script that checks that the string entered by a user is indeed that of an email, otherwise ask him to re-enter it again (until he gets a valid email).

In [168]:
mail = input("Enter your email :")
pattern = re.compile(r"^[a-zA-Z0-9._-]+@[a-z.-]+\.[a-z]{2,3}$")
if pattern.search(mail) is not None: 
    print("Valid e-mail")
else:
    print("Invalid e-mail")

Enter your email : elsa@gm.com


Valid e-mail


**9. Valid Password**  
Write an additional script that verifies the password (obviously if the email is valid) where the only specificity of the password is that it has to contain at least 6 characters.

In [157]:
password = input("Enter your password :")
if pattern.search(mail) is not None and len(password) >= 6:
    print('ok')
else:
    print('ko')
    

Enter your password : ytrezsdc


ko


**10. Valid Password bis**  
The password must now contain at least 6 characters AND  

- at least one lowercase letter AND 
- at least one uppercase letter AND 
- at least one number AND 
- at least one special character (among `$#@`).

In [180]:
mail = input("Enter your email :")
mail_pattern = re.compile(r"^[a-zA-Z0-9._-]+@[a-z.-]+\.[a-z]{2,3}$")
if mail_pattern.search(mail) is not None: 
    print("-> Valid e-mail")
else:
    print("-> Invalid e-mail")
    


if mail_pattern.search(mail):
    password = input("Enter your password :")
    password_pattern = re.compile(r"^(?=.*[a-z])(?=.*[A-Z])(?=.*[$@#]).{8,}$")
    if password_pattern.search(password) is not None:
        print('-> ok')
    else:
        print('-> ko')

Enter your email : e@g.maaaa


-> Invalid e-mail


**11. Search by groups**  
It is possible to search by groups, and it is very powerful!  
`?P<x>\w+` means the capture of a "group" named `x`, this group is composed of at least (`+`) one alphanumeric  character `(\w)`.

In [184]:
m = re.search(
    r"Welcome to (?P<where>\w+) ! You are (?P<age>\d+) years old ?",
    "Welcome to Olivier ! You are 32 years old ?",
)
print(m.group("where"))
print(m.group("age"))

Olivier
32


In [187]:
# Another Example
m = re.search(
    r"^(?P<who>\w*)[.]?(?P<who2>\w*)@(?P<operator>\w*)[.](?P<zone>\w*$)",
    "audrey.boulevart@benextcomapgny.com",
)
if m is not None:
    print(m.group("who"))
    print(m.group("who2"))
    print(m.group("operator"))
    print(m.group("zone"))

audrey
boulevart
benextcomapgny
com


Load the file `./data/mail.txt` and clean it with the regex. The goal is to retrieve the last name, first name, operator and zone, as in the previous example. Store each of those into their own separate list.

In [223]:
with open("./data/mail.txt") as file: 
    list_mail = file.readlines()
    #print(list_mail)
    
pattern = r"^(?P<name>\w+)[._-](?P<last_name>\w+)@(?P<operator>\w+)[.](?P<zone>\w{2,6})$"
pattern2 = r"^(?P<name>\w+)@(?P<operator>\w+)\.(?P<zone>\w{2,6})$"

for email in list_mail:
    research_email = re.search(pattern, email)
    research_email2 = re.search(pattern2, email)
    
    if research_email: 
        print(f"Match found in {email} Name : {research_email.group("name").capitalize()}, surname : {research_email.group("last_name").capitalize()}")

    else: 
        print(f"No match found in {email}")

No match found in vogal-roger@mail.adlp.tech

Match found in aikin.joe@odul.xyz
 Name : Aikin, surname : Joe
No match found in moore@imail.gov

No match found in halknutson@email.xyz

No match found in alexnorquist@proton.com

No match found in matthewlulloff@mail.adlp.me

Match found in jenson-thomas@belgaximus.org
 Name : Jenson, surname : Thomas
No match found in mark4451@mail.bocode.gov

Match found in monte.hylan@belgaximus.net
 Name : Monte, surname : Hylan
No match found in dan80@mail.adlp.com

Match found in schmitt-steve@napster.org
 Name : Schmitt, surname : Steve
No match found in knutsondan@belgaximus.org

No match found in lepage@gaagle.xyz

Match found in chapman_ben@napster.hu
 Name : Chapman, surname : Ben
No match found in upson1544@youhoo.hu

No match found in 552966959@belgaximus.hu

Match found in soloman_ziegler@email.tech
 Name : Soloman, surname : Ziegler
Match found in ortiz.mark@napster.me
 Name : Ortiz, surname : Mark
Match found in ashwoon_hank@imail.io
 Name

In [236]:
# Open the file and read its contents line by line
with open("./data/mail.txt") as file:
    list_mail = file.readlines()  # This gives you a list of lines

# Define the corrected regex pattern
pattern = r"^(?P<name>\w+)[._-](?P<last_name>\w+)@(?P<operator>\w+)(?P<zone>(\.\w+)+)$"
pattern2 = r"^(?P<name>\w+)@(?P<operator>\w+)(?P<zone>(\.\w+)+)$"

# Loop through each line (email)
for email in list_mail:
    email = email.strip()  # Remove any extra whitespace or newline characters
    research_email = re.search(pattern, email)
    research_email2 = re.search(pattern2, email)
    
    if research_email:
        print(f"Match found for {email}: {research_email.groupdict()}")
    elif research_email2:
        print(f"Match found for {email}: {research_email2.groupdict()}")
    else:
        print(f"No match for: {email}")


Match found for vogal-roger@mail.adlp.tech: {'name': 'vogal', 'last_name': 'roger', 'operator': 'mail', 'zone': '.adlp.tech'}
Match found for aikin.joe@odul.xyz: {'name': 'aikin', 'last_name': 'joe', 'operator': 'odul', 'zone': '.xyz'}
Match found for moore@imail.gov: {'name': 'moore', 'operator': 'imail', 'zone': '.gov'}
Match found for halknutson@email.xyz: {'name': 'halknutson', 'operator': 'email', 'zone': '.xyz'}
Match found for alexnorquist@proton.com: {'name': 'alexnorquist', 'operator': 'proton', 'zone': '.com'}
Match found for matthewlulloff@mail.adlp.me: {'name': 'matthewlulloff', 'operator': 'mail', 'zone': '.adlp.me'}
Match found for jenson-thomas@belgaximus.org: {'name': 'jenson', 'last_name': 'thomas', 'operator': 'belgaximus', 'zone': '.org'}
Match found for mark4451@mail.bocode.gov: {'name': 'mark4451', 'operator': 'mail', 'zone': '.bocode.gov'}
Match found for monte.hylan@belgaximus.net: {'name': 'monte', 'last_name': 'hylan', 'operator': 'belgaximus', 'zone': '.net'}


**12. Another way of doing things.**

In [238]:
mail = "audrey.boulevart@benextcomapgny.com"
splitMail = mail.replace(".", " ").split("@").copy()

firstName = []
name = []
ope = []
zone = []

print(splitMail)
firstName.append(splitMail[0].split()[0])
name.append(splitMail[0].split()[-1])
ope.append(splitMail[1].split()[0])
zone.append(splitMail[1].split()[-1])

firstName, name, ope, zone

['audrey boulevart', 'benextcomapgny com']


(['audrey'], ['boulevart'], ['benextcomapgny'], ['com'])

In [273]:
with open("./data/mail.txt") as file:
    list_mail = file.readlines() 

for email in list_mail:
    split_mail = email.replace("-"," ").replace("."," ").replace("_", " ").strip().split("@")

firstName = []
name = []
ope = []
zone = []

firstName.append(split_mail[0].split()[0])
name.append(split_mail[0].split()[-1])
ope.append(split_mail[1].split()[0])
zone.append(split_mail[1].split()[-1])

print(firstName[0])

jack


Repeat the previous exercise with this new formula and compare the length of your lists with those of the previous exercise.  
What do you notice ?