# Regular Expressions 
Regular Expressions - also known as regex - are a sequence of characters that define a search pattern.

https://docs.python.org/3/library/re.html

https://regex101.com/



In [1]:
import re

## Functions

### re.findall (pattern, string, flags=0) 
returns a list of strings containing all matches


In [2]:
juan = re.findall("Juan", "El miercoles fue el cumpleaños de Juan")
print(juan)

['Juan']


In [3]:
names_ages = "Adriana is 22 years old, Juan turned 23 years old on wednesday wile Felipe is 30"
nombre = re.findall(r"[A-Z]+[a-z]*", names_ages)
print(nombre)

['Adriana', 'Juan', 'Felipe']


In [4]:
information = "Javier Perez: javier.perez@gmail.com, Patricia Lopez: patricia_lopez@hotmail.com, Rocio García: rocio-garcia@yahoo.es"


In [5]:
names = re.findall(r"[A-Z]+[a-z]* [A-Z]+[a-z]*", information)
print(names)

['Javier Perez', 'Patricia Lopez', 'Rocio Garc']


In [6]:
emails = re.findall(r"[\w.-]+@+[\w.-]+", information)
print(emails)

['javier.perez@gmail.com', 'patricia_lopez@hotmail.com', 'rocio-garcia@yahoo.es']


In [7]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [8]:
emails_clients=re.findall(r"[\w.-]+@+[\w.-]+", client_info)
print(emails_clients[:5])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk']


In [9]:
client_numbers = re.findall(r"[0-9]{2}-\d{3}-\d{3}-\d{3}", client_info)

In [10]:
print(client_numbers)

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682', '34-773-463-479', '34-017-915-525', '34-274-204-840', '34-575-459-881', '34-249-358-256', '34-299-478-659', '34-094-099-748', '34-236-498-114', '34-541-455-803', '34-274-768-546', '34-850-484-655', '34-193-830-599', '34-768-704-320', '34-960-058-312', '34-835-461-291', '34-524-499-405', '34-553-655-405', '34-193-752-726', '34-165-726-657', '34-172-146-895', '34-309-707-078', '34-289-368-945', '34-432-424-781', '34-880-153-396', '34-876-903-767', '34-508-574-378', '34-219-498-365', '34-413-279-781', '34-789-736-506', '34-701-997-370', '34-912-146-256', '34-550-871-297', '34-818-230-259', '34-707-183-700', '34-006-975-807', '34-975-336-347', '34-208-023-425', '34-810-185-675', '34-318-817-026', '34-229-857-982', '34-415-168-417', '34-595-803-021', '34-620-827-404', '34-711-768-527', '34-159-329-878', '34-619-616-824', '34-865-861-872', '34-294-644-638', '34-439-853-222', '34-215-852-041', '34-425-5

### re.split(pattern, string, maxsplit=0, flags=0)
method splits the string where there is a match and returns a list of strings where the splits have occurred.  

In [11]:
client_list = re.split(r"(?<=[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:5])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941', ' Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242', ' Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292', ' Scarlett Ortiz Nullam.velit@non.ca 6082 Massa Road 34-345-887-949', ' Ocean Bell In@gravidamolestiearcu.co.uk P.O. Box 370, 440 Suspendisse Rd. 34-905-089-682']


### re.sub(pattern, repl, string, count=0, flags=0)
method returns a string where matched occurrences are replaced with the content of replace variable.

#### re.subn(pattern, repl, string, count=0, flags=0)
is similar to re.sub() expect it returns a tuple of 2 items containing the new string and the number of substitutions made. 

In [12]:
information = "Javier Perez: javier.perez@gmail.com, Patricia Lopez: patricia_lopez@hotmail.com, Rocio García: rocio-garcia@yahoo.es"
emails_corrected = re.sub("hotmail.com", "gmail.com", information)
print(emails_corrected)

emails_lowercase = re.sub(r"[A-Z]", r"[a-z]", information)
print(emails_lowercase)


Javier Perez: javier.perez@gmail.com, Patricia Lopez: patricia_lopez@gmail.com, Rocio García: rocio-garcia@yahoo.es
[a-z]avier [a-z]erez: javier.perez@gmail.com, [a-z]atricia [a-z]opez: patricia_lopez@hotmail.com, [a-z]ocio [a-z]arcía: rocio-garcia@yahoo.es


## Methods and Match objects

**.span()** returns a tuple containing the start-, and end positions of the match  
**.string** returns the string passed into the function  
**.group()** returns the part of the string where there was a match  

### re.search(pattern, string, flags=0)  
This method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string. If the search is successful, re.search() returns a match object; if not, it returns None.  

In [13]:
ironhack = "Ironhack courses: Data Analytics, Web Development, UX UI Design "



In [14]:
courses = ["Statistics", "Data Analytics", "Economics", "Web Development"]
for course in courses:
    if re.search(course,ironhack):
        print(f"{course} is in Ironhack")
    else:
        print(f"{course} is not in ironhack")

Statistics is not in ironhack
Data Analytics is in Ironhack
Economics is not in ironhack
Web Development is in Ironhack


In [15]:
re.search("Data Analytics", ironhack).span()

(18, 32)

### re.match(pattern, string, flags=0)
Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None, if the string does not match the given pattern.

In [16]:
class_names = ["Pepe", "Lucia", "Paula", "Roberto"]


In [17]:
for name in class_names:
    if re.match(r"^P", name):
        print(f"{name} in class")
    else:
        print(f"{name} not in class")

Pepe in class
Lucia not in class
Paula in class
Roberto not in class


### re.fullmatch(pattern, string, flags=0)

In [18]:
class_names = ["Pepe", "Lucia", "Paula", "Roberto", "Maria", "Jose Maria"]
for name in class_names:
    if re.fullmatch("Maria", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Pepe is not desired name
Lucia is not desired name
Paula is not desired name
Roberto is not desired name
Maria is desired name
Jose Maria is not desired name


### re.finditer(pattern, string, flags=0)

In [19]:
text = 'This is a text which we will test to see how many times is the word is in the text'
pattern = ' is '
for match in re.finditer(pattern, text):
    print(match.span())

(4, 8)
(55, 59)
(67, 71)


### re.compile(pattern, flags=0)

Compiles a regex into a regular expression object.

In [20]:
'''
^ -- any character except
A-Z upercase letters
a-z lowercase letters
(include a space after z to allow spaces too)
'''

name_check = re.compile(r"[^A-Za-z ]")


In [21]:
name = input("Please insert your name:")
while name_check.search(name):
    print("Please enter your name correctly!")
    name = input("Please insert your name:")

Please insert your name:adriana.coca
Please enter your name correctly!
Please insert your name:Adriana_Coca
Please enter your name correctly!
Please insert your name:Adriana Coca


re.findall()----Returns a list of all regex matches in a string  
re.split()----Returns a list split by pattern  
re.sub()----Returns string with substituted pattern  
re.search()----Scans a string for a regex match  
re.match()----Looks for a regex match at the beginning of a string  
re.fullmatch()----Looks for a regex match on an entire string  
re.finditer()----Returns an iterator that yields regex matches from a string  
re.compile()----Compiles a regex into a regular expression object   

## Sets

**[a-e]** (abcde)  

**[1-4]** (1234)  

**[^abc]** Cualquier caracter excepto abc  

**[^0-9]** Cualquier carácter que no sea un numero 

## Metacharacters

**^** - When added as the first character in a regex pattern, it means "the beginning of the searched string". When used between square brackets, it means "everything but"

**$** - Used to check if a string ends with a certain character

***** - Matches zero or more occurrences of the pattern left to it

**+** - Matches one or more occurrences of the pattern left to it

**?** - Matches zero or one occurrence of the pattern left to it

**{}** - Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it

**|** - Used for alternation (or operator).

**()** - Used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

**\** - Used to escape various characters including all metacharacters. For example,

**\$a** - Match if a string contains a dollar sign followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

**.** Cualquier caracter excepto un salto de linea ("\n")  

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

## Special Sequences 

**\A** - Matches if the specified characters are at the start of a string.  
**\b** - Matches if the specified characters are at the beginning or end of a word.  
**\B** - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.  
**\d** - Matches any digit. Equivalent to [0-9]  
**\D** - Matches any non digit. Equivalent to [^0-9]  
**\s** - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].  
**\S** - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].  
**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.  
**\W** - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]  
**\Z** - Matches if the specified characters are at the end of a string.  

https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132