# Python: Regular Expressions

Regular expressions are primarily used for searching for specific data in a text or data file. The below table provides a quick overview of the most used expressions:
![Table](regex.png) 

**TIPS:** 

Always use raw strings for regular expressions in Python, by adding an 'r' before the string (r"ciao")

The $ to indicate end of line should be inserted after the last character in the regex

#### The \
Using \ allows you to use the actual character in your search, not its regex meaning (\. let's you serach for a dot)

Using \w let's you search for alphanumeric characters

Using \d for digits

Using \b for word boundaries

Using /S+ (any non-blank characters) at the beginning an end, allows you to search for words. 

More information:

#### [Cheat Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/)

#### http://www.regex101.com
(Allows you to test your regex and debug any problems)


In [None]:
#You must first import this library into python using 
import re

#You can then use re.search(r‘search word’, where-to-search) to see if a string matches a regular expressions (regex) 
result = re.search(r'aza','bazaar')
#the function will return the first match and its position as attriubutes to the result object (span=(1,4), match='aza')

#to ignore case
result = re.search(r'aza','BAzaar', re.IGNORECASE)

#looking for lines starting with From:
if re.search(’^From:’, line) 

#and re.findall() to find all portions of a string that matches your regex.
y = re.findall(‘[0-9]=’,x) # returns all numbers in the string x

#By adding an extra parenthesis in your search statement, you can define what part of your results, you’d like to extract.
Y = re.findall(‘From (/S+@/S+)’,x)

#### Character classes

You can specify a class of characters to search for by adding them in [] 

In [None]:
import re

x = re.search(r'[Bb]e',"Ciao Bella, bella donna")

#You can also add a range inside the brackets (like [a-zA-Z]).
y = re.search(r'[a-cA-C]e',"Ciao Bella, bella donna")

#You can also characters to avoid using [^abc]
z = re.search(r'[^B]a',"Ciao Bella, bella donna")

#The | (pipe) symbol allows us to make an either-or
a = re.findall(r'cat|dog',"I like my dog")

print(a)

#### Repetition qualifiers and greedy behavior
Adding a * means repeting character zero or more times (adding + means one or more times)

In [None]:
import re

z = re.search(r'C.*a',"Ciao Bella, bella donna")
#This returns the whole sentence, because it's greedy!
print(z)

w = re.search(r'C.*?a',"Ciao Bella, bella donna")
#This returns just cia, because it's non-greedy!
print(w)

#To just find 5-10 letter words;
v = re.findall(r'\b[A-Za-z]{5,10}\b',"Ciao Bella, bella donna")
print(v)

#### Storing and using results

In [None]:
import re

#The re.search returns a Match object. If you have identified return values in your search using the parenthesis, 
# the first item in the returned object is the full match, then comes the first parentheses, the second, etc. Example:
result=re.search(r"^(\w*), (\w*)$","LoveLace, Ada")
print(result[0],"\n",result[1],'\n',result[2])

#### Splitting and replacing
The re.split function can use Regex to split up a string

In [None]:
import re

result = re.split(r"[.,!?]","This is a test without punctuation in the result. Did you know?")
print(result)

result = re.split(r"[.,!?]","This is a test. Did you know?")
print(result)

#You can replace things with the re.sub function (\2 indicating the second captured group, etc.)
re.sub(r"^(\w*), (\w*)$",r"\2 \1","LoveLace, Ada")

#### Examples of REGEX in action

- to validate whether a text string is a valid variable
- to validate whether a string is a valid websitelink
- to find matches independent of case and punctuation
- to find a list of matches

#### Example of validating a top-level website address:

In [None]:
import re
def check_web_address(text):
  pattern = r"^[A-Za-z0-9_-]+\.[A-Za-z0-9\._-]+$"
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True

#### Example of validating time

In [None]:
import re
def check_time(text):
  pattern = r"^[1-9]+:[0-9][0-9]\s*AM|PM$"
  result = re.search(pattern, text, re.IGNORECASE)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # True
print(check_time("five o'clock")) # False