In [2]:
import re
phoneRegEx =re.compile(r"\d\d\d\d\d\d\d\d\d\d\d")

Passing a string value representing the regular expression to re.compile() returns a Regex pattern object. 
A `\d` in a regex stands for a digit character. There are various characters like `\d` in regEx, they have special meaning and to avoid any confusion while dealing with regular expressions, we would use Raw Strings as r'expression'. The RegEx `\d\d\d\d\d\d\d\d\d\d\d` matches an 11 digit Nigerian phone number. What happens when we want to match a 100-digit number, do we write `\d` 100 times? RegEx can be much more sophisticated, adding 11 in braces ({11}) after a pattern means “Match this pattern eleven times.” So the slightly shorter regex `\d{11}` also matches an 11 digit Nigerian phone number.

  

## MATCHING REGEX OBJECTS
The search() method is used to search a string for the RegEx pattern. It returns None if the string does not contain the RegEx pattern, Otherwise it returns a match object. The match object has a group() method which is used to retrieve the string that matches the RegEx pattern.  The findall() method returns a list of all occurences in the string.

In [7]:
matchObject = phoneRegEx.search('You can reach me on 08057343365 or 08069690994')
print(f'The phone number found is {matchObject.group()}')

The phone number found is 08057343365


In [8]:
matchObject

<re.Match object; span=(20, 31), match='08057343365'>

In [5]:
matchObject = phoneRegEx.findall('You can reach me on 08057343365 0r 07065432188')
print(f'The phone numbers found are {",".join(matchObject)}')

The phone numbers found are 08057343365,07065432188


## GROUPING WITH PARENTHESES
Adding Parentheses within a regex pattern creates a group within the regex. For example, we can use groups to separate the area code from the rest of the phone number. The first set of the parentheses in a regex is group 1 while the second set of parentheses is group 2. We can retrieve different parts of a matched text by passing the position to the match object group() method. 

In [9]:
phoneNumRegEx = re.compile(r"(\d{3})-(\d{10})")
phoneMatchObject = phoneNumRegEx.search("For more enquiries, Please contact 234-8067564321")
print(f"Area code: {phoneMatchObject.group(1)}")
print(f"Phone Number: {phoneMatchObject.group(2)}")
print(phoneMatchObject.group(0))

Area code: 234
Phone Number: 8067564321
234-8067564321


In [13]:
phoneMatchObject.group(1)

'234)'

Since parentheses have a special meaning in regex, if we need to match parentheses in our regex, then we would have to escape it using \ backslash. The groups() method is used to retrieve all groups at once.

In [12]:
phoneNumRegEx = re.compile(r"(\(\d{3}\))-(\d{10})")
phoneMatchObject = phoneNumRegEx.search("For more enquiries, Please contact (234)-8067564321")
print(phoneMatchObject.group())

234)-8067564321


In [14]:
area_code, number = phoneMatchObject.groups()
print(area_code)
print(number)

234)
8067564321


The following characters `.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )` have a special meaning in Regex. We must escape these characters with a backslash if we need to match it in our regex.

## THE PIPE CHARACTER
The pipe character `|` is used to match one of many expressions. If both expressions are contained in the string, the first match is returned.

In [26]:
import re
nameRegEx = re.compile(r"Brian | Thompson")
mo = nameRegEx.search('Bria Thompson is an amateur')
print(mo.group())

 Thompson


In [27]:
mo

<re.Match object; span=(4, 13), match=' Thompson'>

In [28]:
BrRegEx = re.compile(r"Br(own|ian|ead|eathe)")
BrMo = BrRegEx.search("Brown bread is an healthy alternative to white bread")
print(BrMo.group())

Brown


## THE ?, + and * OPERATORS
The `?` operator is used to match a pattern optionally i.e a match should be returned whether or not the optional part is found in the string (zero or one instance of the pattern).  
The `*` operator is used to match zero or more instances of the regex in the string.  
The `+` operator is used to match at least one instance of the regex in the string.

In [41]:
phoneRegEx = re.compile(r"(\+\d{3}-)?(0)?\d{10}")
match_obj = phoneRegEx.finditer("Please call me on +234-8057574332 or 08037584333")
for mo in match_obj:
    print(mo.group())

+234-8057574332
08037584333


In [34]:
for i in (match_obj):
    print(i.group())

In [43]:
genderRegEx = re.compile(r"(fe)*male")
genderMo = genderRegEx.search('Amaka is a female')
print(genderMo.group())
genderMo = genderRegEx.search('Ridwan is a male')
print(genderMo.group())

female
male


In [45]:
haRegEx = re.compile('(ha)+')
haMO = haRegEx.search('He laughed hahahahahahaha')
print(haMO.group())

hahahahahahaha


## GREEDY AND NON GREEDY MATCHING

As seen before, we can match a regex pattern a number of times using the square brackets {} with the number of times enclosed within the brackets. Asides putting a single number within the brackets, we can also specify a range. For example, `(ha){3,5}` would match hahaha, hahahaha and hahahahaha. We can also leave the minimum or maximum unbounded by omitting the first or second number. `(ha){,3}` will match zero up to 3 instances of ha and `(ha){5,}` will match at least five instances of ha.

In [50]:
haRegEx = re.compile(r"(ha){2,4}")
haMo = haRegEx.search("hahahaha")
print(haMo.group())

hahahaha


As seen in the above code, the group() method returned the longer possibilty (4 instances of ha) instead of the shorter possibility (2 instances of ha). This is because regex are greedy in nature and would always return the longest match. To return the shortest match, we use the `?` operator after the square brackets.

In [51]:
haRegEx = re.compile(r"(ha){2,4}?")
haMo = haRegEx.search("hahahaha")
print(haMo.group())

haha


## THE CARET `(^)` AND DOLLAR `($)` CHARACTERS

The caret (^) symbol is used at the start of a regex to indicate that the searched string must start with the string while the dollar ($) character is used to indicate that a match must occur at the end of the searched string. We can use both together to indicate that the search string must start and end with the specific string.


In [54]:
caretRegEx = re.compile(r"^Dear")
caretMo = caretRegEx.search('Dear Brian,')
if(caretMo):
    print(caretMo.group())

Dear


In [58]:
dollarRegEx = re.compile(r"Best wishes$")
dollarMo = dollarRegEx.search("......Best wishes")
print(dollarMo.group())

Best wishes


In [59]:
numberRegEx = re.compile(r"^(\d)+$")
numberMo = numberRegEx.search("123456789")
print(numberMo.group())

123456789


## THE WILDCARD CHARACTER

The dot (`.`) character is known as the wild character in RegEx and it is used to match any character in the searched string. It matches only one character. To match more than one character, we have to use the dot star (`.*`) character.

In [60]:
dotRegEx = re.compile(r".at")
dotMo = dotRegEx.findall('The fat cat sat on the flat mat')
print(dotMo)

['fat', 'cat', 'sat', 'lat', 'mat']


In [61]:
personRegEx = re.compile(r"Name: (.*) Email: (.*) Phone: (.*)")
personMo = personRegEx.finditer("Name: Brian Thomas Email: brianthomas@yahoo.com Phone: 08067543211")
for person in personMo:    
    print(person.group())

Name: Brian Thomas Email: brianthomas@yahoo.com Phone: 08067543211


`.*` matches everything except a newline. To make it match the newline character, we need to pass re.DOTALL as  a second parameter to compile() method

In [62]:
dotStarRegEx = re.compile(r".*", re.DOTALL)
mo = dotStarRegEx.search("The name of my dog is Bingo.\nIt loves eating bones.")
print(mo.group())

The name of my dog is Bingo.
It loves eating bones.


## CASE INSENSITIVE MATCHING

RegEx by default is case sensitive, for case insensitive matching, we pass the re.IGNORECASE or re.I as a second parameter to the compile() method

In [63]:
import re
caseRegEx = re.compile(r"Artificial Intelligence", re.I)
mo = caseRegEx.search('artificial IntelligEnce is a branch of computer science')
print(mo.group())

artificial IntelligEnce


## CHARACTER CLASS

 Before now we have seen the `\d` character. Other class characters and their meaning are:  
 
 |**CHARACTER CLASS**|**MEANING**|
 |-------------------|-----------|
 |`\d`|Any numeric digit between 0 and 9|
 |`\D`|Any character that is not a numeric digit|
 |`\w`|Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)|
 |`\W`|Any character that is not a letter, numeric digit, or the underscore character.|
 |`\s`|Any space, tab, or newline character. (Think of this as matching “space” characters.)|
 |`\S`|Any character that is not a space, tab, or newline.|



In [22]:
dateRegex= re.compile(r"(\d){1,2}\s(\w)+(\.)?,\s(\d){2,4}")
dateMo = dateRegex.search('I have a meeting on 8 March, 2020')
#print(dateMo.group())
print(dateMo.group())

8 March, 2020


## USER DEFINED CHARACTER CLASS

Regex character classes are broad we can narrow the characters we want to match by defining our own character classes using square brackets `[]`.

In [23]:
#matches vowels both lowercase and uppercase in the searched string
vowelsRegEx = re.compile(r"[aeiouAEIOU]")
vowelsMo = vowelsRegEx.findall("My friend went to the market to buy Oranges, Apples and Eggs.")
print(vowelsMo)

['i', 'e', 'e', 'o', 'e', 'a', 'e', 'o', 'u', 'O', 'a', 'e', 'A', 'e', 'a', 'E']


In [66]:
userCharRegEx = re.compile(r"[a-zA-Z0-9]+")
mo = userCharRegEx.findall("Password: User123")
print(mo)

['Password', 'User123']


The caret(`^`) symbol when used within a character class is used to create a negative character class. A negative character class matches characters outside the character class.

In [25]:
consonantsRegEx = re.compile(r"[^aeiouAEIOU]")
consonantsMo = consonantsRegEx.findall("My friend went to the market to buy Oranges, Apples and Eggs.")
print(consonantsMo)

['M', 'y', ' ', 'f', 'r', 'n', 'd', ' ', 'w', 'n', 't', ' ', 't', ' ', 't', 'h', ' ', 'm', 'r', 'k', 't', ' ', 't', ' ', 'b', 'y', ' ', 'r', 'n', 'g', 's', ',', ' ', 'p', 'p', 'l', 's', ' ', 'n', 'd', ' ', 'g', 'g', 's', '.']


## THE SUB() METHOD

The sub() method is used to substitute regex with new strings. It takes 2 parameters, the first is the new string you want to use to replace the regex and the second is the string for the regular expression.

In [76]:
subRegex = re.compile(r"Inspector \wa+")
subRegex.sub("Alice", "Inspector wale gave me a new shirt.")

'Alicele gave me a new shirt.'

## COMPLEX REGEX

Regex can be long and complicated. To make such regex readable, it is best to have it spread across multiple lines with comments. The verbose mode re.VERBOSE can be passed to re.compile() method to make it ignore whitespaces and comments. 

In [68]:
phoneRegex = re.compile(r"""(\+\d{3}|\(\+\d{3}\))? #area code
                        (\s|-|\.)? #seperator
                        (\d)? #0
                        (\d{10})
                        """, re.VERBOSE)
mo = phoneRegex.search("Phone numbers: +234-8034342311")
#mo =  phoneRegex.search("Phone numbers:  (+234)8098765432")                       
#mo = phoneRegex.search("Phone numbers: 09087659987")  
print(mo.group())

+234-8034342311


In [69]:

mo = phoneRegex.findall("Phone numbers: +234 8034342311, (+234)8098765432, 09087659987")

for group in mo:
    phoneNumber = "".join([group[0], group[2], group[3]])
    print(phoneNumber)

+2348034342311
(+234)8098765432
09087659987


In [32]:
match_object = phoneRegex.finditer("Phone numbers: +234 8034342311, (+234)8098765432, 09087659987")
for match in match_object:
    print(match.group())

+234 8034342311
(+234)8098765432
 09087659987


## COMBINING RE.IGNORECASE, RE.VERBOSE AND RE.DOTALL

The re.compile method accepts only one value for its second argument. Sometimes we may want to match every character including a newline while also ignoring case. To use the re.IGNORECASE together with re.DOTALL, we have to use the bitwise operator '|'.

In [70]:
regex = re.compile(r"foo.*",re.I|re.DOTALL)
mo = regex.search("Foo bar\nfOO BaR")
print(mo.group())

Foo bar
fOO BaR


## EXERCISE

Write a program that retrieves all email addresses in the text variable below.

In [79]:
text = """ At RAIN, we have friendly course facilitators. Feel free to reach any of us on 
Ade: Aderinoye@gmail.com Aminat: Aminat123@yahoo.co.uk"""

emailRegex = re.compile(r"""( [a-zA-Z._0-9]+ #username
@ #at symbol
[a-zA-Z]+ #domain
[.a-zA-Z]{2,4} #dot something
[.a-zA-z]{2,4}?)
""", re.VERBOSE)
mo = emailRegex.findall(text)
print(mo)

['Aderinoye@gmail.com', 'Aminat123@yahoo.co.uk']


Find website URLs that begin with http:// or https://


In [80]:
websiteRegex = re.compile(r"""(http://|https://)
[a-zA-Z0-9./_]+""",re.VERBOSE)
websites = "https://www.tutorialspoint.com/python/python_reg_expressions.htm  www.google.com http://utest.com"
website_list = websiteRegex.finditer(websites)
#print(website_list)
for website in website_list:
    print(website.group())

https://www.tutorialspoint.com/python/python_reg_expressions.htm
http://utest.com
