# RegEx

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

Python has a built-in package called re, which can be used to work with Regular Expressions.

The inbuilt library can be used to compile patterns, find patterns, etc.

### Functions

The re module offers a set of functions that allows us to search a string for a match:

    re.findall()	Returns a list containing all matches
    re.search()	Returns a Match object if there is a match anywhere in the string
    re.split() 	Returns a list where the string has been split at each match
    re.sub() 	 Replaces one or many matches with a string
    re.compile() 	 Compile a regular expression pattern, returning a Pattern object

In [49]:
import re
txt = "There is heavy rain in spain"

In [50]:
"""The findall() function returns a list containing all matches.
If no matches are found, an empty list is returned."""

x = re.findall("ai", txt)
print(x)

['ai', 'ai']


In [51]:
"""The search() function searches the string for a match, and returns a Match object if there is a match.
If there is more than one match, only the first occurrence of the match will be returned.
If no matches are found, the value None is returned."""

x = re.search(" ", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 5


In [52]:
"""The split() function returns a list where the string has been split at each match.
You can control the number of occurrences by specifying the maxsplit parameter."""

x = re.split(" ", txt, 2)
print(x)

['There', 'is', 'heavy rain in spain']


In [53]:
"""The sub() function replaces the matches with the text of your choice.
You can control the number of replacements by specifying the count parameter."""

x = re.sub(" ", "---", txt, 2)
print(x)

There---is---heavy rain in spain


### Metacharacters:

Metacharacters are characters with a special meaning:

Metacharacters are considered as the building blocks of regular expressions. Regular expressions are patterns used to match character combinations in the strings. Metacharacter has special meaning in finding patterns and are mostly used to define the search criteria and any text manipulations.

    \d	whole numbers( 0-9 )(single digit)	 \d = 7,  \d\d=77
    \w	alphanumeric character	\w\w\w\w = geek \w\w\w =! geek
    *	0 or more characters	s*  = _,s,ss,sss,ssss…..
    +	1 or more characters	s+ = s,ss,sss,ssss…..
    ?  	0 or 1 character	s?  = _ or s
    {m}	occurs “m” times	sd{3} = sddd
    {m,n}	min “m” and max “n” times	sd{2,3}=sdd or sddd
    \W  	symbols 	\W = %
    [a-z]  or [0-9]	character set	geek[sy] = geeky
    []	A set of characters	"[a-m]"	
    \	Signals a special sequence (can also be used to escape special characters)	"\d"	



    .	Any character (except newline character)	"he..o"	
    ^	Starts with	"^hello"	
    $	Ends with	"world$"
    *	Zero or more occurrences	"aix*"	
    +	One or more occurrences	"aix+"	
    {}	Exactly the specified number of occurrences	"al{2}"	
    |	Either or	"falls|stays"	
    ()	Capture and group

### Special Sequence:

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

    \A	 BEGINNING
    \Z	 END
    
    \b	 BEGINNING/END
    \B	 NOT BEGINNING/END
    
    \d	DIGIT
    \D	NOT DIGIT
    
    \s	WHITE SPACE
    \S	NOT WHITE SPACE
    
    \w	ANY WORD CHARACTERS (A-Z, 0-9, underscore)
    \W	SPECIAL CHARACTERS 

### Set

A set is a set of characters inside a pair of square brackets [] with a special meaning:

Set	Description	Try it

    [arn]	Returns a match where one of the specified characters (a, r, or n) are present	
    [a-n]	Returns a match for any lower case character, alphabetically between a and n	
    [^arn]	Returns a match for any character EXCEPT a, r, and n	
    [0123]	Returns a match where any of the specified digits (0, 1, 2, or 3) are present	
    [0-9]	Returns a match for any digit between 0 and 9	
    
    [0-5][0-9]	Returns a match for any two-digit numbers from 00 and 59	
    [a-zA-Z]	Returns a match for any character alphabetically between a and z, lower case OR upper case	
    
    [+]	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

### Quantifiers

    * - 0 or more occurences
    + - 1 or more occurences
    ? - 0 or one
    {3} - Exact number
    {3,4} - Range of numbers(Min,Max)
    | - Either or
    () - Group

### Match Object
A Match Object is an object containing information about the search and the result. If there is no match, the value None will be returned, instead of the Match Object.

The Match object has properties and methods used to retrieve information about the search, and the result:

    .span() returns a tuple containing the start-, and end positions of the match.
    .string returns the string passed into the function
    .group() returns the part of the string where there was a match

In [56]:
import re
txt = "There is heavy rain in Spain"

x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(16, 18), match='ai'>


In [57]:
# any words that starts with an upper case "S"

x = re.search(r"\bS\w+", txt)
print(x.span())

(23, 28)


In [58]:
# Print the string passed into the function:"""

x = re.search(r"\bS\w+", txt)
print(x.string)

There is heavy rain in Spain


In [59]:
# any words that starts with an upper case "S"

x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


# <font color='blue'> EXAMPLES

### Finding Pattern Ex.1

In [60]:
import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
321-555-4321123
123.555.1234
123*555*12
700*788*7785
800-555-1234
900-555-1234
Mr. Sachin
Mr Kevin
Ms David
Mrs. Raonsla
Mr. Tiago
enjoying@gmail.com
'''

In [63]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(151, 162), match='321-555-432'>
<re.Match object; span=(164, 175), match='321-555-432'>
<re.Match object; span=(180, 191), match='123.555.123'>
<re.Match object; span=(204, 215), match='700*788*778'>
<re.Match object; span=(217, 228), match='800-555-123'>
<re.Match object; span=(230, 241), match='900-555-123'>


In [12]:
#pattern = re.compile(r'abc')
#pattern = re.compile(r'.') Any char except new line
#pattern = re.compile(r'\.') 
#Pattern = re.compile(r'enjoying\.com')
#pattern = re.compile(r'\d') Digit
#pattern = re.compile(r'\d\d\d') 3 Digit number
#pattern = re.compile(r'\D') Non-Digit
#pattern = re.compile(r'\w') Word
#pattern = re.compile(r'\W') Non-word
#pattern = re.compile(r'\s') Whitespace: space, tab, newline
#pattern = re.compile(r'\S') Not whitespace

In [13]:
#pattern = re.compile(r'\bHa') word boundary
#pattern = re.compile(r'\BHa') Not word boundary
#pattern = re.compile(r'^Tata') Beggining of a string
#pattern = re.compile(r'Tata&') End of a string
#pattern = re.compile(r'[a-z]')
#pattern = re.compile(r'[a-zA-Z]')
#pattern = re.compile(r'[1-9]')
#pattern = re.compile(r'[^a-z]')
#pattern = re.compile(r'[^b]at')

In [14]:
#pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d') OR
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

In [64]:
#pattern = re.compile(r'Mr\.') #fullstop after backslash is optional
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(243, 253), match='Mr. Sachin'>
<re.Match object; span=(254, 262), match='Mr Kevin'>
<re.Match object; span=(285, 294), match='Mr. Tiago'>


In [65]:
#Use of groups
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(243, 253), match='Mr. Sachin'>
<re.Match object; span=(254, 262), match='Mr Kevin'>
<re.Match object; span=(263, 271), match='Ms David'>
<re.Match object; span=(272, 284), match='Mrs. Raonsla'>
<re.Match object; span=(285, 294), match='Mr. Tiago'>


### Finding Pattern Ex.2

In [66]:
import re
emails="""
CandySofty@gmail.com
candy.softy@univrsity.edu
candy-321-softy@my-work.net
"""
pattern = re.compile(r'[a-zA-Z]+@[a-z]+\.com') #Compile a regular expression pattern, returning a Pattern object.
matches = pattern.finditer(emails) #Return an iterator over all non-overlapping matches for the RE pattern 
for match in matches:
    print(match)

<re.Match object; span=(1, 21), match='CandySofty@gmail.com'>


In [67]:
pattern = re.compile(r'[a-zA-Z.]+@[a-z]+\.(com|edu)')
matches = pattern.finditer(emails) 
for match in matches:
    print(match)

<re.Match object; span=(1, 21), match='CandySofty@gmail.com'>
<re.Match object; span=(22, 47), match='candy.softy@univrsity.edu'>


In [68]:
pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')
matches = pattern.finditer(emails) 
for match in matches:
    print(match)

<re.Match object; span=(1, 21), match='CandySofty@gmail.com'>
<re.Match object; span=(22, 47), match='candy.softy@univrsity.edu'>
<re.Match object; span=(48, 75), match='candy-321-softy@my-work.net'>


### Finding Pattern Ex.3 : How to match and find Regex in groups

In [69]:
import re
urls="""
https://www.google.com
http://candy.com
https://youtube.com
https://www.nasa.gov
"""
#pattern=re.compile(r'https?://(www\.)?\w+\.\w+') OR
pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches=pattern.finditer(urls)

for match in matches:
    print(match.group(0))

https://www.google.com
http://candy.com
https://youtube.com
https://www.nasa.gov


In [70]:
pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches=pattern.finditer(urls)

for match in matches:
    print(match.group(1))

www.
None
None
www.


In [71]:
pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches=pattern.finditer(urls)

for match in matches:
    print(match.group(2))

google
candy
youtube
nasa


In [72]:
pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches=pattern.finditer(urls)

for match in matches:
    print(match.group(3))

.com
.com
.com
.gov


In [75]:
pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_urls = pattern.sub(r"\2\3", urls)

print(subbed_urls)


google.com
candy.com
youtube.com
nasa.gov



In [55]:
import re

text_to_search="""
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
123-555-1234
321.555.4321
231*569*778
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ & * + ? {} [] \ | ()
coreyms.com
Mr. Schafer
Mr Davis
Ms Smith
Mrs. Robinson
Mr. T

cat mat pat
"""

sentence="Start a sentence and then bring it to an end"

In [50]:
pattern = re.compile(r"\d{3}.\d{3}.\d{4}")
matches=pattern.findall(text_to_search)
for match in matches:
    print(match)

123-555-1234
321.555.4321


In [56]:
pattern = re.compile(r"Start")
matches=pattern.match(sentence) #Searches for the match only at start of string
print(matches)

<re.Match object; span=(0, 5), match='Start'>


In [57]:
pattern = re.compile(r"sentence")
matches=pattern.match(sentence)
print(matches)

None


In [59]:
pattern = re.compile(r"sentence")
matches=pattern.search(sentence) #Searches throughotut the string for the match
print(matches)

<re.Match object; span=(8, 16), match='sentence'>


#### Flags

In [63]:
import re
sentence="Start a sentence and then bring it to an end"

patter=re.compile(r"start",re.IGNORECASE)
#OR patter=re.compile(r"start",re.I)

matches=pattern.search(sentence)
print(matches)

<re.Match object; span=(8, 16), match='sentence'>


### Finding pattern Ex.4

In [76]:
import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
321-555-4321123
123.555.1234
123*555*12
700*788*7785
800-555-1234
900-555-1234
Mr. Sachin
Mr Kevin
Ms David
Mrs. Raonsla
Mr. Tiago
enjoying@gmail.com
'''
sent = "she sells the sea shells on the sea shore @$12 123.555.1234"

exp = r'.'
print (re.findall(exp, sent))

['s', 'h', 'e', ' ', 's', 'e', 'l', 'l', 's', ' ', 't', 'h', 'e', ' ', 's', 'e', 'a', ' ', 's', 'h', 'e', 'l', 'l', 's', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 's', 'e', 'a', ' ', 's', 'h', 'o', 'r', 'e', ' ', '@', '$', '1', '2', ' ', '1', '2', '3', '.', '5', '5', '5', '.', '1', '2', '3', '4']


In [19]:
exp1 = r'\d'
print (re.findall(exp1, sent))

['1', '2', '1', '2', '3', '5', '5', '5', '1', '2', '3', '4']


In [20]:
exp2 = r'\D'
print (re.findall(exp2, sent))

['s', 'h', 'e', ' ', 's', 'e', 'l', 'l', 's', ' ', 't', 'h', 'e', ' ', 's', 'e', 'a', ' ', 's', 'h', 'e', 'l', 'l', 's', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 's', 'e', 'a', ' ', 's', 'h', 'o', 'r', 'e', ' ', '@', '$', ' ', '.', '.']


In [21]:
exp3 = r'\w'
print (re.findall(exp3, sent))

['s', 'h', 'e', 's', 'e', 'l', 'l', 's', 't', 'h', 'e', 's', 'e', 'a', 's', 'h', 'e', 'l', 'l', 's', 'o', 'n', 't', 'h', 'e', 's', 'e', 'a', 's', 'h', 'o', 'r', 'e', '1', '2', '1', '2', '3', '5', '5', '5', '1', '2', '3', '4']


In [22]:
exp4 = r'\W'
print (re.findall(exp4, sent))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '@', '$', ' ', '.', '.']


In [23]:
exp5 = r'\s'
print (re.findall(exp5, sent))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


In [24]:
exp6 = r'\S'
print (re.findall(exp6, sent))

['s', 'h', 'e', 's', 'e', 'l', 'l', 's', 't', 'h', 'e', 's', 'e', 'a', 's', 'h', 'e', 'l', 'l', 's', 'o', 'n', 't', 'h', 'e', 's', 'e', 'a', 's', 'h', 'o', 'r', 'e', '@', '$', '1', '2', '1', '2', '3', '.', '5', '5', '5', '.', '1', '2', '3', '4']


In [25]:
exp7 = r'\bshe\b'
print (re.findall(exp7, sent))

['she']


In [26]:
exp8 = r'she\B'
print (re.findall(exp8, sent))

['she']


In [27]:
exp9 = r'^s'
print (re.findall(exp9, sent))

['s']


In [28]:
exp10 = r'[a-gA-Z]'
print (re.findall(exp10, sent))

['e', 'e', 'e', 'e', 'a', 'e', 'e', 'e', 'a', 'e']


In [29]:
exp11 = r'\d{3}.\d{3}.\d{3}'
print (re.findall(exp11, sent))

['123.555.123']


In [30]:
exp12 = r'[\w.%-_]{2,20}[@][\w]{3}.[\w.]{2,20}'
print (re.findall(exp12, text_to_search))

['enjoying@gmail.com']


In [31]:
exp13 = r'\d{3}.\d{3}.\d{3}'
print (re.findall(exp13,text_to_search))

['321-555-432', '321-555-432', '123.555.123', '700*788*778', '800-555-123', '900-555-123']


In [32]:
exp14 = r'[Mrs.]+\s[A-Z][a-z]*'   # *--> 0 or more
print (re.findall(exp14,text_to_search))

['Mr. Sachin', 'Mr Kevin', 'Ms David', 'Mrs. Raonsla', 'Mr. Tiago']


In [33]:
exp15 = r'M[r-s].*\S.*'   # *--> 0 or more
print (re.findall(exp15,text_to_search))

['Mr. Sachin', 'Mr Kevin', 'Ms David', 'Mrs. Raonsla', 'Mr. Tiago']


In [34]:
exp16 = r'\S+@\S+'   # *--> 0 or more
print (re.findall(exp16,text_to_search))

['enjoying@gmail.com']
