## Tutorial on regular expression 

####  What is re ? 

##### Regular expressions are used to identify whether a pattern exists in a given sequence of characters (string) or not and also to locate the position of the pattern in a corpus of text. They help in manipulating textual data, which is often a pre-requisite for data science projects that involve text analytics.

#### As any module is imported in python, first we need to import the module re 

In [1]:
import re 


## The ‘match’ method

In [5]:
pattern =r'Python'
string1='Python'

#print(re.match(pattern,string1))

if re.match(pattern,string1) is not None:
    print("Matches!")
else :
    print("Does not match .")

Matches!


In [6]:
pattern =r'Python'
string2='python'

if re.match(pattern,string2) is not None:
    print("Matches!")
else :
    print("Does not match .")

Does not match .


### A complied program 
#### Instead of repeating the code, we can use compile to create a regex program and use built-in methods.

In [10]:
prog = re.compile(pattern)
prog.match(string1)

<re.Match object; span=(0, 6), match='Python'>

##### Here the span represents between what indexes of the string do we have a match 


In [11]:
if prog.match(string1)!=None:
    print("Matches ")
else :
    print("Dosen't match")

Matches 


### Positional matching

In [13]:
prog= re.compile(r'P')
prog.match('Python')

<re.Match object; span=(0, 1), match='P'>

##### Here the pos indicates at what index it should strat to find the pattern or expression that is to be matched 
##### By default the pos is 0 . 

In [17]:
prog=re.compile(r'thon')
prog.match('thon123')

<re.Match object; span=(0, 4), match='thon'>

In [20]:
prog.match('marathon marathon')

In [19]:
prog.match('marathon marathon',pos=13)

<re.Match object; span=(13, 17), match='thon'>

### A simple use case 

In [21]:
prog=re.compile(r'ing')
words =['Spring','Cycling','Ringtone']
for w in words : 
    if prog.match(w,pos=len(w)-3)!=None:
        print("{} has the last 3 letters 'ing' ".format(w))
    else :
        print("{} does not have the last 3 letters 'ing' ".format(w))

Spring has the last 3 letters 'ing' 
Cycling has the last 3 letters 'ing' 
Ringtone does not have the last 3 letters 'ing' 


### The search method

In [22]:
prog =re.compile(r'ing')
if prog.match('Spring')==None:
    print("None")

None


In [25]:
prog.search('Spring')

<re.Match object; span=(3, 6), match='ing'>

#### How to get the position of the pattern serached by search method 

In [26]:
prog=re.compile(r'ing')
words =['Spring','Cycling','Ringtone']
for w in words : 
    mt=prog.search(w)
    #Span returns a tuple os start and end positions of the match
    start_pos=mt.span()[0]
    end_pos=mt.span()[1]
    print("The word '{}'contains 'ing' in the position {} - {}".format(w,start_pos,end_pos))

The word 'Spring'contains 'ing' in the position 3 - 6
The word 'Cycling'contains 'ing' in the position 4 - 7
The word 'Ringtone'contains 'ing' in the position 1 - 4


#### Search and Replace 

In [27]:
phone = "2004-959-559 # This is Phone Number"

# Remove anything other than digits
# \d gives digits and \D gives everything other than digits 
num = re.sub(r'\D', "", phone)    
print (num)

2004959559


#### The findall method 

###### The serach metod is powerfull, but it only finds the first occurence. To find all the occurences of a pattern we use findall method 

In [31]:
prog=re.compile(r'ing')
prog.findall('Ringtone of spring1 in ')

['ing', 'ing']

In [32]:
lst_of_ing =prog.findall('The phone is singing the ringtone of spring')
print("There are {} occurence of 'ing ' in the string".format(len(lst_of_ing)))

There are 4 occurence of 'ing ' in the string


In [33]:
for i in prog.finditer('The phone is singing the ringtone of spring'):
    print(i)


<re.Match object; span=(14, 17), match='ing'>
<re.Match object; span=(17, 20), match='ing'>
<re.Match object; span=(26, 29), match='ing'>
<re.Match object; span=(40, 43), match='ing'>


#### Wildcard Matching
##### In all the above cases we knew the exact pattern we wanted to search. Using '.' we can get a single character to match with anything. Its like a dont care. 

In [38]:
prog=re.compile(r'py.')
#print(prog.search('pygmy'))
print(prog.search('Jupyter').group())

pyt


##### Wildcard matches with any character.  What if we want to match with a specific character . Lets say for example a alphabet 

In [84]:
prog=re.compile(r'c\wm')
print(prog.search('comedy').group())
print(prog.search('camera').group())

com
cam


##### The above case was for alphabets. The things that are omitted in \w are captured in \W

In [48]:
prog=re.compile(r'9\W11')
print(prog.search('9/11 was a terrible day!').group())
print(prog.search('9-11 was a terrible day!').group())
print(prog.search('9.11 was a terrible day!').group())

9/11
9-11
9.11


##### Although whitespace can be searched using . operator. But if we want to specifically match whitespace we must use \s 

In [52]:
prog=re.compile(r'Data\swrangling')
print(prog.search("Data wrangling is cool").group())

Data wrangling


#####  \d matches numerical digits 0-9

In [57]:
prog=re.compile(r'score was \d\d')
print(prog.search("My score was 67 ").group())
#print(prog.search("Your score was 73").group())

score was 67


##### This can be used in an application where we have entered marks and by mistake lets say we input marks greater than 100

In [59]:
text="Jack got 67 marks. I got 78 marks. Ronnie was close to me with marks 722. Sandra scored a whooping 95."
digit2=re.compile(r'\d\d')
digit3=re.compile(r'\d\d\d')
lines=text.split('.')
lines

for i,l in enumerate(lines):
    if digit3.search(l) is not None:
        print("There is a typo   : ",digit3.search(l).group())
    elif digit2.search(l) is not None:
        print("this is a valid score:",digit2.search(l).group())

this is a valid score: 67
this is a valid score: 78
There is a typo   :  722
this is a valid score: 95


#### Start of a string
##### the ^(caret) matches pattern at the begining of a string 

In [61]:
def print_match(s):
    if prog.search(s)==None:
        print ("No match")
    else:
        print(prog.search(s).group())

In [64]:
prog=re.compile(r'^India')

print_match("Russia implemnted this law")
print_match("India implemnted this law")
print_match("Bangladesh and India share a physical boundry")

No match
India
No match


#### End of a string 
##### THe $(dollar sign) matches a pattern at the end of the string

In [65]:
patent_company=re.compile(r'Apple$')
patent_number=re.compile(r'\d\d\d\d\d\d')

s1="Patent no 123456 belongs to Apple"
s2="Patent no 987654 belongs to Samsung"
s3="Patent no 753159 belongs to One+"

for s in[s1,s2,s3]:
    if patent_company.search(s) is not None:
        print("Found a patenet of Apple")
        print("Patent number :",patent_number.search(s).group())
    else :
        print("Patent number {} is not a patent of Apple".format(patent_number.search(s).group()))
        

Found a patenet of Apple
Patent number : 123456
Patent number 987654 is not a patent of Apple
Patent number 753159 is not a patent of Apple


##### All these techniques match for a single occurence. What if we want to match for more than one occurence

#### Matching 0 or more repitions of the regular expression can be done using * 

In [66]:
prog=re.compile(r'ab*')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

a
ab
abbb
No match
ab
abb


##### Matching 1 or more repetions
##### This can be done using + 

In [67]:
prog=re.compile(r'ab+')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

No match
ab
abbb
No match
ab
abb


##### Matching 0 or 1 repetion
##### This can be done using ? 

In [68]:
prog=re.compile(r'ab?')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")


a
ab
ab
No match
ab
ab


#### Controlling how many repetions to match 

In [69]:
prog = re.compile(r'A{3}')

print_match("ccAAAdd")
print_match("ccAAAAdd")
print_match("ccAAdd")

AAA
AAA
No match


#### If we want to match specific number of copies we must do using {m,n} . {m,n} specifies exactly m to n copies of RE to match. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound.


In [70]:
prog = re.compile(r'A{2,4}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAABdd")


AAAB
No match
AAB
AAAAB


In [71]:
prog = re.compile(r'A{,3}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAABdd")

AAAB
AB
AAB
AAAB


In [72]:
prog = re.compile(r'A{3,}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAABdd")

AAAB
No match
No match
AAAAAB


##### {m,n}? specifies m to n copies of RE to match in a non-greedy fashion.

In [73]:
prog=re.compile(r'A{2,4}')
print_match("AAAAAAA")

prog=re.compile(r'A{2,4}?')
print_match("AAAAAAA")


AAAA
AA


#### Sets of matching characters [x,y,z] matches x, y, or z in the priority x>y>z

In [75]:
prog = re.compile(r'[A,B]')
print_match("ccAd")
print_match("ccABd")
print_match("ccXdB")
print_match("ccXdZ")

A
A
B
No match


#### Range of characters inside a set 

##### A range of characters can be matched inside the set. This is one of the most widely used regex techniques. We denote range by using a  " - " . For example, a-z or A-Z will match anything between a and z or A and Z i.e. the entire English alphabet.

##### Let’s suppose, we want to extract an email id. We put in a pattern matching regex with alphabetical characters + @ + .com. But it cannot catch an email id with some numerical digits in it.

In [76]:
prog=re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')

print_match("My email is smitshah@gmail.com")
print_match("My email is smitshah222@gmail.com")


smitshah@gamil.com
No match


In [77]:
prog=re.compile(r'[a-zA-Z0-9]+@[a-zA-Z]+\.com')

print_match("My email is smitshah@gmail.com")
print_match("My email is smitshah222@gmail.com")


smitshah@gamil.com
smitshah222@gamil.com


In [78]:
prog=re.compile(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.com')
print_match("My email is smitshah222@gmail.com")
print_match("My email is smitshah222@gmail.org")


smitshah222@gamil.com
No match


In [79]:
prog=re.compile(r'\w+@\w+\.+[a-z]{2,4}')
print_match("My email is coolguy12@xyz.org")
print_match("My email is coolguy12@xyz.com")


coolguy12@xyz.org
coolguy12@xyz.com


#### Combining the power of Regex by OR-ing
##### Like any other good computable objects, Regex supports boolean operation to expand its reach and power. OR-ing of individual Regex patterns is particularly interesting.

##### For example, if we are interested to find phone numbers containing ‘312’ area code, the following code fails to extract it from the second string.

In [80]:
prog=re.compile(r'[0-9]{10}')
print_match("3121233121")
print_match("312-312-3121")

3121233121
No match


In [82]:
p0=r'\+*\d*\s[0-9]{3}-[0-9]{3}-[0-9]{4}'
p1=r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
p3=r'\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
p4=r'[0-9]{3}\.[0-9]{3}\.[0-9]{4}'
pattern =p0+'|'+p1+'|'+p2+'|'+p3+'|'+p4
prog=re.compile(pattern)

print_match("31231231211")     #this is p1 pattern
print_match("312-123-2121")   #this is p2 pattern
print_match("(312)312-3121")  #this is p3 pattern
print_match("312.312.3121")   # this is p4 pattern
print_match("+22 312-312-3121") #this is p0 pattern

3123123121
312-123-2121
(312)312-3121
312.312.3121
+22 312-312-3121


#### Example of finding a valid phone number wihtin a string

In [83]:
## Here are some phone numbers. Some are valid some are not.Pick out the valid phone numbers from the given string.
## For this example a valid phone number is a number with area code 312
## 312-xxx-xxxx or 312.xxx.xxxx
phn_number="312-423-3245,456-334-6712,312.312.1231,312.1.1 423.NUMBER"
re.findall('312[-\.][0-9]{3}[-\.][0-9]{4}',phn_number)

['312-423-3245', '312.312.1231']