## Regular Expressions
Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Take another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

### Let's import the regular expression library in python.

In [1]:
import re

Let's do a quick search using a pattern.

In [2]:
re.search('Ravi', 'Ravi is an exceptional student!')

<re.Match object; span=(0, 4), match='Ravi'>

In [3]:
# print output of re.search()
match = re.search('Ravi', 'Ravi is an exceptional student!')
print(match.group())
print(match.span())

Ravi
(0, 4)


Let's define a function to match regular expression patterns

In [4]:
def find_pattern(text, patterns):
    if re.search(patterns, text):
        return re.search(patterns, text)
    else:
        return 'Not Found!'

### Quantifiers

In [86]:
# '*': Zero or more 
print(find_pattern("ac", "ab*"))
print(find_pattern("abc", "ab*"))
print(find_pattern("abc", "abc*"))
print(find_pattern("abc", "abd*"))
print(find_pattern("abbc", "ab*"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>


In [83]:
# '?': Zero or one (tells whether a pattern is absent or present)
print(find_pattern("ac", "ab?"))
print(find_pattern("abc", "ab?"))
print(find_pattern("abc", "abc?"))
print(find_pattern("abc", "abcd?"))
print(find_pattern("abbc", "ab?"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 2), match='ab'>


In [79]:
# '+': One or more
print(find_pattern("ac", "ab+"))
print(find_pattern("abc", "abc+"))
print(find_pattern("abc", "abcd+"))
print(find_pattern("abbc", "ab+"))

Not Found!
<re.Match object; span=(0, 3), match='abc'>
Not Found!
<re.Match object; span=(0, 3), match='abb'>


In [8]:
# {n}: Matches if a character is present exactly n number of times
print(find_pattern("abbc", "ab{2}"))


<re.Match object; span=(0, 3), match='abb'>


In [90]:
# {m,n}: Matches if a character is present from m to n number of times
print(find_pattern("aabbbbbbc", "ab{3,5}"))   # return true if 'b' is present 3-5 times
print(find_pattern("aacbbbbbbc", "ab{3,5}"))  # return false because c in b/w a and b
print(find_pattern("aabbbbbbc", "ab{7,10}"))  # return true if 'b' is present 7-10 times
print(find_pattern("aabbbbbbc", "ab{,10}"))   # return true if 'b' is present atmost 10 times
print(find_pattern("aabbbbbbc", "ab{10,}"))   # return true if 'b' is present from at least 10 times

<re.Match object; span=(1, 7), match='abbbbb'>
Not Found!
Not Found!
<re.Match object; span=(0, 1), match='a'>
Not Found!


### Anchors

In [10]:
# '^': Indicates start of a string
# '$': Indicates end of string

print(find_pattern("James", "^J"))   # return true if string starts with 'J' 
print(find_pattern("Pramod", "^J"))  # return true if string starts with 'J' 
print(find_pattern("India", "a$"))   # return true if string ends with 'c'
print(find_pattern("Japan", "a$"))   # return true if string ends with 'c'


<re.Match object; span=(0, 1), match='J'>
Not Found!
<re.Match object; span=(4, 5), match='a'>
Not Found!


### Wildcard

In [11]:
# '.': Matches any character
print(find_pattern("a", "."))
print(find_pattern("#", "."))


<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='#'>


### Character sets

In [12]:
# Now we will look at '[' and ']'.
# They're used for specifying a character class, which is a set of characters that you wish to match.
# Characters can be listed individually as follows
print(find_pattern("a", "[abc]"))

# Or a range of characters can be indicated by giving two characters and separating them by a '-'.
print(find_pattern("c", "[a-c]"))  # same as above

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='c'>


In [13]:
# '^' is used inside character set to indicate complementary set
print(find_pattern("a", "[^abc]"))  # return true if neither of these is present - a,b or c

Not Found!


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

### Meta sequences

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |

### Greedy vs non-greedy regex

In [14]:
print(find_pattern("aabbbbbb", "ab{3,5}")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 7), match='abbbbb'>


In [15]:
print(find_pattern("aabbbbbb", "ab{3,5}?")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 5), match='abbb'>


In [16]:
# Example of HTML code
print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>


In [17]:
# Example of HTML code
print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 6), match='<HTML>'>


### The five most important re functions that you would be required to use most of the times are

match() Determine if the RE matches at the beginning of the string

search() Scan through a string, looking for any location where this RE matches

finall() Find all the substrings where the RE matches, and return them as a list

finditer() Find all substrings where RE matches and return them as asn iterator

sub() Find all substrings where the RE matches and substitute them with the given string<br>
span()<br>
compile()<br>
group()<br>
start()

In [18]:
# - this function uses the re.match() and let's see how it differs from re.search()
def match_pattern(text, patterns):
    if re.match(patterns, text):
        return re.match(patterns, text)
    else:
        return ('Not found!')

In [19]:
print(find_pattern("abbc", "b+"))

<re.Match object; span=(1, 3), match='bb'>


In [20]:
print(match_pattern("abbc", "b+"))

Not found!


In [31]:
## Example usage of the sub() function. Replace Road with rd.

street = '21 Ramakrishna Road roumers'
print(re.sub('Road', 'Rd', street))

21 Ramakrishna Rd roumers


In [32]:
print(re.sub('R\w+', 'Rd', street))

21 Rd Rd roumers


In [33]:
print(re.sub('r\w+', 'Rd', street))

21 RamakRd Road Rd


In [23]:
## Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'
for match in re.finditer(pattern, text):
    print('START -', match.start(), end="")
    print('END -', match.end())
    
re.finditer(pattern, text)    

START - 12END - 20
START - 42END - 50


<callable_iterator at 0x25bb589a0a0>

In [38]:
# Example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'
print(re.findall(date_regex, url))

[('2017', '10', '28')]


In [39]:
## Exploring Groups
m1 = re.search(date_regex, url)
print(m1.group())  ## print the matched group

/2017/10/28/


In [26]:
print(m1.group(1)) # - Print first group

2017


In [27]:
print(m1.group(2)) # - Print second group

10


In [28]:
print(m1.group(3)) # - Print third group

28


In [29]:
print(m1.group(0)) # - Print zero or the default group

/2017/10/28/


In [38]:
re.search("tree?",'The trees stands tall.')

# ‘There are a lot of trees in the forest.’)

<re.Match object; span=(4, 8), match='tree'>

In [69]:
re.search("101[01]+","1011001")  # no pattern require of 01

<re.Match object; span=(0, 7), match='1011001'>

In [65]:
re.search("101(01)+","101010101") # pattern require of 01

<re.Match object; span=(0, 9), match='101010101'>

In [55]:
re.search("101{2}","1011")

<re.Match object; span=(0, 4), match='1011'>

In [58]:
re.search("1010*","1010000")

<re.Match object; span=(0, 7), match='1010000'>

In [71]:
re.split(",","pravesh,hari,aman,sahil,gautam")

['pravesh', 'hari', 'aman', 'sahil', 'gautam']

In [73]:
import string
string.split(",","pravesh,hari,aman,sahil,gautam")

AttributeError: module 'string' has no attribute 'split'

In [100]:
print(find_pattern("huray", "hur{2,5}ay"))

Not Found!


In [101]:
print(find_pattern("hurray", "hur{2,5}ay"))

<re.Match object; span=(0, 6), match='hurray'>


In [102]:
print(find_pattern("hurrrrray", "hur{2,5}ay"))

<re.Match object; span=(0, 9), match='hurrrrray'>


In [103]:
print(find_pattern("hurrrrrray", "hur{2,5}ay"))

Not Found!


In [121]:
print(find_pattern("Awesomeee", "Awesome{3,}"))

<re.Match object; span=(0, 9), match='Awesomeee'>


In [115]:
print(find_pattern("0000", "000+1+"))

Not Found!


In [23]:
import re
re.search("[\w\*\w]+","3*5*6*6*25")  # for using the special characters we use \ befor the special character

<re.Match object; span=(0, 10), match='3*5*6*6*25'>

In [6]:
re.search("(d|g)one","done")   # | this is called the pipe operator

<re.Match object; span=(0, 4), match='done'>

In [7]:
re.search("(d|g)one","gone")

<re.Match object; span=(0, 4), match='gone'>

In [42]:
Str1 = "robin\nathan"     #normal_string
Str1

'robin\nAthan'

In [35]:
Str2 = r"robin\nathan" 
Str2

'robin\\nathan'

In [71]:
# b or B -  used to specify the string as a byte string

var = b'Hey I am a Byte String'
print(var)
print(type(var))

var = 'Hey I am a String'.encode('ASCII')
print(var)
print(type(var))

var = b'Hey I am a Byte String'.decode('ASCII')
print(var)
print(type(var))


# f or F - used before the string for formating the strig
m=5
n=3
p=f"the sum of {5} and {3} is {5+3}"
print(p)


# r or R - Python raw string treats the backslash character (\) as a literal character
n = 'Hi\nHello' # used for changing the line 
t = 'Hi\tHello' # used for provide one indent space
print(n)
print(t)
raw_s = r'Hi\nHello'
print(raw_s)
s = "\\examplehost\digitalocean\content\\"
print(s)
s = r"\\examplehost\digitalocean\content\\"
print(s)
s = R'ab\\'
print(s)

# u - means unicode string in python 3 string is already unicode but in python 2 we have ot make it unicode
s=u"pravesh" 
print(s)
print(type(s))

b'Hey I am a Byte String'
<class 'bytes'>
b'Hey I am a String'
<class 'bytes'>
Hey I am a Byte String
<class 'str'>
the sum of 5 and 3 is 8
Hi
Hello
Hi	Hello
Hi\nHello
\examplehost\digitalocean\content\
\\examplehost\digitalocean\content\\
ab\\
pravesh
<class 'str'>


In [70]:
# even though ^ is not being used as anchor, it won't be matched literally
print(bool(re.search('b^2', 'a^2 + b^2 - C*3')))
False
# escaping will work
print(bool(re.search(r'b\^2', 'a^2 + b^2 - C*3')))
True

# match ( or ) literally
print(re.sub(r'\(|\)', '', '(a*b) + c'))
'a*b + c'

# note that the input string is also a raw string here
re.sub(r'\\', '/', r'\learn\by\example')
'/learn/by/example'

False
True
a*b + c


'/learn/by/example'

In [65]:
re.sub('\t', ':', 'a\tb\tc')
'a:b:c'


'a:b:c'

In [66]:
re.sub('\n', ' ', '1\n2\n3')
'1 2 3'

'1 2 3'

In [72]:
#  re.compile() function. This function stores the regular expression pattern in the cache memory, resulting in a faster search.

# without re.compile() function
result = re.search("a+", "abc")
print(result)
# using the re.compile() function
pattern = re.compile("a+")
print(pattern)
result = pattern.search("abc")
print(result)

<re.Match object; span=(0, 1), match='a'>
re.compile('a+')
<re.Match object; span=(0, 1), match='a'>


In [114]:
string="Balasubrahmamyhu"
pattern ="^(?=.*[A-Za-z])[A-Za-z\d]{3,15}$" # write your regex here
pattern='^.{3,15}$'
# check whether pattern is present in string or not
result = re.search(pattern, string)

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if result != None:
    print(True)
else:
    print(False)

False


In [125]:
string="amoplk62340"
pattern ="^[a-z]{1,10}[\d]{4}" # write your regex here
pattern="^(?=.*[a-z])[a-z]{1,10}(?=.*\d)[0-9]{4}$"
# check whether pattern is present in string or not
result = re.search(pattern, string)

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if result != None:
    print(True)
else:
    print(False)

False


In [8]:
import re
re.sub("^[a-zA-Z]{1}","$","my name is pravesh kumar")  # starting word replace with $

'$y name is pravesh kumar'

In [22]:
import re
for i in re.finditer("[\w\s\.-]","Do not compare apples with-oranges. Compare apples with apples"):
     print(i.group(),end="")

Do not compare apples with-oranges. Compare apples with apples

In [33]:
import re
import ast, sys
string = sys.stdin.read()
string="Do not compare apples with-oranges. Compare apples with apples"

# regex pattern
pattern ="[\w]+" # write regex to extract all the words from a given piece of text

# store results in the list 'result'
result = []

# iterate over the matches
for match in re.finditer(pattern,string): # replace the ___ with the 'finditer' function to extract 'pattern' from the 'string'
#     print(match.group())
    if len(match.group()) >= 5:
        result.append(match)
    else:
        continue

# evaluate result - don't change the
print(result)

[<re.Match object; span=(7, 14), match='compare'>, <re.Match object; span=(15, 21), match='apples'>, <re.Match object; span=(27, 34), match='oranges'>, <re.Match object; span=(36, 43), match='Compare'>, <re.Match object; span=(44, 50), match='apples'>, <re.Match object; span=(56, 62), match='apples'>]


In [37]:
import re
import ast, sys
string = sys.stdin.read()
string="Playing outdoor games when its raining outside is always fun!"

# regex pattern
pattern ="[\w]+ing[\w]*" # write regex to extract words ending with 'ing'

# store results in the list 'result'
result =re.findall(pattern,string) # extract words having the required pattern, using the findall function

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
print(len(result))

2


In [44]:
import re
import ast, sys
string = sys.stdin.read()
string="Today’s date is 18-05-2018."
# regex pattern
pattern ="(\d{1,2})-(\d{1,2})-(\d{4})" # write regex to extract date in DD-MM-YYYY format

# store result
result = re.search(pattern,string)  # pass the parameters to the re.search() function

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if result != None:
    print(result.group(0))  # result.group(0) will output the entire match
else:
    print(False)

18-05-2018


In [48]:
import re
import ast, sys
string = sys.stdin.read()
string="praveshbhagle@gmail.com"
# regex pattern
pattern ="([\w\-]+)@([\w\.-]+)" # write regex to extract email and use groups to extract domain name ofthe mail

# store result
result = re.search(pattern, string)

# extract domain using group command
if result != None:
    domain =result.group(2) # use group to extract the domain from result
else:
    domain = "NA"

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
print(domain)

gmail.com


In [1]:
import re
import ast, sys
string = "praveshbhagle@gmail.com"

# regex pattern
pattern = "\w+@([A-z]+\.com)"

# store result
result = re.search(pattern, string)

# extract domain using group command
if result != None:
    domain = result.group(1)
else:
    domain = "NA"

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
print(domain)

gmail.com


In [8]:
re.search("[\d\s]+","4")

<re.Match object; span=(0, 1), match='4'>