## Regular Expressions
Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Take another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

### Let's import the regular expression library in python.

In [3]:
import re

Let's do a quick search using a pattern.

In [4]:
re.search('Suraaj', 'Suraaj is an exceptional student!')

<re.Match object; span=(0, 6), match='Suraaj'>

In [5]:
# print output of re.search()
match = re.search('Suraaj', 'Suraaj is an exceptional student!')
print(match.group())

Suraaj


Let's define a function to match regular expression patterns

In [6]:
def find_pattern(text, patterns):
    if re.search(patterns, text, flags=re.I | re.M):  #Used Regex flags to ignore cases and search in multiple lines if any
        print("starting point: ",re.search(patterns, text, flags=re.I | re.M).start())
        print("ending point: ",re.search(patterns, text, flags=re.I | re.M).end())
        return re.search(patterns, text)
    
    else:
        return 'Not Found!'

In [7]:
find_pattern('My name is Suraaj hasija','suraaj')

starting point:  11
ending point:  17


### Quantifiers

In [8]:
# '*': Zero or more 
print(find_pattern("ac", "ab*"))
print(find_pattern("abc", "ab*"))
print(find_pattern("abbc", "ab*"))

starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='a'>
starting point:  0
ending point:  2
<re.Match object; span=(0, 2), match='ab'>
starting point:  0
ending point:  3
<re.Match object; span=(0, 3), match='abb'>


In [9]:
# '?': Zero or one (tells whether a pattern is absent or present)
print(find_pattern("ac", "ab?"))
print(find_pattern("abc", "ab?"))
print(find_pattern("abbc", "ab?"))

starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='a'>
starting point:  0
ending point:  2
<re.Match object; span=(0, 2), match='ab'>
starting point:  0
ending point:  2
<re.Match object; span=(0, 2), match='ab'>


In [10]:
# '+': One or more
print(find_pattern("ac", "ab+"))
print(find_pattern("abc", "ab+"))
print(find_pattern("abbc", "ab+"))

Not Found!
starting point:  0
ending point:  2
<re.Match object; span=(0, 2), match='ab'>
starting point:  0
ending point:  3
<re.Match object; span=(0, 3), match='abb'>


In [11]:
# {n}: Matches if a character is present exactly n number of times
print(find_pattern("abbc", "ab{2}"))


starting point:  0
ending point:  3
<re.Match object; span=(0, 3), match='abb'>


In [12]:
# {m,n}: Matches if a character is present from m to n number of times
print(find_pattern("aabbbbbbc", "ab{3,5}"))   # return true if 'b' is present 3-5 times
print(find_pattern("aabbbbbbc", "ab{7,10}"))  # return true if 'b' is present 7-10 times
print(find_pattern("aabbbbbbc", "ab{,10}"))   # return true if 'b' is present atmost 10 times
print(find_pattern("aabbbbbbc", "ab{10,}"))   # return true if 'b' is present from at least 10 times

starting point:  1
ending point:  7
<re.Match object; span=(1, 7), match='abbbbb'>
Not Found!
starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='a'>
Not Found!


In [13]:
#grouping, OR concepts

In [14]:
# without re.compile() function
result = re.search("a+", "abc")

result

<re.Match object; span=(0, 1), match='a'>

In [15]:
# using the re.compile() function
pattern = re.compile("a+")
result = pattern.search("abc")
result

<re.Match object; span=(0, 1), match='a'>

#### Q: Write a regular expression that matches any string that starts with one or more ‘1’s, followed by three or more ‘0’s, followed by any number of ones (zero or more), followed by ‘0’s (from one to seven), and then ends with either two or three ‘1’s.

In [16]:
text='11000011100111'
pattern = '^1+0{3,}1*0{1,7}1{2,3}' #write your regex here

# check whether pattern is present in string or not
result = re.search(pattern, text)
result

<re.Match object; span=(0, 14), match='11000011100111'>

### Anchors

In [17]:
# '^': Indicates start of a string
# '$': Indicates end of string

print(find_pattern("James", "^J"))   # return true if string starts with 'J' 
print(find_pattern("Pramod", "^J"))  # return true if string starts with 'J' 
print(find_pattern("India", "a$"))   # return true if string ends with 'c'
print(find_pattern("Japan", "a$"))   # return true if string ends with 'c'


starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='J'>
Not Found!
starting point:  4
ending point:  5
<re.Match object; span=(4, 5), match='a'>
Not Found!


##### Note: if you’re asked to write a regex pattern that should match a string that starts with four characters, followed by three 0s and two 1s, followed by any two characters. The valid strings can be abcd00011ft, jkds00011hf, etc. The pattern that satisfies this kind of condition would be 
**‘.{4}0{3}1{2}.{2}’**. 

You can also use ‘....00011..’ where the dot acts as a placeholder which means anything can sit on the place of the dot. Both are correct regex patterns.




### Wildcard

In [18]:
# '.': Matches any character
print(find_pattern("a", "."))
print(find_pattern("#", "."))


starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='a'>
starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='#'>


**Q. Write a regular expression to match first names (consider only first names, i.e. there are no spaces in a name) that have length between three and fifteen characters**

In [19]:
text='Balasubrahmanyam'
pattern = '^.{3,15}$'# write your regex here

# check whether pattern is present in string or not
result = re.search(pattern, text)
if result != None:
    print(True)
else:
    print(False)

False


### Character sets

In [20]:
# Now we will look at '[' and ']'.
# They're used for specifying a character class, which is a set of characters that you wish to match.
# Characters can be listed individually as follows
print(find_pattern("a", "[abc]"))

# Or a range of characters can be indicated by giving two characters and separating them by a '-'.
print(find_pattern("c", "[a-c]"))  # same as above

starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='a'>
starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='c'>


In [21]:
# '^' is used inside character set to indicate complementary set
print(find_pattern("jjj", "[^abc]"))  # return true if neither of these is present - a,b or c

starting point:  0
ending point:  1
<re.Match object; span=(0, 1), match='j'>


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

### Meta sequences

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |

**Write a regular expression with the help of meta-sequences that matches usernames of the users of a database. The username starts with alphabets of length one to ten characters long and then followed by a number of length 4**

In [22]:
string='suraaj199'
pattern = '^[a-z]{1,10}[0-9]{4}'# write your regex here

# check whether pattern is present in string or not
result = re.search(pattern, string, re.I)
result

if result!=None:
    print('Not Found')
else:
    print("Found")

Found


### Greedy vs non-greedy regex

In [23]:
print(find_pattern("aabbbbbb", "ab{3,5}")) # return if a is followed by b 3-5 times GREEDY

starting point:  1
ending point:  7
<re.Match object; span=(1, 7), match='abbbbb'>


In [24]:
print(find_pattern("aabbbbbb", "ab{3,5}?")) # return if a is followed by b 3-5 times GREEDY

starting point:  1
ending point:  5
<re.Match object; span=(1, 5), match='abbb'>


In [25]:
# Example of HTML code
print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>


In [26]:
# Example of HTML code
print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 6), match='<HTML>'>


### The five most important re functions that you would be required to use most of the times are

match() Determine if the RE matches at the beginning of the string

search() Scan through a string, looking for any location where this RE matches

finall() Find all the substrings where the RE matches, and return them as a list

finditer() Find all substrings where RE matches and return them as asn iterator

sub() Find all substrings where the RE matches and substitute them with the given string

In [27]:
# - this function uses the re.match() and let's see how it differs from re.search()
def match_pattern(text, patterns):
    if re.match(patterns, text):
        return re.match(patterns, text)
    else:
        return ('Not found!')

In [28]:
print(find_pattern("abbc", "b+"))

starting point:  1
ending point:  3
<re.Match object; span=(1, 3), match='bb'>


In [29]:
print(match_pattern("abbc", "b+")) #beacuse the string doesn't start with b

Not found!


In [30]:
## Example usage of the sub() function. Replace Road with rd.

street = '21 Ramakrishna Road'
print(re.sub('Road', 'Rd', street))

21 Ramakrishna Rd


In [31]:
print(re.sub('R\w+', 'Rd', street))

21 Rd Rd


In [32]:
pattern = "\d"
replacement = "X"
string = "My address is 13B, Baker Street"

re.sub(pattern, replacement, string)

'My address is XXB, Baker Street'

In [33]:
## Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'
for match in re.finditer(pattern, text):
    print('START -', match.start(), end="")
    print('END -', match.end())

START - 12END - 20
START - 42END - 50


In [46]:
# Example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})'
final=re.findall(date_regex, url)
print(final)

[('2017', '10', '28'), ('2017', '05', '12')]


#### Q. Write a regular expression to extract all the words from a given sentence. Then use the re.finditer() function and store all the matched words that are of length more than or equal to 5 letters in a separate list called result.



In [35]:
string="Do not compare apples with oranges. Compare apples with apples"
# regex pattern
pattern = '\w+' # write regex to extract all the words from a given piece of text

# store results in the list 'result'
result = []

# iterate over the matches
for match in re.finditer(pattern,string): # replace the ___ with the 'finditer' function to extract 'pattern' from the 'string'
    if len(match.group()) >= 5:
        result.append(match)
    else:
        continue


In [36]:
result

[<re.Match object; span=(7, 14), match='compare'>,
 <re.Match object; span=(15, 21), match='apples'>,
 <re.Match object; span=(27, 34), match='oranges'>,
 <re.Match object; span=(36, 43), match='Compare'>,
 <re.Match object; span=(44, 50), match='apples'>,
 <re.Match object; span=(56, 62), match='apples'>]

In [37]:
## Exploring Groups
m1 = re.search(date_regex, url)
print(m1.group())  ## print the matched group

/2017/10/28


In [38]:
print(m1.group(1)) # - Print first group

2017


In [39]:
print(m1.group(2)) # - Print second group

10


In [40]:
print(m1.group(3)) # - Print third group

28


In [41]:
print(m1.group(0)) # - Print zero or the default group

/2017/10/28


#### Q: Write a regular expression to extract the domain name from an email address. The format of the email is simple - the part before the ‘@’ symbol contains alphabets, numbers and underscores. The part after the ‘@’ symbol contains only alphabets followed by a dot followed by ‘com’ 


Sample input: 
user_name_123@gmail.com 
 
Expected output: 
gmail.com

In [42]:
string='suraaj.hasija@mastercard.com'
# regex pattern
pattern = "\w+@([A-z]+\.com)"

# store result
result = re.search(pattern, string)

# extract domain using group command
if result != None:
    domain = result.group(1)
else:
    domain = "NA"

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
print(domain)

mastercard.com


#### Ques: Find all file names

In [43]:
# items contains all the files and folders of current directory
items = ['photos', 'documents', 'videos', 'image001.jpg','image002.jpg','image005.jpg', 'wallpaper.jpg',
         'flower.jpg', 'earth.jpg', 'monkey.jpg', 'image002.png']

# create an empty list to store resultant files
images = []

# regex pattern to extract files that end with '.jpg'
pattern = ".*\.jpg$"

for item in items:
    if re.search(pattern, item):
        images.append(item)

# print result
print(images)

['image001.jpg', 'image002.jpg', 'image005.jpg', 'wallpaper.jpg', 'flower.jpg', 'earth.jpg', 'monkey.jpg']
