# Regular Expressions
Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose we have a dataset of customer reviews about any restaurant. Say, we want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Another example, the artificial assistants such as Siri, Google Now use information retrieval to give us better results. When we ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask us to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

<a href=https://docs.python.org/3/library/re.html>Regular Expression Official Documentation<a/>

In [1]:
# import library for regular expressions
import re

In [2]:
print(re.search('Neha','Neha is a Data Enthusiast!'))

<re.Match object; span=(0, 4), match='Neha'>


In [3]:
print(re.search('Puja','Neha is a Data Enthusiast!'))

None


#### The 're.search()' method returns a RegexObject if the pattern is found in the string, else it returns a None object.

#### Finding the start and end position of a pattern match

In [4]:
string = 'The roots of education are bitter, but the fruit is sweet.'
pattern = 'education'

result = re.search(pattern,string)

print('Starting position of pattern match is {0}'.format(result.start()))
print('End position of pattern match is {0}'.format(result.end()))

Starting position of pattern match is 13
End position of pattern match is 22


### Quantifiers in Regular Expressions

Quantifiers allow us to mention and have control over how many times we want the character(s) in our pattern to occur.

Let’s take an example. Suppose we have some data which have the word ‘awesome’ in it. The list might look like - [‘awesome’, ‘awesomeeee’, ‘awesomee’]. We decide to extract only those elements which have more than one ‘e’ at the end of the word ‘awesome’. This is where quantifiers come into the picture. They let us handle these tasks.

Types of quantifiers:

- The ‘?’ operator
- The ‘*’ operator
- The ‘+’ operator
- The ‘{m, n}’ operator

In [5]:
# Function which returns the if pattern is found else the string 'Not Found'

def find_pattern(pattern,text):
    if re.search(pattern,text):
        return re.search(pattern,text)
    else:
        return 'Pattern Not Found!'

### '?': Character or character set that it follows can appear zero or one time

In [6]:
print(find_pattern('ab?','ac'))

<re.Match object; span=(0, 1), match='a'>


In [7]:
print(find_pattern('ab?','abc'))

<re.Match object; span=(0, 2), match='ab'>


In [8]:
print(find_pattern('ab?','abbc'))

<re.Match object; span=(0, 2), match='ab'>


#### The ‘?’  can be used where we want the preceding character of our pattern to be an optional character in the string. 

For example, if we want to write a regex that matches both ‘car’ and ‘cars’, the corresponding regex will be ’cars?’. ‘s’ followed by ‘?’ means that ‘s’ can be absent or present, i.e. it can be present zero or one time. 

#### Question: Write a regular expression that matches the following words: 
xyz 
xy 
xz 
x

Make sure that the regular expression doesn’t match the following words: 
Xyyz 
Xyzz
Xyy 
Xzz 
Yz 

In [9]:
print(find_pattern('xy?z?','xyz'))
print(find_pattern('xy?z?','xy'))
print(find_pattern('xy?z?','xz'))
print(find_pattern('xy?z?','x'))
print(find_pattern('xy?z?','Xyyz'))
print(find_pattern('xy?z?','Xyzz'))
print(find_pattern('xy?z?','Xyy'))
print(find_pattern('xy?z?','Xzz'))
print(find_pattern('xy?z?','Yz'))

<re.Match object; span=(0, 3), match='xyz'>
<re.Match object; span=(0, 2), match='xy'>
<re.Match object; span=(0, 2), match='xz'>
<re.Match object; span=(0, 1), match='x'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


### '*': Character or character set that it follows can appear zero or more number of times

In [10]:
print(find_pattern('ab*','ac'))

<re.Match object; span=(0, 1), match='a'>


In [11]:
print(find_pattern('ab*','abc'))

<re.Match object; span=(0, 2), match='ab'>


In [12]:
print(find_pattern('ab*','abbc'))

<re.Match object; span=(0, 3), match='abb'>


#### A ‘*’ quantifier matches the preceding character any number of times.

#### Question: Match a binary number that starts with 101 and ends with zero or more number of zeroes.

Sample positive cases (pattern should match all of these):
1010
10100
101000
101

Sample negative cases (shouldn’t match any of these):
10
100
1

In [13]:
print(find_pattern('1010*','1010'))
print(find_pattern('1010*','10100'))
print(find_pattern('1010*','101000'))
print(find_pattern('1010*','101'))
print(find_pattern('1010*','10'))
print(find_pattern('1010*','100'))
print(find_pattern('1010*','1'))

<re.Match object; span=(0, 4), match='1010'>
<re.Match object; span=(0, 5), match='10100'>
<re.Match object; span=(0, 6), match='101000'>
<re.Match object; span=(0, 3), match='101'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


#### Question: Write a pattern that starts with 1 and ends with zero but has arbitrary number of 1s (zero or more) in between

Sample positive cases (should match all of these):
110
11111110
10

Sample negative cases (shouldn't match any of these): 
11
00
1
0

In [14]:
print(find_pattern('11*0','110'))
print(find_pattern('11*0','11111110'))
print(find_pattern('11*0','10'))
print(find_pattern('11*0','11'))
print(find_pattern('11*0','00'))
print(find_pattern('11*0','1'))
print(find_pattern('11*0','0'))

<re.Match object; span=(0, 3), match='110'>
<re.Match object; span=(0, 8), match='11111110'>
<re.Match object; span=(0, 2), match='10'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


### '+': Character or character set that it follows can appear one or more number of times

In [15]:
print(find_pattern('ab+','abbc'))

<re.Match object; span=(0, 3), match='abb'>


In [16]:
print(find_pattern('ab+','abc'))

<re.Match object; span=(0, 2), match='ab'>


In [17]:
print(find_pattern('ab+','ac'))

Pattern Not Found!


#### The ‘+’ quantifier matches the preceding character one or more times. That means the preceding character has to be present at least once for the pattern to match the string.

Thus, the only difference between '+' and '*' is that the '+' needs a character to be present at least once, while the '*' does not.

#### Question: Write a pattern that matches numbers that are powers of 10.

Sample positive matches (should match all of the following):
10
100
1000

Sample negative matches (shouldn't match either of these):
0
1
15


In [18]:
print(find_pattern('10+','10'))
print(find_pattern('10+','100'))
print(find_pattern('10+','1000'))
print(find_pattern('10+','0'))
print(find_pattern('10+','1'))
print(find_pattern('10+','15'))

<re.Match object; span=(0, 2), match='10'>
<re.Match object; span=(0, 3), match='100'>
<re.Match object; span=(0, 4), match='1000'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


### '{m,n}': Character or character set that it follows can appear a fixed number of times between m and n times

In [19]:
print(find_pattern('ab{3,5}','abbc'))

Pattern Not Found!


In [20]:
print(find_pattern('ab{3,5}','abbbc'))

<re.Match object; span=(0, 4), match='abbb'>


In [21]:
print(find_pattern('ab{3,5}','abbbbbbbabbbbbbc'))

<re.Match object; span=(0, 6), match='abbbbb'>


In [22]:
print(find_pattern('ab{4}','abbbbbbbbbbbbbc'))

<re.Match object; span=(0, 5), match='abbbb'>


#### There are four variants of the quantifier that you just saw:

- {m,n}: Matches the preceding character ‘m’ times to ‘n’ times.
- {m,}: Matches the preceding character ‘m’ times to infinite times, i.e. there is no upper limit to the occurrence of the preceding character.
- {,n}: Matches the preceding character from zero to ‘n’ times, i.e. the upper limit is fixed regarding the occurrence of the preceding character.
- {n}: Matches if the preceding character occurs exactly ‘n’ number of times.

#### Note that while specifying the {m,n} notation, avoid using a space after the comma, i.e. use {m,n} rather than {m, n}

In [23]:
print(find_pattern('ab{3, 5}','abbbc'))

Pattern Not Found!


#### An interesting thing to note is that this quantifier can replace the ‘?’, ‘*’ and the ‘+’ quantifier. That is because:

- '?' is equivalent to zero or once, or {0, 1}
- '*' is equivalent to zero or more times, or {0, }
- '+' is equivalent to one or more times, or {1, }

#### Write a regular expression which matches variants of the word ‘awesome’ where there are more than two ‘e’s at the end of the word.

The following strings should match:
awesomeee
awesomeeee

The following strings shouldn’t match:
awesom
awesome
awesomee

In [24]:
print(find_pattern('awesome{3,}','awesomeee'))
print(find_pattern('awesome{3,}','awesomeeee'))
print(find_pattern('awesome{3,}','awesom'))
print(find_pattern('awesome{3,}','awesome'))
print(find_pattern('awesome{3,}','awesomee'))

<re.Match object; span=(0, 9), match='awesomeee'>
<re.Match object; span=(0, 10), match='awesomeeee'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


## Use of Parentheses

Till now, we have used quantifiers preceded by a single character which meant that the character preceded by the quantifier can repeat a specified number of times. If we put the parentheses around some characters, the quantifier will look for repetition of the group of characters rather than just looking for repetitions of the preceding character. This concept is called grouping in regular expression jargon. For example, the pattern ‘(abc){1, 3}’ will match the following strings:

- abc
 - abcabc
- abcabcabc

Similarly, the pattern (010)+ will match:

- 010
- 010010
- 010010010, and so on.

#### Question: Write a regular expression which matches a string where '23' occurs one or more times followed by occurrence of '78' one or more times

Sample positive matches (should match all of these):

- 2378
- 23237878
- 232323237878

Sample negative matches (shouldn't match either of these):
- 23
- 78
- 23378
- 223378
- 22337788

In [25]:
print(find_pattern('(23){1,}(78){1,}','2378'))
print(find_pattern('(23){1,}(78){1,}','23237878'))
print(find_pattern('(23){1,}(78){1,}','232323237878'))
print(find_pattern('(23){1,}(78){1,}','23'))
print(find_pattern('(23){1,}(78){1,}','78'))
print(find_pattern('(23){1,}(78){1,}','23378'))
print(find_pattern('(23){1,}(78){1,}','223378'))
print(find_pattern('(23){1,}(78){1,}','22337788'))

<re.Match object; span=(0, 4), match='2378'>
<re.Match object; span=(0, 8), match='23237878'>
<re.Match object; span=(0, 12), match='232323237878'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


## Use of Pipe Operator

The pipe operator is used as an OR operator. We need to use it inside the parentheses. For example, the pattern ‘(d|g)one’ will match both the strings - ‘done’ and ‘gone’. The pipe operator tells that the place inside the parentheses can be either ‘d’ or ‘g’.

Similarly, the pattern ‘(ICICI|HDFC) Bank’ will match the strings ‘ICICI Bank’ and ‘HDFC Bank’. We can also use quantifiers after the parentheses as usual even when there is a pipe operator inside. Not only that, there can be an infinite number of pipe operators inside the parentheses. The pattern ‘(0|1|2){2} means 'exactly two occurrences of either of 0, 1 or 2', and it will match these strings - ‘00’, ‘01’, ‘02’, ‘10’, ‘11’, ‘12’, ‘20’, ‘21’ and ‘22’.

#### Question: Write a regular expression that matches the following strings: 

- Basketball 
- Baseball 
- Volleyball 
- Softball 
- Football

In [29]:
print(find_pattern('(Basket|Base|Volley|Soft|Foot)ball','BaseballCricket'))

<re.Match object; span=(0, 8), match='Baseball'>


## Special Characters

Lastly, we will often find ourself in situations where we will need to mention characters such as ‘?’, ‘*’, ‘+’, ‘(‘, ‘)’, ‘{‘, etc. in your regular expressions. These are called special characters since they have special meanings when they appear inside a  regex pattern.

Suppose we want to extract all the questions from a document, and we assume that all questions end with a question mark - ‘?’. So we would need to use the ‘?’ in the regular expression. Now, we already know that ‘?’ has a special meaning in regular expressions.

In situations such as these, we’ll need to use escape sequences. The escape sequence, denoted by a backslash ‘\’, is used to escape the special meaning of the special characters.

The '\' itself is a special character, and to match the ‘\’ character literally, we need to escape it too. We can use the pattern ‘\\’ to escape the backslash.

#### Question: Write a regular expression that returns True when passed a multiplication equation. For any other equation, it should return False. In other words, it should return True if there an asterisk - ‘*’ - present in the equation.

Sample positive cases (should match all of these):
- 3a*4b
- 3*2
- 4 * 5 * 6=120

Sample negative cases (shouldn't match either of these):
- 5+3=8 
- 3%2=1


In [33]:
import warnings
warnings.filterwarnings('ignore')

In [34]:
def validate_multiplication(text):
    if re.search('\*',text):
        return 'True'
    else:
        return 'False'

In [35]:
validate_multiplication('3a*4b')

'True'

In [36]:
validate_multiplication('5+3')

'False'

## Regex flags

A flag has a special meaning. For example, if we want our regex to ignore the case of the text then we can pass the 're.I' flag. Similarly, we can have a flag with the syntax re.M that enables us to search in multiple lines (in case the input text has multiple lines). We can pass all these flags in the re.search() function. The syntax to pass multiple flags is:

re.search(pattern, string, flags=re.I | re.M)

In [50]:
string = 'Neha is a data Enthusiast'
pattern = 'Data'

result1 = re.search(pattern,string, flags=re.I)

print('Case insensitive search result is {0}'.format(result1))

result2 = re.search(pattern,string)

print('Case sensitive search result is {0}'.format(result2))

Case insensitive search result is <re.Match object; span=(10, 14), match='data'>
Case sensitive search result is None
