# Regular Expressions
Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose we have a dataset of customer reviews about any restaurant. Say, we want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Another example, the artificial assistants such as Siri, Google Now use information retrieval to give us better results. When we ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask us to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help us to clean and handle our text in a much better way.

<a href=https://docs.python.org/3/library/re.html>Regular Expression Official Documentation<a/>

In [1]:
# import library for regular expressions
import re

In [2]:
print(re.search('Neha','Neha is a Data Enthusiast!'))

<re.Match object; span=(0, 4), match='Neha'>


In [3]:
print(re.search('Puja','Neha is a Data Enthusiast!'))

None


#### The 're.search()' method returns a RegexObject if the pattern is found in the string, else it returns a None object.

#### Finding the start and end position of a pattern match

In [4]:
string = 'The roots of education are bitter, but the fruit is sweet.'
pattern = 'education'

result = re.search(pattern,string)

print('Starting position of pattern match is {0}'.format(result.start()))
print('End position of pattern match is {0}'.format(result.end()))

Starting position of pattern match is 13
End position of pattern match is 22


### Quantifiers in Regular Expressions

Quantifiers allow us to mention and have control over how many times we want the character(s) in our pattern to occur.

Let’s take an example. Suppose we have some data which have the word ‘awesome’ in it. The list might look like - [‘awesome’, ‘awesomeeee’, ‘awesomee’]. We decide to extract only those elements which have more than one ‘e’ at the end of the word ‘awesome’. This is where quantifiers come into the picture. They let us handle these tasks.

Types of quantifiers:

- The ‘?’ operator
- The ‘*’ operator
- The ‘+’ operator
- The ‘{m, n}’ operator

In [5]:
# Function which returns the if pattern is found else the string 'Not Found'

def find_pattern(pattern,text):
    if re.search(pattern,text):
        return re.search(pattern,text)
    else:
        return 'Pattern Not Found!'

### '?': Character or character set that it follows can appear zero or one time

In [6]:
print(find_pattern('ab?','ac'))

<re.Match object; span=(0, 1), match='a'>


In [7]:
print(find_pattern('ab?','abc'))

<re.Match object; span=(0, 2), match='ab'>


In [8]:
print(find_pattern('ab?','abbc'))

<re.Match object; span=(0, 2), match='ab'>


#### The ‘?’  can be used where we want the preceding character of our pattern to be an optional character in the string. 

For example, if we want to write a regex that matches both ‘car’ and ‘cars’, the corresponding regex will be ’cars?’. ‘s’ followed by ‘?’ means that ‘s’ can be absent or present, i.e. it can be present zero or one time. 

#### Question: Write a regular expression that matches the following words: 
xyz 
xy 
xz 
x

Make sure that the regular expression doesn’t match the following words: 
Xyyz 
Xyzz
Xyy 
Xzz 
Yz 

In [9]:
print(find_pattern('xy?z?','xyz'))
print(find_pattern('xy?z?','xy'))
print(find_pattern('xy?z?','xz'))
print(find_pattern('xy?z?','x'))
print(find_pattern('xy?z?','Xyyz'))
print(find_pattern('xy?z?','Xyzz'))
print(find_pattern('xy?z?','Xyy'))
print(find_pattern('xy?z?','Xzz'))
print(find_pattern('xy?z?','Yz'))

<re.Match object; span=(0, 3), match='xyz'>
<re.Match object; span=(0, 2), match='xy'>
<re.Match object; span=(0, 2), match='xz'>
<re.Match object; span=(0, 1), match='x'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


### '*': Character or character set that it follows can appear zero or more number of times

In [10]:
print(find_pattern('ab*','ac'))

<re.Match object; span=(0, 1), match='a'>


In [11]:
print(find_pattern('ab*','abc'))

<re.Match object; span=(0, 2), match='ab'>


In [12]:
print(find_pattern('ab*','abbc'))

<re.Match object; span=(0, 3), match='abb'>


#### A ‘*’ quantifier matches the preceding character any number of times.

#### Question: Match a binary number that starts with 101 and ends with zero or more number of zeroes.

Sample positive cases (pattern should match all of these):
1010
10100
101000
101

Sample negative cases (shouldn’t match any of these):
10
100
1

In [13]:
print(find_pattern('1010*','1010'))
print(find_pattern('1010*','10100'))
print(find_pattern('1010*','101000'))
print(find_pattern('1010*','101'))
print(find_pattern('1010*','10'))
print(find_pattern('1010*','100'))
print(find_pattern('1010*','1'))

<re.Match object; span=(0, 4), match='1010'>
<re.Match object; span=(0, 5), match='10100'>
<re.Match object; span=(0, 6), match='101000'>
<re.Match object; span=(0, 3), match='101'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


#### Question: Write a pattern that starts with 1 and ends with zero but has arbitrary number of 1s (zero or more) in between

Sample positive cases (should match all of these):
110
11111110
10

Sample negative cases (shouldn't match any of these): 
11
00
1
0

In [14]:
print(find_pattern('11*0','110'))
print(find_pattern('11*0','11111110'))
print(find_pattern('11*0','10'))
print(find_pattern('11*0','11'))
print(find_pattern('11*0','00'))
print(find_pattern('11*0','1'))
print(find_pattern('11*0','0'))

<re.Match object; span=(0, 3), match='110'>
<re.Match object; span=(0, 8), match='11111110'>
<re.Match object; span=(0, 2), match='10'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


### '+': Character or character set that it follows can appear one or more number of times

In [15]:
print(find_pattern('ab+','abbc'))

<re.Match object; span=(0, 3), match='abb'>


In [16]:
print(find_pattern('ab+','abc'))

<re.Match object; span=(0, 2), match='ab'>


In [17]:
print(find_pattern('ab+','ac'))

Pattern Not Found!


#### The ‘+’ quantifier matches the preceding character one or more times. That means the preceding character has to be present at least once for the pattern to match the string.

Thus, the only difference between '+' and '*' is that the '+' needs a character to be present at least once, while the '*' does not.

#### Question: Write a pattern that matches numbers that are powers of 10.

Sample positive matches (should match all of the following):
10
100
1000

Sample negative matches (shouldn't match either of these):
0
1
15


In [18]:
print(find_pattern('10+','10'))
print(find_pattern('10+','100'))
print(find_pattern('10+','1000'))
print(find_pattern('10+','0'))
print(find_pattern('10+','1'))
print(find_pattern('10+','15'))

<re.Match object; span=(0, 2), match='10'>
<re.Match object; span=(0, 3), match='100'>
<re.Match object; span=(0, 4), match='1000'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


### '{m,n}': Character or character set that it follows can appear a fixed number of times between m and n times

In [19]:
print(find_pattern('ab{3,5}','abbc'))

Pattern Not Found!


In [20]:
print(find_pattern('ab{3,5}','abbbc'))

<re.Match object; span=(0, 4), match='abbb'>


In [21]:
print(find_pattern('ab{3,5}','abbbbbbbabbbbbbc'))

<re.Match object; span=(0, 6), match='abbbbb'>


In [22]:
print(find_pattern('ab{4}','abbbbbbbbbbbbbc'))

<re.Match object; span=(0, 5), match='abbbb'>


#### There are four variants of the quantifier that we just saw:

- {m,n}: Matches the preceding character ‘m’ times to ‘n’ times.
- {m,}: Matches the preceding character ‘m’ times to infinite times, i.e. there is no upper limit to the occurrence of the preceding character.
- {,n}: Matches the preceding character from zero to ‘n’ times, i.e. the upper limit is fixed regarding the occurrence of the preceding character.
- {n}: Matches if the preceding character occurs exactly ‘n’ number of times.

#### Note that while specifying the {m,n} notation, avoid using a space after the comma, i.e. use {m,n} rather than {m, n}

In [23]:
print(find_pattern('ab{3, 5}','abbbc'))

Pattern Not Found!


#### An interesting thing to note is that this quantifier can replace the ‘?’, ‘*’ and the ‘+’ quantifier. That is because:

- '?' is equivalent to zero or once, or {0, 1}
- '*' is equivalent to zero or more times, or {0, }
- '+' is equivalent to one or more times, or {1, }

#### Write a regular expression which matches variants of the word ‘awesome’ where there are more than two ‘e’s at the end of the word.

The following strings should match:
awesomeee
awesomeeee

The following strings shouldn’t match:
awesom
awesome
awesomee

In [24]:
print(find_pattern('awesome{3,}','awesomeee'))
print(find_pattern('awesome{3,}','awesomeeee'))
print(find_pattern('awesome{3,}','awesom'))
print(find_pattern('awesome{3,}','awesome'))
print(find_pattern('awesome{3,}','awesomee'))

<re.Match object; span=(0, 9), match='awesomeee'>
<re.Match object; span=(0, 10), match='awesomeeee'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


## Use of Parentheses

Till now, we have used quantifiers preceded by a single character which meant that the character preceded by the quantifier can repeat a specified number of times. If we put the parentheses around some characters, the quantifier will look for repetition of the group of characters rather than just looking for repetitions of the preceding character. This concept is called grouping in regular expression jargon. For example, the pattern ‘(abc){1, 3}’ will match the following strings:

- abc
 - abcabc
- abcabcabc

Similarly, the pattern (010)+ will match:

- 010
- 010010
- 010010010, and so on.

#### Question: Write a regular expression which matches a string where '23' occurs one or more times followed by occurrence of '78' one or more times

Sample positive matches (should match all of these):

- 2378
- 23237878
- 232323237878

Sample negative matches (shouldn't match either of these):
- 23
- 78
- 23378
- 223378
- 22337788

In [25]:
print(find_pattern('(23){1,}(78){1,}','2378'))
print(find_pattern('(23){1,}(78){1,}','23237878'))
print(find_pattern('(23){1,}(78){1,}','232323237878'))
print(find_pattern('(23){1,}(78){1,}','23'))
print(find_pattern('(23){1,}(78){1,}','78'))
print(find_pattern('(23){1,}(78){1,}','23378'))
print(find_pattern('(23){1,}(78){1,}','223378'))
print(find_pattern('(23){1,}(78){1,}','22337788'))

<re.Match object; span=(0, 4), match='2378'>
<re.Match object; span=(0, 8), match='23237878'>
<re.Match object; span=(0, 12), match='232323237878'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


## Use of Pipe Operator

The pipe operator is used as an OR operator. We need to use it inside the parentheses. For example, the pattern ‘(d|g)one’ will match both the strings - ‘done’ and ‘gone’. The pipe operator tells that the place inside the parentheses can be either ‘d’ or ‘g’.

Similarly, the pattern ‘(ICICI|HDFC) Bank’ will match the strings ‘ICICI Bank’ and ‘HDFC Bank’. We can also use quantifiers after the parentheses as usual even when there is a pipe operator inside. Not only that, there can be an infinite number of pipe operators inside the parentheses. The pattern ‘(0|1|2){2} means 'exactly two occurrences of either of 0, 1 or 2', and it will match these strings - ‘00’, ‘01’, ‘02’, ‘10’, ‘11’, ‘12’, ‘20’, ‘21’ and ‘22’.

#### Question: Write a regular expression that matches the following strings: 

- Basketball 
- Baseball 
- Volleyball 
- Softball 
- Football

In [26]:
print(find_pattern('(Basket|Base|Volley|Soft|Foot)ball','BaseballCricket'))

<re.Match object; span=(0, 8), match='Baseball'>


## Special Characters

Lastly, we will often find ourself in situations where we will need to mention characters such as ‘?’, ‘*’, ‘+’, ‘(‘, ‘)’, ‘{‘, etc. in our regular expressions. These are called special characters since they have special meanings when they appear inside a  regex pattern.

Suppose we want to extract all the questions from a document, and we assume that all questions end with a question mark - ‘?’. So we would need to use the ‘?’ in the regular expression. Now, we already know that ‘?’ has a special meaning in regular expressions.

In situations such as these, we’ll need to use escape sequences. The escape sequence, denoted by a backslash ‘\’, is used to escape the special meaning of the special characters.

The '\' itself is a special character, and to match the ‘\’ character literally, we need to escape it too. We can use the pattern ‘\\’ to escape the backslash.

#### Question: Write a regular expression that returns True when passed a multiplication equation. For any other equation, it should return False. In other words, it should return True if there an asterisk - ‘*’ - present in the equation.

Sample positive cases (should match all of these):
- 3a*4b
- 3*2
- 4 * 5 * 6=120

Sample negative cases (shouldn't match either of these):
- 5+3=8 
- 3%2=1


In [27]:
import warnings
warnings.filterwarnings('ignore')

In [28]:
def validate_multiplication(text):
    if re.search('\*',text):
        return 'True'
    else:
        return 'False'

In [29]:
validate_multiplication('3a*4b')

'True'

In [30]:
validate_multiplication('5+3')

'False'

## Regex flags

A flag has a special meaning. For example, if we want our regex to ignore the case of the text then we can pass the 're.I' flag. Similarly, we can have a flag with the syntax re.M that enables us to search in multiple lines (in case the input text has multiple lines). We can pass all these flags in the re.search() function. The syntax to pass multiple flags is:

re.search(pattern, string, flags=re.I | re.M)

In [31]:
string = 'Neha is a data Enthusiast'
pattern = 'Data'

result1 = re.search(pattern,string, flags=re.I)

print('Case insensitive search result is {0}'.format(result1))

result2 = re.search(pattern,string)

print('Case sensitive search result is {0}'.format(result2))

Case insensitive search result is <re.Match object; span=(10, 14), match='data'>
Case sensitive search result is None


## re.complie() v/s re.search()

<b>re.compile()</b>

 - <b>Purpose:</b> Creates a compiled regular expression object that can be reused for multiple operations (e.g., searching, matching, finding all occurrences, etc.).

 - <b>Usage:</b> It is used when the same regular expression will be applied multiple times. This is more efficient since the regex is compiled once and reused.

<b>re.search()</b>

 - <b>Purpose:</b> Searches for the first occurrence of a pattern in a string.
 
 - <b>Usage:</b> It is used for a one-time search in a given string. It directly returns a match object or None.

In [32]:
# Compile the regex
pattern = re.compile('data', flags=re.I)

# Reuse the compiled pattern for multiple operations
result1 = pattern.search("Neha loves Data Science")
result2 = pattern.search("Data drives insights")

print(result1.group())  # Output: Data
print(result2.group())  # Output: Data


Data
Data


In [33]:
# Directly use re.search
result = re.search('data', "Neha loves Data Science", flags=re.I)

if result:
    print(result.group())  # Output: Data


Data


## Anchors

Anchors are used to specify the start and end of the string. 

- <b>'^'</b>

    It specifies the start of the string. The character followed by the ‘^’ in the pattern should be the first character of the string in order for a string to match the pattern

- <b>'$'</b>

    It specifies the end of the string. The character that precedes the ‘$’ in the pattern should be the last character in the string in order for the string to match the pattern


Both the anchors can be specified in a single regular expression itself. For example, the regular expression pattern ‘^01*0$’ will match any string that starts and end with zeroes with any number of 1s between them.

#### Question: Write a pattern that matches all the dictionary words that start with ‘A’

Positive matches (should match all of these):
- Avenger
- Acute
- Altruism

Negative match (shouldn’t match any of these):
- Bribe
- 10
- Zenith

In [34]:
def find_pattern_ignorecase(pattern,text):
    if re.search(pattern,text,flags=re.I):
        return re.search(pattern,text,flags=re.I)
    else:
        return 'Pattern Not Found'

In [35]:
print(find_pattern_ignorecase('^A','Avenger'))
print(find_pattern_ignorecase('^A','acute'))
print(find_pattern_ignorecase('^A','10'))

<re.Match object; span=(0, 1), match='A'>
<re.Match object; span=(0, 1), match='a'>
Pattern Not Found


#### Question: Write a pattern which matches a word that ends with ‘ing’. Words such as ‘playing’, ‘growing’, ‘raining’, etc. should match while words that don’t have ‘ing’ at the end shouldn’t match.

In [36]:
print(find_pattern_ignorecase('ing$','playing'))
print(find_pattern_ignorecase('ing$','sothig'))

<re.Match object; span=(4, 7), match='ing'>
Pattern Not Found


#### Question: Write a regular expression that matches any string that starts with one or more ‘1’s, followed by three or more ‘0’s, followed by any number of ones (zero or more), followed by ‘0’s (from one to seven), and then ends with either two or three ‘1’s.

In [37]:
print(find_pattern('^1{1,}0{3,}1*0{1,7}1{2,3}$','00001100011111'))
print(find_pattern('^1{1,}0{3,}1*0{1,7}1{2,3}$','11000011000111'))

Pattern Not Found!
<re.Match object; span=(0, 14), match='11000011000111'>


## Wildcards

There is one special character in regular expressions that acts as a placeholder and can match any character (literally!) in the given input string. It’s the ‘.’ (dot) character that is also called the wildcard character.

In [38]:
print(find_pattern_ignorecase('.','adaadad'))
print(find_pattern_ignorecase('.','#'))
print(find_pattern_ignorecase('.','123233'))
print(find_pattern_ignorecase('.',''))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='#'>
<re.Match object; span=(0, 1), match='1'>
Pattern Not Found


#### Question: Write a regular expression to match first names (consider only first names, i.e. there are no spaces in a name) that have length between three and fifteen characters.

Sample positive match:
Amandeep
Krishna

Sample negative match:
Balasubrahmanyam

In [39]:
print(find_pattern('^.{3,15}$','Amandeep'))
print(find_pattern('^.{3,15}$','Balasubrahmanyam'))

<re.Match object; span=(0, 8), match='Amandeep'>
Pattern Not Found!


## Character Sets

For example, say we want to match phone numbers in a large document. We know that the numbers may contain hyphens, plus symbol etc. (e.g. +91-9930839123), but it will not have any alphabet. We need to somehow specify that we are looking only for numerics and some other symbols, but avoid alphabets.

To handle such situations, we can use what are called character sets in regular expression jargon.

In [40]:
print(find_pattern('[abc]','a'))
print(find_pattern('[a-c]','c'))
print(find_pattern('[a-c]','d'))
print(find_pattern('[a-z]ed','ted'))
print(find_pattern('[a-z]ed','ated'))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='c'>
Pattern Not Found!
<re.Match object; span=(0, 3), match='ted'>
<re.Match object; span=(1, 4), match='ted'>


In [41]:
# character set with anchors
print(find_pattern('^[a-z]ed','ated'))
print(find_pattern('^[a-z]ed','ted'))

Pattern Not Found!
<re.Match object; span=(0, 3), match='ted'>


#### Note that a quantifier loses its special meaning when it’s present inside the character set. Inside square brackets, it is treated as any other character.

In [42]:
print(find_pattern('[a-z*]ed','ated'))
print(find_pattern('[a-z*]ed','at*ed'))

<re.Match object; span=(1, 4), match='ted'>
<re.Match object; span=(2, 5), match='*ed'>


#### Complement operator in a character set

^ when used inside a character set, it acts as a complement operator, i.e. it specifies that it will match any character other than the ones mentioned inside the character set

In [43]:
print(find_pattern('[^abc]','a'))  # return true if neither of these is present - a,b or c
print(find_pattern('[^abc]','d'))  # return true if neither of these is present - a,b or c

Pattern Not Found!
<re.Match object; span=(0, 1), match='d'>


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

## Meta Sequences

It is a shorthand way to write commonly used character sets in regular expressions.

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |


We can use meta-sequences in two ways:


- We can either use them without the square brackets. For example, the pattern ‘\w+’ will match any alphanumeric character.

- Or we can use it inside the square brackets. For example, the pattern ‘[\w]+’ is same as ‘\w+’. But when we use meta-sequences inside a square bracket, they’re commonly used along with other meta-sequences. For example, the ‘[\w\s]+’ matches both alphanumeric characters and whitespaces. The square brackets are used to group these two meta-sequences into one.

In [44]:
print(find_pattern('\s+','Data')) 
print(find_pattern('\s+','Data Science')) 
print(find_pattern('\d+','Data'))
print(find_pattern('\d+','Data Science in 2025'))
print(find_pattern('[\s\d+]','Data Science in 2025'))

Pattern Not Found!
<re.Match object; span=(4, 5), match=' '>
Pattern Not Found!
<re.Match object; span=(16, 20), match='2025'>
<re.Match object; span=(4, 5), match=' '>


#### Question: Write a regular expression with the help of meta-sequences that matches usernames of the users of a database. The username starts with alphabets of length one to ten characters long and then followed by a number of length 4.

Sample positive matches:

- sam2340 
- irfann2590 

Sample negative matches:

- 8730 
- bobby9073834 
- sameer728 
- radhagopalaswamy7890 

In [45]:
print(find_pattern('^[a-zA-Z]{1,10}\d{4}$','sam2340')) 
print(find_pattern('^[a-zA-Z]{1,10}\d{4}$','irfann2590')) 
print(find_pattern('^[a-zA-Z]{1,10}\d{4}$','8730')) 
print(find_pattern('^[a-zA-Z]{1,10}\d{4}$','bobby9073834')) 
print(find_pattern('^[a-zA-Z]{1,10}\d{4}$','sameer728')) 
print(find_pattern('^[a-zA-Z]{1,10}\d{4}$','radhagopalaswamy7890')) 

<re.Match object; span=(0, 7), match='sam2340'>
<re.Match object; span=(0, 10), match='irfann2590'>
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!
Pattern Not Found!


## Greedy v/s Non-Greedy Search

When we use a regular expression to match a string, the regex greedily tries to look for the longest pattern possible in the string. For example, when we specify the pattern 'ab{3,5}' to match the string 'abbbbb', it will look for the maximum number of occurrences of 'b' (in this case 5).

This is called a 'greedy approach'. By default, a regular expression is greedy in nature.

There is another approach called the non-greedy approach, also called the lazy approach, where the regex stops looking for the pattern once a particular condition is satisfied.

In [46]:
# Greedy Search
print(find_pattern('ab{3,5}','aabbbbbbc'))

<re.Match object; span=(1, 7), match='abbbbb'>


In [47]:
# Non-Greedy Search - achieved with ?
print(find_pattern('ab{3,5}?','aabbbbbbc'))

<re.Match object; span=(1, 5), match='abbb'>


In [48]:
# Real time this can be used for searching text from a web page

print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>


In [49]:
# In the above example, since its sample HTML content, it is fine to retrieve entire text but in real time it would have thousands of characters. Hence we can use lazy search approach as below

print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 6), match='<HTML>'>


#### It is important to not confuse the greedy approach with matching multiple strings in a large piece of text - these are different use cases. Similarly,  the lazy approach is different from matching only the first match.

For example, take the string ‘One batsman among many batsmen.’. If we run the patterns ‘bat*’ and ‘bat*?’ on this text, the pattern ‘bat*’ will match the substring ‘bat’ in ‘batsman’ and ‘bat’ in ‘batsmen’ while the pattern ‘bat*?’ will match the substring ‘ba’ in batsman and ‘ba’ in ‘batsmen’. The pattern ‘bat*’ means look for the term ‘ba’ followed by zero or more ‘t’s so it greedily looks for as many ‘t’s as possible and the search ends at the substring ‘bat’. On the other hand, the pattern ‘bat*?’ will look for as few ‘t’s as possible. Since ‘*’ indicates zero or more, the lazy approach stops the search at ‘ba’.

In [50]:
print(find_pattern('bat*','One batsman among many batsmen.'))

<re.Match object; span=(4, 7), match='bat'>


In [51]:
print(find_pattern('bat*?','One batsman among many batsmen.'))

<re.Match object; span=(4, 6), match='ba'>


## Function of re

- match() Determine if the regular expression matches at the beginning of the string

- search() Scan through a string, looking for any location where this regular expression matches

- sub() Find all substrings where the regular expression matches and substitute them with the given string

- finditer() Find all substrings where regular expression matches and return them as an iterator

- findall() Find all the substrings where the regular expression matches, and return them as a list

#### re.search() v/s re.match()

In [52]:
def match_pattern(pattern, text):
    if re.match(pattern, text):
        return re.match(pattern, text)
    else:
        return ('Match Not found!')

In [53]:
# using re.search()
print(find_pattern('b+','abbc'))

<re.Match object; span=(1, 3), match='bb'>


In [54]:
# using re.match()
print(match_pattern('b+','abbc')) 

Match Not found!


#### re.sub()

The re.sub() function is used to substitute a part of our string using a regex pattern. It is often the case when we want to replace a substring of our string where the substring has a particular pattern that can be matched by the regex engine and then it is replaced by the re.sub() command. 

In [55]:
# Replace text Road with Rd
address = '16D Rajajinagar Main Road'
re.sub('Road','Rd', address)

'16D Rajajinagar Main Rd'

In [56]:
# Lets try with a pattern for above example
address = '16D Rajajinagar Main Road'
re.sub('R\w+','Rd', address)

'16D Rd Main Rd'

#### re.finditer() & re.finalall()

The match and search command returns only one match. But we often need to extract all the matches rather than only the first match, and that's when we use the other methods - findall() and finditer()

The result of the findall() function is a list of all the matches and the finditer() function is used in a 'for' loop to iterate through each separate match one by one.

In [57]:
# Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'

for match in re.finditer(pattern,text):
    print('START -', match.start(), end=" ")
    print('END -', match.end())

START - 12 END - 20
START - 42 END - 50


In [58]:
# using findall
re.findall(pattern,text)

['festival', 'festival']

In [59]:
# Another example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12"
date_regex = '(\d{4})/(\d{1,2})/(\d{1,2})'
print(re.findall(date_regex, url))

[('2017', '10', '28'), ('2017', '05', '12')]


## Regular Expressions Grouping

Sometimes we need to extract sub-patterns out of a larger pattern. This can be done by using grouping. Suppose we have textual data with dates in it and we want to extract only the year from the dates. We can use a regular expression pattern with grouping to match dates and then we can extract the component elements such as the day, month or the year from the date.

Grouping is achieved using the parenthesis operators.

In [60]:
result = re.search(date_regex, url)

In [61]:
result.group(0) # returns the default search string

'2017/10/28'

In [62]:
result.group(1)

'2017'

In [63]:
result.group(2)

'10'

In [64]:
result.group(3)

'28'

In [65]:
# Another example
result = re.search('(\d{2})-(\d{2})-(\d{4})','I have a flight on 14-09-2018')

In [66]:
result.group(3)

'2018'