In [1]:
import re


# re.match(pattern, string):

This method finds match if it occurs at start of the string. For example, calling match() on the string ‘AV Analytics AV’ and looking for a pattern ‘AV’ will match. However, if we look for only Analytics, the pattern will not match. Let’s perform it in python now.

In [2]:
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print result

<_sre.SRE_Match object at 0x00000000063F67E8>


Above, it shows that pattern match has been found. To print the matching string we’ll use method group (It helps to return the matching string). Use “r” at the start of the pattern string, it designates a python raw string.

In [3]:
print (result.group(0))

AV


Let’s now find ‘Analytics’ in the given string. Here we see that string is not starting with ‘AV’ so it should return no match. Let’s see what we get:

In [4]:
result = re.match(r'Analytics','AV Analytics Vidhya AV')
print result

None


# re.search(pattern, string):

It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Analytics’ will return a match.

In [5]:
result = re.search(r'Analytics','AV Analytics Vidhya AV')
print(result.group(0))



Analytics


Here you can see that, search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.

# re.findall (pattern, string):

It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV’ in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.

In [7]:
result = re.findall(r'AV','AV Analytics Vidhya AV')
print(result)

['AV', 'AV']


# re.sub(pattern, repl, string):

It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [8]:
result = re.sub('India','the World', 'AV is largest Analytics community of India')
print(result)

AV is largest Analytics community of the World


# re.compile(pattern, repl, string):

We can combine a regular expression pattern into what are called as pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [9]:
pattern = re.compile('AV')
result = pattern.findall('AV Analytics Vidhya AV')
print(result)

['AV', 'AV']


In [10]:
result2 = pattern.findall('AV is largest Analytics community of India')
print(result2)

['AV']


# Quick Recap of various methods:

Till now,  we looked at various methods of regular expression using a constant pattern (fixed characters). But, what if we do not have a constant search pattern and we want to return specific set of characters (defined by a rule) from a string?  Don’t be intimidated.

This can easily be solved by defining an expression with the help of pattern operators (meta  and literal characters). Let’s look at the most common pattern operators.

# What are the most commonly used operators?

Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

<li>Operators	Description
<li>.	 Matches with any single character except newline ‘\n’.
<li>?	 match 0 or 1 occurrence of the pattern to its left
<li>+	 1 or more occurrences of the pattern to its left
<li>*	 0 or more occurrences of the pattern to its left
<li>\w	 Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
<li>\d	  Matches with digits [0-9] and /D (upper case D) matches with non-digits.
<li>\s	 Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.
<li>\b	 boundary between word and non-word and /B is opposite of /b
<li>[..]	 Matches any single character in a square bracket and [^..] matches any single character not in square bracket
<li>\	 It is used for special meaning characters like \. to match a period or \+ for plus sign.
<li>^ and $	 ^ and $ match the start or end of the string respectively
<li>{n,m}	 Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
<li>a| b	 Matches either a or b
<li>( )	Groups regular expressions and returns matched text
<li>\t, \n, \r	 Matches tab, newline, return

# Some Examples of Regular Expressions

# Problem 1: Return the first word of a given string

In [12]:
import re
#. Matches with any single character except newline ‘\n’.
result=re.findall(r'.','AV is largest Analytics community of India')
print result

['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']


Above, space is also extracted, now to avoid it use “\w” instead of “.“.

In [14]:
#\w Matches with a alphanumeric character whereas  [a-zA-Z0-9]
result = re.findall(r'\w','AV is the largest community of India')
print(result)

['A', 'V', 'i', 's', 't', 'h', 'e', 'l', 'a', 'r', 'g', 'e', 's', 't', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']


Solution-2  Extract each word (using “*” or “+“)

In [15]:
#* 0 or more occurrences of the pattern to its left
result = re.findall(r'\w*','AV is the largest community of India')
print(result)

['AV', '', 'is', '', 'the', '', 'largest', '', 'community', '', 'of', '', 'India', '']


In [16]:
#+ 1 or more occurrences of the pattern to its left
result = re.findall(r'\w+','AV is the largest community of India')
print(result)

['AV', 'is', 'the', 'largest', 'community', 'of', 'India']


In [17]:
result = re.findall(r'^\w+','AV is the largest community of India')
print(result)

['AV']


In [18]:
result = re.findall(r'\w+$','AV is the largest community of India')
print(result)

['India']


# Problem 2: Return the first two character of each word

In [19]:
result=re.findall(r'\w\w','AV is largest Analytics community of India')
print result

['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']


Extract consecutive two characters those available at start of word boundary (using “\b“)

In [20]:
result=re.findall(r'\b\w.','AV is largest Analytics community of India')
print result

['AV', 'is', 'la', 'An', 'co', 'of', 'In']


# Problem 3: Return the domain type of given email-ids

Extract all characters after “@”

In [21]:
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print result

['@gmail', '@test', '@analyticsvidhya', '@rest']


Above, you can see that “.com”, “.in” part is not extracted. To add it, we will go with below code.

In [22]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print result

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


 Extract only domain name using “( )”

In [None]:
# result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print result

# Problem 4: Return date from given string

Here we will use “\d” to extract digit.

In [24]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print result

['12-05-2007', '11-11-2011', '12-01-2009']


If you want to extract only year again parenthesis “( )” will help you.

In [26]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print result

['2007', '2011', '2009']


# Problem 5: Return all words of a string those starts with vowel

In [27]:
result=re.findall(r'\w+','AV is largest Analytics community of India')
print result

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


Return words starts with alphabets (using [])

In [28]:
#[aeiouAEIOU]\w+ - It will consider the words starting with one of the alphabets in the braces
result=re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print result

['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']


Above you can see that it has returned “argest” and “ommunity” from the mid of words. To drop these two, we need to use “\b” for word boundary.

In [29]:
result=re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print result

['AV', 'is', 'Analytics', 'of', 'India']


In similar ways, we can extract words those starts with constant using “^” within square bracket.

In [30]:
result=re.findall(r'\b[^aeiouAEIOU]\w+','AV is largest Analytics community of India')
print result

[' is', ' largest', ' Analytics', ' community', ' of', ' India']


Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].

In [31]:
result=re.findall(r'\b[^aeiouAEIOU ]\w+','AV is largest Analytics community of India')
print result

['largest', 'community']


Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

We have a list phone numbers in list “li” and here we will validate phone numbers using regular

In [33]:
import re
li=['9999999999','999999-999','99999x9999']

for val in li:
    if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val)==10:
        print 'Yes'
    else:
        print 'No'

Yes
No
No


# Problem 8: Retrieve Information from HTML file

I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string str.

In [34]:
str ='''<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>'''

In [36]:
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
print(result)

[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia')]
