### The most common uses of regular expressions are:
* search a string(search and match)
* finding a string(findall)
* Break string into a sub strings(split)
* Replace part of a string(sub)

### The various methods of Regular Expressions

* re.match()
* re.search()
* re.findall()
* re.split()
* re.sub()
* re.compile()

re.match finds match if it occurs at start of the string

In [1]:
import re
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result)

<re.Match object; span=(0, 2), match='AV'>


In [6]:
# to view the match string use the group method
print(result.group(0))

AV


In [13]:
# There are method like start() and end() 
# to know the start and end position of matching pattern in the string
print(result.start())
print(result.end())

0
2


The search() method is able to find a pattern from any position o f the string but it only returns the first occurence of the search pattern

In [15]:
# using re.search(pattern, sting)
# this does not restrict us from finding match at the beginning of the string only
result = re.search(r'Vidhya', 'AV Analytics Vidhya AV ')
print(result.group(0))

Vidhya


Using the re.findall(pattern, string):It helps to get a list of matching patterns,It has no constraints of searching from start or end.Use findall, it can work like search() and match()

In [16]:
result = re.findall(r'AV', 'AV Analytics Vidhya AV')
print(result)

['AV', 'AV']


Using the re.split(pattern, string,[maxsplit=0]:
This method helps to split by the occurence of given pattern



In [17]:
result = re.split(r'y','Analytics')
print(result)

['Anal', 'tics']


More operations on split()

In [3]:
result = re.split(r'i', 'Analytics Vidhya')
print(result)

['Analyt', 'cs V', 'dhya']


In [4]:
# using the maxsplit 
result = re.split(r'i', 'Analytics Vidhya', maxsplit=1)
result

['Analyt', 'cs Vidhya']

Using the re.sub method
it helps to search a pattern and replace with a new sub string. if the pattern is not found string is returned unchange

In [5]:
result = re.sub(r'India', 'the World', 'AV is largest Analytics Community of India')
result

'AV is largest Analytics Community of the World'

Understanding the re.compile methods
this allows us to combine a regular expression pattern into pattern objects which can be used for pattern matching. It helps to search a pattern again without rewriting it

In [6]:
# definining pattern - 'AV'
pattern = re.compile('AV')
result = pattern.findall('AV Analytics Vidyha AV')
print(result)


['AV', 'AV']


In [8]:
result2 = pattern.findall('AV is largest analytics community of India')
print(result2)

['AV']


Working with and using Pattern operators

In [2]:
# using '.' operator: this matches with any single character execept newline \n
#this return all the character with space
result = re.findall(r'.','AV is largest Analytics Community of India')
print(result)

['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'C', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']


Note: in the above matching the space is also extracted to avoid it use \w operator instead of '.' operator


In [6]:
# returning each character in the string without space
#\w matches alphanumeric characters
result = re.findall(r'\w','AV is largest Analytics Community of India 34')
print(result)

['AV', 'is', 'largest', 'Analytics', 'Community', 'of', 'India', '34']


Let extract each word using '*' or '+' operator

* '*': this match O or more occurence of the pattern to it left
* '+':this matches 1 or more occurence of the pattern to its left


In [4]:
result = re.findall(r'\w*','AV is largest Analytics Community of India')
print(result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'Community', '', 'of', '', 'India', '']


In [5]:
# to remove the spaces in the above use  the + operator
result = re.findall(r'\w+','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [7]:
# returning the first word using the ^ operator
result = re.findall(r'^\w+','AV is largest Analytics community of India')
print(result)

['AV']


In [7]:
# returning the  last word using the $ operator
result = re.findall(r'\w+$','AV is largest Analytics community of India')
print(result)

['India']


### Lets Return the first two character of each word

In [8]:
# extracting consecutive two characters of each word
# excluding spaces(using the '\w')
result = re.findall(r'\w\w','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']


In [9]:
# extract consecutive two characters those available at start of word boundary
# using the '\b' operator
result = re.findall(r'\b\w.','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'la', 'An', 'co', 'of', 'In']


### Return the domain type of given email-ids


In [11]:
# Extract all characters after "@"
result = re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['@gmail', '@test', '@analyticsvidhya', '@rest']


In [12]:
# Extract all characters after "@" and adding the .com
result = re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


In [13]:
#extract only domain name usng () operator
result = re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['com', 'in', 'com', 'biz']


In [14]:
# Return data from given string
# here we will use \d operator to extract digit
result = re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')

In [15]:
print(result)

['12-05-2007', '11-11-2011', '12-01-2009']


In [16]:
# to extract only the year using the () operator
result = re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)

['2007', '2011', '2009']


In [17]:
# Return words that starts with alphabets(using[])
result = re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']


In [10]:
result = re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']


In [11]:
result = re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'Analytics', 'of', 'India']
