In [1]:
import re

The most common uses of regular expressions are:

    1. Search a string (search and match)
    2. Finding a string (findall)
    3. Break string into a sub strings (split)
    4. Replace part of a string (sub)


The 're' package provides multiple methods to perform queries on an input string. Here are the most commonly used methos:
  1. re.match()
  2. re.search()
  3. re.findall()
  4. re.split()
  5. re.sub()
  6. re.compile()
  

### re.match(pattern, string): 
This method finds match if it occurs at start of the string.

In [2]:
result = re.match(r'AV', 'AV Avishek Benarji')
print(result)

<re.Match object; span=(0, 2), match='AV'>


It shows that the pattern match has been found. 

TO print the matching string we'll use method group.

In [3]:
result = re.match(r'AV', 'AV Avishek Benarji')
print(result.group(0))

AV


Let's now find 'Avishek' in the given string. Here we see that string is not starting with 'AV' so it should return no match

In [4]:
result = re.match(r'Avishek', 'AV Avishek Benarji')
print(result)

None


There are methods like start() and end() to know start and end position of matching pattrn in the string

In [5]:
result = re.match(r'AV', 'AV Avishek Benarji')
print(result.start())
print(result.end())

0
2


### re.search(pattern, string):


It is similar to match() but it doesn't restrict us to find matches at the begining of the string only.search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.

In [7]:
result = re.search(r'Avishek', 'AV Avishek Benarji Avishek')
print(result.group(0))

Avishek


In [8]:
result = re.search(r'Avishek', 'AV Avishek Benarji Avishek')
print(result.start())
print(result.end())

3
10


### re.findall(pattern, string):

It helps to get a list of all matching patterns. It has no constraints of searching from start or end.

In [10]:
result = re.findall(r'Avishek', 'AV Avishek Benarji Avishek')
print(result)

['Avishek', 'Avishek']


### re.split(pattern, string, [maxsplit=0]):

This methods helps to split string by the occurrences of given pattern.

In [11]:
result=re.split(r'i','Vikings')
result

['V', 'k', 'ngs']

but if we give value to maxsplit, it will split the string. 

In [12]:
result=re.split(r'i','Vikings',maxsplit=1)
result

['V', 'kings']

### re.sub(pattern, repl, string):

It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [13]:
result=re.sub(r'India','the World','Bangalore is the largest IT hub of India')
result

'Bangalore is the largest IT hub of the World'

### re.compile(pattern, repl, string):

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [15]:
import re
pattern=re.compile('Av')
result=pattern.findall('AV Avishek kumar Avishek')
print(result)

['Av', 'Av']


In [16]:
result2=pattern.findall('Avishek is a good guy, Avishek, Avishek')

In [17]:
print(result2)

['Av', 'Av', 'Av']


Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.
Operators 	Description


. 	 Matches with any single character except newline.

? 	 match 0 or 1 occurrence of the pattern to its left

+ 	 1 or more occurrences of the pattern to its left

* 	 0 or more occurrences of the pattern to its left

\w 	 Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.

\d 	  Matches with digits [0-9] and /D (upper case D) matches with non-digits.

\s 	 Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.

\b 	 boundary between word and non-word and /B is opposite of /b

[..] 	 Matches any single character in a square bracket and [^..] matches any single character not in square bracket

\ 	 It is used for special meaning characters like \. to match a period or \+ for plus sign.


{n,m} 	 Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.

a| b 	 Matches either a or b

( ) 	Groups regular expressions and returns matched text

\t, \n, \r 	 Matches tab, newline, return

### Problem 1: Return the first word of a given string

In [2]:
import re
result=re.findall(r'.','I am  a student of Data-Scienc')
print(result)

['I', ' ', 'a', 'm', ' ', ' ', 'a', ' ', 's', 't', 'u', 'd', 'e', 'n', 't', ' ', 'o', 'f', ' ', 'D', 'a', 't', 'a', '-', 'S', 'c', 'i', 'e', 'n', 'c']


Above, space is also extracted, now to avoid it use “\w” instead of “.“.

In [3]:
result=re.findall(r'\w','I am  a student of Data-Scienc')
print(result)

['I', 'a', 'm', 'a', 's', 't', 'u', 'd', 'e', 'n', 't', 'o', 'f', 'D', 'a', 't', 'a', 'S', 'c', 'i', 'e', 'n', 'c']


Extract each word 

In [4]:
result=re.findall(r'\w*','I am  a student of Data-Scienc')
print(result)

['I', '', 'am', '', '', 'a', '', 'student', '', 'of', '', 'Data', '', 'Scienc', '']


Again, it is returning space as a word because “*” returns zero or more matches of pattern to its left. Now to remove spaces we will go with “+“

In [5]:
result=re.findall(r'\w+','I am  a student of Data-Scienc')
print(result)

['I', 'am', 'a', 'student', 'of', 'Data', 'Scienc']


Extract each word (using “^“)

In [7]:
result=re.findall(r'^\w','I am  a student of Data-Scienc')
print(result)

['I']


In [8]:
"""If we will use “$” instead of “^”, it will return the word from the end of the string. Let’s look at it."""

result=re.findall(r'\w+$','I am  a student of Data-Scienc')
print(result)

['Scienc']


Return the first two character of each word


Extract consecutive two characters of each word, excluding spaces (using “\w“)


In [9]:
result=re.findall(r'\w\w','I am  a student of Data-Scienc')
print(result)

['am', 'st', 'ud', 'en', 'of', 'Da', 'ta', 'Sc', 'ie', 'nc']


Extract consecutive two characters those available at start of word boundary (using “\b“)

In [11]:
result = re.findall(r'\b\w.', 'I am a student of Data-Science')
print(result)

['I ', 'am', 'a ', 'st', 'of', 'Da', 'Sc']


### Return the domain type of given email-ids

Extract all characters after “@”

In [12]:
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz') 
print(result) 

['@gmail', '@test', '@analyticsvidhya', '@rest']


Above, you can see that “.com”, “.in” part is not extracted. To add it, we will go with below code.

In [13]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


Extract only domain name using “( )”

In [14]:
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['com', 'in', 'com', 'biz']


### Return date from given string

Here we will use “\d” to extract digit.

In [15]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)

['12-05-2007', '11-11-2011', '12-01-2009']


If you want to extract only year again parenthesis “( )” will help you.

In [16]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)

['2007', '2011', '2009']


### Return all words of a string those starts with vowel

Return each words

In [17]:
result = re.findall(r'\w+', 'I am a student of Data-Science')
print(result)

['I', 'am', 'a', 'student', 'of', 'Data', 'Science']


 Return words starts with alphabets (using [])

In [19]:
result = re.findall(r'[aeiouAEIOU]\w*', 'I am a student of Data-Science')
print(result)

['I', 'am', 'a', 'udent', 'of', 'ata', 'ience']


In [20]:
result = re.findall(r'[aeiouAEIOU]\w+', 'I am a student of Data-Science')
print(result)

['am', 'udent', 'of', 'ata', 'ience']


In [21]:
result = re.findall(r'\b[aeiouAEIOU]\w+', 'I am a student of Data-Science')
print(result)

['am', 'of']


In [24]:
result = re.findall(r'\b[^aeiouAEIOU]\w+', 'I am a student of Data-Science')
print(result)

[' am', ' a', ' student', ' of', ' Data', '-Science']


Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].

In [23]:
result = re.findall(r'\b[^aeiouAEIOU ]\w+', 'I am a student of Data-Science')
print(result)

['student', 'Data', '-Science']


### Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

In [25]:
import re
li=['9999999999','999999-999','99999x9999']
for val in li:
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print('yes')
 else:
     print('no')

yes
no
no


### Split a string with multiple delimiters

In [26]:
import re
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
print(result)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


We can also use method re.sub() to replace these multiple delimiters with one as space ” “.

In [27]:
import re
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,\s]',' ', line)
print(result)


asdf fjdk afed fjek asdf foo


### Retrieve Information from HTML file

