# Learn to Use Regular Expressions (RegEx)

Python has a built-in module to work with regular expressions called **re**. Some of the commonly used methods from the **re** module are listed below:

1. re.match( )
2. re.search( )
3.re.findall( )
4. re.sub( )

<br>

**Resources:**
[Beginners Tutorial for Regular Expressions in Python](https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/)
[4 Applications of Regular Expressions that every Data Scientist should know](https://www.analyticsvidhya.com/blog/2020/01/4-applications-of-regular-expressions-that-every-data-scientist-should-know-with-python-code/)

Let us look at each method with the help of example.

**1. re.match()**

The re.match function returns a match object on success and none on failure. 

In [3]:
# import re library
import re

#match a word at the beginning of a string

result = re.match('Analytics','Analytics Vidhya is the largest data science community of India') 
print(result)

<re.Match object; span=(0, 9), match='Analytics'>


In [5]:
result_2 = re.match('largest','Analytics Vidhya is the largest data science community of India') 
print(result_2)

None


Since output of the re.match is an object, we will use *group()* function of match object to get the matched expression.

In [2]:
print(result.group())  #returns the total matches

Analytics


<br>

**2. re.search()**

Matches the **first** occurence of a pattern in the entire string.

In [6]:
# search for the pattern "founded" in a given string
result = re.search('founded','Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result.group())

founded


<br>

**3. re.findall()**

It will return all the occurrences of the pattern from the string. I would recommend you to use *re.findall()* always, it can work like both *re.search()* and *re.match()*.

In [7]:
result = re.findall('founded','Andrew NG founded Coursera. He also founded deeplearning.ai')  
print(result)

['founded', 'founded']


__4. re.sub()__

This method returns a string where matched occurences are replaced with a new text string.

In [None]:
result = re.sub('He', 'Andrew NG', 'Andrew NG founded Coursera. He also founded deeplearning.ai')  
print(result)

Andrew NG founded Coursera. Andrew NG also founded deeplearning.ai


In [None]:
result = re.sub('also', '', 'Andrew NG founded Coursera. He also founded deeplearning.ai')  
print(result)

Andrew NG founded Coursera. He  founded deeplearning.ai


### Special sequences

1. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [9]:
# Check if there is any word that ends with "est"
x = re.findall(r"ics\b", "Analytics Vidhya is one of the largest data science communities")
print(x)

['ics']


In [None]:
# Check if there is any word that ends with "est"
x = re.findall(r"est\b", "Analytics Vidhya is one of the largest data science communities")
print(x)

['est']


It returns the last three characters of the word "largest".

2. **\d** returns a match when the string contains digits (numbers from 0-9)

In [10]:
str = "2 million monthly visits in Jan'19."

# Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

['2', '1', '9']


In [11]:
str = "2 million monthly visits in Jan'19."

# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space
x = re.findall("\d+", str)

print(x)

['2', '19']


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as **\d** does character wise comparison.

The special character "+" matches the expression to its left 1 or more times.

3. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)


In [12]:
str = "2 million monthly visits!"

x = re.findall("\w+",str)

print(x)

['2', 'million', 'monthly', 'visits']


Note the ! is removed.

\S matches the non-whitespace characters while matches the white space characters.

## Metacharacters

Metacharacters are characters with a special meaning

1. **(.)** matches any character (except newline character)

In [13]:
str = "rohan and rohit recently published a research paper!" 

# search for a string that starts with "ro", followed by 1 character
x = re.findall("ro.", str)

print(x)

['roh', 'roh']


In [14]:
# search for a string that starts with "ro", followed by three characters
x2 = re.findall("ro...", str)

print(x2)

['rohan', 'rohit']


2. **(^)** starts with

In [15]:
str = "Data Science"

#Check if the string starts with 'Data':
x = re.findall("^Data", str)

if (x):
  print("Yes, the string starts with 'Data'")
else:
  print("No match")
  
#print(x)  

Yes, the string starts with 'Data'


In [16]:
# try with a different string
str2 = "Big Data"

#Check if the string starts with 'Data':
x2 = re.findall("^Data", str2)

if (x2):
  print("Yes, the string starts with 'data'")
else:
  print("No match")
  
#print(x2)  

No match


3. **($)** ends with

$ matches the expression to its left at the end of a string.

In [None]:
str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

Yes, the string ends with 'Science'


In [None]:
str = "Big Data"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

No match


4. (*) matches for zero or more occurences of the pattern to the left of it.

In [17]:
str = "easy easssy eay eaty"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)

print(x)

['easy', 'easssy', 'eay']


Which of the following will match the strings "invent", "invention", "inventory" and "inventories"?

invent[a-z]*

The subword invent is common in all three strings and after that the set of will search for O or more characters between a to z. In this way all the required strings will matched.

## Sets

1. A set is a bunch of characters inside a pair of square brackets [ ] with a special meaning.

In [18]:
str = "Analytics Vidhya is one of the largest data science communities"

#Check for the characters y, d, or h, in the above string
x = re.findall("[ydh]", str)

print(x)

['y', 'd', 'h', 'y', 'h', 'd']


In [19]:
str = "Analytics Vidhya is the one of the largest data science communities"

#Check for the characters between a and g, in the above string
x = re.findall("[a-g]", str)

print(x)

['a', 'c', 'd', 'a', 'e', 'e', 'f', 'e', 'a', 'g', 'e', 'd', 'a', 'a', 'c', 'e', 'c', 'e', 'c', 'e']


2. **[^]** Check whether string has other characters mentioned after ^

In [20]:
str = "Analytics Vidhya is one of the largest data sciece communities"

#Check if every word character has characters other than y, d, or h

x = re.findall("[^ydh]", str)

print(x)

['A', 'n', 'a', 'l', 't', 'i', 'c', 's', ' ', 'V', 'i', 'a', ' ', 'i', 's', ' ', 'o', 'n', 'e', ' ', 'o', 'f', ' ', 't', 'e', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'c', 'e', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'i', 'e', 's']


In [21]:
str = "@AnalyticsVidhya"

x = re.findall("[^@]", str)

print(x)

['A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'V', 'i', 'd', 'h', 'y', 'a']


---
## Solve Some Queries

Let us try solving some queries that we are likeli to come across while working with real world text datasets.



### Eliminating Unwanted Terms

In [22]:
str = "@AV a Data Science community #AV!!"

# extract words that start with a special character
x = re.sub("[^a-zA-Z ]", "",str)

print(x)

AV a Data Science community AV


In [23]:
str = "@AV a Data Science community #AV!!"

# extract words that start with a special character
# \w matches any alpha numeric character 
# + for repeats a character one or more times
x = re.sub("[^a-zA-Z ]\w+", "",str)

print(x)

 a Data Science community !!


### Finding Email IDs

In [24]:
str = 'Send a mail to rohan.1997@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'
  
# \w matches any alpha numeric character 
# + for repeats a character one or more times
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)     
  
# Printing of List 
print(x) 

['rohan.1997@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']


The set ([\w._]+) will search for one or more occurrences of all the alpha numeric along
with the special characters "." and "_" before and after "@".