<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/17_3_Regex_in_action.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learn to Use Regular Expressions (RegEx)

Python has a built-in module to work with regular expressions called **re**. Some of the commonly used methods from the **re** module are listed below:

1.re.match( )

2.re.search( )

3.re.findall( )

4.re.sub( )

<br>

Let us look at each method with the help of example.

**1. re.match()**

The re.match function returns a match object on success and none on failure.

In [8]:
# import re library
import re

#match a word at the beginning of a string

result = re.match(' Vidhya',' Vidhya is the largest data science Analytics community of India')
print(result)

result_2 = re.match('largest','Analytics Vidhya is the largest data science community of India')
print(result_2)

<re.Match object; span=(0, 7), match=' Vidhya'>
None


Since output of the re.match is an object, we will use *group()* function of match object to get the matched expression.

In [9]:
print(result.group())  #returns the total matches

 Vidhya


<br>

**2. re.search()**

Matches the first occurence of a pattern in the entire string.

In [13]:
# search for the pattern "founded" in a given string
result = re.search('founded','Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result.group())

founded


<br>

**3. re.findall()**

It will return all the occurrences of the pattern from the string. I would recommend you to use *re.findall()* always, it can work like both *re.search()* and *re.match()*.

In [16]:
result = re.findall('founde','Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

['founde', 'founde']


__4. re.sub()__

This method returns a string where matched occurences are replaced with a new text string.

In [17]:
result = re.sub('He', 'Andrew NG', 'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

Andrew NG founded Coursera. Andrew NG also founded deeplearning.ai


In [18]:
result = re.sub('also', '', 'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

Andrew NG founded Coursera. He  founded deeplearning.ai


### Special sequences

1. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [22]:
# Check if there is any word that ends with "est"
x = re.findall(r"ne\b", "Analytics Vidhya is one of the largest data science communities")
print(x)

['ne']


In [23]:
# Check if there is any word that ends with "est"
x = re.findall(r"ies\b", "Analytics Vidhya is one of the largest data science communities")
print(x)

['ies']


It returns the last three characters of the word "largest".

2. **\d** returns a match when the string contains digits (numbers from 0-9)

In [24]:
str = "2 million monthly visits in Jan'19."

# Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

['2', '1', '9']


In [28]:
str = "2mi1lion monthly visits in Jan'19."

# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space
x = re.findall("\d+", str)

print(x)

['2', '1', '19']


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as **\d** does character wise comparison.

3. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)


In [30]:
str = "2 millionmonthly visits!"

x = re.findall("\w+",str)

print(x)

['2', 'millionmonthly', 'visits']


## Metacharacters

Metacharacters are characters with a special meaning

1. **(.)** matches any character (except newline character)

In [35]:
str = "rohan and rohit recently published a research paper!"

# search for a string that starts with "ro", followed by 1 character
x = re.findall("r.....", str)

print(x)

['rohan ', 'rohit ', 'recent', 'resear']


In [36]:
# search for a string that starts with "ro", followed by three characters
x2 = re.findall("ro...", str)

print(x2)

['rohan', 'rohit']


2. **(^)** starts with

In [38]:
str = "Data Science"

#Check if the string starts with 'Data':
x = re.findall("^ata", str)

if (x):
  print("Yes, the string starts with 'Data'")
else:
  print("No match")

#print(x)

No match


In [39]:
# try with a different string
str2 = "Big Data"

#Check if the string starts with 'Data':
x2 = re.findall("^Data", str2)

if (x2):
  print("Yes, the string starts with 'data'")
else:
  print("No match")

#print(x2)

No match


3. **($)** ends with

In [42]:
str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("cien$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")

#print(x)

No match


In [43]:
str = "Big Data"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")

#print(x)

No match


4. (*) matches for zero or more occurences of the pattern to the left of it.

In [48]:
str = "easy eatsssy eaoyt eaty"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)

print(x)

['easy']


## Sets

1. A set is a bunch of characters inside a pair of square brackets [ ] with a special meaning.

In [49]:
str = "Analytics Vidhya is one of the largest data science communities"

#Check for the characters y, d, or h, in the above string
x = re.findall("[ydh]", str)

print(x)

['y', 'd', 'h', 'y', 'h', 'd']


In [50]:
str = "Analytics Vidhya is the one of the largest data science communities"

#Check for the characters between a and g, in the above string
x = re.findall("[a-g]", str)

print(x)

['a', 'c', 'd', 'a', 'e', 'e', 'f', 'e', 'a', 'g', 'e', 'd', 'a', 'a', 'c', 'e', 'c', 'e', 'c', 'e']


2. **[^]** Check whether string has other characters mentioned after ^

In [52]:
str = "Analytics Vidhya is one of the largest data sciece communities"

#Check if every word character has characters other than y, d, or h

x = re.findall("[^sdh]", str)

print(x)

['A', 'n', 'a', 'l', 'y', 't', 'i', 'c', ' ', 'V', 'i', 'y', 'a', ' ', 'i', ' ', 'o', 'n', 'e', ' ', 'o', 'f', ' ', 't', 'e', ' ', 'l', 'a', 'r', 'g', 'e', 't', ' ', 'a', 't', 'a', ' ', 'c', 'i', 'e', 'c', 'e', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'i', 'e']


In [53]:
str = "@AnalyticsVidhya"

x = re.findall("[^@]", str)

print(x)

['A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'V', 'i', 'd', 'h', 'y', 'a']


In [61]:
string = "Contact us on training_queries@analyticsvidhya.com"

match = re.search("([\w.]+)@([\w_]+)", string)

---
## Solve Some Queries

Let us try solving some queries that we are likeli to come across while working with real world text datasets.



### Eliminating Unwanted Terms

In [54]:
str = "@AV a Data Science community #AV!!"

# extract words that start with a special character
x = re.sub("[^a-zA-Z ]", "",str)

print(x)

AV a Data Science community AV


In [55]:
str = "@AV a Data Science community #AV!!"

# extract words that start with a special character
# \w matches any alpha numeric character
# + for repeats a character one or more times
x = re.sub("[^a-zA-Z ]\w+", "",str)

print(x)

 a Data Science community !!


### Finding Email IDs

In [56]:
str = 'Send a mail to rohan.1997@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'

# \w matches any alpha numeric character
# + for repeats a character one or more times
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)

# Printing of List
print(x)

['rohan.1997@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']
