<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/4_5_Regular_Expressions_in_Python_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions in Python
### Table of contents:
1. **match()**
2. **search()**
3. **findall()**
4. **finditer()**
5. **sub()**
6. **split()**
7. **Groups**

In [2]:
import re

## 1. match()
Checks for a match only at the beginning of the string

In [3]:
# Defining a string
string = "Tiger is the national animal of India. Tiger lives in Forest."

# Defining the pattern
pattern = "Tiger"

# Running match() on a string
result = re.match(pattern, string)

# Printing the result
print(result)

<re.Match object; span=(0, 5), match='Tiger'>


In [4]:
# Defining a string
string = "Tiger is the national animal of India. Tiger lives in Forest."

# Defining the pattern
pattern = "Tiger"

# Extracting String from a match object
result = re.match(pattern, string).group()

# Printing the result
print(result)

Tiger


In [5]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Checking for match
result = re.match(pattern, string)
print(result)

None


## 2. search()
Locates a sub-string matching the RegEx pattern anywhere in the string

In [6]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Searching a substring using search()
result = re.search(pattern, string)
print(result)

<re.Match object; span=(32, 37), match='Tiger'>


In [9]:
string = "The national animal of India is Tiger. iger lives in Forest."
pattern = "Tiger"

# Extracting searched string
result = re.search(pattern, string).group()
print(result)

Tiger


## 3. findall()
Finds all the sub-strings matching the RegEx pattern

In [13]:
string = "The national animal of India is Tiger. Tier lives in Forest."
pattern = "Tiger"

# Using findall() on a string
result = re.findall(pattern, string)
print(result)

['Tiger']


In [16]:
# Defining the string
text = "India got freedom on 15 -08-1947, and it is celebrated as Independence Day.\
        Indian Constitution came into effect on 26-01-1950, and it is celebrated as Republic Day."

# Defining the pattern
date_pattern = r'\d{2} -\d{2}-\d{4}'

# Extracting dates using findall()
re.findall(date_pattern, text)

['15 -08-1947']

## 4. finditer()
Similar to findall() but returns an iterator

In [18]:
string = "The national animal of India is Tger. Tiger lives in Forest."
pattern = "Tiger"

# Using finditer() on a string
result = re.finditer(pattern, string)
print(result)

# Iterating over the iterator
for m in result:
    # Printing match object
    print(m)
    # Printing starting and ending index with matched substring
    print('Start:',m.start(),' End:',m.end(),' Sub-string:',m.group())

<callable_iterator object at 0x7aa153c1f4f0>
<re.Match object; span=(38, 43), match='Tiger'>
Start: 38  End: 43  Sub-string: Tiger


## 5. sub()
Searches for a substring and replaces it with another string

In [20]:
text="Analytics Vidhya is largest Analytics community of India."

# Replacing a substring using sub()
result=re.sub('India', 'theWorld',text)
print(result)

Analytics Vidhya is largest Analytics community of theWorld.


## 6. split()
Split the text by the given RegEx Pattern

In [22]:
line = "I have a big test tomorrow, I can't go out tonight."

# Splitting a string into multiple substrings
re.split(r'[;]', line)

["I have a big test tomorrow, I can't go out tonight."]

## 7. Groups

In [24]:
# Running a simple pattern on some text
string="Ajay credited $500 to your account on 13-08- 2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

pattern="[\w]+ [\w]+ \$[\d,]+ [a-zA-z ]+ \d{2}-\d{2}-\d{4}"

result=re.findall(pattern,string)

print(result)

['Anmol debited $1,700 from your account on 14-08-2020', 'Alex debited $100 on 16-08-2020']


In [26]:
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14- 08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

# Creating groups in the previous pattern
pattern="([\w]+) ([\w]+) (\$[\d,]+) [a-zA-z ]+ (\d{2}-\d{2}-\d{4})"

result=re.findall(pattern,string)

print(result)

[('Ajay', 'credited', '$500', '13-08-2020'), ('Alex', 'debited', '$100', '16-08-2020')]


In [27]:
import pandas as pd

# Creating a dataframe
df=pd.DataFrame(result,columns=['Name','Type','Amount','Date'])
df

Unnamed: 0,Name,Type,Amount,Date
0,Ajay,credited,$500,13-08-2020
1,Alex,debited,$100,16-08-2020


In [28]:
# Using finditer() for getting match objects
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

pattern="([\w]+) ([\w]+) (\$[\d,]+) [a-zA-z ]+ (\d{2}-\d{2}-\d{4})"

result=re.finditer(pattern,string)

# Accessing groups separately
for i in result:
    print(i.group(0),'=>',i.group(1),'=>',i.group(2),'=>',i.group(3),'=>',i.group(4))

Ajay credited $500 to your account on 13-08-2020 => Ajay => credited => $500 => 13-08-2020
Anmol debited $1,700 from your account on 14-08-2020 => Anmol => debited => $1,700 => 14-08-2020
Alex debited $100 on 16-08-2020 => Alex => debited => $100 => 16-08-2020


**Note:** Syntax for naming groups: `(?P<Group Name>Pattern)`

In [29]:
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

# Naming Groups
pattern="(?P<Name>[\w]+) (?P<Type>[\w]+) (?P<Amount>\$[\d,]+) [a-zA-z ]+ (?P<Date>\d{2}-\d{2}-\d{4})"

result=list(re.finditer(pattern,string))

In [30]:
# Accessing data by group names
for i in result:
    print(i.group('Name'),'=>',i.group('Amount'),'=>',i.group('Date'),'=>',i.group('Type'))

Ajay => $500 => 13-08-2020 => credited
Anmol => $1,700 => 14-08-2020 => debited
Alex => $100 => 16-08-2020 => debited


In [31]:
# Printing data with group names
for i in result:
    print(i.groupdict())

{'Name': 'Ajay', 'Type': 'credited', 'Amount': '$500', 'Date': '13-08-2020'}
{'Name': 'Anmol', 'Type': 'debited', 'Amount': '$1,700', 'Date': '14-08-2020'}
{'Name': 'Alex', 'Type': 'debited', 'Amount': '$100', 'Date': '16-08-2020'}


In [51]:
string="Sam started learning NLP on 02-01-2020. He created his first self project on 18-02-2019. After this, he worked hard and got an internship at ABC Pvt. Ltd. on 10-06-2019. Finally, he got his first job at XYZ Pvt. Ltd. on 22-10-2019."
#pattern="\w[A-Z]+ Pvt\. Ltd\."
pattern = r'(\b\d{2})-(\d{2})-(\d{4}\b)'
pattern2 = r'\b\d{2}-\d{2}-\d{4}\b'
result=re.sub(pattern, r'\1-\2-2020', string)
result2 = re.findall(pattern2, string)
print(result)
print(result2)

Sam started learning NLP on 02-01-2020. He created his first self project on 18-02-2020. After this, he worked hard and got an internship at ABC Pvt. Ltd. on 10-06-2020. Finally, he got his first job at XYZ Pvt. Ltd. on 22-10-2020.
['02-01-2020', '18-02-2019', '10-06-2019', '22-10-2019']
