# Regular Expressions

A regular expression is a sequence of characters that forms a search pattern. When you search for data in a text, you can use this search pattern to describe what you are searching for. A regular expression can be a single character, or a more complicated pattern. They can be used to perform all types of text search and text replace operations.

## Importing libraries

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import re

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline

print(pd.__version__)
print(np.__version__)

2.1.3
1.26.1


## Basics of Regular Expressions

### 'r' expression
**r** expression is used to create a raw string. Python raw string treats backslash (\\) as a literal character.

In [2]:
# Normal string vs Raw string
path = "C:\\Users\\Dhruv"  #string
print('String:',path)

String: C:\Users\Dhruv


In [3]:
path= r"C:\Users\Dhruv"  #raw string
print('Raw String:', path)

Raw String: C:\Users\Dhruv


### Methods of re module

It is always recommended to use raw strings while dealing with regular expressions.

#### re.match()

The match() function returns a match object on success and none on failure. 

In [4]:
# Match a word at the beginning of a string
result_1 = re.match('Dhruv', r'Dhruv Gupta is studying from Delhi Technological University!') 
print('Result 1:', result_1)

result_2 = re.match('Delhi', r'Dhruv Gupta is studying from Delhi Technological University!') 
print('Result 2:', result_2)

Result 1: <re.Match object; span=(0, 5), match='Dhruv'>
Result 2: None


In [5]:
# Use the group() function to get the matched expression.
print(result_1.group())  # returns the total matches

Dhruv


#### re.search()

Matches the first occurence of a pattern in the entire string.

In [6]:
# search for the pattern "is" in a given string
result = re.search('is', r'Dhruv Gupta is studying from Delhi Technological University! Dhruv is 23 years old!')
print(result.group())

is


#### re.findall()

It will return all the occurrences of the pattern from the string. It can work like both **`re.search()`** and **`re.match()`**.

In [7]:
# Search for all the occurences for the pattern 'is' in a given string
result = re.findall('is', r'Dhruv Gupta is studying from Delhi Technological University! Dhruv is 23 years old!') 
print(result)

['is', 'is']


### Special sequences

#### '\A'	sequence

Returns a match if the specified pattern is at the beginning of the string. This is useful in cases where you have multiple strings of text, and you have to extract the first word only, given that first word is 'Analytics'. If you would try to find some other word, then it will return an empty list.

In [8]:
# Match the pattern in a given string
result_1 = re.findall('\ADhruv', r'Dhruv Gupta is studying from Delhi Technological University!') 
print('Result 1:', result_1)

result_2 = re.findall('\ADelhi', r'Dhruv Gupta is studying from Delhi Technological University!') 
print('Result 2:', result_2)

Result 1: ['Dhruv']
Result 2: []


#### \b sequence 

Returns a match where the specified pattern is at the beginning or at the end of a word.

In [9]:
# Check if there is any word that ends with "ity"
result = re.findall(r'ity\b', r'Dhruv Gupta is studying from Delhi Technological University! This University is the best!')
print(result)

['ity', 'ity']


#### \B sequence	

Returns a match where the specified pattern is present, but NOT at the beginning (or at the end) of a word.

In [10]:
# Check if there is any 'i' present in the given string
result = re.findall(r'\Bi', r'Dhruv Gupta is studying from Delhi Technological University!')
print(result)

['i', 'i', 'i', 'i', 'i']


#### \d and \d+ sequences

Returns a match where the string contains digits (numbers from 0-9).

In [11]:
# Check if the string contains any digits (numbers from 0-9)
string = r'21 million cases occurred in 2020'

result_1 = re.findall('\d', string)

# adding '+' after '\d' will continue to extract digits till encounters a space
result_2 = re.findall('\d+', string)

print(result_1)
print(result_2, end = '\n')

if (result_1 and result_2):
    print('Numbers are there!')
else:
    print('No numbers!')

['2', '1', '2', '0', '2', '0']
['21', '2020']
Numbers are there!


#### \D and \D+ sequences 

Returns a match where the string does not contain any digit.

In [12]:
string = r'21 million cases occurred in 2020'

result_1 = re.findall('\D', string)

# adding '+' after '\D' will continue to extract characters till encounters a space
result_2 = re.findall('\D+', string)

print(result_1)
print(result_2, end = '\n')

if (result_1 and result_2):
    print('Numbers are not there!')
else:
    print('Numbers are there!')

[' ', 'm', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'c', 'a', 's', 'e', 's', ' ', 'o', 'c', 'c', 'u', 'r', 'r', 'e', 'd', ' ', 'i', 'n', ' ']
[' million cases occurred in ']
Numbers are not there!


#### \w and \w+ sequences 

Helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character).

In [13]:
string = r'21 million cases occurred in 2020'

result_1 = re.findall('\w', string)

# adding '+' after '\D' will continue to extract characters till encounters a space
result_2 = re.findall('\w+', string)

print(result_1)
print(result_2, end = '\n')

['2', '1', 'm', 'i', 'l', 'l', 'i', 'o', 'n', 'c', 'a', 's', 'e', 's', 'o', 'c', 'c', 'u', 'r', 'r', 'e', 'd', 'i', 'n', '2', '0', '2', '0']
['21', 'million', 'cases', 'occurred', 'in', '2020']


#### \W sequence 

Returns match at every non alphanumeric character.

In [14]:
string = r'21 million cases occurred in 2020!!'

# Returns a match at every non word character (characters not between a and Z. Like "!", "?" white-space etc.)
result = re.findall('\W', string)
print(result)

[' ', ' ', ' ', ' ', ' ', '!', '!']


### Metacharacters

Metacharacters are characters with a special meaning.

#### 1. (.) character

Matches any character (except newline character).

In [15]:
string = r'Dhruv Gupta is studying from Delhi Technological University!' 

# Search for a string that starts with "uni", followed by three (any) characters
result_1 = re.findall('Uni.', string)
result_2 = re.findall('Univ...', string)

print(result_1)
print(result_2, end = '\n')

['Univ']
['Univers']


#### 2. (^) character 

Starts with a pattern.

In [16]:
string = r'Dhruv Gupta is studying from Delhi Technological University!'

# Check if the string starts with 'Dhruv'
result_1 = re.findall('^Dhruv', string)

# Check if the string starts with 'Gupta'
result_2 = re.findall('^Gupta', string)

print(result_1)
print(result_2, end = '\n')

['Dhruv']
[]


#### ($) character

Ends with a pattern.

In [17]:
string = r'Dhruv Gupta is studying from Delhi Technological University!'

# Check if the string ends with 'University!'
result_1 = re.findall('University!$', string)

# Check if the string ends with 'University'
result_2 = re.findall('University$', string)

print(result_1)
print(result_2, end = '\n')

['University!']
[]


#### (*) character 

Matches for zero or more occurences of the pattern to the left of it.

In [18]:
string = 'easy easssy eay ey'

# Check if the string contains 'ea' followed by 0 or more 's' characters and ending with y
result = re.findall('eas*y', string)

print(result)

['easy', 'easssy', 'eay']


#### (+) character 

Matches one or more occurences of the pattern to the left of it.

In [19]:
string = 'easy easssy eay ey'

# Check if the string contains 'ea' followed by 1 or more 's' characters and ending with y
result = re.findall('eas+y', string)

print(result)

['easy', 'easssy']


#### (?) character 

Matches zero or one occurrence of the pattern left to it.

In [20]:
string = 'easy easssy eay ey'

# Check if the string contains 'ea' followed by 0 or 1 character and ending with y
result = re.findall('eas?y', string)

print(result)

['easy', 'eay']


#### (|) character 

Check of either 1 pattern or another in a given string.

In [21]:
string = r'Dhruv Gupta is studying from Delhi Technological University!'

# Check if the string contains either 'Dhruv' or 'Gupta'.
result_1 = re.findall('Dhruv | Gupta', string)
result_2 = re.findall('is | New', string)

print(result_1)
print(result_2, end = ' ')

['Dhruv ']
['is '] 

### Set operations for regular expressions

A set is a bunch of characters inside a pair of square brackets [ ] with a special meaning.

In [23]:
string = r'Dhruv Gupta is studying from Delhi Technological University!'

# Check for the characters u, G, or o, in the above string
result_1 = re.findall('[uGo]', string)

# Check for the characters between A and Z, in the above string
result_2 = re.findall('[A-Z]', string)

print('Result 1:', result_1)
print('Result 2:', result_2)

Result 1: ['u', 'G', 'u', 'u', 'o', 'o', 'o']
Result 2: ['D', 'G', 'D', 'T', 'U']


In [24]:
string = r'Dhruv Gupta is studying from Delhi Technological University since 2019. He is 23 years old now and doing a job!'

# Extract the numbers starting with 0 to 4 from in the above string
result = re.findall(r'\b[0-4]\d+', string)

print(result)

['2019', '23']


**[^]:** Checks whether string has other characters mentioned after ^.

In [26]:
string = r'Dhruv Gupta is studying from Delhi Technological University!'

# Check if every word character has characters than y, d, or h
result = re.findall('[^uGo]', string)

print(result)

['D', 'h', 'r', 'v', ' ', 'p', 't', 'a', ' ', 'i', 's', ' ', 's', 't', 'd', 'y', 'i', 'n', 'g', ' ', 'f', 'r', 'm', ' ', 'D', 'e', 'l', 'h', 'i', ' ', 'T', 'e', 'c', 'h', 'n', 'l', 'g', 'i', 'c', 'a', 'l', ' ', 'U', 'n', 'i', 'v', 'e', 'r', 's', 'i', 't', 'y', '!']


**[a-zA-Z0-9]:** Checks whether string has alphanumeric characters

In [27]:
string = r'!!Dhruv Gupta is studying from @Delhi &&Technological %#University!'

# Extract words that start with a special character
result = re.findall("[^a-zA-Z0-9 ]\w+", string)

print(result)

['!Dhruv', '@Delhi', '&Technological', '#University']


### Extracting Email IDs



In [30]:
string = 'Send a mail to dhruvg029@gmail.com with dhruvgupta948@gmail.com as cc.'
  
# \w matches any alpha numeric character, + for repeats a character one or more times
# [a-zA-Z0-9._-] is for selecting any character before @
result = re.findall('[a-zA-Z0-9._-]+@\w+\.com', string)     

print(result) 

['dhruvg029@gmail.com', 'dhruvgupta948@gmail.com']


### Extracting Dates

In [33]:
string = 'Today is 09-07-2024 and I was born on 06-07-2001. So that is a huge difference.'

# '\d{4}' repeats '\d' 4 times
result = re.findall('\d{2}.\d{2}.\d{4}', string)

print(result)

['09-07-2024', '06-07-2001']


In [34]:
string = 'Today is 09 July 2024 and I was born on 06 July 2001. So that is a huge difference.'

# '\w{4} means '\w' repeating 4 times
result = re.findall('\d{2}.\w{4}.\d{4}', string)

print(result)

['09 July 2024', '06 July 2001']


In [35]:
# Extract dates with varying lengths
string = 'Paris Olympics will happen from 26 July 2024 to 11 August 2024.'

# '\w {3,10}' repeats '\w' 3 to 10 times
result = re.findall('\d{2}.\w{3,10}.\d{4}', string)

print(result)

['26 July 2024', '11 August 2024']


## Example of Regex in a real world dataset

In [36]:
# load dataset
data = pd.read_csv('datasets/titanic.csv')

# Check the data
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [37]:
# Print a few passenger names
data['Name'].head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

### Extract the title from the names 

#### Method 1: Use split on the pandas dataframe and extract the title

In [38]:
name = 'Allen, Mr. William Henry'
split_name = name.split('.')

print(split_name)

['Allen, Mr', ' William Henry']


In [39]:
# Split the first value with ','
print(split_name[0].split(','))

['Allen', ' Mr']


In [43]:
# Applying the method
titles = data['Name'].apply(lambda x: x.split('.')[0].split(',')[1])

# Check the data
titles.head()

0       Mr
1      Mrs
2     Miss
3      Mrs
4       Mr
Name: Name, dtype: object

#### Method 2: Use Regex to extract titles

In [46]:
def title(name):
    # Title is followed by '.' since we are searching for a pattern that includes '.'
    return re.findall('\w+\.', name)[0]

titles = data['Name'].apply(lambda x: title(x))
titles.head()

0      Mr.
1     Mrs.
2    Miss.
3     Mrs.
4      Mr.
Name: Name, dtype: object