# Regular Expressions

# Introduction To Regular Expressions:

## What is Regular Expressions?
Regular expressions define search patterns for text, using a sequence of characters. These patterns can range from simple, such as finding specific words, to complex patterns, like validating email addresses or phone numbers.<br>

Regular expressions (regex) are a powerful tool for matching text patterns. They are used across various programming environments, including Python and its libraries like Pandas, for tasks such as searching, replacing, and parsing text.<br>

Reference Material: https://www.w3schools.com/python/python_regex.asp


## RegEx Module

Python has a built-in package called `re`, which can be used to work with Regula Expressions.

In [1]:
import re # library for Regular Expressions
import pandas as pd

In [2]:
# An Example in Python
text = "Summer is beautiful in Rochester"
x = re.findall("^Summer.*Rochester$",text) # patter to look for words between Summer and Rochester
print(x)

['Summer is beautiful in Rochester']


In [3]:
# A Pandas Example
df = pd.DataFrame({'text': ['bat', 'cat', 'rat']})

# Filtering rows that match a regex
filtered_df = df[df['text'].str.contains('^[bc]at$')]
print(filtered_df)

  text
0  bat
1  cat


## RegEx Functions

The `re` module offers a set of functions that allows us to search a string for a match:<br>
* `findall` : Returns a list containing all matches.
* `search` : Returns a *Match object* if there is a match anywhere in the string.
* `split` : Returns a list where the string has been split at each match.
* `sub` : replace one or many matches with a string.

Will have examples after we get cover some basic elements of Regular Expression.

## Meta Characters
Meta Characters are charaters with special meaning.

![class 7](meta_characters.png)
source:W3 School

In [4]:
txt = "We are studying regular expresssions in python and the expressions are funky"
print(f'charaters from q-z: {re.findall("[q-z]",txt)}')
print(f'Any character \".\" : {re.findall("r..",txt)}')
print(f'Specified no of occurances : {re.findall("expre.{2}ions",txt)}')
print(f'Either: {re.findall("expressions|expresssion",txt)}')

charaters from q-z: ['r', 's', 't', 'u', 'y', 'r', 'u', 'r', 'x', 'r', 's', 's', 's', 's', 'y', 't', 't', 'x', 'r', 's', 's', 's', 'r', 'u', 'y']
Any character "." : ['re ', 'reg', 'r e', 'res', 'res', 're ']
Specified no of occurances : ['expressions']
Either: ['expresssion', 'expressions']


In [5]:
m = re.findall("^We",txt)
if m:
    print("Text starts with We")

m = re.findall("python$",txt)
if m:
    print("The text ends with Python") 

Text starts with We


In [6]:
txt = "Do you know the top 5, 10, 100, 1000 trends in the market?"
print(f'Find 0 or more occurance: {re.findall("o*",txt)}')
print(f'Find 1 or more occurance: {re.findall("10+",txt)}')
print(f'Find 0 or 1 occurance:{re.findall("tr?",txt)}')

Find 0 or more occurance: ['', 'o', '', '', 'o', '', '', '', '', 'o', '', '', '', '', '', '', '', 'o', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
Find 1 or more occurance: ['10', '100', '1000']
Find 0 or 1 occurance:['t', 't', 'tr', 't', 't']


## Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning.<br>

![class 7](sets.png)
source:W3 School


In [7]:
txt = "I stay in apartment 3A,90 Bedford St, New York, NY 10014 "
print(f'{re.findall("[apy]",txt)}')
print(f'{re.findall("[^ apy]",txt)}')
print(f'{re.findall("[0-5]",txt)}')
print(f'{re.findall("[a-mB-Y]",txt)}')
print(f'{re.findall("[0-5][1-4]",txt)}')
print(f'{re.findall("[0-3][A-Z]",txt)}')

['a', 'y', 'a', 'p', 'a']
['I', 's', 't', 'i', 'n', 'r', 't', 'm', 'e', 'n', 't', '3', 'A', ',', '9', '0', 'B', 'e', 'd', 'f', 'o', 'r', 'd', 'S', 't', ',', 'N', 'e', 'w', 'Y', 'o', 'r', 'k', ',', 'N', 'Y', '1', '0', '0', '1', '4']
['3', '0', '1', '0', '0', '1', '4']
['I', 'a', 'i', 'a', 'a', 'm', 'e', 'B', 'e', 'd', 'f', 'd', 'S', 'N', 'e', 'Y', 'k', 'N', 'Y']
['01']
['3A']


## Special Sequences

A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning.

![class 7](special_sequence.png)

source:W3 School

In [8]:
txt = "Meet me at 1234 Main St. at 12:00 pm."
print(re.search(r'\AMeet', txt).string)
print(re.findall(r'\bme\b', txt))
print(re.findall(r'\BSt\b', txt))
print(re.findall(r'\d', txt))
print(re.findall(r'\D', txt))
print(re.findall(r'\s', txt))
print(re.findall(r'\S', txt))
print(re.findall(r'\w', txt))
print(re.findall(r'\W', txt))
print(re.findall(r'pm.\Z', txt))

Meet me at 1234 Main St. at 12:00 pm.
['me']
[]
['1', '2', '3', '4', '1', '2', '0', '0']
['M', 'e', 'e', 't', ' ', 'm', 'e', ' ', 'a', 't', ' ', ' ', 'M', 'a', 'i', 'n', ' ', 'S', 't', '.', ' ', 'a', 't', ' ', ':', ' ', 'p', 'm', '.']
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
['M', 'e', 'e', 't', 'm', 'e', 'a', 't', '1', '2', '3', '4', 'M', 'a', 'i', 'n', 'S', 't', '.', 'a', 't', '1', '2', ':', '0', '0', 'p', 'm', '.']
['M', 'e', 'e', 't', 'm', 'e', 'a', 't', '1', '2', '3', '4', 'M', 'a', 'i', 'n', 'S', 't', 'a', 't', '1', '2', '0', '0', 'p', 'm']
[' ', ' ', ' ', ' ', ' ', '.', ' ', ' ', ':', ' ', '.']
['pm.']


# RegEx Functions - How to use them.

## findall()

The findall() function in Python is used to find all occurrences of a pattern in a string. It returns a list of all matches found. If the pattern is not found, it returns an empty list.

In [9]:
txt = "I want to watch Deadpool and wolverine this weekend."
x = re.findall("wa",txt)
print(x)

address = "90 Bedford St, New York, NY 10014"
x = re.findall(r'\d+',address)
print(x)

['wa', 'wa']
['90', '10014']


## search()
Searches the string for a match, and returns a `Match object` if there is a match.
If there is more than one match, only the first occurrence of the match will be returned.

The Match object has properties and methods used to retrieve information about the search, and the result:
* `span()` returns a tuple containing the start-, and end positions of the match.
* `string` returns the string passed into the function.
* `group()` returns the part of the string where there was a match.


In [10]:
txt = "classic burgers are nice but cheese burgers are amazing."
x = re.search("burger",txt)
if x:
    print("text:", x.group()) # replace "burger with some other word that does not exist in string"
    print(x.span()) 
    print(x.string)

numbers = "New York, NY 10014"
x = re.search(r'\d+',numbers)
if x:
    print("number:", x.group())

print(x) #match object

text: burger
(8, 14)
classic burgers are nice but cheese burgers are amazing.
number: 10014
<re.Match object; span=(13, 18), match='10014'>


## match()
Checks if the beginning of the string matches the regex pattern. Returns a match object if it does, or None if it doesn't.

Alt. Explanation:
The match() function in Python's re module is used to determine if a regex pattern matches at the beginning of a string. Unlike search(), which looks for a match anywhere in the string, match() only checks at the start.

In [11]:
txt = "classic burgers are nice but cheese burgers are amazing."
x = re.match("classic",txt)
if x:
    print("text begins with:", x.group()) 

numbers = "123 New York, NY 10014"
x = re.match(r'\d+',numbers)
if x:
    print("number at start:", x.group()) # remove 123 from the string and check the answer

text begins with: classic
number at start: 123


Key Differences between search() and match()

1. Position Sesitivity:
    * **re.search()** looks through the whole string for a match.
    * **re.match()** only checks from the beginning of the string.

2. Common User:
    * **re.search()** is used when the location of the match in the string is not important.
    * **re.match()** is used when the match must be anchored at the start of the string.

## split()

Returns a list where the string has been split at each match.

In [12]:
txt = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair."
x = re.split(',',txt)
print(x)

x = re.split(',',txt,2) # controlling the number of splits
print(x)

['It was the best of times', ' it was the worst of times', ' it was the age of wisdom', ' it was the age of foolishness', ' it was the epoch of belief', ' it was the epoch of incredulity', ' it was the season of light', ' it was the season of darkness', ' it was the spring of hope', ' it was the winter of despair.']
['It was the best of times', ' it was the worst of times', ' it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.']


## sub()

The sub() function replaces the matches with the text of your choice.

In [13]:
txt = "I shop at Wegmans"
x = re.sub("\s","0",txt)
print(x)

x = re.sub("\s","0",txt,1) # controlling the number of substitution
print(x)

I0shop0at0Wegmans
I0shop at Wegmans


# RegEx in Pandas

To use regular expressions (regex) with Pandas, you typically employ the str accessor with methods like<br>
`str.contains()`<br>
`str.extract()`<br> 
`str.findall()`<br>
`str.replace()`<br> 
`str.match()`<br>
`str.count()`<br>


In [14]:
df = pd.read_excel("simon_fall_2024.xlsx")

## str.contains()

The str.contains() function in Pandas is used to determine if each string in a Series contains a particular substring or pattern, often defined by a regular expression. It returns a Series of boolean values (True or False), indicating whether the pattern is found in each string.

In [15]:
df['Mon'] = df['Meeting Patterns'].str.contains(r'\bMon').fillna(False).astype(int)
df['Tue'] = df['Meeting Patterns'].str.contains(r'\bTues').fillna(False).astype(int)
df['Wed'] = df['Meeting Patterns'].str.contains(r'\bWed').fillna(False).astype(int)
df['Thu'] = df['Meeting Patterns'].str.contains(r'\bThurs').fillna(False).astype(int)

In [16]:
df

Unnamed: 0,Course Subjects,Academic Period,Section,Instructors,Format,Delivery Mode,Meeting Patterns,Times,Course Tags,Units,Mon,Tue,Wed,Thu
0,Accounting,Fall A 2024 Simon,ACC 401-11A - Corporate Financial Accounting,Timothy Hungerford,Lecture,In-Person,Mon/Wed,8:40 AM - 10:10 AM,,2.5,1,0,1,0
1,Accounting,Fall A 2024 Simon,ACC 401-12A - Corporate Financial Accounting,Timothy Hungerford,Lecture,In-Person,Mon/Wed,10:30 AM - 12:00 PM,,2.5,1,0,1,0
2,Accounting,Fall A 2024 Simon,ACC 401-13A - Corporate Financial Accounting,Vivek Pandey,Lecture,In-Person,Tues/Thurs,10:30 AM - 12:00 PM,,2.5,0,1,0,1
3,Accounting,Fall A 2024 Simon,ACC 401-14A - Corporate Financial Accounting,Vivek Pandey,Lecture,In-Person,Tues/Thurs,8:40 AM - 10:10 AM,,2.5,0,1,0,1
4,Accounting,Fall A 2024 Simon,ACC 410-11A - Managerial Accounting and Perfor...,Sudarshan Jayaraman,Lecture,In-Person,Tues/Thurs,10:30 AM - 12:00 PM,STEM,2.5,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,Competitive and Organization Strategy,Fall B 2024 Simon,STR 423-14B - Pricing Policies,Greg Shaffer,Lecture,In-Person,Tues/Thurs,1:40 PM - 3:10 PM,STEM,2.5,0,1,0,1
132,Competitive and Organization Strategy,Fall B 2024 Simon,STR 423-Lab14 - Pricing Policies-Lab,,Laboratory,In-Person,Thursday,3:30 PM - 4:30 PM,STEM,0.0,0,0,0,1
133,Competitive and Organization Strategy,Fall B 2024 Simon,STR 424-80B - Human Resource Strategy,Barry Friedman,Lecture,Hybrid,Wednesday,5:40 PM - 9:00 PM,,2.5,0,0,1,0
134,Competitive and Organization Strategy,Fall A 2024 Simon,STR 427-80A - Organizational Behavior,Barry Friedman,Lecture,Hybrid,Wednesday,5:40 PM - 9:00 PM,,2.5,0,0,1,0


## str.extract()

The str.extract() function in Pandas is used to extract specific patterns from each string in a Series using regular expressions. It captures the matched groups in the pattern and returns them as a DataFrame. This function is particularly useful when you need to extract specific parts of a string, such as dates, phone numbers, or any other structured information.

In [17]:
df['Start Time'] = df['Times'].str.extract(r'(\d{1,2}:\d{2}\s[APM]{2})')

In [18]:
df['First Name'] = df['Instructors'].str.extract(r'^(\w+)')
df['Last Name'] = df['Instructors'].str.extract(r'(\w+)$')

## str.findall()

The str.findall() function in Pandas is used to find all occurrences of a pattern in each string of a Series using regular expressions. It returns a Series of lists, where each list contains all the matches found in the corresponding string.

In [19]:
df['Instructors Initials'] = df['Instructors'].str.findall(r'\b\w').apply(lambda x: ''.join(x) if isinstance(x, list) else '')

## str.replace()

The str.replace() function in Pandas is used to replace occurrences of a specified pattern (or substring) within each string in a Series. It can be used with a literal string or a regular expression.

In [20]:
df['Format'] = df['Format'].str.replace('Laboratory', 'Lab', case=False)

In [21]:
df.head(5)

Unnamed: 0,Course Subjects,Academic Period,Section,Instructors,Format,Delivery Mode,Meeting Patterns,Times,Course Tags,Units,Mon,Tue,Wed,Thu,Start Time,First Name,Last Name,Instructors Initials
0,Accounting,Fall A 2024 Simon,ACC 401-11A - Corporate Financial Accounting,Timothy Hungerford,Lecture,In-Person,Mon/Wed,8:40 AM - 10:10 AM,,2.5,1,0,1,0,8:40 AM,Timothy,Hungerford,TH
1,Accounting,Fall A 2024 Simon,ACC 401-12A - Corporate Financial Accounting,Timothy Hungerford,Lecture,In-Person,Mon/Wed,10:30 AM - 12:00 PM,,2.5,1,0,1,0,10:30 AM,Timothy,Hungerford,TH
2,Accounting,Fall A 2024 Simon,ACC 401-13A - Corporate Financial Accounting,Vivek Pandey,Lecture,In-Person,Tues/Thurs,10:30 AM - 12:00 PM,,2.5,0,1,0,1,10:30 AM,Vivek,Pandey,VP
3,Accounting,Fall A 2024 Simon,ACC 401-14A - Corporate Financial Accounting,Vivek Pandey,Lecture,In-Person,Tues/Thurs,8:40 AM - 10:10 AM,,2.5,0,1,0,1,8:40 AM,Vivek,Pandey,VP
4,Accounting,Fall A 2024 Simon,ACC 410-11A - Managerial Accounting and Perfor...,Sudarshan Jayaraman,Lecture,In-Person,Tues/Thurs,10:30 AM - 12:00 PM,STEM,2.5,0,1,0,1,10:30 AM,Sudarshan,Jayaraman,SJ


## str.match()

Matches each string in the Series/Index against a regular expression.

In [22]:
df['Section'].str.match(r'^ACC')

0       True
1       True
2       True
3       True
4       True
       ...  
131    False
132    False
133    False
134    False
135    False
Name: Section, Length: 136, dtype: bool

## str.count()

Counts occurrences of a pattern in each string in the Series/Index.

In [23]:
df['Section'].str.count(r'a')

0      3
1      3
2      3
3      3
4      6
      ..
131    0
132    2
133    2
134    4
135    2
Name: Section, Length: 136, dtype: int64

## str.find() 
Returns the index of the first occurrence of a substring. If not found, returns -1.

In [24]:
df['Format'].str.find('e')

0      1
1      1
2      1
3      1
4      1
      ..
131    1
132   -1
133    1
134    1
135    1
Name: Format, Length: 136, dtype: int64

## str.rfind()
Returns the index of the last occurrence of a substring. If not found, returns -1.

In [25]:
df['Format'].str.rfind('e')

0      6
1      6
2      6
3      6
4      6
      ..
131    6
132   -1
133    6
134    6
135    6
Name: Format, Length: 136, dtype: int64