# Regular Expressions

In [1]:
import re

Search string or pattern

In [2]:
txt = "Across the Universe"
'''
Check if the string starts with (^) the word "Across" and ends with ($)
the letter "e". The .* is for any other characters.
'''

x = re.search("^Across.*e$", txt)

if x:
  print("Yes! We have a match!")
else:
  print("No match")

Yes! We have a match!


Split sentence

In [3]:
txt = "Across the Universe"

#Split the string at all white-space character:
print(re.split("\\s+", txt))

#Split the string at the first white-space character:
print(re.split("\\s", txt, 1))

['Across', 'the', 'Universe']
['Across', 'the Universe']


Find pattern using findall

In [4]:
txt = 'The heart is a bloom, shoots up through the stony ground'
print(re.findall("oo", txt))

['oo', 'oo']


Substitute pattern

In [5]:
txt = "But in the end, it doesn't even matter"
print(re.sub("doesn't even", "does really", txt))

But in the end, it does really matter


Use raw string for regex

In [6]:
# normal string vs raw string
#string
path = ("C:\\Desktop\nigel")
print("string:",path)

# In the above, \n in \nigel is taken as newline
# Use raw string always while dealing with Regular expressions.

#raw string
path = (r"C:\Desktop\nigel")
print("raw string:",path)

string: C:\Desktop
igel
raw string: C:\Desktop\nigel


Regex example to remove all numbers from a text string

In [7]:
string_with_numbers = "Hello 2023 World 123"

cleaned_string = re.sub(r"\d+", "", string_with_numbers) 
print(cleaned_string)

Hello  World 


The regex pattern \d+ matches one or more digits. By replacing those matches with an empty string "", we have removed all the numbers from the original string.

Some key things to note about this regex:
1. \d matches any digit character 
2. `+' means match one or more of the preceding pattern 
3. r"" defines a raw string literal so backslashes don't need to be escaped

Additional regex tips for effective data cleaning:
1. Use anchors (^ and $) to match the start and end of strings 
2. Square brackets [] define character ranges to match 
3. Parentheses () group parts of the pattern 
4. The pipe | acts as an OR operator

### Clean invalid characters
Certain special characters like ^, $, and . can cause issues in data analysis. 
Regex gives precision when removing these characters.

In [8]:
string = ")Th@is$ is# an^ inv?alid &te/`/'xt *and! it (n[e|e/ds *cl&ea,n*****ing}"
cleaned = re.sub('[^a-zA-Z0-9 \n\\.]', '', string) 
print(cleaned)

This is an invalid text and it needs cleaning


Extract substrings

In [9]:
phone = "My phone no. is 987-666-5432132443!" 
number = re.search(r'\d{3}-\d{3}-\d{4}', phone).group()
print(number)

987-666-5432


Extract timestamps, for example from logfiles

In [10]:
from datetime import datetime

timelog = "21/May/2023:12:14:11" 

timestamp = re.search(r'(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})', timelog).groups()[0]
print(datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S"))

2023-05-21 12:14:11
