# What is Regex?¶

# A regular expression (regex) is a sequence of characters that defines a search pattern. It’s commonly used for string manipulation tasks like searching, matching, and replacing.

# In simple language, Regular expression is all about matching pattern in a text and retrieving key information out of it.

# Tip: If anyone who is starting the NLP learning, regular expression is the first thing that I wana advise them that they should learn because many times the NLP problems can be solved completely by regular expression and in that case you don't have to use any fancy machine learning.

# Use Cases: Commonly used for validating input, searching and replacing substrings, and extracting information from text.

# In Python, the isdecimal() method is a string method used to check if all the characters in a string are decimal characters. Decimal characters include digits from 0 to 9 and are part of the Unicode "Decimal_Number" category.

# Returns True if all characters in the string are decimal characters.
# Returns False if the string contains non-decimal characters or is empty.
# Works only on strings, not on other data types.

In [1]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

In [7]:
print(isPhoneNumber("(415) 555-1234"))# 
print(isPhoneNumber("415-555-1234"))#
print(isPhoneNumber("415 555-1234"))#

False
True
False


In [6]:
# Example 1: All decimal characters
print("12345".isdecimal()) 
# Example 2: Contains non-decimal characters
print("123.45".isdecimal())  
# Example 3: Contains letters
print("123abc".isdecimal())  
# Example 4: Empty string
print("".isdecimal())  
# Example 5: Unicode decimal characters
print("x**2".isdecimal())

True
False
False
False
False


# Tediousness of writing many such checks

# Inflexibility for formats like (415) 555-1234

In [9]:
import re
random_text='''
Dr Ahmed's phone number is 123-88271672, call me if you have any questions on subject matters. 
Call CWNU if you have any questions related to school matters contact person contact (999)-333-7777
'''

# The re.findall() function in Python is a powerful tool for finding all occurrences of a specified pattern in a given string

# extracting a single digit

# In Python, the re module provides support for working with regex.

# Common Functions in re:
# re.match(): Checks for a match only at the beginning of the string.
# re.search(): Searches the entire string for the first match.
# re.findall(): Returns all matches as a list.
# re.sub(): Replaces matches with a specified string.

In [10]:
pattern='\d'
matches=re.findall(pattern,random_text)
matches

['1',
 '2',
 '3',
 '8',
 '8',
 '2',
 '7',
 '1',
 '6',
 '7',
 '2',
 '9',
 '9',
 '9',
 '3',
 '3',
 '3',
 '7',
 '7',
 '7',
 '7']

# extracting two continuous digits

In [15]:

pattern='\d\d'
matches=re.findall(pattern,random_text)
matches

['12', '88', '27', '16', '72', '40', '99', '33', '77', '77']

# extracting exactly three continuous digits

In [16]:

pattern='\d{3}'
matches=re.findall(pattern,random_text)
matches

['123', '882', '716', '999', '333', '777']

# extracting 4,5,7,.. 10,13 digits of phone number type-1

In [21]:
import re
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRegex.search('My number is "415-555-4242.')
print('Phone number found:', mo.group())

Phone number found: 415-555-4242


<!-- | Symbol               | Meaning                            |
| -------------------- | ---------------------------------- |
| `.`                  | Any character (except newline)     |
| `\d`                 | Digit (0-9)                        |
| `\D`                 | Not a digit                        |
| `\w`                 | Word character (a-z, A-Z, 0-9, \_) |
| `\W`                 | Not a word character               |
| `\s`                 | Whitespace                         |
| `\S`                 | Non-whitespace                     |
| `^` / `$`            | Start / End of string              |
| `*`, `+`, `?`, `{n}` | Quantifiers                        |
| `[]`                 | Character classes                  |
| `()`                 | Grouping                           |
 -->

# Special Characters
.: Matches any character except a newline.

^: Anchors the match at the start of a string.

$: Anchors the match at the end of a string.

*: Matches zero or more of the preceding element.

+: Matches one or more of the preceding element.

?: Matches zero or one of the preceding element (optional).

{n}: Matches exactly n occurrences of the preceding element.

{n,}: Matches n or more occurrences.

{n,m}: Matches between n and m occurrences.

# Character Classes
[abc]: Matches any single character inside the brackets (a, b, or c).

[^abc]: Matches any character not in the brackets.

\d: Matches any digit (equivalent to [0-9]).

\D: Matches any non-digit.

\w: Matches any word character (alphanumeric + underscore).

\W: Matches any non-word character.

\s: Matches any whitespace character (spaces, tabs, newlines).

\S: Matches any non-whitespace character.

# Advanced Features
Grouping and Capturing

Parentheses (): Create groups for capturing and applying quantifiers.

Example: (abc)+ matches "abc", "abcabc", etc.

Named Groups: Use (?P<name>...) to define a named group.

# Alternation
|: Acts as a logical OR operator.

Example: cat|dog matches either "cat" or "dog".

# Lookaheads and Lookbehinds

Lookahead (?=...): Assert that what follows matches the pattern.

Negative Lookahead (?!...): Assert that what follows does not match the pattern.

Lookbehind (?<=...): Assert that what precedes matches the pattern.

Negative Lookbehind (?<!...): Assert that what precedes does not match the pattern.

In [9]:
# Match 3 digits followed by a word
re.search(r'\d{3} \w+', 'Call 415 Bob')  # Output: '415 Bob'

<re.Match object; span=(5, 12), match='415 Bob'>

In [10]:
import re
phoneRegex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
mo = phoneRegex.search('My number is 415-555-4242')
print(mo.group(1))  # 415
print(mo.group(2))  # 555
print(mo.group(3))  # 4242
print(mo.groups())  # ('415', '555', '4242')

415
555
4242
('415', '555', '4242')


In [11]:
heroRegex = re.compile(r'Batman|Superman')
mo = heroRegex.search('Batman and Superman saved the day.')
print(mo.group())  # Batman

Batman


In [12]:
batRegex = re.compile(r'Bat(wo)?man')
print(batRegex.search('The Adventures of Batman').group())   # Batman
print(batRegex.search('The Adventures of Batwoman').group()) # Batwoman


Batman
Batwoman


# Matching a Pattern

In [None]:

import re

pattern = r"\d+"  # Matches one or more digits
text = "I have 2 apples and 10 oranges."

match = re.search(pattern, text)
if match:
    print("Found:", match.group())  # Output: Found:

# Finding All Matches

In [None]:
import re
pattern = r"\d+"  # Matches one or more digits
text = "I have 2 apples and 10 oranges."

matches = re.findall(pattern, text)
print("All Matches:", matches)  # Output: ['2', '10']

# Replacing Text

In [37]:
import re

pattern = r"\d+"  # Matches one or more digits
text = "I have 2 apples and 10 oranges."

result = re.sub(pattern, "many", text)
print("Replaced Text:", result) 

Replaced Text: I have many apples and many oranges.


In [26]:
text = '''
Note 1 - Summary of Significant Accounting Policies
Unauditad Interen Financial Statements
The consolidated financial statements of Tesla, Inc. (Tesla the Company, we us" or "our), including the consolidated balance sheet as of March 31, 2024, the consolidated statements of operations, the consobdated statements of comprehensive income, the
consolidated statements of redeemable noncontrolling interests and equity, and the consolidated statements of cash flows for the three months ended March 31, 2024 and 2023, as well as other information disclosed in the accompanying notes, are unaudited. The consolidated
balance sheet as of December 31, 2023 was derived from the audited consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in conjunction with the annual consolidated financial statements and the
accompanying notes contained in our Annual Report on Form 10-K for the year ended December 31, 2023 The interim consolidated financial statements and the accompanying notes have been prepared on the same basas as the annual consolidated financial statements and, in the opinion of management, reflect all adjustments, which mclude only normal recurring adjustments,
necessary for a fair statement of the results of operations for the periods presented. The consolidated results of operations for any interim period are not necessarily indicative of the results to be expected for the full your or for any other future years or interim periods
Note 2 - Fair Value of Financial Instruments
should be used in ASC 820, Fair Value Measurements ("ASC 820") states that fair value is an exat price, representing the amount that would be recerved to sell an asset or paid to transfer a liability in an orderly transaction between market participants. As such, fair value is a market-based measurement that should be determined 1 based based on on assumptions that market participants would use in pricing an asset or a liability. lability The three-tiered fair value hierarchy. hierarchy, which which prioritizes pr which hich inputs measuring fair value, is comprised of (Level 1) observable inputs such am quoted prices in active markets, (Level II) imputs other than quoted prices in active maricets that are observable either directly or indirectly and (Level III) unobservable inputs for which there is little or no market data. The fair value hierarchy requires the use of observable market data when available in determining fair value. Our assets and liabilities that were measured at fair value on a recurring basis were as follows (in millions)
Note: 3 - I am going to writing a summary on financial policies
Note 8 - Equity Incentive Plans
Other Performance-Based Grants
From time to time, the Compensation Committee of our Board of Directors grants certain employees performance-based restricted stock units and stock options
As of March 31, 2024, we had unrecognized stock-based compensation expense of $613 million under these grants to purchase or receive an aggregate 5.0 million shares of our common stock For awards probable of achievement, we estimate the unrecognized stock-based compensation expense of $104 million will be recognized over a weighted-average period of 4.8 vears.
For the three months ended March 31, 2024 and 2023, stock-based compensation expense related to these grants, net of forfeitures, ware immaterial.
'''

In [35]:
# extracting note numbers
pattern='Note \d '
matches=re.findall(pattern, text)
matches

['Note: 3 ']

# https://www.kaggle.com/code/krishd123/regular-expressions-in-python

# https://regex101.com/ online tester

In [14]:
ChengduText = '''Awais and his family visited Chengdu City on 12/25/2024 and again on 01/01/2025. 
 They posted on social media using hashtags like #HolidayTrip and #CTUAdventures.
 Their contact numbers are 415-555-1234, (212) 666-9999, and 202.333.4444 x567.
 Emails like awais.ahmed@example.com and familyAdd@example.com and familyAdd@example-1.com were listed in the directory.
 They browsed https://travel.example.com and https://www.tripadvisor.com/ for planning. 
 Awais's SSN was mistakenly visible as 123-45-6789 in the shared file.
 They mentioned their backup plan via email to admin@company.net.
 Conclusion: "It was very cold and the food food was amazing!"
 The event took place at: <div class="event">Winter Gala</div> on December 31.''' 

In [18]:
mobileNum = re.findall("\d{3}-\d{3}-\d{4}",ChengduText)
print(mobileNum)

['415-555-1234']


In [21]:
p = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
mobileNum = re.findall(p, ChengduText)
print(mobileNum)


['awais.ahmed@example.com', 'familyAdd@example.com', 'familyAdd@example-1.com', 'admin@company.net']


In [22]:
import re

ChengduText = """
Awais and his family visited Chengdu City on 12/25/2024 and again on 01/01/2025. 
They posted on social media using hashtags like #HolidayTrip and #CTUAdventures.
Their contact numbers are 415-555-1234, (212) 666-9999, and 202.333.4444 x567.
Emails like awais.ahmed@example.com and familyAdd@example.com and familyAdd@example-1.com were listed in the directory.
They browsed https://travel.example.com and https://www.tripadvisor.com/ for planning. 
Awais's SSN was mistakenly visible as 123-45-6789 in the shared file.
They mentioned their backup plan via email to admin@company.net.
Conclusion: "It was very cold and the food food was amazing!"
The event took place at: <div class="event">Winter Gala</div> on December 31.
"""

# Q1: US phone numbers xxx-xxx-xxxx
q1 = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', ChengduText)
print("Q1:", q1)

# Q2: All email addresses
q2 = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+', ChengduText)
print("Q2:", q2)

# Q3: Dates in MM/DD/YYYY
q3 = re.findall(r'\b\d{2}/\d{2}/\d{4}\b', ChengduText)
print("Q3:", q3)

# Q4: URLs
q4 = re.findall(r'https?://[^\s]+', ChengduText)
print("Q4:", q4)

# Q5: Extract area code and number
q5 = re.findall(r'(\d{3})-(\d{3}-\d{4})', ChengduText)
print("Q5:", q5)

# Q6: Capitalized words
q6 = re.findall(r'\b[A-Z][a-zA-Z]+\b', ChengduText)
print("Q6:", q6)

# Q7: Hashtags
q7 = re.findall(r'#\w+', ChengduText)
print("Q7:", q7)

# Q8: Password validation regex (not extractable from this text)
q8 = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$'
print("Q8:", q8)

# Q9: Phone numbers with extensions
q9 = re.findall(r'\d{3}\.\d{3}\.\d{4}\sx\d+', ChengduText)
print("Q9:", q9)

# Q10: Multiple phone number formats
q10 = re.findall(r'\d{3}-\d{3}-\d{4}|\(\d{3}\)\s\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4}', ChengduText)
print("Q10:", q10)

# Q11: Repeated words
q11 = re.findall(r'\b(\w+)\s+\1\b', ChengduText)
print("Q11:", q11)

# Q12: Replace all email domains with @example.com
q12 = re.sub(r'(@)[\w.-]+(\.[a-zA-Z]+)', r'@example.com', ChengduText)
print("Q12:\n", q12)

# Q13: Remove all HTML tags
q13 = re.sub(r'<.*?>', '', ChengduText)
print("Q13:\n", q13)

# Q14: SSNs
q14 = re.findall(r'\b\d{3}-\d{2}-\d{4}\b', ChengduText)
print("Q14:", q14)

# Q15: Extract email domain only
q15 = re.findall(r'@([\w.-]+\.[a-zA-Z]+)', ChengduText)
print("Q15:", q15)

# Q16: IPv4 (not matched, only regex)
q16 = r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
print("Q16:", q16)

# Q17: CSV row format regex
q17 = r'"[^"]*",\d+,"[^"]*"'
print("Q17:", q17)

Q1: ['415-555-1234']
Q2: ['awais.ahmed@example.com', 'familyAdd@example.com', 'familyAdd@example-1.com', 'admin@company.net']
Q3: ['12/25/2024', '01/01/2025']
Q4: ['https://travel.example.com', 'https://www.tripadvisor.com/']
Q5: [('415', '555-1234')]
Q6: ['Awais', 'Chengdu', 'City', 'They', 'HolidayTrip', 'CTUAdventures', 'Their', 'Emails', 'They', 'Awais', 'SSN', 'They', 'Conclusion', 'It', 'The', 'Winter', 'Gala', 'December']
Q7: ['#HolidayTrip', '#CTUAdventures']
Q8: ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
Q9: ['202.333.4444 x567']
Q10: ['415-555-1234', '(212) 666-9999', '202.333.4444']
Q11: ['food']
Q12:
 
Awais and his family visited Chengdu City on 12/25/2024 and again on 01/01/2025. 
They posted on social media using hashtags like #HolidayTrip and #CTUAdventures.
Their contact numbers are 415-555-1234, (212) 666-9999, and 202.333.4444 x567.
Emails like awais.ahmed@example.com and familyAdd@example.com and familyAdd@example.com were listed in the dir

In [None]:
# Input Validation
# Check if an email address is valid: ^[\w\.-]+@[\w\.-]+\.[a-z]{2,}$
# Validate phone numbers: ^\d{3}-\d{3}-\d{4}$

In [None]:
# Text Processing
# Extract URLs from text: https?://[^\s]+
# Find words with repeated characters: (\w)\1
# Search and Replace
# Use the sub() method to replace matched patterns in strings.

In [36]:
# import re, pyperclip

# phoneRegex = re.compile(r'''(
#     (\d{3}|\(\d{3}\))?              # area code
#     (\s|-|\.)?                      # separator
#     \d{3}                           # first 3 digits
#     (\s|-|\.)                       # separator
#     \d{4}                           # last 4 digits
#     (\s*(ext|x|ext.)\s*\d{2,5})?    # extension
# )''', re.VERBOSE)

# emailRegex = re.compile(r'''(
#     [a-zA-Z0-9._%+-]+      # username
#     @                      # @ symbol
#     [a-zA-Z0-9.-]+         # domain name
#     (\.[a-zA-Z]{2,4})      # dot-something
# )''', re.VERBOSE)

# text = pyperclip.paste()
# matches = []
# for groups in phoneRegex.findall(text):
#     matches.append(groups[0])
# for groups in emailRegex.findall(text):
#     matches.append(groups[0])

# if matches:
#     pyperclip.copy('\n'.join(matches))
#     print('Copied to clipboard:')
#     print('\n'.join(matches))
# else:
#     print('No phone numbers or email addresses found.')