<p style="text-align: center; font-size:24px;"><b>Regex Expressions</b></p>

summary

# Literals 
in regular expressions are the most straightforward way to perform a search. A literal simply means writing the exact text you want to find. For example, the expression `a` matches the character `a`, and `bananas` matches exactly the word “bananas.” What makes literals powerful is that they can match not only whole words but also parts of a text. If you search with `monkey`, it will match the word “monkey” even when it appears inside a longer sentence such as “The monkeys like to eat bananas.” This behavior isn’t limited to letters: numbers and combinations of characters work as well. The literal `3` will match the digit “3” in “34,” and `5 gibbons` will match exactly those characters in that order. Under the hood, regular expressions process text from left to right, moving one character at a time. As soon as the pattern matches the first character, the engine continues to check if the following sequence fits the rest of the expression, creating a continuous match.

In [7]:
# Small dataset with text
df = pd.DataFrame({
    "text": ["monkey", "bananas", "5 gibbons", "34 elephants"]
})

# Match the literal 'monkey'
print(df[df["text"].str.contains("monkey")])

# Match the literal '5 gibbons'
print(df[df["text"].str.contains("5 gibbons")])

# Match the literal '3'
print(df[df["text"].str.contains("3")])


     text
0  monkey
        text
2  5 gibbons
           text
3  34 elephants


# Alternation
Ib regular expressions is the way to express a choice between two or more options. It works with the pipe symbol `|`, which you can think of as the logical “or.” For example, the pattern `baboons|gorillas` will look for either the word “baboons” or the word “gorillas.” If the text is “I love baboons,” the regex matches “baboons,” and if the text is “I love gorillas,” it matches “gorillas.” What matters is that one of the alternatives appears. At this stage, the alternation only applies to the specific word itself, not to the entire sentence around it. So `baboons|gorillas` will not match the whole phrase “I love baboons” or “I love gorillas,” but only the single words. Later on, you can build more complex expressions that capture the full sentence if you want, but alternation gives you the first step: a way to handle multiple possibilities with a single pattern.


In [8]:
# Simple dataset
df = pd.DataFrame({
    "text": ["I love baboons", "I love gorillas", "I love monkeys"]
})

# Match either 'baboons' OR 'gorillas'
print(df[df["text"].str.contains("baboons|gorillas")])


              text
0   I love baboons
1  I love gorillas


# Character sets 
In regular expressions give you flexibility when you want to allow for multiple possible characters in the same position. Instead of writing a rigid expression that only accepts one exact spelling, you can use square brackets to define alternatives. For example, con[sc]en[sc]us will match the correct spelling “consensus,” but it will also match common mistakes like “concensus,” “consencus,” and “concencus.” The brackets [sc] mean “this position can be either s or c,” and the regex engine will accept both. Character sets work at the single-character level: [cat] does not mean “cat,” but rather “c or a or t.” You can also reverse the meaning with the caret symbol ^. If placed inside the brackets at the beginning, as in [^cat], the regex matches any character that is not c, a, or t. For instance, [^cat] will happily match the letters d, o, or g. With character sets and their negated form, regular expressions become much more forgiving and adaptable, allowing us to account for variation or error in text.

In [9]:
# Example dataset with different spellings
df = pd.DataFrame({
    "text": ["consensus", "concensus", "consencus", "concencus", "dog"]
})

# Regex with character sets to catch all spelling variants
print(df[df["text"].str.contains("con[sc]en[sc]us")])

# Regex with a negated character set: match words that start with NOT c, a, or t
print(df[df["text"].str.contains("^[^cat]")])


        text
0  consensus
1  concensus
2  consencus
3  concencus
  text
4  dog


# Wildcards 
In regular expressions are a way to say, “I don’t care what character appears here, as long as something does.” The symbol . acts as this placeholder, and it can represent any single character: a letter, a digit, a symbol, or even a space. For example, writing nine dots in a row like ......... will match any text with exactly nine characters, such as “orangutan” or “marsupial.” You can also mix wildcards with normal text. The regex I ate . bananas will match “I ate 3 bananas” or “I ate 8 bananas,” since the dot stands in for the digit. But what if you actually want to look for a period character rather than using it as a wildcard? That’s where the backslash escape comes in. If you write \. the dot loses its special meaning and is treated as a literal period. So Howler monkeys are really lazy\. matches the sentence “Howler monkeys are really lazy.” including the dot at the end. Wildcards are a powerful way to allow flexibility when the exact character is not important.

In [10]:
# Example dataset
df = pd.DataFrame({
    "text": [
        "orangutan",
        "marsupial",
        "I ate 3 bananas",
        "I ate 8 bananas",
        "Howler monkeys are really lazy."
    ]
})

# 1. Match any 9-character word using 9 wildcards (.........)
print(df[df["text"].str.contains("^.........$")])

# 2. Match 'I ate . bananas' → the dot matches any single character
print(df[df["text"].str.contains("I ate . bananas")])

# 3. Match a literal period at the end using \.
print(df[df["text"].str.contains("Howler monkeys are really lazy\\.")])


        text
0  orangutan
1  marsupial
              text
2  I ate 3 bananas
3  I ate 8 bananas
                              text
4  Howler monkeys are really lazy.


# Ranges
In this case, the idea is to understand how regular expressions can be used to flexibly locate and capture the hidden part of a URL, such as a key, and then add it to the end of another base URL. Conceptually, the hidden key is just a sequence of characters embedded in a longer string. By writing a regex that looks for a certain pattern, you can extract it and then concatenate it to the original link. For example, if your browser shows a URL with an embedded key like https://example.com/?id=XYZ123, you could use a regex to capture the XYZ123 part and then append it to another given URL. This demonstrates the practical use of regex: finding a specific chunk of text even when it’s surrounded by other characters.

In [11]:
import re

# Example: a browser URL that contains a hidden key
browser_url = "https://example.com/page?hidden_key=XYZ123"

# Base URL where we want to append the key
base_url = "https://myapp.com/start?key="

# Use regex to extract the hidden key after 'hidden_key='
match = re.search(r"hidden_key=([A-Za-z0-9]+)", browser_url)

if match:
    hidden_key = match.group(1)  # Extract the key (XYZ123)
    final_url = base_url + hidden_key
    print("Final URL:", final_url)


Final URL: https://myapp.com/start?key=XYZ123


# Shorthand character classes 
It make regular expressions much easier to write when you are dealing with common types of characters. Instead of spelling out long ranges every time, you can use compact symbols. For example, \w stands for “word characters,” meaning any letter (uppercase or lowercase), digit, or underscore. Similarly, \d matches any digit from 0 to 9, and \s matches any whitespace, such as a space, tab, or newline. With these, you can quickly express patterns like “a digit, followed by a space, followed by several letters.” For instance, the regex \d\s\w\w\w\w\w\w\w matches the text “3 monkeys”: it sees a digit 3, a whitespace space, and then seven letters.
There are also negated versions: \W matches any character that is not a word character, \D matches anything that is not a digit, and \S matches anything that is not whitespace. These broaden your control by letting you explicitly exclude categories of characters. Together, shorthand character classes are essential tools for making regex patterns both concise and clear.

In [12]:
# Example dataset
df = pd.DataFrame({
    "text": [
        "3 monkeys",
        "12_apples",
        "hello world",
        "42",
        "no_digits_here!"
    ]
})

# 1. Match a digit followed by a space and then 7 word characters
print("Match '\\d\\s\\w{7}':")
print(df[df["text"].str.contains(r"\d\s\w{7}")], "\n")

# 2. Match any text containing at least one digit
print("Match '\\d':")
print(df[df["text"].str.contains(r"\d")], "\n")

# 3. Match any text containing a non-digit
print("Match '\\D':")
print(df[df["text"].str.contains(r"\D")], "\n")

# 4. Match any text containing whitespace
print("Match '\\s':")
print(df[df["text"].str.contains(r"\s")])


Match '\d\s\w{7}':
        text
0  3 monkeys 

Match '\d':
        text
0  3 monkeys
1  12_apples
3         42 

Match '\D':
              text
0        3 monkeys
1        12_apples
2      hello world
4  no_digits_here! 

Match '\s':
          text
0    3 monkeys
2  hello world


# Grouping 
In regular expressions allows you to structure patterns so that alternation or repetition applies only to part of the regex. Without grouping, the | alternation symbol acts on everything before or after it, which can lead to unexpected matches. For instance, if you try I love baboons|gorillas, it will correctly match “I love baboons,” but with “I love gorillas” it only matches “gorillas,” because the | divides the whole expression into “I love baboons” or “gorillas.” By adding parentheses, you can control the scope: I love (baboons|gorillas) matches “I love baboons” or “I love gorillas.” This way, the alternation is limited to just the words inside the group. Groups also act as capture groups, meaning they can extract the matched portion of text for later use, making them powerful not only for matching but also for data extraction.

In [14]:
# Example dataset
df = pd.DataFrame({
    "text": [
        "I love baboons",
        "I love gorillas",
        "I love monkeys",
        "gorillas"
    ]
})

# Regex with grouping to match full phrases
print("Match 'I love (baboons|gorillas)':")
print(df[df["text"].str.contains(r"I love (baboons|gorillas)")], "\n")

# Capture group example: extract which animal was matched
df["animal"] = df["text"].str.extract(r"I love (baboons|gorillas)")
print("Extracted animals:")
print(df)


Match 'I love (baboons|gorillas)':
              text
0   I love baboons
1  I love gorillas 

Extracted animals:
              text    animal
0   I love baboons   baboons
1  I love gorillas  gorillas
2   I love monkeys       NaN
3         gorillas       NaN


  print(df[df["text"].str.contains(r"I love (baboons|gorillas)")], "\n")


# Fixed quantifiers 
They let you move beyond matching text character by character. Instead of writing the same symbol multiple times in a row, you can use curly braces to indicate exactly how many times something should appear, or within what range. For example, \w{3} matches exactly three word characters, while \w{4,7} matches a word of at least four characters but no more than seven. This is more compact and flexible than repeating \w over and over. Quantifiers also work inside larger patterns: roa{3}r matches “roaaar,” and roa{3,7}r will match “roaaar” through “roaaaaaaar,” since it accepts between three and seven as. One key point is that quantifiers are greedy: they try to match as much as possible. For instance, mo{2,4} applied to “moooo” will return the longest valid option—“moooo” with four os—rather than the shorter “moo” or “mooo.” Fixed quantifiers make regex far more expressive and efficient when describing patterns of repetition.

In [15]:
# Example dataset
df = pd.DataFrame({
    "text": [
        "rhesus monkey",
        "roaaar",
        "roaaaaar",
        "roaaaaaaar",
        "moooo"
    ]
})

# 1. Match exactly 6 letters, a space, and 6 more letters (like 'rhesus monkey')
print("Match '\\w{6}\\s\\w{6}':")
print(df[df["text"].str.contains(r"\w{6}\s\w{6}")], "\n")

# 2. Match 'ro' + between 3 and 7 'a's + 'r'
print("Match 'roa{3,7}r':")
print(df[df["text"].str.contains(r"roa{3,7}r")], "\n")

# 3. Show greedy behavior: 'mo{2,4}' on 'moooo'
print("Match 'mo{2,4}':")
print(df[df["text"].str.contains(r"mo{2,4}")])


Match '\w{6}\s\w{6}':
            text
0  rhesus monkey 

Match 'roa{3,7}r':
         text
1      roaaar
2    roaaaaar
3  roaaaaaaar 

Match 'mo{2,4}':
    text
4  moooo


# Fixed quantifiers 
They sre a way to specify exactly how many times a character or group of characters should repeat, rather than writing each occurrence out manually. With curly braces {}, you can ask for a precise number of repetitions or define a range. For instance, \w{6}\s\w{6} means “match six word characters, followed by a space, then six more word characters,” which fits the string “rhesus monkey.” Similarly, roa{3,7}r matches “ro” followed by between three and seven “a” characters, then “r,” so it will catch “roaaar,” “roaaaaar,” and “roaaaaaaar.” Quantifiers are greedy by default, which means they try to take as many characters as they can while still fulfilling the pattern. So mo{2,4} applied to “moooo” doesn’t stop at “moo” or “mooo” but instead matches the full “moooo” with four “o”s.

In [16]:
# Example dataset
df = pd.DataFrame({
    "text": [
        "rhesus monkey",
        "roaaar",
        "roaaaaar",
        "roaaaaaaar",
        "moooo"
    ]
})

# 1. Match exactly 6 letters, a space, and 6 more letters (like 'rhesus monkey')
print("Match '\\w{6}\\s\\w{6}':")
print(df[df["text"].str.contains(r"\w{6}\s\w{6}")], "\n")

# 2. Match 'ro' + between 3 and 7 'a's + 'r'
print("Match 'roa{3,7}r':")
print(df[df["text"].str.contains(r"roa{3,7}r")], "\n")

# 3. Show greedy behavior: 'mo{2,4}' on 'moooo'
print("Match 'mo{2,4}':")
print(df[df["text"].str.contains(r"mo{2,4}")])


Match '\w{6}\s\w{6}':
            text
0  rhesus monkey 

Match 'roa{3,7}r':
         text
1      roaaar
2    roaaaaar
3  roaaaaaaar 

Match 'mo{2,4}':
    text
4  moooo


# Optional quantifiers 
They make a character or group of characters flexible, allowing them to appear once or not at all. This is especially useful when dealing with variations in spelling or wording. For example, the regex humou?r can match both “humor” and “humour,” because the u is marked optional—it may appear zero or one time. The ? always applies only to the character (or group, if parentheses are used) immediately before it. With grouping, you can extend this power to longer sequences. For instance, The monkey ate a (rotten )?banana matches both “The monkey ate a banana” and “The monkey ate a rotten banana,” since the group (rotten ) is treated as optional. Because ? itself is a special symbol, you must escape it with a backslash if you want to search for an actual question mark. So Aren't owl monkeys beautiful\? will match the sentence including the final ?. Optional quantifiers give your regex just the right balance of precision and flexibility, making it robust against natural variations in text.

In [17]:
# Example dataset
df = pd.DataFrame({
    "text": [
        "humor",
        "humour",
        "The monkey ate a banana",
        "The monkey ate a rotten banana",
        "Aren't owl monkeys beautiful?"
    ]
})

# 1. Match 'humor' or 'humour'
print("Match 'humou?r':")
print(df[df["text"].str.contains(r"humou?r")], "\n")

# 2. Match with optional group '(rotten )'
print("Match 'The monkey ate a (rotten )?banana':")
print(df[df["text"].str.contains(r"The monkey ate a (rotten )?banana")], "\n")

# 3. Match a literal question mark at the end
print("Match 'Aren't owl monkeys beautiful\\?':")
print(df[df["text"].str.contains(r"Aren't owl monkeys beautiful\?")])


Match 'humou?r':
     text
0   humor
1  humour 

Match 'The monkey ate a (rotten )?banana':
                             text
2         The monkey ate a banana
3  The monkey ate a rotten banana 

Match 'Aren't owl monkeys beautiful\?':
                            text
4  Aren't owl monkeys beautiful?


  print(df[df["text"].str.contains(r"The monkey ate a (rotten )?banana")], "\n")


# Anchors 
Used to make your regular expressions more precise by tying them to the beginning or the end of a string. The caret ^ is used to assert that a match must start at the very beginning, while the dollar sign $ asserts that it must end at the very last character. For example, ^Monkeys: my mortal enemy$ matches exactly the text “Monkeys: my mortal enemy” and nothing more. It won’t match longer versions like “Spider Monkeys: my mortal enemy in the wild,” because in that case the string starts with “Spider” and ends with “wild.” Without anchors, however, the same regex would still find “Monkeys: my mortal enemy” as a substring within the longer text. Anchors are invaluable when you need complete control over the placement of a match, ensuring you capture only exact strings rather than fragments buried inside larger ones. Since ^ and $ are metacharacters, escaping them is necessary when you actually want to match those symbols. For instance, My spider monkey has \$10\^6 in the bank will match the sentence containing “$10^6” as literal characters.

In [None]:
import pandas as pd

# Example dataset
df = pd.DataFrame({
    "text": [
        "Monkeys: my mortal enemy",
        "Spider Monkeys: my mortal enemy in the wild",
        "Squirrel Monkeys: my mortal enemy in the wild",
        "My spider monkey has $10^6 in the bank"
    ]
})

# 1. Match exactly 'Monkeys: my mortal enemy' with anchors
print("Match '^Monkeys: my mortal enemy$':")
print(df[df["text"].str.contains(r"^Monkeys: my mortal enemy$")], "\n")

# 2. Match 'Monkeys: my mortal enemy' without anchors (substring match)
print("Match 'Monkeys: my mortal enemy' (no anchors):")
print(df[df["text"].str.contains(r"Monkeys: my mortal enemy")], "\n")

# 3. Match literal '$' and '^' characters
print("Match 'My spider monkey has \\$10\\^6 in the bank':")
print(df[df["text"].str.contains(r"My spider monkey has \$10\^6 in the bank")])
