# Advanced Regular Expressions Lab

Complete the following set of exercises to solidify your knowledge of regular expressions.

In [1]:
import re

### 1. Use a regular expression to find and extract all vowels in the following text.

In [2]:
text = "This is going to be a sentence with a good number of vowels in it."

In [5]:
vowels = re.findall(r'[aeiouAEIOU]', text)

### 2. Use a regular expression to find and extract all occurrences and tenses (singular and plural) of the word "puppy" in the text below.

In [26]:
text = "The puppy saw all the rest of the puppies playing and wanted to join them. I saw this and wanted a puppy of my own!"

In [41]:
occurrences = re.findall(r'\bpupp(?:y|ies)\b', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# pupp: The literal string "pupp".
# (?:y|ies): A non-capturing group that matches either "y" or "ies" (to cover both singular and plural forms).
# \b: Another word boundary to complete the match.

# The re.findall function finds all occurrences of "puppy" and "puppies" (both singular and plural) in the text and returns them as a list.

### 3. Use a regular expression to find and extract all tenses (present and past) of the word "run" in the text below.

In [22]:
text = "I ran the relay race the only way I knew how to run it."

In [40]:
tenses = re.findall(r'\br(?:un|an)\b', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# r: The literal character "r".
# (?:un|an): A non-capturing group that matches either "un" (present tense) or "an" (past tense).
# \b: Another word boundary to complete the match.

### 4. Use a regular expression to find and extract all words that begin with the letter "r" from the previous text.

In [39]:
words_starting_with_r = re.findall(r'\br\w+', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# r: The literal character "r".
# \w+: Matches one or more word characters (letters, digits, or underscores) that follow the letter "r".

### 5. Use a regular expression to find and substitute the letter "i" for the exclamation marks in the text below.

In [29]:
text = "Th!s !s a sentence w!th spec!al characters !n !t."

In [37]:
modified_text = re.sub(r'!', 'i', text)

### 6. Use a regular expression to find and extract words longer than 4 characters in the text below.

In [31]:
text = "This sentence has words of varying lengths."

In [36]:
words_longer_than_4_chars = re.findall(r'\b\w{5,}\b', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# \w{5,}: Matches five or more word characters (letters, digits, or underscores) that form a word longer than 4 characters.
# \b: Another word boundary to complete the match.

### 7. Use a regular expression to find and extract all occurrences of the letter "b", some letter(s), and then the letter "t" in the sentence below.

In [33]:
text = "I bet the robot couldn't beat the other bot with a bat, but instead it bit me."

In [35]:
matches = re.findall(r'b\w+t', text)

# b: Matches the letter "b".
# \w+: Matches one or more word characters (letters, digits, or underscores) that follow the letter "b".
# t: Matches the letter "t".

['bet', 'bot', 'beat', 'bot', 'bat', 'but', 'bit']

### 8. Use a regular expression to find and extract all words that contain either "ea" or "eo" in them.

In [None]:
text = "During many of the peaks and troughs of history, the people living it didn't fully realize what was unfolding. But we all know we're navigating breathtaking history: Nearly every day could be — maybe will be — a book."


In [42]:
words_with_ea_or_eo = re.findall(r'\b\w*ea\w*|\b\w*eo\w*', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# \w*: Matches zero or more word characters (letters, digits, or underscores).
# ea: Matches the letters "ea".
# |\b\w*eo\w*: The vertical bar | functions as an OR operator. This part of the pattern matches words that contain "eo".
# \w*: Matches zero or more word characters (letters, digits, or underscores).
# eo: Matches the letters "eo".

### 9. Use a regular expression to find and extract all the capitalized words in the text below individually.

In [43]:
text = "Teddy Roosevelt and Abraham Lincoln walk into a bar."

In [44]:
capitalized_words = re.findall(r'\b[A-Z][a-zA-Z]*\b', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# [A-Z]: Matches an uppercase letter (the first letter of the capitalized word).
# [a-zA-Z]*: Matches zero or more lowercase or uppercase letters that follow the initial uppercase letter.
# \b: Another word boundary to complete the match.

### 10. Use a regular expression to find and extract all the sets of consecutive capitalized words in the text above.

In [45]:
consecutive_capitalized_sets = re.findall(r'\b[A-Z][a-zA-Z]*(?:\s+[A-Z][a-zA-Z]*)*\b', text)

# \b: Word boundary to ensure we match the whole word and not part of a word.
# [A-Z]: Matches an uppercase letter (the first letter of the capitalized word).
# [a-zA-Z]*: Matches zero or more lowercase or uppercase letters that follow the initial uppercase letter.
# (?:\s+[A-Z][a-zA-Z]*)*: A non-capturing group (?:...) that matches zero or more sets of consecutive capitalized words separated by whitespace \s+. Each set starts with an uppercase letter followed by zero or more lowercase or uppercase letters.
# \b: Another word boundary to complete the match.

### 11. Use a regular expression to find and extract all the quotes from the text below.

*Hint: This one is a little more complex than the single quote example in the lesson because there are multiple quotes in the text.*

In [53]:
text = 'Roosevelt says to Lincoln, "I will bet you $50 I can get the bartender to give me a free drink." Lincoln says, "I am in!"'

In [55]:
quotes = re.findall(r'["\']{2}(.*?)["\']{2}', text)
quotes

[]

### 12. Use a regular expression to find and extract all the numbers from the text below.

In [56]:
text = "There were 30 students in the class. Of the 30 students, 14 were male and 16 were female. Only 10 students got A's on the exam."

In [58]:
numbers = re.findall(r'\b\d+(?:\.\d+)?\b', text)

# \b: Word boundary to ensure we match the whole number and not part of a larger number.
# \d+: Matches one or more digits (0-9) that make up the integer part of the number.
# (?:\.\d+)?: A non-capturing group (?:...) that matches the decimal part of the number, which is optional. It consists of a period \. followed by one or more digits \d+.
# \b: Another word boundary to complete the match.

### 13. Use a regular expression to find and extract all the social security numbers from the text below.

In [60]:
text = """
Henry's social security number is 876-93-2289 and his phone number is (847)789-0984.
Darlene's social security number is 098-32-5295 and her phone number is (987)222-0901.
"""

In [62]:
ssns = re.findall(r'\b\d{3}-\d{2}-\d{4}\b', text)

# \b: Word boundary to ensure we match the whole SSN and not part of a larger number.
# \d{3}: Matches exactly three digits (0-9) for the area number.
# -: Matches the hyphen character '-'.
# \d{2}: Matches exactly two digits for the group number.
# -: Matches another hyphen.
# \d{4}: Matches exactly four digits for the serial number.
# \b: Another word boundary to complete the match.

### 14. Use a regular expression to find and extract all the phone numbers from the text below.

In [66]:
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)

# \(?\d{3}\)?: Matches an optional opening parenthesis \(, followed by exactly three digits (0-9), and an optional closing parenthesis \) for the area code.
# [-.\s]?: Matches an optional hyphen -, dot ., or whitespace \s after the area code.
# \d{3}: Matches exactly three digits for the central office code.
# [-.\s]?: Matches an optional hyphen -, dot ., or whitespace \s after the central office code.
# \d{4}: Matches exactly four digits for the line number.

### 15. Use a regular expression to find and extract all the formatted numbers (both social security and phone) from the text below.

In [68]:
formatted_numbers = re.findall(r'\b(?:\d{3}-\d{2}-\d{4}|\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})\b', text)

# \b: Word boundary to ensure we match the whole formatted number and not part of a larger number.
# (?:...): A non-capturing group that matches either a Social Security Number pattern (\d{3}-\d{2}-\d{4}) or a phone number pattern (\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}).
# \d{3}-\d{2}-\d{4}: Matches the Social Security Number pattern with hyphens.
# \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}: Matches the phone number pattern, which can have optional parentheses and various separators.
# \b: Another word boundary to complete the match.