# Section R - Regex

Regex is a text matching language that can be used with python
* Advanced pattern matching when looking for a fixed set of characters won't help
* Matching repeating groups with specific permutations
* Matching sub-groups of patterns in a string
* Input validation - use regex to facilitate pass-lists of options/input
  * email address validation
  * phone number validation
  * ...
* Advanced text replacement.

Resources for learning:
* http://regex101.com – live testing with explanation
* http://pythex.org – Python-flavored regex testing

Let's do an email input validation example:

In [None]:
import re

valid_email = r'[a-zA-Z0-9_\-.]+@[a-zA-Z0-9\-]+.(com|org)'
print("What's your email address?")
email = input()
print("You entered:", email)
if re.match(valid_email, email):
    print("This is a valid email address!")
else:
    print("This is NOT a valid email address!")


  valid_email = '[a-zA-Z0-9_\-\.]+\.[a-zA-Z0-9\-]+\.(com|org|net|us|cn)'


What's your email address?


## Example regex patterns/characters
Regex uses special characters or patterns to match types of characters, whitespace, boundaries, and groups.

```
Pattern	    Meaning                         Example Match  
.           Any character except newline    a.b → acb, a7b  
^ / $	    Start / end of line             ^Hi matches lines starting with "Hi"  
\d / \w	    Digit / Word char               \d = 0–9, \w = a–z, A–Z, _  
\s          Whitespace (space, tab, etc.)	
+, *, ?	    1+ / 0+ / 0 or 1 repeats        a+, a*, a?  
{n} / {m,n} Exactly n / Between m and n     \d{4} → 4 digits  
[abc]	    a, b, or c                      gr[ae]y → gray, grey  
[^abc]	    Not a, b, or c
( )         Grouping
```

In [None]:
# Search for a pattern in a string
m = re.search(r'\d+', 'There are 15 cats')
print(m.group())  # '15'

# Find all occurrences of a pattern in a string
nums = re.findall(r'\d+', 'There are 15 cats and 7 dogs')
print(nums)  # ['15', '7']

# Substitute a pattern in a string
clean = re.sub(r'\s+', ' ', 'Too     many   spaces')
print(clean)  # 'Too many spaces'

# Grouping data to pick out specific parts
m = re.match(r'(\w+): (\d+)', 'Age: 30')
print(m.group(1))  # 'Age'
print(m.group(2))  # '30'

# Using flags to modify regex behavior
re.findall(r'dog', 'Dog DOG dog', flags=re.IGNORECASE)
# ['Dog', 'DOG', 'dog']

hello


## Problem Set
It can help to copy the text to the regex101 site and play with regex there to get things to match and then move the working regex to your code here.

In [None]:
# Extract All Numbers From a String
text = "In 2023, there were 150 cats and 30 dogs."

# your solution here
# hint \d+

print("The numbers are:", all_numbers)
if set(all_numbers) == {'2023', '150', '30'}:
    print("All numbers extracted correctly!")

In [None]:
# Extract All Words Starting with a Capital Letter
text = "Alice went to Wonderland with Bob and Charlie."

# your solution here
# hint \b[A-Z][a-z]*\b

print("The capitalized words are:", capitalized_words)
if set(capitalized_words) == {'Alice', 'Bob', 'Charlie', 'Wonderland'}:
    print("All capitalized words extracted correctly!")

In [None]:
# Find all words ending with 'ing'
text = "I am singing while walking and then running."

# your solution here
# hint \b\w+ing\b

print("The words ending with 'ing' are:", ing_words)
if set(ing_words) == {'singing', 'walking', 'running'}:
    print("All 'ing' words extracted correctly!")

In [None]:
# Replace all dashes with underscores
text = "this-is_some-kind_of-text"

# your solution here
# hint re.sub

print("The modified text is:", modified_text)
if modified_text == "this_is_some_kind_of_text":
    print("Dashes replaced with underscores correctly!")

In [None]:
# Validate US zip codes
codes = ["12345", "9876", "123456"]

# your solution here

print("Valid zip codes are:", valid_zip_codes)
if set(valid_zip_codes) == {'12345'}:
    print("All valid zip codes extracted correctly!")

In [None]:
# Use regex to pick the email addresses out from the following text
text = "Contact me at test@example.com or foo.bar99@domain.co"

# your soultion here

print("The emails are:", emails)

In [None]:
# Replace dates like MM/DD/YYYY with YYYY-MM-DD.
dates = ("01/15/2020, 12/31/2019, 07/04/2021")

# your solution here

print("The reformatted dates are:", reformatted_dates)
if set(reformatted_dates) == {'2020-01-15', '2019-12-31', '2021-07-04'}:
    print("The dates were reformatted correctly!")