<a href="https://colab.research.google.com/github/dsynderg/CS-479-machine-translation/blob/main/CS_479_Regex_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Interactive Regex Basics in Python (15-Minute Tour)

**Goal:** Learn the fundamental concepts of Regular Expressions (Regex) and how to use them in Python using the `re` module.

**Time:** ~15 minutes

**Format:** Read the explanations and fill in the missing `pattern` variable in the code cells marked with `# TODO:`. Run each code cell to check your work against the expected output.

---

## Cell 1: Introduction & Setup

**What are Regular Expressions?**

Regular Expressions (or regex) are sequences of characters that define a search pattern. They are used to match character combinations in strings. Think of them as a super-powered "find" or "search" tool for text.

**Why use them?**

* Validating input (e.g., checking if text looks like an email address or phone number).
* Searching for specific patterns in large amounts of text.
* Extracting specific information (like dates, numbers, or names) from text.
* Replacing text that matches a pattern.

**Python's `re` Module**

Python has a built-in module called `re` to work with regular expressions. Let's import it.


In [None]:
# Import the regular expression module
import re

print("re module imported successfully!")


---

## Cell 2: Literal Characters

The simplest regex patterns match literal characters. The pattern `cat` will find the exact sequence of characters "c", "a", "t".

We'll use `re.search(pattern, text)`. This function scans the `text` looking for the *first* location where the `pattern` produces a match. It returns a special "match object" if found, otherwise it returns `None`. We use `r"..."` (raw strings) for patterns to avoid issues with backslashes.

**Exercise:** Find the word "fox" in the text below.


In [None]:
# Text to search within
text_to_search_1 = "The quick brown fox jumps over the lazy dog."

# TODO: Write the regex pattern to find the literal word "fox"
# Use a raw string: r"your_pattern_here"
pattern_1 = # Fill in the blank here

# Perform the search
match_1 = re.search(pattern_1, text_to_search_1)

# Check if a match was found and print details
if match_1:
    # match.group() returns the matched string
    # match.start() returns the starting index of the match
    # match.end() returns the ending index of the match
    print(f"Found a match: '{match_1.group()}' starting at index {match_1.start()} and ending at index {match_1.end()}")
else:
    print("No match found.")

# Expected Output: Found a match: 'fox' starting at index 16 and ending at index 19


---

## Cell 3: Metacharacters: The Dot (`.`) and Character Sets (`[]`)

Metacharacters have special meanings in regex.

* `.` (Dot): Matches *any single character* (except newline `\n`).
* `[]` (Character Set): Matches *any one* of the characters inside the brackets.
    * Example: `[aeiou]` matches any vowel.
    * Ranges: `[a-z]` matches any lowercase letter, `[0-9]` matches any digit.

**Exercise 1:** Find any three-letter word starting with 'd', ending with 'g'.


In [None]:
# Text for this cell's exercises
text_to_search_2 = "The quick brown fox jumps over the lazy dog digging a den."

# TODO: Write a pattern to match 'd', then ANY single character, then 'g'.
# Use the dot (.) metacharacter.
pattern_2a = # Fill in the blank here

match_2a = re.search(pattern_2a, text_to_search_2)

if match_2a:
    print(f"Found a match using '.': '{match_2a.group()}' at index {match_2a.start()}")
else:
    print("No match found using '.'")

# Expected Output: Found a match using '.': 'dog' at index 39


**Exercise 2:** Find any occurrence of 'b', 'f', or 'd' followed immediately by 'o'. Use `re.findall()` which returns a list of *all* non-overlapping matches.


In [None]:
# Text is the same as Cell 3a (text_to_search_2)

# TODO: Use a character set [] for the first letter (b, f, or d), followed by 'o'.
pattern_2b = # Fill in the blank here

# Find all matches
matches_2b = re.findall(pattern_2b, text_to_search_2)

if matches_2b:
    print(f"Found matches using '[]': {matches_2b}")
else:
    print("No matches found using '[]'")

# Corrected Expected Output for 3b: Found matches using '[]': ['fo', 'do']


---

## Cell 4: Metacharacters: Anchors (`^`, `$`) and Shorthands (`\d`, `\w`, `\s`)

* `^` (Caret): Matches the *start* of the string (or line in multi-line mode).
* `$` (Dollar): Matches the *end* of the string (or line in multi-line mode).
* `\d`: Matches any *digit* (equivalent to `[0-9]`).
* `\w`: Matches any *word character* (letters `[a-zA-Z]`, digits `[0-9]`, and underscore `_`).
* `\s`: Matches any *whitespace character* (space ` `, tab `\t`, newline `\n`, etc.).

**Exercise 1:** Check if the text *starts* with the word "The".


In [None]:
text_to_search_3 = "The answer is 42."

# TODO: Write a pattern to check if the string starts with "The"
# Use the ^ anchor. Remember a space might follow "The".
start_pattern = # Fill in the blank here (\s matches the space)

match_start = re.search(start_pattern, text_to_search_3)

if match_start:
    print(f"The text starts with 'The'. Match: '{match_start.group()}'")
else:
    print("The text does not start with 'The'.")

# Expected Output: The text starts with 'The'. Match: 'The '


**Exercise 2:** Find the first sequence of exactly two digits.


In [None]:
# Text is the same as Cell 4a (text_to_search_3)

# TODO: Write a pattern to find exactly two digits in a row.
# Use the \d shorthand. You need two of them.
digit_pattern = # Fill in the blank here

match_digits = re.search(digit_pattern, text_to_search_3)

if match_digits:
    print(f"Found two digits: '{match_digits.group()}'")
else:
    print("Did not find two consecutive digits.")

# Expected Output: Found two digits: '42'


---

## Cell 5: Quantifiers (`*`, `+`, `?`, `{}`)

Quantifiers specify *how many* times the preceding character, group, or character set should occur.

* `*`: Zero or more times. (`go*gle` matches `ggle`, `google`, `gooogle`)
* `+`: One or more times. (`go+gle` matches `google`, `gooogle`, but not `ggle`)
* `?`: Zero or one time (makes it optional). (`colou?r` matches `color`, `colour`)
* `{n}`: Exactly `n` times. (`\d{3}` matches exactly 3 digits)
* `{n,}`: `n` or more times. (`\d{2,}` matches 2 or more digits)
* `{n,m}`: Between `n` and `m` times (inclusive). (`\d{2,4}` matches 2, 3, or 4 digits)

**Exercise:** Find all sequences of 3 or more digits in the text.


In [None]:
text_to_search_4 = "Item 1 costs $99. Item 2 costs $1500. Item 3 is $7."

# TODO: Find all sequences of 3 or more digits.
# Use \d and the {n,} quantifier. Use re.findall().
digit_seq_pattern = # Fill in the blank here

matches_digits = re.findall(digit_seq_pattern, text_to_search_4)

if matches_digits:
    print(f"Found digit sequences (3+ digits long): {matches_digits}")
else:
    print("No digit sequences of length 3+ found.")

# Expected Output: Found digit sequences (3+ digits long): ['1500']


---

## Cell 6: Grouping (`()`)

Parentheses `()` create groups within your regex pattern. This is useful for:

1.  Applying quantifiers to multiple characters (e.g., `(ha)+` matches `ha`, `haha`, `hahaha`).
2.  Capturing parts of the match separately. `match.group(0)` (or `match.group()`) is the whole match, `match.group(1)` is the content of the first `()`, `match.group(2)` is the second, etc.

**Exercise:** Extract the protocol (http/https) and the domain name from URLs.


In [None]:
text_to_search_5 = "Visit us at http://example.com or https://test-site.org today!"

# TODO: Write a pattern with two groups:
# Group 1: Capture 'http' or 'https'. (Hint: 's?' makes 's' optional)
# Group 2: Capture the domain name (sequence of word characters, hyphens, and dots after ://).
# Remember to escape literal dots: \.
# Use \w, -, and \. inside a character set [] for the domain part. Use + for one or more.
url_pattern = # Fill in the blank here

# Find all matches; findall returns tuples of the captured groups
url_matches = re.findall(url_pattern, text_to_search_5)

if url_matches:
    print("Found URL parts (protocol, domain):")
    # url_matches will be a list of tuples, e.g., [('http', 'example.com'), ...]
    for match_tuple in url_matches:
        print(f" - Protocol: {match_tuple[0]}, Domain: {match_tuple[1]}")
else:
    print("No URLs with http/https found.")

# Expected Output:
# Found URL parts (protocol, domain):
#  - Protocol: http, Domain: example.com
#  - Protocol: https, Domain: test-site.org


---

## Cell 7: Summary & Next Steps

Congratulations! You've touched on the basics of regex in Python:

* **Literal characters:** Match exact text (`fox`).
* **Metacharacters:** Special meanings (`.`, `[]`, `^`, `$`).
* **Shorthands:** Common patterns (`\d`, `\w`, `\s`).
* **Quantifiers:** Control repetition (`*`, `+`, `?`, `{}`).
* **Grouping:** Capture parts of matches (`()`).
* **Python `re` functions:** `re.search` (find first), `re.findall` (find all).

**Where to go next?**

* Explore more complex patterns (lookarounds, non-capturing groups `(?:...)`).
* Learn about `re.match` (matches *only* at the beginning of the string) and `re.sub` (find and replace).
* Use online regex testers like [regex101.com](https://regex101.com/) to build and debug patterns interactively.
* Practice on different kinds of text data!


In [None]:
# No exercise here, just a final message.
print("Regex basics complete! Keep practicing!")
