<a href="https://colab.research.google.com/github/arloera01-blip/AshlynL_DTSC3020_Fall2025/blob/main/Regex_Text_Processing_EN_demo__text_summaries_plain_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions

Regular expressions (regex) are a mini language for finding text.
When you need to find an ID, a username, or any pattern inside a long string, regex make it fast and precisely.

---


“Regex lets us search by patterns, not exact words.
We’ll start with two tools:

- re.match → checks only the start of the string.
- re.search → finds the first match anywhere in the string.

By the end, you’ll know which one to use, how to read the Match object, and how to extract the matched text.”

---

## We will cover
re.search, re.match, re.findall, re.sub, plus core syntax: character classes, quantifiers, anchors, and groups.







## 1) `re.search` vs `re.match`
`match` checks from the **start** of the string; `search` finds the **first occurrence anywhere**.

In [None]:
import re  # import regex library
text = 'ID=abc123; user=me'
print(re.match(r'ID=\w+', text))      # from the start
print(re.search(r'user=\w+', text))    # anywhere

<re.Match object; span=(0, 9), match='ID=abc123'>
<re.Match object; span=(11, 18), match='user=me'>


In [None]:
import re  # import regex library
samples = ['ID=xyz789; user=anna', 'user=bob; ID=000']
for s in samples:  # loop over items
    m1 = re.match(r'ID=\w+', s)  # match at string start
    m2 = re.search(r'user=\w+', s)  # search first match
    print('\nText:', s)  # display output




Text: ID=xyz789; user=anna

Text: user=bob; ID=000


#2- Regex basics —

Idea: A regex pattern says what to find, how many, and where in the string.

---

## What to find (character classes)
- \d → one digit (0–9)  
  *Example:* \d matches 5 in ID5.
- \w → one word character (letter, digit, or `_`)  
  *Not a whole word—just one character.*  
  *Example:* \w matches A in A-1.

Common combos
- \d+ → one or more digits (e.g., `123`)
- \w+ → one or more word characters (e.g., `abc_12`)

---

## How many (quantifiers)
- + = one or more (≥1) → \w+, \d+
- * = zero or more (≥0) → \s*
- ? = zero or one (optional) → colou?r
- {m,n} = between *m* and *n* → \d{3} (exactly 3 digits)

*Quantifiers are greedy by default (they take as much as they can and still match).*

---

## Where in the string (anchors)
- ^ = start of the string
- $ = end of the string

Examples
- ^\d+ → digits only if they’re at the start
- ^\d+$ → the entire string is digits (start to end)
- re.fullmatch(r'\d+', s) → code way to force the whole string to be digits

---




## Key takeaway
- \d / \w say what kind of characters.
- +, *, ?, {m,n} say how many.
- ^ and $ say where (start/end).

In [None]:
import re

text = "codes: a1, b22, c333; end"

# extract all letter+ 1–3 digit codes (bounded by word boundaries)
code_pat = re.compile(r"\b[A-z]\d{1,3}\b")
print("found codes:", code_pat.findall(text))

# does the line start with 'codes:' (allow optional leading spaces)?
print("starts with 'codes:'?", bool(re.match(r"^\s*codes:", text)))


found codes: ['a1', 'b22', 'c333']
starts with 'codes:'? True


## 3) Groups (capturing data) and **named groups**


What groups do: Parentheses (...) let you grab pieces of the text that matched your pattern.

You can read them back with numbers (group(1)) or names (group('name')).

⸻

When you want to pull fields out of a string

Do this:
 1. Decide the pieces you need (e.g., user, id).
 2. Wrap each piece in a group: (...) or (?P<name>...).
 3. Run re.search (first match) or re.finditer (all matches).
 4. Read the results with m.group(1) / m.group('name').

In [None]:
import re
text = "user: alice, id=007"

m = re.search(r"user:\s*(?P<user>\w+),\s*id=(?P<id>\d+)", text)
print(m.group('user'), m.group('id'))
# Or: print(m.groupdict()) -> {'user': 'alice', 'id': '007'}

alice 007


When you want to parse log lines

Do this: give each part a named group so your code reads like English.

Example: "2025-05-02 08:01:05 [WARN] Low disk"

In [None]:
import re
line = "2025-05-02 08:01:05 [WARN] Low disk"

pat = re.compile(
    r"(?P<date>\d{4}-\d{2}-\d{2})\s+"
    r"(?P<time>\d{2}:\d{2}:\d{2})\s+"
    r"\[(?P<level>[A-Z]+)\]\s+"
    r"(?P<msg>.+)"
)

m = pat.search(line)
print(m.group('date'), m.group('time'), m.group('level'), m.group('msg'))
# 2025-05-02 08:01:05 WARN Low disk

2025-05-02 08:01:05 WARN Low disk


## 4) `findall` and `finditer` examples (emails)
`findall` returns a list of matches; `finditer` yields match objects (useful for spans).

In [None]:
import re

text = "Emails: a@x.com, b_y@z.org; invalid: c@@bad"

#  email pattern
pat = re.compile(r"\b[\w.-]+@[\w.-]+\.[A-Za-z]{2,}\b")

# 1) findall → list of the matched strings
print("findall:", pat.findall(text))        # -> ['a@x.com', 'b_y@z.org']

# 2) finditer → match objects (so you can get spans/positions)
print("finditer with spans:")
for m in pat.finditer(text):
    print(" match:", m.group(0), " span:", m.span())
    # m.group(0) → the exact text that matched.
    # m.span() → a pair (start, end) showing where it sits in the original string.

findall: ['a@x.com', 'b_y@z.org']
finditer with spans:
 match: a@x.com  span: (8, 15)
 match: b_y@z.org  span: (17, 26)


 • \b — word boundary (don’t match inside longer tokens).

 • [\w.-]+ — one or more word characters, dots, or dashes (the local part or domain).

 • @ — the at sign.

 • [\w.-]+ — one or more word/dot/dash for the domain.

 • \. — a literal dot before the TLD.

 • [A-Za-z]{2,} — top-level domain with 2+ letters (com, org, …).


## 5) `re.sub` for masking/replacing
Use `sub` to replace sensitive data.

In [None]:
# Real Output Demo: mask digits like credit‑card fragments
import re  # import regex library
text = 'Order 123-456-7890 placed by user 42; cc 4111-1111-1111-1111'
masked = re.sub(r'\d', '•', text)  # substitute text
print(masked)  # display output


Order •••-•••-•••• placed by user ••; cc ••••-••••-••••-••••


## 6) Combine with File I/O
Create a tiny log file, then extract structured records and save them to JSON.

compile regex patterns, then search for the first match, then access capture groups (positional/named).

In [None]:
import re, json  # import regex library
from pathlib import Path  # import module(s)

rgx = re.compile(r'(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<time>\d{2}:\d{2}:\d{2})\s+\[(?P<level>[A-Z]+)\]\s+(?P<msg>.+)')  # compile regex pattern
path = Path('demo_logs.txt')
path.write_text('''2025-05-02 08:00:00 [INFO] Boot
2025-05-02 08:01:05 [WARN] Low disk
this is junk
''')

records = []
for line in path.read_text().splitlines():
    m = rgx.search(line)
    if m:
        records.append(m.groupdict())

with open('demo_logs.json', 'w') as f:
    json.dump(records, f, indent=2)

demonstrate regular-expression text processing.

In [None]:
# Real Output Demo: show the JSON content
from pathlib import Path  # import module(s)
print(Path('demo_logs.json').read_text())  # display output
