# Lecture 4: Regular Expressions

Regular expressions (regex) are a powerful notation for matching patterns in text. This tutorial covers regex basics and Python's `re` module, with examples using the Night Vale transcript.

**Prerequisites:**
- Python and the `ling250` environment
- `Night_Vale.txt` in your `data/` folder (download from Blackboard if needed)

## Getting Started

In [None]:
import re

# Load the Night Vale transcript
with open("../data/Night_Vale.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Check it loaded correctly
print(len(text), "characters")
print(text[:500])

For many examples, we'll also use individual lines:

In [None]:
lines = text.split("\n")
print(len(lines), "lines")

## Part 1: Literal Matching

The simplest regex is a literal string. It matches exactly that sequence of characters.

In [None]:
# Find all occurrences of "Night Vale"
matches = re.findall("Night Vale", text)
print(len(matches), "matches")

**Important:** Regex is case-sensitive by default.

In [None]:
# These return different results
print(len(re.findall("night", text)))
print(len(re.findall("Night", text)))

**Note:** A pattern matches substrings too — `"the"` matches both the word "the" and part of "other".

In [None]:
# This matches "the" inside other words too
print(re.findall("the", "the other cat"))

## Part 2: Character Sets

Square brackets `[ ]` define a **set of characters**. The pattern matches any single character from the set.

In [None]:
# Match "Night" or "night"
matches = re.findall("[Nn]ight", text)
print(len(matches), "matches")
print(matches[:10])

### Ranges

Use a hyphen for consecutive characters:

In [None]:
# Any lowercase letter
re.findall("[a-z]", "Hello World 123")

In [None]:
# Any digit
re.findall("[0-9]", "Episode 47 aired in 2014")

In [None]:
# Any letter (upper or lower)
re.findall("[a-zA-Z]", "Hello World 123")

### Negation

A caret `^` at the start of a set means "NOT these characters":

In [None]:
# Anything except digits
re.findall("[^0-9]", "Hello 123 World")

In [None]:
# Anything except whitespace
re.findall("[^ ]", "Hello World")

## Part 3: Quantifiers (Counters)

Quantifiers specify how many times a pattern should repeat.

| Symbol | Meaning |
|--------|---------|
| `?` | Zero or one (optional) |
| `+` | One or more |
| `*` | Zero or more |

In [None]:
# "?" - optional character
# Matches "color" or "colour"
re.findall("colou?r", "color and colour")

In [None]:
# "+" - one or more
# Matches words with repeated letters
re.findall("e+", "I see the beekeeper")

In [None]:
# "*" - zero or more
re.findall("ab*a", "aa aba abba abbba")

### Combining sets with quantifiers

In [None]:
# One or more lowercase letters (a "word")
re.findall("[a-z]+", "Hello World 123")

In [None]:
# One or more digits (a "number")
re.findall("[0-9]+", "Episode 47 aired in 2014")

### Night Vale example: Find all capitalized words

In [None]:
# Capital letter followed by lowercase letters
capitalized = re.findall("[A-Z][a-z]+", text)
print(capitalized[:20])

## Part 4: The Wildcard

The period `.` matches **any single character** (except newline).

In [None]:
# "beg" + any character + "n"
re.findall("beg.n", "begin begun began beg9n")

Combine with quantifiers:

In [None]:
# Match anything between "the" and "cat"
re.findall("the .+ cat", "the black cat and the orange cat")

**To match a literal period**, escape it with backslash:

In [None]:
re.findall("Mr\\.", "Mr. Smith and Mrs. Jones")
# Or use a raw string (recommended):
re.findall(r"Mr\.", "Mr. Smith and Mrs. Jones")

**Tip:** Always use raw strings (`r"pattern"`) for regex to avoid escaping issues.

## Part 5: Anchors

Anchors match **positions**, not characters.

| Symbol | Meaning |
|--------|---------|
| `^` | Beginning of line/string |
| `$` | End of line/string |
| `\b` | Word boundary |

In [None]:
# Lines that START with "The"
for line in lines[:100]:
    if re.search(r"^The", line):
        print(line[:60])

In [None]:
# Lines that END with a question mark
for line in lines[:100]:
    if re.search(r"\?$", line):
        if len(line) <= 60:
            print(line)
        else:
            print(line[:30] + " ... " + line[-30:])

### Word boundaries

`\b` matches the boundary between a word character and a non-word character.

In [None]:
# Match "the" as a whole word, not inside "other"
print(re.findall(r"\bthe\b", "the other cat"))  # ['the']
print(re.findall(r"the", "the other cat"))       # ['the', 'the']

### Night Vale example: Find sentences starting with "Cecil"

In [None]:
cecil_starts = [line for line in lines if re.search(r"^Cecil", line)]
print(len(cecil_starts), "lines start with Cecil")
for line in cecil_starts[:5]:
    print(line[:70])

## Part 6: Aliases (Shorthand Classes)

These are shortcuts for common character sets:

| Alias | Meaning | Equivalent |
|-------|---------|------------|
| `\d` | Any digit | `[0-9]` |
| `\D` | Any non-digit | `[^0-9]` |
| `\w` | Any "word" character | `[a-zA-Z0-9_]` |
| `\W` | Any non-word character | `[^a-zA-Z0-9_]` |
| `\s` | Any whitespace | `[ \t\n\r\f\v]` |
| `\S` | Any non-whitespace | `[^ \t\n\r\f\v]` |

In [None]:
# Find all numbers in the text
numbers = re.findall(r"\d+", text)
print(numbers[:20])

In [None]:
# Find all "words" (sequences of word characters)
words = re.findall(r"\w+", "Hello, World! How are you?")
print(words)

### Night Vale example: Extract years

In [None]:
# Years are 4-digit numbers
years = re.findall(r"\b\d{4}\b", text)
print(set(years))  # unique years

## Part 7: Disjunction (OR)

The pipe `|` means "either/or":

In [None]:
# Match "cat" or "dog"
re.findall(r"cat|dog", "I have a cat and a dog")

The disjunction applies to entire patterns on each side:

In [None]:
# This matches "cat" OR "dog", not "cadog" or "catog"
re.findall(r"cat|dog", "catdog cadog")

## Part 8: Groups and Parentheses

Parentheses `( )` group parts of a pattern together.

**Important Python quirk:** Regular parentheses create a "capture group" — `findall` returns only the captured part, not the whole match.

In [None]:
# With capturing group - returns only what's in parentheses
re.findall(r"gupp(y|ies)", "guppy and guppies")  # ['y', 'ies']

**Use `(?:...)` for non-capturing groups** if you want the whole match:

In [None]:
# Non-capturing group - returns the whole match
re.findall(r"gupp(?:y|ies)", "guppy and guppies")  # ['guppy', 'guppies']

### Night Vale example: Find "Sheriff's Secret Police" variations

**Note:** The transcript uses a curly apostrophe (`’`) rather than the straight one on your keyboard (`'`). When working with text data, always check what characters are actually present!

In [None]:
# Find "Sheriff’s Secret Police" + the next word (using curly apostrophe)
sheriff = re.findall(r"Sheriff’s Secret Police \w+", text)
print(len(sheriff), "matches")
for match in set(sheriff):
    print(match)

## Part 9: Advanced Quantifiers

For more precise control over repetition:

| Syntax | Meaning |
|--------|---------|
| `{n}` | Exactly n times |
| `{n,m}` | Between n and m times |
| `{n,}` | At least n times |
| `{,m}` | At most m times |

In [None]:
# Exactly 3 digits
re.findall(r"\d{3}", "12 123 1234 12345")

In [None]:
# 2 to 4 digits
re.findall(r"\d{2,4}", "1 12 123 1234 12345")

In [None]:
# Phone number pattern (simplified)
re.findall(r"\d{3}-\d{3}-\d{4}", "Call 555-123-4567 today!")

## Part 10: Python's `re` Functions

### `re.findall(pattern, string)`

Returns a **list** of all matches:

In [None]:
matches = re.findall(r"\bNight\b", text)
print(len(matches), "occurrences of 'Night'")

### `re.search(pattern, string)`

Returns a **Match object** for the first match, or `None` if no match:

In [None]:
match = re.search(r"Cecil", text)
if match:
    print("Found at position:", match.start())
    print("Matched text:", match.group())

### `re.sub(pattern, replacement, string)`

**Substitutes** all matches with the replacement:

In [None]:
# Censor "Dog Park"
censored = re.sub(r"Dog Park", "[REDACTED]", text[:1000])
print(censored)

### Flags

Add flags to modify behavior:

In [None]:
# Case-insensitive matching
re.findall(r"night vale", text, re.IGNORECASE)

In [None]:
# Or use the short form
re.findall(r"night vale", text, re.I)

## Part 11: Putting It Together

### Example: Find weather reports

Night Vale has distinctive weather reports. Let's find lines mentioning weather:

In [None]:
# Find lines containing "weather" (case insensitive)
weather_lines = [line for line in lines if re.search(r"\bweather\b", line, re.I)]
print(len(weather_lines), "lines mention weather")
for line in weather_lines[:5]:
    print(line[:80])

### Example: Find words with unusual patterns

In [None]:
from collections import Counter

# Words with double letters
double_letters = re.findall(r"\b\w*([a-z])\1\w*\b", text, re.I)
print("Most common double letters:", Counter(double_letters).most_common(10))

In [None]:
# Words ending in -tion or -sion
tion_words = re.findall(r"\b\w+[ts]ion\b", text, re.I)
print("Unique -tion/-sion words:", len(set(tion_words)))
print(set(tion_words))

### Example: Find text in parentheses

In [None]:
# Text between parentheses
parenthetical = re.findall(r'\([^)]+\)', text)
print(len(parenthetical), "parenthetical phrases")
print(parenthetical[:5])

### Example: Find questions

In [None]:
# Sentences ending with ?
questions = re.findall(r"[^.!?]*\?", text)
print(len(questions), "questions")

for q in questions[:5]:
    if len(q) <= 60:
        print(q)
    else:
        print(q[:30] + " ... " + q[-30:])

---

## Challenges

Try these on your own before looking at the solutions!

### Challenge 1: Find hyphenated words

Find all hyphenated words or phrases like "well-known" or "self-aware".

**Hint:** A hyphenated word is one or more word characters, a hyphen, then more word characters.

In [None]:
# Your code here


### Challenge 2: Find all times

Find patterns like "6:00 PM", "12:30 am", "9:00".

**Hint:** Hours are 1-2 digits, minutes are always 2 digits, AM/PM is optional.

In [None]:
# Your code here


### Challenge 3: Find repeated words

Sometimes writers accidentally repeat words ("the the cat"). Find all cases where a word is immediately repeated in the text.

**Hint:** You'll need a capture group to match a word, then reference that same text again. Look up "backreferences" in regex — the syntax is `\1` to refer to the first capture group.

In [None]:
# Your code here


---

## Quick Reference

| Pattern | Meaning |
|---------|---------|
| `abc` | Literal characters |
| `[abc]` | Any character in set |
| `[^abc]` | Any character NOT in set |
| `[a-z]` | Range of characters |
| `.` | Any character (except newline) |
| `\d` | Digit `[0-9]` |
| `\w` | Word character `[a-zA-Z0-9_]` |
| `\s` | Whitespace |
| `\b` | Word boundary |
| `^` | Start of line |
| `$` | End of line |
| `?` | Zero or one |
| `+` | One or more |
| `*` | Zero or more |
| `{n}` | Exactly n |
| `{n,m}` | Between n and m |
| `a\|b` | a OR b |
| `(...)` | Capture group |
| `(?:...)` | Non-capture group |
| `\` | Escape special character |

### Python `re` functions

| Function | Returns |
|----------|---------|
| `re.findall(pattern, string)` | List of all matches |
| `re.search(pattern, string)` | First Match object or None |
| `re.sub(pattern, repl, string)` | String with replacements |

### Useful flags

| Flag | Meaning |
|------|---------|
| `re.IGNORECASE` or `re.I` | Case-insensitive |
| `re.MULTILINE` or `re.M` | `^` and `$` match line boundaries |

---

## Further Reading

- [SLP Chapter 2](https://web.stanford.edu/~jurafsky/slp3/2.pdf) — Regular Expressions
- [Python `re` documentation](https://docs.python.org/3/library/re.html)
- [regex101.com](https://regex101.com/) — Interactive regex tester (set flavor to Python)

---

## Challenge Solutions

Scroll down only after you've attempted the challenges yourself!

<br><br><br><br><br><br><br><br><br><br>

### Solution 1: Hyphenated words

In [None]:
hyphenated = re.findall(r"\w+-\w+", text)
print(len(hyphenated), "hyphenated words")
print(hyphenated[:15])

### Solution 2: Times

In [None]:
times = re.findall(r"\d{1,2}:\d{2}(?:\s*[AaPp][Mm])?", text)
print(times[:20])

### Solution 3: Repeated words

In [None]:
# Find repeated words (case insensitive)
repeated = re.findall(r"\b(\w+)\s+\1\b", text, re.I)
print(repeated[:20])

# See them in context
for match in re.finditer(r"\b(\w+)\s+\1\b", text, re.I):
    start = max(0, match.start() - 20)
    end = min(len(text), match.end() + 20)
    print(f"...{text[start:end]}...")