## Regex Treasure Hunt


We provide a big text file (`data/regex_treasure.txt`) filled with emails, URLs, phone numbers, hashtags,
and sentinel tokens. Your job is to extract specific items and reveal a hidden message.


In [1]:
import numpy as np
from pathlib import Path
import re
import random

# Reproducibility
RNG_SEED = 123
np.random.seed(RNG_SEED)
random.seed(RNG_SEED)

DATA_DIR = Path("./data")
DATA_DIR.mkdir(exist_ok=True, parents=True)

print("Environment ready. Using data dir:", DATA_DIR.resolve())


# Generate/overwrite the treasure file deterministically
treasure_path = DATA_DIR / "regex_treasure.txt"
np.random.seed(RNG_SEED)
random.seed(RNG_SEED)

special_lines = {251: "->TIDY<-", 173: "->IT<-", 97: "->KEEP<-" }

lines = []
hashtags_bank = ["#DATA","#CLEANING","#WINS","#TODAY","#KEEP","#IT","#TIDY","#STAY","#CURIOUS","#REGEX"]

for i in range(1, 1200):
    parts = []
    if i in special_lines:
        parts.append(special_lines[i])
    # random content
    if random.random() < 0.5:
        parts.append(f"{random.randint(100,999)}-{random.randint(100,999)}-{random.randint(1000,9999)}")
    if random.random() < 0.5:
        parts.append(f"{random.choice(['http','https'])}://{random.choice(['www.',''])}{random.choice(['example','datahub','regexlab','university'])}.{random.choice(['com','org','edu','net'])}/{random.choice(['a','b','c'])}/{random.randint(10,999)}")
    if random.random() < 0.5:
        parts.append(f"{random.choice(['alex','jordan','sam','kay','taylor'])}@{random.choice(['example','datahub','regexlab','university'])}.{random.choice(['com','org','edu','net'])}")
    if random.random() < 0.05:
        parts.append(random.choice(['Unknown','NA','999']))
    if random.random() < 0.1:
        parts.append(random.choice(['ALERT','NOTICE','FLAG']))
    if random.random() < 0.2:
        parts.append(random.choice(hashtags_bank))
    lines.append(" ".join(parts) if parts else "ok")

# Add a few tricky cases
lines.append("Contact: help(at)example(dot)org #notQuiteAnEmail")
lines.append("Edge phone: (801) 245-6789 (not our pattern)")

treasure_path.write_text("\n".join(lines), encoding="utf-8")
print("Wrote", treasure_path, "with", len(lines), "lines")


Environment ready. Using data dir: C:\Users\tkerby2\Desktop\Teaching\Winter_2026\STAT_386\demos\regex_missing_walkthrough\data
Wrote data\regex_treasure.txt with 1201 lines



### Tasks
1. **Emails:** Extract all valid emails (simple pattern: word chars before `@`, word chars for domain, dot TLD).  
2. **URLs:** Extract `http/https` URLs up to the first whitespace or `)` char.  
3. **Phones:** Extract US-like phone numbers in `###-###-####` format.  
4. **ALL-CAPS words:** Extract all-caps tokens (AUDIT/ALERT/NOTICE/etc.).  
5. **Hashtags at line start:** Extract hashtags that begin a line and reveal the hidden message.


In [2]:

text = (DATA_DIR / "regex_treasure.txt").read_text()

# 1) Emails (very simple pattern; not RFC-perfect)
emails = re.findall(r"FIXME", text)

# 2) URLs (stop before whitespace or ')')
urls = re.findall(r"FIXME", text)

# 3) Phones ###-###-####
phones = re.findall(r"FIXME", text)

# 4) ALL-CAPS tokens
allcaps = re.findall(r"FIXME", text)

# 5) Hashtags at start-of-line
hashtags_line_start = re.findall(r"^\#", text, flags=re.MULTILINE)

print("emails:", len(emails))
print("urls:", len(urls))
print("phones:", len(phones))
print("ALLCAPS:", len(allcaps))
print("hashtags(line-start):", hashtags_line_start[:10], "...")


emails: 0
urls: 0
phones: 0
ALLCAPS: 0
hashtags(line-start): ['#', '#', '#', '#', '#', '#', '#', '#', '#', '#'] ...


### Reveal the hidden message from hashtags

hidden message has this structure for each word: `->WORD<-` with '->' at the start and '<-' at the end.

In [3]:
secret_message = re.findall(r"FIXME", text)
print("Secret message:", secret_message)


Secret message: []
