# Regex (Regular Expressions) for Web Scraping

What is **regex** in Python with **practical scraping-style examples** (extracting emails, prices, dates, IDs, etc.).

## Learning goals
- Understand what regex is and when to use it
- Read common patterns (character classes, quantifiers, groups)
- Use `re.search`, `re.findall`, `re.sub`, `re.split`
- Apply regex to messy text from the web (snippets, HTML, logs)

> Tip: Regex is powerful but should not replace HTML parsing. For web scraping, **use BeautifulSoup to get the right text/attributes**, then use regex to **clean** or **extract** patterns.

In [None]:
import re
from pprint import pprint

## 1) What is regex?

A **regular expression** is a pattern used to **match text**.

Common scraping tasks:
- Find **emails**, **phone numbers**, **prices**, **postal codes**
- Extract **IDs** from URLs
- Normalize text (remove extra spaces, convert formats)
- Validate or filter content

Python regex lives in the built-in module: `re`.

## 2) The 4 most used functions

- `re.search(pattern, text)` → finds the **first** match anywhere
- `re.findall(pattern, text)` → returns **all** matches
- `re.sub(pattern, repl, text)` → **replaces** matches
- `re.split(pattern, text)` → splits text by a pattern


In [None]:
text = "My email is alice@example.com and my backup is bob.smith@company.co.uk"
pattern = r"[\w.-]+@[\w.-]+\.\w+"  # a simple (not perfect) email regex

m = re.search(pattern, text)
print("search ->", m.group(0))

all_emails = re.findall(pattern, text)
print("findall ->", all_emails)

masked = re.sub(pattern, "<EMAIL>", text)
print("sub ->", masked)


## 3) Core building blocks (cheat sheet)

### Character classes
- `.` any character (except newline)
- `\d` digit, `\D` non-digit
- `\w` word char (letters/digits/_), `\W` non-word
- `\s` whitespace, `\S` non-whitespace
- `[abc]` one of a/b/c
- `[a-z]` range
- `[^a-z]` NOT in the range

### Quantifiers
- `?` 0 or 1
- `*` 0 or more
- `+` 1 or more
- `{n}` exactly n
- `{n,}` at least n
- `{n,m}` between n and m

### Anchors
- `^` start of string
- `$` end of string
- `\b` word boundary

### Groups
- `( ... )` capturing group
- `(?: ... )` non-capturing group
- `(?P<name> ... )` named capturing group


In [None]:
samples = [
    "Order #A-10293 shipped",
    "Order #B-7 shipped",
    "No order here",
]

pattern = r"#([A-Z])-([0-9]+)"  # capture letter and digits

for s in samples:
    m = re.search(pattern, s)
    if m:
        print(s, "->", m.group(0), "| groups:", m.group(1), m.group(2))
    else:
        print(s, "-> no match")

## 4) Greedy vs lazy

Quantifiers like `*` and `+` are **greedy** by default (they match as much as possible).
Use `*?` or `+?` to make them **lazy** (match as little as possible).


In [None]:
html = "<div>Price: <b>19.99</b> EUR</div><div>Price: <b>5.00</b> EUR</div>"

greedy = re.findall(r"<b>(.*)</b>", html)
lazy = re.findall(r"<b>(.*?)</b>", html)

print("greedy:", greedy)
print("lazy:", lazy)

## 5) Flags (multiline, ignore case, dotall)

- `re.IGNORECASE` (or `re.I`): case-insensitive
- `re.MULTILINE` (or `re.M`): `^` and `$` work per line
- `re.DOTALL` (or `re.S`): `.` matches newlines too


In [None]:
text = """Name: Alice\nname: Bob\nNAME: Charlie"""
print(re.findall(r"^name:\s*(\w+)", text, flags=re.I | re.M))

## 6) Practical examples

### A) Extract prices (€, $, with comma or dot)

Real-world issue: prices may appear as `9,99€`, `€9.99`, `$ 1,200.50`, etc.


In [None]:
text = "Promo: 9,99€ now! Old price: €12.50. US format: $ 1,200.50"

# This pattern captures currency + number with optional thousand separators
price_pattern = r"(?P<currency>€|\$)\s?(?P<amount>\d{1,3}(?:[\.,]\d{3})*(?:[\.,]\d{2})?)|(?P<amount2>\d+(?:[\.,]\d{2})?)\s?(?P<currency2>€|\$)"

matches = list(re.finditer(price_pattern, text))
for m in matches:
    d = m.groupdict()
    currency = d.get('currency') or d.get('currency2')
    amount = d.get('amount') or d.get('amount2')
    print("price ->", currency, amount, "| raw:", m.group(0))

### B) Extract dates in multiple formats

Examples: `2026-01-20`, `20/01/2026`, `Jan 20, 2026`.

> In scraping, you often need regex **just to detect/extract**, then use Pandas to parse into datetimes.


In [None]:
text = "Release: 2026-01-20 | Updated: 20/01/2026 | Blog: Jan 20, 2026"

date_pattern = r"\b(\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},\s+\d{4})\b"
print(re.findall(date_pattern, text))

### C) Extract IDs from URLs

Useful for e-commerce product IDs, user IDs, etc.


In [None]:
urls = [
    "https://site.com/product/12345?ref=home",
    "https://site.com/product/98765",
    "https://site.com/about",
]

pat = r"/product/(\d+)"
for u in urls:
    m = re.search(pat, u)
    print(u, "->", m.group(1) if m else None)

### D) Clean messy whitespace (common after scraping)

Scraped text often contains newlines, tabs, multiple spaces.


In [None]:
raw = "\n\t  This   is\n a    messy\t\ttext.   "
clean = re.sub(r"\s+", " ", raw).strip()
print("raw:", repr(raw))
print("clean:", repr(clean))

### E) Extract phone numbers (simple FR-style demo)

**Note:** phone regex can get complex internationally; this is a *teaching example*.


In [None]:
text = "Call us: 06 12 34 56 78 or +33 6 12 34 56 78"
pat = r"(?:\+33\s?)?0?6(?:\s?\d{2}){4}"
print(re.findall(pat, text))

## 7) Bonus: Lookarounds (when you want context but don't want to capture it)

- `(?=...)` positive lookahead
- `(?!...)` negative lookahead
- `(?<=...)` positive lookbehind
- `(?<!...)` negative lookbehind

Example: extract numbers **only** if followed by `€`.


In [None]:
text = "Items: 12€ and 30 dollars and 5€"
print(re.findall(r"\d+(?=€)", text))

## 8) Mini scraping-style workflow (static HTML string)

In real scraping:
1. Use `requests` to download HTML
2. Use **BeautifulSoup** to select the right elements
3. Use regex to extract/clean inside that text

Here we simulate step (2) by using a plain string.


In [None]:
page_text = """
Product: Super Mug\n
Price: 9,99€\n
Contact: shop@my-store.com\n
Product ID: SKU-AB1234\n
"""

email = re.search(r"[\w.-]+@[\w.-]+\.\w+", page_text).group(0)
price = re.search(r"\d+(?:[\.,]\d{2})?\s?€", page_text).group(0)
sku = re.search(r"SKU-[A-Z]{2}\d{4}", page_text).group(0)

print("email:", email)
print("price:", price)
print("sku:", sku)

## 9) Exercises

1. Extract all hashtags from a text (e.g. `#PANDAS`, `#WEB_SCRAPING`).
2. Extract all numbers that look like percentages (e.g. `20%`, `50 %`).
3. Replace multiple spaces/newlines with a single space.
4. Extract the domain names from a list of URLs.

### Starter data


In [None]:
text = "This course covers #WEB_SCRAPING, #PANDAS, and #REGEX. Grading: 20% MCQ, 30 % CC, 50% EXAM."
urls = [
    "https://www.wikipedia.org/wiki/Pandas",
    "http://example.com/path/to/page",
    "https://sub.domain.co.uk/index.html",
]
messy = "Hello\n\n   world\t\tthis   is   messy"

### Solutions (uncomment to reveal)


In [None]:
# 1) Hashtags
# print(re.findall(r"#[A-Z_]+", text))

# 2) Percentages
# print(re.findall(r"\b\d+\s?%\b", text))

# 3) Normalize whitespace
# print(re.sub(r"\s+", " ", messy).strip())

# 4) Domains from URLs (simple)
# for u in urls:
#     m = re.search(r"https?://([^/]+)", u)
#     print(u, "->", m.group(1) if m else None)

## 10) Summary

- Use BeautifulSoup to select the right chunk of HTML/text
- Use regex to extract patterns (emails, prices, IDs, dates)
- Use `re.sub` to clean messy text
- Watch out for **greedy vs lazy** matching