# What regex is ? 

Regex (regular expressions) is a **text pattern-matching language**.

Example:
```r"<h1>.*?</h1>"```

Regex:
* Works on plain text
* Looks for patterns, not structure
* Does not understand HTML semantics
* Is executed by a programming language (Python, JS, etc.)

Regex is about **finding patterns in strings**.

| Concept           | Role                          |
| ----------------- | ----------------------------- |
| **HTML**          | Structure the page            |
| **CSS**           | Style the page                |
| **BeautifulSoup** | Navigate HTML structure       |
| **Regex**         | Extract / clean text patterns |
| **Pandas**        | Analyze structured data       |


In [2]:
html = """
<html>
  <body>
    <h1>Shop</h1>

    <a href="/product/123" class="product" data-price="9,99€">Super Mug</a>
    <a href="/product/987" class="product" data-price="12.50€">Tea Cup</a>
    <a href="https://example.com/help" class="nav">Help</a>

    <img src="mug.jpg" alt="Photo of a red mug" class="thumb">
    <img src="cup.jpg" alt="Tea cup on a table" class="thumb">

    <p class="note">Promo ends 20/01/2026</p>
  </body>
</html>
"""

## Most used functions

- `re.search(pattern, text)` → finds the **first** match anywhere
- `re.findall(pattern, text)` → returns **all** matches
- `re.sub(pattern, repl, text)` → **replaces** matches
- `re.split(pattern, text)` → splits text by a pattern


In [4]:
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, "html.parser")

# Example: get all visible text from the page
page_text = soup.get_text(" ", strip=True)
page_text

'Shop Super Mug Tea Cup Help Promo ends 20/01/2026'

| Function     | Returns          | Typical use            |
| ------------ | ---------------- | ---------------------- |
| `re.search`  | One match object | Find first occurrence  |
| `re.findall` | List of strings  | Extract many values    |
| `re.sub`     | Modified string  | Clean / normalize text |
| `re.split`   | List of strings  | Break messy text       |


In [5]:
# .search

pattern = r"\b\d{2}/\d{2}/\d{4}\b"

match = re.search(pattern, page_text)
match.group(0)


'20/01/2026'

* Scans the whole text
* Stops at the first match
* Returns a Match object
* Use .group() to extract the value

In [6]:
pattern = r"\d+(?:[.,]\d{2})?\s?€"

prices = re.findall(pattern, page_text)
prices

[]

* Returns a list
* No Match object, just the matched strings
* Most common choice for scraping lists of data

In [7]:
clean_text = re.sub(r"\s+", " ", page_text)
clean_text

'Shop Super Mug Tea Cup Help Promo ends 20/01/2026'

In [8]:
normalized = re.sub(r"(\d+),(\d{2})", r"\1.\2", page_text)
normalized

'Shop Super Mug Tea Cup Help Promo ends 20/01/2026'

* Replaces every match
* Extremely common in data cleaning
* Often used before pd.to_numeric()

## Mini exercise : 
For the following text, please : 
* extract the first price
* extract all prices
* replace commas with dot

In [9]:
text = "Promo: 9,99€ | Old price: 12.50€ | Ends 20/01/2026"

In [None]:
# Extract the first price
first_price = re.search(r"\d+(?:[.,]\d{2})?\s?€", text).group(0)
first_price

In [None]:
# Extract all prices
all_prices = re.findall(r"\d+(?:[.,]\d{2})?\s?€", text)
all_prices

In [None]:
# Replace commas with dots

normalized_text = re.sub(r"(\d+),(\d{2})", r"\1.\2", text)
normalized_text