
# Module 2: Regular Expressions & NLP Preprocessing (Hands‑On)

This notebook accompanies your 1.5h lecture. It is **self-contained** (no internet or large model downloads needed) and covers:

- **Regex basics**: character classes, quantifiers, anchors, groups
- **Pattern matching & extraction**: emails, phones, hashtags, dates, URLs
- **NLP preprocessing** (lightweight): tokenization, lowercasing, punctuation removal, simple stop word filtering, *toy* stemming/lemmatization rules
- **Mini exercises** with answer keys


## 0) Messy Text Sample

In [25]:

messy_text = """
Hey there!  My name’s Anna, I’m from the UK 🇬🇧 and I’ve just bought 3 iPhones for 2,499.99 USD!!!  
Can u believe it?? 😂  Email me at anna_92@example.com or contact@tech-review.co.uk.  
BTW, check out my blog @ https://techstuff.blog or follow me on Twitter #TechLife #AI #Python3. 
Order ID: #2025-00458 | Call me maybe? +44-20-7946-0958 📞  
P.S.  See you on 08/10/2025 😎
"""
print(messy_text)



Hey there!  My name’s Anna, I’m from the UK 🇬🇧 and I’ve just bought 3 iPhones for 2,499.99 USD!!!  
Can u believe it?? 😂  Email me at anna_92@example.com or contact@tech-review.co.uk.  
BTW, check out my blog @ https://techstuff.blog or follow me on Twitter #TechLife #AI #Python3. 
Order ID: #2025-00458 | Call me maybe? +44-20-7946-0958 📞  
P.S.  See you on 08/10/2025 😎




## 1) Regular Expressions — Quick Primer

**Core syntax**  
- Character classes: `[abc]`, `[a-z]`, `\d` (digit), `\w` (word), `\s` (space)
- Quantifiers: `*` (0+), `+` (1+), `?` (0/1), `{m,n}` (range)
- Anchors: `^` (start), `$` (end), `\b` (word boundary)
- Groups & alternation: `( … )`, `|`

We'll use the Python `re` module.


In [28]:

import re

# A helper to pretty-print matches with their start/end indices
def show_matches(pattern, text, flags=0):
    print(f"Pattern: {pattern!r}\n")
    for m in re.finditer(pattern, text, flags):
        span = f"[{m.start()}–{m.end()}]"
        print(span, repr(m.group(0)))  # $0 is the whole match
    if not list(re.finditer(pattern, text, flags)):
        print("(no matches)")


# Try it out: find all words that start with a capital letter
show_matches(r"\b[A-Za-z]+\b", messy_text)


Pattern: '\\b[A-Za-z]+\\b'

[1–4] 'Hey'
[5–10] 'there'
[13–15] 'My'
[16–20] 'name'
[21–22] 's'
[23–27] 'Anna'
[29–30] 'I'
[31–32] 'm'
[33–37] 'from'
[38–41] 'the'
[42–44] 'UK'
[48–51] 'and'
[52–53] 'I'
[54–56] 've'
[57–61] 'just'
[62–68] 'bought'
[71–78] 'iPhones'
[79–82] 'for'
[92–95] 'USD'
[101–104] 'Can'
[105–106] 'u'
[107–114] 'believe'
[115–117] 'it'
[123–128] 'Email'
[129–131] 'me'
[132–134] 'at'
[143–150] 'example'
[151–154] 'com'
[155–157] 'or'
[158–165] 'contact'
[166–170] 'tech'
[171–177] 'review'
[178–180] 'co'
[181–183] 'uk'
[187–190] 'BTW'
[192–197] 'check'
[198–201] 'out'
[202–204] 'my'
[205–209] 'blog'
[212–217] 'https'
[220–229] 'techstuff'
[230–234] 'blog'
[235–237] 'or'
[238–244] 'follow'
[245–247] 'me'
[248–250] 'on'
[251–258] 'Twitter'
[260–268] 'TechLife'
[270–272] 'AI'
[284–289] 'Order'
[290–292] 'ID'
[308–312] 'Call'
[313–315] 'me'
[316–321] 'maybe'
[344–345] 'P'
[346–347] 'S'
[350–353] 'See'
[354–357] 'you'
[358–360] 'on'



### Common Patterns You Can Try

- **Emails:** ``[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}``
- **Phone (with country code, flexible):** ``\+?\d{1,3}[-\s]?\d{2,4}[-\s]?\d{3,4}[-\s]?\d{3,4}``
- **Hashtag:** ``#\w+``
- **Dates (day/month/year or with dashes):** ``\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b``
- **URL (basic):** ``https?://[A-Za-z0-9./_-]+``
- **Numbers (ints/decimals):** ``\d+(\.\d+)?``


### Extraction Examples

In [29]:

patterns = {
    "email": r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}",
    "phone": r"\+?\d{1,3}[-\s]?\d{2,4}[-\s]?\d{3,4}[-\s]?\d{3,4}",
    "hashtag": r"#\w+",
    "date_dmy": r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
    "url": r"https?://[A-Za-z0-9./_-]+",
    "numbers": r"\d+(?:\.\d+)?"
}

for label, pat in patterns.items():
    print(f"\n--- {label.upper()} ---")
    print(re.findall(pat, messy_text))



--- EMAIL ---
['anna_92@example.com', 'contact@tech-review.co.uk']

--- PHONE ---
['+44-20-7946-0958']

--- HASHTAG ---
['#TechLife', '#AI', '#Python3', '#2025']

--- DATE_DMY ---
['44-20-7946', '08/10/2025']

--- URL ---
['https://techstuff.blog']

--- NUMBERS ---
['3', '2', '499.99', '92', '3', '2025', '00458', '44', '20', '7946', '0958', '08', '10', '2025']



## 2) Mini Regex Exercises (Your Turn)

Using `messy_text`:

1. **Find all hashtags**.  
2. **Extract the date** (day/month/year).  
3. **Extract all email addresses**.  
4. **Extract all numbers, including decimals**.  
5. **Replace multiple spaces with a single space**.  

> ✍️ Write your code in the cell below. (Hints above 👆)



## 3) NLP Preprocessing (Lightweight, No External Downloads)

We'll implement a **simple pipeline** with only Python & regex:
- Tokenization (split on non-letters)
- Normalization (lowercasing)
- Stop word removal (tiny built-in list)
- *Toy* stemming/lemmatization (rule-based, just to illustrate the concept)

> For real projects, replace this section with spaCy or NLTK pipelines.


In [16]:

import re
from collections import Counter

# 1) Tokenization: keep alphabetic sequences, drop punctuation & digits
def tokenize(text):
    return re.findall(r"[A-Za-z]+", text)

# 2) Normalize: lowercase all tokens
def normalize(tokens):
    return [t.lower() for t in tokens]

# 3) Stop words: small illustrative set (extend as needed)
STOP_WORDS = {
    'a','an','the','and','or','but','if','then','else','for','of','on','in','to','is','are','am','i','you','he','she',
    'it','we','they','me','my','our','your','their','this','that','these','those','be','been','was','were','with','as',
    'at','by','from','so','than'
}

def remove_stopwords(tokens):
    return [t for t in tokens if t not in STOP_WORDS]

# 4) Toy stemmer/lemmatizer: extremely simple rules (illustrative only)
def toy_lemmatize(token):
    # very naive rules, for classroom demonstration
    if token.endswith('ies') and len(token) > 4:
        return token[:-3] + 'y'     # studies -> study
    if token.endswith('sses') and len(token) > 5:
        return token[:-2]           # classes -> class
    if token.endswith('s') and len(token) > 3:
        return token[:-1]           # cats -> cat
    if token.endswith('ing') and len(token) > 5:
        return token[:-3]           # running -> run (not always correct)
    if token.endswith('ed') and len(token) > 4:
        return token[:-2]           # chased -> chase
    return token

def toy_stem(token):
    # even simpler: chop common suffixes
    for suf in ('ing','ed','ly','es','s'):
        if token.endswith(suf) and len(token) > len(suf)+2:
            return token[:-len(suf)]
    return token

def pipeline(text, use_lemma=True):
    tokens = tokenize(text)
    tokens = normalize(tokens)
    filtered = remove_stopwords(tokens)
    if use_lemma:
        processed = [toy_lemmatize(t) for t in filtered]
    else:
        processed = [toy_stem(t) for t in filtered]
    return processed

demo_tokens = pipeline(messy_text, use_lemma=True)
print(demo_tokens[:40])
print("Token count:", len(demo_tokens))
print("Top 10:", Counter(demo_tokens).most_common(10))


['hey', 'there', 'name', 's', 'anna', 'm', 'uk', 've', 'just', 'bought', 'iphone', 'usd', 'can', 'u', 'believe', 'email', 'anna', 'example', 'com', 'contact', 'tech', 'review', 'co', 'uk', 'btw', 'check', 'out', 'blog', 'http', 'techstuff', 'blog', 'follow', 'twitter', 'techlife', 'ai', 'python', 'order', 'id', 'call', 'maybe']
Token count: 43
Top 10: [('s', 2), ('anna', 2), ('uk', 2), ('blog', 2), ('hey', 1), ('there', 1), ('name', 1), ('m', 1), ('ve', 1), ('just', 1)]


In [23]:

# TODO: Your solutions here
import re

hashtags = re.findall(r"<#\w+>", messy_text)
date = re.findall(r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b", messy_text)
emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", messy_text)
numbers = re.findall(r"\d+(?:\.\d+)?", messy_text)
single_spaced = re.sub(r"\s{2,}", " ", messy_text)  # replace multiple spaces with a single space

print("hashtags:", hashtags)
print("date:", date)
print("emails:", emails)
print("numbers:", numbers[:10], "... (showing first 10 if many)")
print("\n--- Single-spaced preview ---\n", single_spaced[:250], "...")


hashtags: []
date: ['44-20-7946', '08/10/2025']
emails: ['anna_92@example.com', 'contact@tech-review.co.uk']
numbers: ['3', '2', '499.99', '92', '3', '2025', '00458', '44', '20', '7946'] ... (showing first 10 if many)

--- Single-spaced preview ---
 
Hey there! My name’s Anna, I’m from the UK 🇬🇧 and I’ve just bought 3 iPhones for 2,499.99 USD!!! Can u believe it?? 😂 Email me at anna_92@example.com or contact@tech-review.co.uk. BTW, check out my blog @ https://techstuff.blog or follow me on Twitt ...



### Compare Toy Stemming vs. Toy Lemmatization


In [21]:

sample_tokens = ["studies", "running", "cats", "chased", "happily", "classes", "buses", "flying", "easily", "tries"]

print("Lemma head:", [toy_lemmatize(token) for token in sample_tokens])
print("Stem  head:", [toy_stem(token) for token in sample_tokens])


Lemma head: ['study', 'runn', 'cat', 'chas', 'happily', 'class', 'buse', 'fly', 'easily', 'try']
Stem  head: ['studi', 'runn', 'cat', 'chas', 'happi', 'class', 'bus', 'fly', 'easi', 'tri']



## 4) Combining Regex with NLP Cleaning

Example: extract **(number, following-noun)** pairs from text, after light cleaning.


In [24]:

clean_text = re.sub(r"\s+", " ", messy_text)  # normalize whitespace
pairs = re.findall(r"(\d+)\s+([A-Za-z]+)", clean_text)
print(pairs)


[('3', 'iPhones'), ('99', 'USD')]
