# Advanced Regular Expressions
In the world of data processing and text analysis, regular expressions (regex) are a versatile tool that can slice, extract, validate, and even reassemble pieces of information from raw data. Many people know them as simple pattern-finding tools, but in reality, regex has a much deeper layer of capabilities, from lookaround assertions and named groups to nested matching with recursion.

# Library and Dataset Preparation

## Import library

In [3]:
import re
import regex   
import pandas as pd
from datetime import datetime
from dateutil import parser as dateparser
import unicodedata

## Sample Dataset

In [4]:
server_logs = '''127.0.0.1 - - [10/Oct/2023:13:55:36 +0700] "GET /index.html HTTP/1.1" 200 1024 "-" "Mozilla/5.0"
192.168.1.5 - john [11/Oct/2023:08:20:10 +0700] "POST /api/v1/login HTTP/1.1" 500 512 "-" "curl/7.64"
ERROR [2023-10-11 08:20:11] Traceback (most recent call last):\n  File "app.py", line 23, in <module>\n    main()\n'''

mixed_notes = '''Contact: Andra <andra@example.com>, tel:+62-812-3456-7890. Invoice #INV-2023/09/0001 Rp 1.250.000. Due 2023-10-20.
Big sale: 50% off! Visit https://shop.example.com/deals?id=123&src=email
DOB: 12 Aug 1990, alternative: 1990/08/12'''

html_snippets = '''<div><p>Hello <span>World <b>bold</b></span></p><div>Nested <span>deep <i>italic</i></span></div></div>'''

# Parsing Server Logs
Captures IP address, timestamp, status code, and endpoint patterns from inconsistently formatted Apache/Nginx server logs. Uses named groups to directly map the extracted results.

In [5]:
log_pattern = re.compile(r'''
    (?P<ip>\d{1,3}(?:\.\d{1,3}){3})      # IP address
    \s+-\s+                              # literal " - " separator
    (?P<user>[^\s\[]+)                   # username (no space or '[')
    \s+\[(?P<timestamp>[^\]]+)\]         # timestamp inside [ ]
    \s+"(?P<method>GET|POST|PUT|DELETE|PATCH|OPTIONS)  # HTTP method
    \s+(?P<path>[^\s]+)                  # URL path
    \s+(?P<proto>HTTP/\d\.\d)"           # HTTP protocol version
    \s+(?P<status>\d{3})                 # 3-digit status code
    \s+(?P<size>\d+|-)                   # response size (number or '-')
    \s+"(?P<ref>.*?)"                    # referer inside quotes
    \s+"(?P<agent>.*?)"                  # user-agent inside quotes
    ''', re.VERBOSE)

rows = []
for m in log_pattern.finditer(server_logs):
    d = m.groupdict()  # regex match result as a dictionary
    try:
        d['timestamp'] = dateparser.parse(d['timestamp'])  # convert to datetime
    except Exception:
        d['timestamp'] = d['timestamp']  # keep original string if parsing fails
    rows.append(d)

df_logs = pd.DataFrame(rows)
print(df_logs)


            ip  user                   timestamp method           path  \
0    127.0.0.1     -  10/Oct/2023:13:55:36 +0700    GET    /index.html   
1  192.168.1.5  john  11/Oct/2023:08:20:10 +0700   POST  /api/v1/login   

      proto status  size ref        agent  
0  HTTP/1.1    200  1024   -  Mozilla/5.0  
1  HTTP/1.1    500   512   -    curl/7.64  


Meaning:
* ip → client address.
* user → username (or - if missing).
* timestamp → request time.
* method → HTTP method (GET, POST).
* path → requested URL path.
* proto → HTTP protocol.
* status → server status code.
* size → response size in bytes.
* ref → referer (here - means empty).
* agent → application/device information used by the client.

# Advanced Text Cleaning and Normalization
Removes invisible characters (zero-width spaces, soft hyphens), handles Unicode combinations, and cleans text from alphanumeric noise with lookahead/lookbehind assertions.

In [6]:
def normalize_text(s):
    s = unicodedata.normalize('NFKC', s)
    s = re.sub('[\u200B\u200C\u200D\uFEFF]', '', s)
    s = s.replace('\u00AD', '')  # soft hyphen
    return s

print(normalize_text('Andra\u200Bexample — café'))

Andraexample — café


Description:
* 'NFKC' → Normalization Form Compatibility Composition (tidies up shapes and symbols into their most standard form).
* \u200B (zero-width space)
* \u200C (zero-width non-joiner)
* \u200D (zero-width joiner)
* \uFEFF (zero-width no-break space)
* s.replace('\u00AD', '') → removes soft hyphens (\u00AD), hyphens that only appear when hyphenating words.

## Simple lookaround example

In [7]:
s = 'Hello. world. 3.14 not removed.'
s2 = re.sub(r'(?<=\w)\.(?=\s+[a-z])', '', s)
print(s2)

Hello world. 3.14 not removed.


Note:
* There is a letter/number ((?<=\w)) before it.
* There is a space followed by a lowercase letter ((?=\s+[a-z])) after it.

# Named Entity Extraction without an NLP Library
Using regular expressions (regex) to find emails, URLs, phone numbers, prices, and dates from raw text, leveraging name groups for a clean result structure.

In [8]:
EMAIL = r'(?P<email>[\w.+-]+@[\w-]+\.[\w.-]+)'
URL = r'(?P<url>https?://[\w./?=&%#-]+)'
PHONE = r'(?P<phone>\+?\d{1,3}[\-\s]?\d{1,4}[\-\s]?\d{3,4}[\-\s]?\d{3,4})'
PRICE = r'(?P<price>Rp\s?[\d.,]+|\$\s?[\d,]+(?:\.\d{2})?)'
DATE = r'(?P<date>\b\d{1,2}[\-\s\/][A-Za-z0-9]{2,4}[\-\s\/]\d{2,4}\b|\b\d{4}-\d{2}-\d{2}\b)'

combined = re.compile('|'.join([EMAIL, URL, PHONE, PRICE, DATE]))

for m in combined.finditer(mixed_notes):
    print(m.groupdict())

{'email': 'andra@example.com', 'url': None, 'phone': None, 'price': None, 'date': None}
{'email': None, 'url': None, 'phone': '+62-812-3456-7890', 'price': None, 'date': None}
{'email': None, 'url': None, 'phone': None, 'price': 'Rp 1.250.000.', 'date': None}
{'email': None, 'url': None, 'phone': None, 'price': None, 'date': '2023-10-20'}
{'email': None, 'url': 'https://shop.example.com/deals?id=123&src=email', 'phone': None, 'price': None, 'date': None}
{'email': None, 'url': None, 'phone': None, 'price': None, 'date': '12 Aug 1990'}


Description:
1. Email
    * [\w.+-]+ → The part before the @ can be a letter, number, period, plus sign, or minus sign.
    * @ → separator.
    * [\w-]+ → domain name.
    * \. → period before the domain extension.
    * [\w.-]+ → domain extension (e.g., .com, .co.id).
2. URL
    * https?:// → http:// or https://.
    * [\w./?=&%#-]+ → URL content, can contain letters, numbers, periods, slashes, query strings, etc.
3. Phone
    * \+?\d{1,3} → optional country code (+62).
    * [\-\s]? → may contain a minus sign or space.
    * Then the numbers in the general telephone number format.
4. Price
    * Can be Rp or $ followed by a number.
    * [\d.,]+ → number, period, comma.
    * The (?:\.\d{2})? part for prices in dollars with two decimal places.
5. Date
    * Format DD-MM-YYYY, DD Mon YYYY, or YYYY-MM-DD.
6. Why so many Nones?\
Because only one pattern matches in each match, the rest are automatically empty (None).
7. What is this for?\
Very useful for extracting entities from raw text (e.g., parsing incoming emails, meeting notes, or free text in web scraping).

# Nested Pattern Matching (recursive with the regex module)
Handle nested parentheses as in programming language parsing, using Python's regex module, which supports recursive matching.

In [9]:
pattern = regex.compile(r'(?P<tag><(?P<name>\w+)(?:\s[^>]*)?>)(?:(?:(?R))|.*?)(?P<close></(?P=name)>)', flags=regex.DOTALL)

m = pattern.search(html_snippets)
if m:
    print('Matched block:')
    print(m.group(0))

Matched block:
<div><p>Hello <span>World <b>bold</b></span></p><div>Nested <span>deep <i>italic</i></span></div>


# Password validation

In [10]:
def validate_password(pw, username=None):
    if len(pw) < 12:
        return False
    checks = [r'[A-Z]', r'[a-z]', r'\d', r'[^A-Za-z0-9]']
    if not all(re.search(c, pw) for c in checks):
        return False
    if username and username.lower() in pw.lower():
        return False
    return True

tests = [
    ('StrongPass!2023', 'andra'),
    ('Short1!', 'andra'),
    ('strongpassword2023', 'andra'),
    ('StrongPassword2023', 'andra'),
    ('StrongPass!word', 'andra'),
    ('StrongPass!2023', 'StrongPass!2023'),
    ('SuperSecure@99', 'andra'),
    ('Password!1234', 'pass'),
    ('Qwerty!Qwerty1', 'andra')
]

for pw, user in tests:
    print(pw, '->', validate_password(pw, username=user))

StrongPass!2023 -> True
Short1! -> False
strongpassword2023 -> False
StrongPassword2023 -> False
StrongPass!word -> False
StrongPass!2023 -> False
SuperSecure@99 -> True
Password!1234 -> False
Qwerty!Qwerty1 -> True


Description:
| Password | Username | Reason Valid / Invalid |
| -------------------- | ----------------- | ---------------------------------------------------------------------------------------------- |
| `StrongPass!2023` | `andra` | ✅ **Valid** — length 15, contains uppercase and lowercase letters, numbers, symbols, and does not contain "andra". |
| `Short1!` | `andra` | ❌ **Invalid** — less than 12 characters. |
| `strongpassword2023` | `andra` | ❌ **Invalid** — no uppercase letters, no symbols. |
| `StrongPassword2023` | `andra` | ❌ **Invalid** — no symbols. |
| `StrongPass!word` | `andra` | ❌ **Invalid** — no numbers. |
| `StrongPass!2023` | `StrongPass!2023` | ❌ **Invalid** — contains an exact username. |
| `SuperSecure@99` | `andra` | ✅ **Valid** — length 14, contains uppercase and lowercase letters, numbers, symbols, no "andra" name. |
| `Password!1234` | `pass` | ❌ **Invalid** — username `pass` is in the password (case-insensitive). |
| `Qwerty!Qwerty1` | `andra` | ✅ **Valid** — length 15, all criteria met. |

# Luhn Check

In [11]:
cc_re = re.compile(r'(?P<number>(?:\d[ -]*?){13,19})') # Mencari nomor kartu kredit dengan panjang total digit 13 hingga 19 digit dan boleh dipisah "-"

def luhn_check(number):
    digits = [int(d) for d in re.sub(r'[^0-9]', '', number)][::-1]
    total = 0
    for i,d in enumerate(digits):
        if i % 2 == 1:
            d = d*2
            if d>9: d-=9
        total += d
    return total % 10 == 0

cards = [
    '4111 1111 1111 1111',
    '5500-0000-0000-0004',
    '378282246310005',
    '6011111111111117',
    '4111 1111 1111 1112',
    '1234 5678 9012 3456',
    '4111 1111 111',
    '4111 1111 1111 1111 1111'
]

for c in cards:
    found = cc_re.search(c)
    if found:
        print(c, '->', luhn_check(found.group('number')))
    else:
        print(c, '->', 'Format tidak cocok')

4111 1111 1111 1111 -> True
5500-0000-0000-0004 -> True
378282246310005 -> True
6011111111111117 -> True
4111 1111 1111 1112 -> False
1234 5678 9012 3456 -> False
4111 1111 111 -> Format tidak cocok
4111 1111 1111 1111 1111 -> True


Description:
| Card Number | Valid / Invalid | Reason |
| -------------------------- | ------------- | ------------------------------------------------------ |
| `4111 1111 1111 1111` | ✅ Valid | Official Visa trial number, passed Luhn. |
| `5500-0000-0000-0004` | ✅ Valid | Official Mastercard trial number, passed Luhn. |
| `378282246310005` | ✅ Valid | American Express trial number, 15 digits, passed Luhn. |
| `601111111111117` | ✅ Valid | Discover Card trial number, passed Luhn. |
| `4111 1111 1111 1112` | ❌ Invalid | Failed Luhn calculation (last digit modified). |
| `1234 5678 9012 3456` | ❌ Invalid | Does not meet Luhn rules. |
| `4111 1111 111` | ❌ Invalid | Only 11 digits, less than the 13-digit minimum. |
| `4111 1111 1111 1111 1111` | ❌ Invalid | More than 19 digits. |

# Regex + Pandas for Data Wrangling
Automatically split a text column containing a combination of name, invoice number, and date into separate columns using regular expressions in Pandas.

In [12]:
data = pd.DataFrame({'note':["Name: Andra; INV: INV-001; Date: 2023-10-11", "Name: Budi; INV: INV-002; Date: 2023/10/12"]})
pattern = r'Name:\s*(?P<name>[^;]+);\s*INV:\s*(?P<inv>[^;]+);\s*Date:\s*(?P<date>[^;]+)'
extracted = data['note'].str.extract(pattern)
extracted['date_norm'] = extracted['date'].apply(lambda x: dateparser.parse(x).date())
result = pd.concat([data, extracted], axis=1)
print(data)
print("-----------------")
print(result)

                                          note
0  Name: Andra; INV: INV-001; Date: 2023-10-11
1   Name: Budi; INV: INV-002; Date: 2023/10/12
-----------------
                                          note   name      inv        date  \
0  Name: Andra; INV: INV-001; Date: 2023-10-11  Andra  INV-001  2023-10-11   
1   Name: Budi; INV: INV-002; Date: 2023/10/12   Budi  INV-002  2023/10/12   

    date_norm  
0  2023-10-11  
1  2023-10-12  


# Text Redaction / Sensitive Data Masking
Automatically obscure information such as email addresses, ID card numbers, or account numbers with regular expression replacement, and display before-after comparisons.

In [13]:
# mask email
s = 'andra@example.com'
masked = re.sub(r'(?P<user>[^@]{1,3})([^@]*)(?=@)', lambda m: m.group('user') + '*'*len(m.group(2)), s)
print(masked)

# mask phone: save country code + last 2 digits
s = '+62-812-3456-7890'
masked_phone = re.sub(r'(\+?\d{1,3}[\-\s]?)([\d\-\s]+)(\d{2})$', lambda m: m.group(1) + re.sub(r'\d','*',m.group(2)) + m.group(3), s)
print(masked_phone)

and**@example.com
+62-***-****-**90


Email mask description: Keep the first 1–3 characters before the @, replace the rest *
* (?P<user>[^@]{1,3}) → Capture the first 1–3 characters before the @ sign (the initial part of the email name).
* ([^@]*) → Capture all other characters before the @ sign (the part of the username to be masked).
* (?=@) → Lookahead to ensure we stop right before the @ without deleting it.
* m.group('user') → Capture the first part of the username (1–3 characters).
* '*' * len(m.group(2)) → Replace the rest of the username with asterisks according to its length.

Phone mask description: Keep the country code and the last 2 digits, replace the rest *
* (\+?\d{1,3}[\-\s]?) → Capture the country code (e.g., +62-).
* ([\d\-\s]+) → Capture the middle digits (all will be masked).
* (\d{2})$ → Capture the last 2 digits (keep the original).
* m.group(1) → The country code is still displayed.
* re.sub(r'\d', '*', m.group(2)) → All middle digits are replaced with asterisks, the - sign remains.
* m.group(3) → The last two digits are still displayed.

# Simple HTML/XML parsing (recursive)
Uses the recursion feature in the regex module to find nested HTML/XML tag pairs and their contents, without the help of an external parser.

In [14]:
html = '<div><p>Visitors <b>Bold</b></p><div>Section <span>Inside</span></div></div>'
expr = regex.compile(
    r'(?P<node><(?P<tag>\w+)(?:\s[^>]*)?>)\s*'                # opening tag with optional attributes
    r'(?P<content>(?:(?R)|.*?))\s*'                           # content (can be nested tags, recursive match)
    r'(?P<close></(?P=tag)>)',                                # matching closing tag
    flags=regex.DOTALL
)

print(html)
print("-" * 5)
print(expr)
print("-" * 5)

for m in expr.finditer(html):
    print('TAG:', m.group('tag'))
    print('CONTENT:', m.group('content'))
    print('CLOSE:', m.group('close'))

<div><p>Visitors <b>Bold</b></p><div>Section <span>Inside</span></div></div>
-----
regex.Regex('(?P<node><(?P<tag>\\w+)(?:\\s[^>]*)?>)\\s*(?P<content>(?:(?R)|.*?))\\s*(?P<close></(?P=tag)>)', flags=regex.S | regex.V0)
-----
TAG: div
CONTENT: <p>Visitors <b>Bold</b></p><div>Section <span>Inside</span>
CLOSE: </div>


# Performance Testing & Benchmark Suite

In [15]:
import time
pat = regex.compile(r'^(?:[a-zA-Z0-9._%+-]+)@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
emails = ['user{}@example.com'.format(i) for i in range(10000)]
start = time.time()
for e in emails:
    pat.match(e)
print('Time (sec):', time.time()-start)

Time (sec): 0.022815465927124023


# Thank You