## Python - Regular 
- regex (regular expressions) in Python is super powerful for text cleaning and preprocessing. Let me give you a structured overview with examples.
- https://docs.python.org/3/library/re.html

In [1]:
import re

## Basic Functions
- re.findall(pattern, text) → returns all matches
- re.search(pattern, text) → finds the first match
- re.sub(pattern, repl, text) → replaces text
- re.split(pattern, text) → splits text

### Cleaning 

In [4]:
# Remove punctuation
text = "Hello!!! How are you?? I’m fine..."
cleaned = re.sub(r"[^\w\s]", "", text)
text

'Hello!!! How are you?? I’m fine...'

In [5]:
print(cleaned)   # "Hello How are you I’m fine"

Hello How are you Im fine


In [6]:
#  Lowercasing + remove extra spaces
text = "   THIS   is   a   TEST   "
cleaned = re.sub(r"\s+", " ", text).strip().lower()
print(text , '\n', cleaned)   # "this is a test"

   THIS   is   a   TEST    
 this is a test


In [7]:
# Remove numbers
text = "Order #123 was placed on 2025-08-15"
cleaned = re.sub(r"\d+", "", text)
print(text , '\n', cleaned)   # "Order # was placed on -"

Order #123 was placed on 2025-08-15 
 Order # was placed on --


In [8]:
# Extract only emails
text = "Contact us at support@example.com or sales@shop.com"
emails = re.findall(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}", text)
#   _____@ _____
print(text, '\n', emails)   # ['support@example.com', 'sales@shop.com']

Contact us at support@example.com or sales@shop.com 
 ['support@example.com', 'sales@shop.com']


In [9]:
# Extract only hashtags
text = "Loving #AI and #Python for #DataScience!"
hashtags = re.findall(r"#\w+", text)
print(text, '\n', hashtags)   # ['#AI', '#Python', '#DataScience']

Loving #AI and #Python for #DataScience! 
 ['#AI', '#Python', '#DataScience']


In [10]:
# Remove HTML Tags
text = "<p>Hello <b>World</b></p>"
cleaned = re.sub(r"<.*?>", "", text)
print(text, '\n', cleaned)   # "Hello World"

<p>Hello <b>World</b></p> 
 Hello World


In [11]:
# Normalize whitespace (newlines, tabs → space)
text = "Hello\nWorld\tPython"
cleaned = re.sub(r"\s+", " ", text)
print(text, '\n', cleaned)   # "Hello World Python"

Hello
World	Python 
 Hello World Python


##  Regex + Pandas (cleaning columns)

In [13]:
import pandas as pd

df = pd.DataFrame({"text": [
    "Hello!!!",
    "Python   is GREAT???",
    "<p>Data Science</p>",
]})

df

Unnamed: 0,text
0,Hello!!!
1,Python is GREAT???
2,<p>Data Science</p>


In [14]:
df["clean_text"] = df["text"].str.lower()\
                              .str.replace(r"<.*?>", "", regex=True)\
                              .str.replace(r"[^\w\s]", "", regex=True)\
                              .str.replace(r"\s+", " ", regex=True)
df

Unnamed: 0,text,clean_text
0,Hello!!!,hello
1,Python is GREAT???,python is great
2,<p>Data Science</p>,data science


# Benefits
- With regex, you can flexibly extract, replace, or normalize text patterns.
- It’s especially useful in NLP, web scraping, and log cleaning.