<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regular Expression

Regular expressions (often shortened to regex or regexp) are powerful tools used in Natural Language Processing (NLP) for pattern matching within text. They provide a concise and flexible way to identify, extract, or manipulate specific sequences of characters.

Here's a detailed breakdown of their importance and common use cases in NLP:

### Why are Regular Expressions Important in NLP?

1.  **Pattern Recognition**: NLP often deals with unstructured text. Regex allows you to define patterns to find specific data points, such as dates, email addresses, phone numbers, hashtags, mentions, or specific word forms.
2.  **Text Cleaning and Preprocessing**: Before any analysis can be done, text usually needs cleaning. Regex is invaluable for:
    *   **Removing unwanted characters**: Like HTML tags, punctuation (unless it's semantically important), extra spaces, or special symbols.
    *   **Standardizing text**: Converting all text to lowercase, removing numbers, or handling variations in spelling.
    *   **Tokenization**: Breaking down text into words or sentences, especially when dealing with complex rules that simple `split()` functions can't handle.
3.  **Information Extraction**: Extracting specific entities or facts from large bodies of text, such as names, locations, organizations, or product names, especially in rule-based systems.
4.  **Data Validation**: Checking if input text conforms to a certain format (e.g., ensuring an email address is valid).
5.  **Lexical Analysis**: Identifying specific lexical patterns or features in text that might be indicative of sentiment, topic, or other linguistic characteristics.

### Core Components and Concepts of Regular Expressions:

Regex uses a special syntax with various characters and operators:

#### 1. Literal Characters:

Most characters match themselves. E.g., `apple` will match the word "apple".

#### 2. Metacharacters (Special Characters):

These characters have special meanings and form the core of regex power.

*   `.` (Dot): Matches any single character (except newline, by default).
    *   Example: `a.t` matches "cat", "bat", "hat".
*   `^` (Caret): Matches the beginning of a string.
    *   Example: `^Hello` matches "Hello world" but not "Hi Hello".
*   `$` (Dollar sign): Matches the end of a string.
    *   Example: `world$` matches "Hello world" but not "world wide".
*   `*` (Asterisk): Matches zero or more occurrences of the preceding character or group.
    *   Example: `ab*c` matches "ac", "abc", "abbc", "abbbc".
*   `+` (Plus sign): Matches one or more occurrences of the preceding character or group.
    *   Example: `ab+c` matches "abc", "abbc", "abbbc" but not "ac".
*   `?` (Question mark): Matches zero or one occurrence of the preceding character or group (makes it optional).
    *   Example: `colou?r` matches "color" and "colour".
*   `{n}`: Matches exactly `n` occurrences of the preceding character or group.
    *   Example: `a{3}` matches "aaa".
*   `{n,}`: Matches `n` or more occurrences.
    *   Example: `a{2,}` matches "aa", "aaa", "aaaa".
*   `{n,m}`: Matches between `n` and `m` occurrences.
    *   Example: `a{2,4}` matches "aa", "aaa", "aaaa".
*   `[]` (Square brackets): Matches any one of the characters inside the brackets (character set).
    *   Example: `[aeiou]` matches any vowel. `[0-9]` matches any digit. `[A-Za-z]` matches any alphabet.
*   `|` (Pipe): Acts as an OR operator.
    *   Example: `cat|dog` matches "cat" or "dog".
*   `()` (Parentheses): Creates a capturing group. It allows you to group characters or patterns together, apply quantifiers to the entire group, or extract matched substrings.
    *   Example: `(ab)+` matches "ab", "abab", "ababab".
*   `\` (Backslash): Escapes a metacharacter, making it literal. It also introduces special sequences.
    *   Example: `\.` matches a literal dot, not any character.

#### 3. Special Sequences (Predefined Character Sets):

These are shorthand for common character classes, often prefixed with a backslash.

*   `\d`: Matches any digit (equivalent to `[0-9]`).
*   `\D`: Matches any non-digit (equivalent to `[^0-9]`).
*   `\w`: Matches any word character (alphanumeric + underscore, equivalent to `[a-zA-Z0-9_]`).
*   `\W`: Matches any non-word character.
*   `\s`: Matches any whitespace character (space, tab, newline, etc.).
*   `\S`: Matches any non-whitespace character.
*   `\b`: Matches a word boundary (the position between a word character and a non-word character, or the beginning/end of a string).
    *   Example: `\bcat\b` matches "cat" in "The cat sat." but not in "catapult".
*   `\B`: Matches a non-word boundary.

#### 4. Quantifiers (Greedy vs. Non-Greedy):

By default, quantifiers (`*`, `+`, `?`, `{}`) are "greedy," meaning they try to match as much as possible.

*   To make them "non-greedy" (or "lazy"), append a `?` after the quantifier.
    *   Example: `<.*>` (greedy) would match `<b>hello</b>` as "`<b>hello</b>`".
    *   `<.*?>` (non-greedy) would match `<b>` and `</b>` separately.

### Example Use Cases in NLP:

*   **Email Extraction**: `\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`
*   **Hashtag Identification**: `#[A-Za-z0-9_]+`
*   **Mention Identification**: `@\w+`
*   **Date Formats (e.g., YYYY-MM-DD)**: `\d{4}-\d{2}-\d{2}`
*   **Removing Punctuation**: `[^\w\s]` (matches any character that is not a word character or whitespace)
*   **Finding all occurrences of a word (case-insensitive)**: `[Tt]he` or `(?i)the` (if the regex engine supports flags)

### Libraries for Regex in Python:

Python's built-in `re` module provides comprehensive support for regular expressions:

*   `re.search()`: Scans through a string looking for the first location where the regex pattern produces a match.
*   `re.match()`: Checks for a match only at the beginning of the string.
*   `re.findall()`: Finds all non-overlapping matches of the pattern in the string and returns them as a list of strings.
*   `re.sub()`: Replaces occurrences of a pattern in a string with a replacement string.
*   `re.compile()`: Compiles a regex pattern into a regex object, which can be used for more efficient matching when the same pattern is used multiple times.

Regular expressions are an essential skill for anyone working with text data, and mastering them significantly enhances one's ability to preprocess, clean, and extract information from unstructured text in NLP.

In [None]:
import re

In [None]:
"Hello,how are you !"

'Hello,how are you !'

In [None]:
'you' in "Hello,how are you !"

True

In [None]:
srch = re.search('you' , 'Hello, how are you?')


In [None]:
srch.span()


(15, 18)

In [None]:
srch.start()


15

In [None]:
srch.end()


18

2. Findall

In [None]:
a="Hello ,How are area are you ?"
b="are"

len(re.findall(b,a))

3

3. Finditer

In [None]:

for s in re.finditer('are','Hello, how are area are you?'):
    print(s.span())

(11, 14)
(15, 18)
(20, 23)


In [None]:
import re

text = "I love NLP and NLP is powerful"

matches = re.finditer("NLP", text)

for match in matches:
    print(match.group(), match.start(), match.end())


NLP 7 10
NLP 15 18


4. Creating Regular Expressions

In [None]:
txt = 'My telephone number is 834-4324-345'

pattern = '\d\d\d-\d\d\d\d-\d\d\d'


re.search(pattern, txt)

  pattern = '\d\d\d-\d\d\d\d-\d\d\d'


<re.Match object; span=(23, 35), match='834-4324-345'>

In [None]:
txt = 'My telephone number is 834-4324-345'

pattern = '\d{3}-\d{4}-\d{3}'


re.search(pattern, txt)

  pattern = '\d{3}-\d{4}-\d{3}'


<re.Match object; span=(23, 35), match='834-4324-345'>

In [None]:
txt = 'My telephone number is 834-4324-345'

pattern = '\d{3}-\d{4}-\d{3}'

re.search(pattern, txt).group()

  pattern = '\d{3}-\d{4}-\d{3}'


'834-4324-345'

In [None]:
import re

txt = 'My telephone number is 834-4324-345'
pattern = r'\d\d\d-\d\d\d\d-\d\d\d'

matches = re.finditer(pattern, txt)

for match in matches:
    print("Phone Number:", match.group())
    print("Start Index:", match.start())
    print("End Index:", match.end())


Phone Number: 834-4324-345
Start Index: 23
End Index: 35


In [None]:
txt = "Call 834-4324-345 or 999-1111-222"

for match in re.finditer(pattern, txt):
    print(match.group(), match.span())


834-4324-345 (5, 17)
999-1111-222 (21, 33)


üè∑Ô∏è Named Entity‚ÄìStyle Extraction

In [None]:
text = "Email me at test@gmail.com or admin@yahoo.com"

pattern = r"\b[\w.-]+@[\w.-]+\.\w+\b"

for match in re.finditer(pattern, text):
    print("Entity:", match.group(), "| Span:", match.span())


Entity: test@gmail.com | Span: (12, 26)
Entity: admin@yahoo.com | Span: (30, 45)


üßπ Text Cleaning with Index Tracking

In [None]:
text = "Remove #hashtags and @mentions"

for match in re.finditer(r"[#@]\w+", text):
    print("Remove:", match.group(), "at", match.start())


Remove: #hashtags at 7
Remove: @mentions at 21


üß™ Highlighting Words (NLP Visualization idea)

In [None]:
text = "NLP makes machines understand language"

for match in re.finditer(r"\b\w{8,}\b", text):
    print(f"Long word '{match.group()}' at {match.span()}")


Long word 'machines' at (10, 18)
Long word 'understand' at (19, 29)
Long word 'language' at (30, 38)


#üß± Basic Alphanumeric Patterns

In [None]:
print(re.findall('at','The rat sat on the mat and attached by a cat'))

['at', 'at', 'at', 'at', 'at']


In [None]:
print(re.findall('.at','The rat sat on the mat and attached by a cat'))

['rat', 'sat', 'mat', ' at', 'cat']


In [None]:
print(re.findall('\d','4 is divisible by 2 and not by 3'))

['4', '2', '3']


  print(re.findall('\d','4 is divisible by 2 and not by 3'))


In [None]:
print(re.findall('\d\$','4 is divisible by 2 and not by 3'))
print(re.findall('\d\$','4 is divisible by 2 and not by 3 in maths'))

[]
[]


  print(re.findall('\d\$','4 is divisible by 2 and not by 3'))
  print(re.findall('\d\$','4 is divisible by 2 and not by 3 in maths'))


In [None]:
print(re.findall('[A-Z]','Welcome to GFG 1'))
print(re.findall('[a-z]','Welcome to GFG 1'))
print(re.findall('[0-9]','Welcome to GFG 1'))
print(re.findall('[A-Za-z]','Welcome to GFG 1'))
print(re.findall('[A-Za-z0-9]','Welcome to GFG 1'))

['W', 'G', 'F', 'G']
['e', 'l', 'c', 'o', 'm', 'e', 't', 'o']
['1']
['W', 'e', 'l', 'c', 'o', 'm', 'e', 't', 'o', 'G', 'F', 'G']
['W', 'e', 'l', 'c', 'o', 'm', 'e', 't', 'o', 'G', 'F', 'G', '1']


In [None]:
import re

text = "NLP2026!"
pattern = r"[A-Za-z0-9]"

print(re.findall(pattern, text))


['N', 'L', 'P', '2', '0', '2', '6']


In [None]:
text = "NLP_2026 is powerful!"

pattern = r"\b[A-Za-z0-9]+\b"
print(re.findall(pattern, text))


['is', 'powerful']


In [None]:
text = "User_id = user123"

print(re.findall(r"\w+", text))


['User_id', 'user123']


In [None]:
text = "OrderID A12B34 shipped on 2026"

for match in re.finditer(r"[A-Za-z0-9]+", text):
    print(match.group(), match.span())


OrderID (0, 7)
A12B34 (8, 14)
shipped (15, 22)
on (23, 25)
2026 (26, 30)


Extract Product Codes

In [None]:
text = "Products: TV123, MOB456, LAP789"

pattern = r"\b[A-Z]{2,3}\d{3}\b"
print(re.findall(pattern, text))


['TV123', 'MOB456', 'LAP789']


Remove Non-Alphanumeric Characters

In [None]:
text = "Clean!!! NLP@2026###"

clean = re.sub(r"[^A-Za-z0-9\s]", "", text)
print(clean)


Clean NLP2026


In [None]:
import re

text = "Eshant @# is happ\$y!"
cleaned = re.findall(r'[^@#\$!]+', text)
print(cleaned)
print(''.join(cleaned))

['Eshant ', ' is happ\\', 'y']
Eshant  is happ\y


  text = "Eshant @# is happ\$y!"


In [None]:
text = "I'm Aaliya Fatma and 123"
print(re.findall(r'\D+', text))

["I'm Aaliya Fatma and "]


In [None]:
text = "Geeks-for-Geeks works-it-out wklfd-dfjgk-fjkds"
patterns = re.findall(r'\w+-\w+(?:-\w+)?', text)
print(patterns)

['Geeks-for-Geeks', 'works-it-out', 'wklfd-dfjgk-fjkds']


Accesing numbers


In [None]:
for no in ["64-6534-342", "543-5345-645", "4563-453-445", "53-5453-5345", "435-234-6324"]:
    print(re.findall(r'\d+-\d+-\d+', no)[0].replace('-', ''))

646534342
5435345645
4563453445
5354535345
4352346324


In [None]:
import re

text = "eshant@gfg.in"
pattern = r'[A-Za-z0-9]+@[\w]+\.[\w]+'
print(re.match(pattern, text))

<re.Match object; span=(0, 13), match='eshant@gfg.in'>


In [None]:
text = "Contact us at eshant@gfg.in for details"
pattern = r'[A-Za-z0-9]+@[\w]+\.[\w]+'
print(re.findall(pattern, text))

['eshant@gfg.in']


Accessing Emails

In [None]:
mail1 = "eshant@gfg.org"
mail2 = "eshant@gmail.com"

pattern = r'[A-Za-z0-9]+@gfg\.org'
print(re.match(pattern, mail1))
print(re.match(pattern, mail2))

<re.Match object; span=(0, 14), match='eshant@gfg.org'>
None


In [None]:
mail1 = "eshant@gfg.org"
mail2 = "eshant@gfg.net"
mail3 = "eshant@gfg.com"

pattern = r'[A-Za-z0-9]+@gfg\.(org|net)'
print(re.match(pattern, mail1))
print(re.match(pattern, mail2))
print(re.match(pattern, mail3))

<re.Match object; span=(0, 14), match='eshant@gfg.org'>
<re.Match object; span=(0, 14), match='eshant@gfg.net'>
None


1Ô∏è‚É£ Remove Special Characters (Keep letters + numbers + space)

In [None]:
import re

text = "Hello!!! NLP@2026 #AI"
clean_text = re.sub(r"[^A-Za-z0-9\s]", "", text)

print(clean_text)


Hello NLP2026 AI


2Ô∏è‚É£ Remove Special Characters (Keep only letters)

In [None]:
text = "NLP2026!!! is #Awesome"

clean_text = re.sub(r"[^A-Za-z\s]", "", text)
print(clean_text)


NLP is Awesome


3Ô∏è‚É£ Remove Special Characters (Keep only numbers)

In [None]:
text = "Call me at +91-98765-43210"

numbers = re.sub(r"\D", "", text)
print(numbers)


919876543210


Clean Text + Remove Extra Spaces (Best Practice)

In [None]:
def clean_text(text):
    text = re.sub(r"[^A-Za-z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

print(clean_text("Hello!!!   NLP@2026   #AI"))


Hello NLP2026 AI


In [None]:
text = "I love NLP üòäüî•"
clean = re.sub(r"[^\x00-\x7F]+", "", text)
print(clean)


I love NLP 
