# Regular Expressions

- A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. 
- Usually this pattern is then used by `string` searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

- The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language.
- The concept came into common use with `Unix` text-processing utilities.
- Since the 1980s, different syntaxes for writing regular expressions exist, one being the POSIX standard and another, widely used, being the Perl syntax.

<h1 align=center>RegEx in Python</h1>

 [python documentation](https://docs.python.org/2/library/re.html#regular-expression-syntax)

[RegEx 101](https://regex101.com/)

### 🔹 *What is Regex?*

*Regular Expression (Regex)*: is a sequence of characters that forms a search pattern.

It is used to `find`, `match`, and `manipulate` strings.

Example:
- Pattern: \d+
- Meaning: "one or more digits".
- Text: "I have 12 apples" → Match: 12.

### 🔹 *Why use Regex?*
- To search inside text.
- To validate inputs (emails, phone numbers, passwords).
- To clean or extract information from text.
- To perform substitutions or formatting.


### *Regex is a powerful tool in Python for string searching, validation, and text manipulation using patterns.*

### Regex Module
python provide module for regex called `re`

most used functions:
- sub() → Replace text.
- findall() → Return all matches.
- match() → Match only at the start.
- search() → Find first match anywhere.
- compile() → Save a regex pattern for reuse.

## Regex in Python

python provide module for regex called `re`

we are going to use the `findall` and `sub` functions

In [1]:
import re

text = "my name is Ahmed Mohamed Ahmed"

re.sub("Ahmed", "Ali", text)

'my name is Ali Mohamed Ali'

In [2]:
re.findall("Ahmed", text)

['Ahmed', 'Ahmed']

here we used normal text for substitue and search, we can use a regex though !

## Special Characters (Regular Expressions)

some characters have special functions and are not just character, for example the `\n` which indicate a newline and the `\t` which is a tap space.

### Basic patterns that match single chars

| Character  | function |
| ------------- | ------------- |
| a-z, 0-9  | ordinary characters just match themselves exactly.|
| . (dot)  | matches any single character except newline '\n'  |
| \w | matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_] |
| \W | matches any non-word character |
| \b | boundary between word and non-word |
| \s | matches a single whitespace character -- space, newline, return, tab |
| \S | matches any non-whitespace character |
| \t, \n, \r | tab, newline, return |
| \d | decimal digit [0-9] |
| ^ | matches start of the string |
| $ | match the end of the string |

## Let's mix them with normal characters

> note that we use `r` before the pattern string to let python know not to parse them, for example not to take \n and replace it by newline.

In [3]:
text = """regular expression is a special sequence of characters \
that helps you match or find other strings or sets of strings, \
using regular expression pattern. regular expressions are widely used in UNIX world."""

print(re.findall(r"^Regular", text)) #^ regular in Start Text
print(re.findall(r"regular", text)) #all found

[]
['regular', 'regular', 'regular']


## Replace the first regular to a title case one

In [4]:
text = """regular expression is a special sequence of characters \
that helps you match or find other strings or sets of strings, \
using regular expression pattern. regular expressions are widely used in UNIX world."""

re.sub(r"^regular", "Regular", text)

'Regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using regular expression pattern. regular expressions are widely used in UNIX world.'

# Lab 1 : Ends with number

using regular expression write a script that check if a string ends with a numbe

In [7]:
text = input("enter a string:").strip()

result = re.findall(r"[0-9]$", text)
print(result)


if result:
    print("string ends with a number")
else:
    print("it is not")

enter a string: Ali 123


['3']
string ends with a number


# Lab 2 : Only Valid text

write a script using regular expression to check if user input consists of only **`alphabet letters`**, **`number`** and **`_`**

In [20]:
username = input("enter your username:").strip()

invalid_input = re.findall(r"[^a-zA-Z0-9_]", username) # [^a-zA-Z0-9_]  == \W

if invalid_input:
    print("invalid username")
else:
    print("username {} is valid".format(username))

enter your username: Ali.ahmed123


invalid username


In [11]:
username = input("enter your username:").strip()

invalid_input = re.findall(r"\W", username) # [^a-zA-Z0-9_]  == \W

if invalid_input:
    print("invalid username")
else:
    print("username {} is valid".format(username))

enter your username: Ali.ahmed123


invalid username


## Building a bigger regular expression

you can mix and match multiple expressions and have more than one instances of them

| Example  | description |
| --- | --- |
| [Pp]ython | Match "Python" or "python" |
| rub[ye] | Match "ruby" or "rube" |
| [aeiou] | Match any one lowercase vowel |
| [0-9] | Match any digit; same as [0123456789] |
| [a-z] | Match any lowercase ASCII letter |
| [A-Z] | Match any uppercase ASCII letter |
| [a-zA-Z0-9] | Match any of the defined |
| [^aeiou] | Match anything other than a lowercase vowel |
| [^0-9] | Match anything other than a digit |

> we can use **OR** to use multiple regex together.




## Examples

In [8]:
import re

# Text
text = "Python and python are amazing! ruby, rube, AI-2025, Room 101B, education."

# 1. Match "Python" or "python"
print("[Pp]ython →", re.findall(r"[Pp]ython", text))

# 2. Match "ruby" or "rube"
print("rub[ye] →", re.findall(r"rub[ye]", text))

# 3. Match any one lowercase vowel(a, e, i, o, u)
print("[aeiou] →", re.findall(r"[aeiou]", text))

# 4. Match any digit; same as [0123456789]
print("[0-9] →", re.findall(r"[0-9]", text))
print()

# 5. Match any lowercase ASCII letter
print("[a-z] →", re.findall(r"[a-z]", text))
print()

# 6. Match any uppercase ASCII letter
print("[A-Z] →", re.findall(r"[A-Z]", text))
print()

# 7. Match any of the defined (alphanumeric)
print("[a-zA-Z0-9] →", re.findall(r"[a-zA-Z0-9]", text))
print()

# 8. Match anything other than a lowercase vowel
print("[^aeiou] →", re.findall(r"[^aeiou]", text))
print()

# 9. Match anything other than a digit
print("[^0-9] →", re.findall(r"[^0-9]", text))

[Pp]ython → ['Python', 'python']
rub[ye] → ['ruby', 'rube']
[aeiou] → ['o', 'a', 'o', 'a', 'e', 'a', 'a', 'i', 'u', 'u', 'e', 'o', 'o', 'e', 'u', 'a', 'i', 'o']
[0-9] → ['2', '0', '2', '5', '1', '0', '1']

[a-z] → ['y', 't', 'h', 'o', 'n', 'a', 'n', 'd', 'p', 'y', 't', 'h', 'o', 'n', 'a', 'r', 'e', 'a', 'm', 'a', 'z', 'i', 'n', 'g', 'r', 'u', 'b', 'y', 'r', 'u', 'b', 'e', 'o', 'o', 'm', 'e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n']

[A-Z] → ['P', 'A', 'I', 'R', 'B']

[a-zA-Z0-9] → ['P', 'y', 't', 'h', 'o', 'n', 'a', 'n', 'd', 'p', 'y', 't', 'h', 'o', 'n', 'a', 'r', 'e', 'a', 'm', 'a', 'z', 'i', 'n', 'g', 'r', 'u', 'b', 'y', 'r', 'u', 'b', 'e', 'A', 'I', '2', '0', '2', '5', 'R', 'o', 'o', 'm', '1', '0', '1', 'B', 'e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n']

[^aeiou] → ['P', 'y', 't', 'h', 'n', ' ', 'n', 'd', ' ', 'p', 'y', 't', 'h', 'n', ' ', 'r', ' ', 'm', 'z', 'n', 'g', '!', ' ', 'r', 'b', 'y', ',', ' ', 'r', 'b', ',', ' ', 'A', 'I', '-', '2', '0', '2', '5', ',', ' ', 'R', 'm', ' ', '

## Example 2

In [9]:
texts = [
  "python is a great language",
  "i lov to write in py",
  "what a cool language Python is",
  "the pyramids of giza are so huge!"
]


for text in texts:
    python_detected = re.findall(r"[Pp]ython|\b[Pp]y\b", text)
    if python_detected:
        print("talking about python")
    else:
        print("something else")

talking about python
talking about python
talking about python
something else


## Repetition Cases

| Example | description |
| --- | --- |
| happy? | Match "happ" or "happy": the y is optional |
| happy* | Match "happ" plus 0 or more y(s) |
| happy+ | Match "happ" plus 1 or more y(s) |
| \d{3} | Match exactly 3 digits |
| \d{3,} | Match 3 or more digits |
| \d{3,5} | Match 3, 4, or 5 digits |

## Example:

In [10]:
text = "it's 2018, happy new year!"

print(re.findall(r"\d", text)) #\d  \d*  \d+  d{4}
print(re.findall(r"\d+", text))
print(re.findall(r"\d*", text))
print(re.findall(r"\d{4}", text))

['2', '0', '1', '8']
['2018']
['', '', '', '', '', '2018', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['2018']


## Let's replace numbers with NUM

In [26]:
text = "this is a text tweet that contains\
multiple numbers 01112131411, 012121212 and 01010101010"

new_text = re.sub(r"\d{11}", " PHONE ", text)
new_text = re.sub(r"\d+", " NUM ", new_text)
print(new_text)

this is a text tweet that containsmultiple numbers  PHONE ,  NUM  and  PHONE 


## Search groups

you can create a search groups with regex and retrieve each one with `search()` function

In [5]:
import re
# Note that () is meaning match as arrange
# but [] is meaninig not match as arrange
text = """this is a string$$.
this is a string$$.
this is a string$$.
This is another one"""
print(re.findall("$", text))
print(re.findall(r"\$", text))
print(re.findall(".", text))
print(re.findall(r"\.", text))
print(re.findall("[this]", text))
print(re.findall("(this)", text))

['']
['$', '$', '$', '$', '$', '$']
['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '$', '$', '.', 't', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '$', '$', '.', 't', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '$', '$', '.', 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', 'n', 'o', 't', 'h', 'e', 'r', ' ', 'o', 'n', 'e']
['.', '.', '.']
['t', 'h', 'i', 's', 'i', 's', 's', 't', 'i', 't', 'h', 'i', 's', 'i', 's', 's', 't', 'i', 't', 'h', 'i', 's', 'i', 's', 's', 't', 'i', 'h', 'i', 's', 'i', 's', 't', 'h']
['this', 'this', 'this']


## Match Email Example

In [28]:
email_address = 'Please contact us at: support@datacamp.com  support1@datacamp.com support2@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+).(\w{3})', email_address)
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)
    print(match.group(3)) # The host (group 3)

support@datacamp.com
support
datacamp
com


## `match` vs `search`

The `match()` function checks for a match only at the beginning of the string whereas the `search()` function checks for a match anywhere in the string.

## The `re.compile()`

we can compile a regular expression instead of writing it multiple time

In [13]:
mail_re = re.compile(r"[\w\.-]+@[\w\.-]+\.[\w\.-]+")

mails = [
    "this is a message with an email of:example.name@company.org",
    "my-email55@yahoo.com is the email you would like to use",
    "send me an email at:shortmail@long-company.net"
]

for mail in mails:
    print(mail_re.findall(mail))

['example.name@company.org']
['my-email55@yahoo.com']
['shortmail@long-company.net']


# Lab 3 : Extract Hashtags

Write a script to extract a hashtag from tweet.

In [11]:
import re
tweet = "a tweet with no hashtag, but a #HASHTAG and another #cool one #هاشتاج"

hashtags = re.findall(r"#\S+", tweet)

if hashtags:
    print(hashtags)
else:
    print("no hashtags found")

['#HASHTAG', '#cool', '#هاشتاج']


# Excecise 1:
- Extract Information from Text
    - Write regex patterns to extract the following from the paragraph below:
        - All email addresses
        - All dates in format DD/MM/YYYY or DD-MM-YYYY
        - All numbers greater than or equal to 100
        - # Text Example:
            - Contact us at info@company.com or sales@shop.org. Our next event is on 25-10-2025, and registration closes 15/10/2025.  We already have 120 participants and 99 seats left!

# Excecise 2:
- Clean and Analyze Social Media Text
    - Given the text below, use regex to:
        - Remove all punctuation.
        - Extract all hashtags (#word).
        - Extract all mentions (@username).
        - # Text Example:
            - "Great session on #AI and #NLP today! Follow @DataLab and @AI_Community for updates."

# Example Text From Teslat

Elon musk's phone number is 9991116666, call him if you have any questions on dodgecoin. 
Tesla's revenue is 40 billion
Tesla's CFO number (999)-333-7777

Please contact us at: support@Tesla.com

Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop, manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.
Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines
against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection
rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges
and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor
supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of
comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited
consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in
conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year
ended December 31, 2020.

Thanks
Belal A. Hamed 
belalahmed.h7@gmail.com