# Some Examples of Regular Expressions

# Simple


## Example 1: Email Address Validation
* `email_pattern` is a regular expression (regex) pattern for checking whether a string is a valid email address.
* __re.match()__ matches the string from the beginning against the regular expression.
* Outputs "Valid email address." if it matches, and "Invalid email address." if it does not.

In [1]:
import re

email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
email = "example@email.com"

if re.match(email_pattern, email):
    print("Valid email.")
else:
    print("Invalid email.")


Valid email.


Explanation:
* email_pattern: A more stringent and commonly used RegEx pattern for email validation.
* The character before `@` can be a letter, number, underscore `(_)`, period `(.)`, plus sign `(+)`, or minus sign `(-)`.
* After `@`, the domain name only accepts letters, numbers, and the - sign.
* Ends with a top-level domain (such as .com, .id, etc.) with at least 2 letters.
* The __re.match()__ function will match from the beginning of the string.

In [4]:
import re

email_pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z]{2,}$'
email = "example@email.com"

if re.match(email_pattern, email):
    print("Valid email.")
else:
    print("Invalid email.")

Valid email.


Explanation:
* In this example, we use a regular expression (email_pattern) that matches properly formatted email addresses.
* This expression checks the characters before and after the "@" (@), as well as the domain.
* If the email address matches this pattern, we declare the email valid.

## Example 2: Extracting Domains from Email Addresses
* __re.search(r'@([\w\.-]+)', email)__ searches for the pattern `@` followed by one or more characters: letters, numbers, underscore `(_)`, period `(.)`, or minus sign `(-)`.
* __.group(1)__ returns the portion enclosed in parentheses `()`, that is, just the domain, without the `@`.

In [5]:
import re

email = "example@rocketmail.com"

# Find the domain part after the '@' symbol
domain = re.search(r'@([\w\.-]+)', email).group(1)

print("Domain:", domain)

Domain: rocketmail.com


Explanation:
* `r'@([a-zA-Z0-9.-]+)'`: This pattern searches for the `@` symbol followed by any combination of letters, numbers, periods, or minus signs.
* .group(1): Gets only the part inside the parentheses (i.e., rocketmail.com, without the @).
* .group() (without numbers): Gets the entire match, i.e., @rocketmail.com.
* re.search(...): Returns the match object along with the match position.

In [6]:
import re

email = "example@rocketmail.com"

# Find the domain part after the '@' symbol
domain = re.search(r'@([a-zA-Z0-9.-]+)', email).group(1)

print("Domain:", domain)  # Prints domain only (without '@')

# Print the match object
print(re.search(r'@([a-zA-Z0-9.-]+)', email))

# Print the full match including '@'
print(re.search(r'@([a-zA-Z0-9.-]+)', email).group())

Domain: rocketmail.com
<re.Match object; span=(7, 22), match='@rocketmail.com'>
@rocketmail.com


# Intermediate

## Example 1: Extracting All Links from an HTML Page
* __re.findall(...)__: Returns all strings that match a regular expression pattern.
* __r'href=["\'](https?://\S+)["\']'__: Finds the value of the href attribute that contains a link starting with `http://` or `https://`.
* `\S+`: Represents one or more non-whitespace characters up to the closing quotation mark.

In [7]:
import re

html = """
<a href="https://example.com">Link 1</a>
<a href="https://openai.com">Link 2</a>
<p>Not a link</p>
"""

# Find all links within href attributes
links = re.findall(r'href=["\'](https?://\S+)["\']', html)

# Print each found link
for link in links:
    print("Link:", link)


Link: https://example.com
Link: https://openai.com


Explanation:
* __re.findall(...)__ searches for all matches of a pattern in an HTML string.
* The pattern `r'href=["\'](https?://\S+)["\']'` searches for links in the href attribute.
* `https?` includes both http and https.
* `\S+` means one or more non-whitespace characters (up to the closing quotation mark).
* __print(links)__ displays the entire extracted list.

In [10]:
import re

html = """
<a href="https://example.com">Link 1</a>
<a href="https://openai.com">Link 2</a>
<p>Not a link</p>
"""

# Find all URLs from href attributes starting with http or https
links = re.findall(r'href=["\'](https?://\S+)["\']', html)

# Print each link one by one
for link in links:
    print("Link:", link)

# Print the full list of links
print(links)

Link: https://example.com
Link: https://openai.com
['https://example.com', 'https://openai.com']


Explanation:
* `href=["\']`: Searches for the word href= followed by a quotation mark (either " or ').
* `(https?://[^\s"\'<>]+)`:
  - `https?://`: Matches either http:// or https://.
  - `[^\s"\'<>]`+: Searches for one or more characters other than spaces, quotation marks (" '), and the < > symbol.
* `["\']`: Closing quotation mark.

In [11]:
import re

html = """
<a href="https://example.com">Link 1</a>
<a href="https://openai.com">Link 2</a>
<p>Not a link</p>
"""

# Find all valid URLs from href attributes
links = re.findall(r'href=["\'](https?://[^\s"\'<>]+)["\']', html)

# Print all found links
for link in links:
    print("Link:", link)


Link: https://example.com
Link: https://openai.com


## Example 2: Extracting URLs from HTML Text
* `<a [^>]*`: Matches the <a tag and any character (except >) as many times as possible.
* `href=["\']`: Matches the `href=` attribute followed by single or double quotes.
* `(https?://\S+)`: Captures links that start with `http://` or `https://` and are followed by any non-whitespace character.
* `["\']`: Closes a quote.

In [12]:
import re

html = """
Visit our website at <a href="https://example.com">Example</a>
Click <a href="https://openai.com">here</a> for more info.
"""

# Find all URLs from <a> elements containing href
links = re.findall(r'<a [^>]*href=["\'](https?://\S+)["\']', html)

# Print all found links
for link in links:
    print("Link:", link)


Link: https://example.com
Link: https://openai.com


# Advanced

## Example 1: Validating Phone Numbers with Different Formats
* `^` and `$`: Mark the beginning and end of the string.
* `\+`: Must start with a `+` sign (country code).
* `\d{1,3}`: 1–3 digit number (country code).
* `\s?`: Spaces are allowed after the country code.
* `\d{1,3}[-. ]\d{1,4}[-. ]\d{1,4}`: Number format (separated by `-`, ., or spaces).

In [13]:
import re

phone_numbers = ["+1 123-456-7890", "555-5555", "123.456.7890", "+44 20 7123 1234"]

# Regex pattern to match international phone numbers
phone_pattern = r'^\+\d{1,3}\s?\d{1,3}[-. ]\d{1,4}[-. ]\d{1,4}$'

# Loop through each phone number
for phone in phone_numbers:
    if re.match(phone_pattern, phone):
        print(f"Phone Number '{phone}' is valid.")
    else:
        print(f"Phone Number '{phone}' is not valid.")


Phone Number '+1 123-456-7890' is valid.
Phone Number '555-5555' is not valid.
Phone Number '123.456.7890' is not valid.
Phone Number '+44 20 7123 1234' is valid.


This code wants to check whether the string "+1 123-456-7890" matches the international phone number pattern given by the regex phone_pattern.\
If it doesn't match, it will return `None`.

In [14]:
import re

phone_numbers = ["+1 123-456-7890", "555-5555", "123.456.7890", "+44 20 7123 1234"]

phone_pattern = r'^\+\d{1,3}\s?\d{1,3}[-. ]\d{1,4}[-. ]\d{1,4}$'

print(re.match(phone_pattern, phone_numbers[0]))

<re.Match object; span=(0, 15), match='+1 123-456-7890'>


Explanation:
* `^` and `$`: ensure a match from the beginning to the end of the string
* `\+?`: optional + character at the beginning (country code)
* `(\d{1,3})?`: country code (1–3 digits) → optional
* `[-.\s]?`: optional separator (can be -, . or space)
* `(\d{1,4})?`: beginning of the phone number → optional
* `[-.\s]?`: separator again
* `(\d{1,4})`: second part of the number (required)
* `[-.\s]?`: separator
* `(\d{1,9})`: end of the number (maximum 9 digits)

In [15]:
import re

phone_numbers = ["+1 123-456-7890", "555-5555", "123.456.7890", "+44 20 7123 1234"]

phone_pattern = r'^\+?(\d{1,3})?[-.\s]?(\d{1,4})?[-.\s]?(\d{1,4})[-.\s]?(\d{1,9})$'

for phone in phone_numbers:
    if re.match(phone_pattern, phone):
        print(f"Phone number '{phone}' is valid.")
    else:
        print(f"Phone number '{phone}' is not valid.")


Phone number '+1 123-456-7890' is valid.
Phone number '555-5555' is valid.
Phone number '123.456.7890' is valid.
Phone number '+44 20 7123 1234' is valid.


## Example 2: Extracting Area Codes from Phone Numbers
* __re.search(r'\+(\d{1,3})', phone_number)__ searches for international area codes that begin with a plus sign +, followed by 1 to 3 digits.
* `group(1)` retrieves the first capture group, which is the digits after the plus sign.
* In the example +62 123-456-7890, the +62 part matches the pattern, so 62 is the Indonesian area code.

In [16]:
import re

phone_number = "+62 123-456-7890"

area_code = re.search(r'\+(\d{1,3})', phone_number).group(1)

print("Area Code:", area_code)


Area Code: 62


In [17]:
import re

phone_numbers = [
    "+62 123-456-7890",
    "+1 555-555-5555",
    "+44 20 7123 1234",
    "+81-3-1234-5678",
    "0812-3456-7890"  # No country code
]

for phone in phone_numbers:
    match = re.search(r'^\+(\d{1,3})', phone)
    if match:
        print(f"Number: {phone}, Area Code: {match.group(1)}")
    else:
        print(f"Number: {phone}, Area Code: Not found")


Number: +62 123-456-7890, Area Code: 62
Number: +1 555-555-5555, Area Code: 1
Number: +44 20 7123 1234, Area Code: 44
Number: +81-3-1234-5678, Area Code: 81
Number: 0812-3456-7890, Area Code: Not found
