<a href="https://colab.research.google.com/github/buvir/daily_python_practice/blob/main/Regex_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Regex

# Python Regular Expressions (Regex) Special Sequences

Special sequences in Python regex are sequences preceded by a backslash `\` that have a special meaning, rather than representing the literal character(s). They are often used to match predefined sets of characters like digits, whitespace, word boundaries, etc.

| Character | Description                                                                 | Example (Regex) |
| :-------- | :-------------------------------------------------------------------------- | :-------------- |
| `\`       | Escape special characters or indicate a special sequence.                     | `\.` (matches a literal dot), `\\` (matches a literal backslash) |
| `\d`      | Matches any digit (0-9). Equivalent to `[0-9]`.                            | `\d+` (matches one or more digits) |
| `\D`      | Matches any non-digit character. Equivalent to `[^0-9]`.                   | `\D+` (matches one or more non-digits) |
| `\s`      | Matches any whitespace character (space, tab, newline, return, form feed). | `\s+` (matches one or more whitespace characters) |
| `\S`      | Matches any non-whitespace character.                                       | `\S+` (matches one or more non-whitespace characters) |
| `\w`      | Matches any word character (alphanumeric + underscore). Equivalent to `[a-zA-Z0-9_]`. | `\w+` (matches one or more word characters) |
| `\W`      | Matches any non-word character. Equivalent to `[^a-zA-Z0-9_]`.             | `\W+` (matches one or more non-word characters) |
| `\b`      | Matches a word boundary. The position between a word character and a non-word character, or the start/end of the string. | `\bword\b` (matches the whole word "word") |
| `\B`      | Matches a non-word boundary. The position between two word characters or two non-word characters. | `\Bword\B` (matches "word" only if it's part of a larger word, like "keyword") |
| `\A`      | Matches the start of the string.                                            | `\AHello` (matches "Hello" only if it's at the very beginning of the string) |
| `\Z`      | Matches the end of the string.                                              | `World\Z` (matches "World" only if it's at the very end of the string) |

**Note:** `\Z` matches the end of the string. `$` also matches the end of the string, but it *also* matches immediately before a newline at the end of the string. For matching strictly the very end, `\Z` is more precise. Similarly, `^` matches the start of the string (like `\A`), and also immediately after a newline in multiline mode. Use `\A` and `\Z` for unambiguous start/end-of-string matching regardless of multiline flags.

# Python Regular Expressions (Regex) Special Characters / Constructs

Beyond the special sequences (`\d`, `\s`, etc.), regex uses several special characters and constructs to define patterns for repetition, position, grouping, and alternatives. These are often combined with character classes or other patterns.

| Special Character(s) | Description                                                                 | Example (Regex)      |
| :------------------- | :-------------------------------------------------------------------------- | :------------------- |
| `.`                  | Matches any single character *except* newline by default.                    | `a.b` (matches "aXb" where X is any char except newline) |
| `^`                  | Matches the start of the string. In multiline mode, also matches after a newline. | `^Start` (matches "Start" only at the beginning) |
| `$`                  | Matches the end of the string. In multiline mode, also matches before a newline. | `End$` (matches "End" only at the end) |
| `*`                  | Matches the preceding element zero or more times (greedy).                   | `a*` (matches "", "a", "aa", "aaa", etc.) |
| `+`                  | Matches the preceding element one or more times (greedy).                    | `a+` (matches "a", "aa", "aaa", etc., but not "") |
| `?`                  | Matches the preceding element zero or one time (optional, greedy).           | `a?` (matches "a" or "") |
| `{n}`                | Matches the preceding element exactly `n` times.                             | `\d{3}` (matches exactly 3 digits) |
| `{n,}`               | Matches the preceding element `n` or more times (greedy).                    | `a{2,}` (matches "aa", "aaa", "aaaa", etc.) |
| `{n,m}`              | Matches the preceding element between `n` and `m` times, inclusive (greedy). | `[0-9]{1,5}` (matches 1 to 5 digits) |
| `*?`                 | Matches the preceding element zero or more times (non-greedy).               | `a*?` |
| `+?`                 | Matches the preceding element one or more times (non-greedy).                | `a+?` |
| `??`                 | Matches the preceding element zero or one time (optional, non-greedy).       | `a??` |
| `{n,}?`              | Matches the preceding element `n` or more times (non-greedy).                | `a{2,}?` |
| `{n,m}?`             | Matches the preceding element between `n` and `m` times, inclusive (non-greedy). | `[0-9]{1,5}?` |
| `[...]`              | Defines a character set. Matches any single character within the brackets.   | `[abc]` (matches "a", "b", or "c") |
| `[^...]`             | Defines a negated character set. Matches any single character *not* within the brackets. | `[^0-9]` (matches any single non-digit character) |
| `(` `)`              | Groups expressions together, treating them as a single unit. Also creates a capturing group. | `(ab)+` (matches "ab", "abab", "ababab", etc.) |
| `(?:` `)`            | Groups expressions together without creating a capturing group (non-capturing group). | `(?:ab)+` (matches "ab", "abab", etc., but doesn't capture the group) |
| `|`                  | Acts as OR. Matches the expression before or after the `|`.                  | `cat|dog` (matches either "cat" or "dog") |


**Note:**
* **Greedy vs. Non-Greedy:** By default, quantifiers (`*`, `+`, `?`, `{n,}`, `{n,m}`) are "greedy". They try to match the *longest* possible string. Adding a `?` after a quantifier makes it "non-greedy" or "lazy", meaning it tries to match the *shortest* possible string.
* To match a literal special character (like `*`, `+`, `?`, `.`, `^`, `$`, `(`, `)`, `[`, `]`, `|`, `\`), you must escape it with a backslash (`\`). Example: `\$` matches a literal dollar sign.

In [6]:
import re

Extracting phone numbers

In [None]:
text='''My contact number is 9999999999' , and its us contact number is (910)-666-8888'''

Extracting indian  number

In [None]:
pattern='\d'
matches=re.findall(pattern,text)
print(matches)


['9', '9', '9', '9', '9', '9', '9', '9', '9', '9', '9', '1', '0', '6', '6', '6', '8', '8', '8', '8']


In [None]:
pattern='\d\d\d\d'
matches=re.findall(pattern,text)
print(matches)

['9999', '9999', '8888']


In [None]:
pattern='\d{10}'
matches=re.findall(pattern,text)
print(matches)

['9999999999']


Extracting us  number

In [None]:
pattern='\(\d{3}\)-\d{3}-\d{4}'
matches=re.findall(pattern,text)
print(matches)

['(910)-666-8888']


Extracting us  number

In [None]:
pattern='\d{10}|\(\d{3}\)-\d{3}-\d{4}'
matches=re.findall(pattern,text)
print(matches)

['9999999999', '(910)-666-8888']


Extracting Title from text

In [1]:
Text='''Model S
Main article: Tesla Model S

Tesla Model S
The Model S is a full-size car with a liftback body style and a dual motor, all-wheel drive layout. Development of the Model S began before 2007 and deliveries started in June 2012. The Model S has seen two major design refreshes, first in April 2016, which introduced a new front-end design and again in June 2021, which revised the interior. The Model S was the top-selling plug-in electric car worldwide in 2015 and 2016. More than 250,000 vehicles have been sold as of December 2018 (when Tesla merged production numbers for the Model S and Model X).

Model X
Main article: Tesla Model X

Tesla Model X
The Model X is a mid-size luxury crossover SUV offered in 5-, 6- and 7-passenger configurations with either a dual- or trimotor, all-wheel drive layout. The rear passenger doors open vertically with an articulating "falcon-wing" design. A prototype Model X was first shown in February 2012 and deliveries started in September 2015.[101] The Model X shares around 30 percent of its content with the Model S. The vehicle has seen one major design refresh in June 2021 which revised the interior.

Model 3
Main article: Tesla Model 3

Tesla Model 3
The Model 3 is a mid-size car with a fastback body style and either a dual-motor, all-wheel drive layout or a rear-motor, rear-wheel drive layout. The vehicle was designed to be more affordable than the luxury Model S sedan. A prototype Model 3 was first shown in 2016 and within a week, the company received over 325,000 paid reservations.[49] Deliveries started in July 2017.[102] The Model 3 ranked as the world's best-selling electric car from 2018 to 2021,[103][104][105] and cumulative sales passed 1 million in June 2021.[106] The vehicle has seen one major design refresh in September 2023 which revised the exterior and interior.

Model Y
Main article: Tesla Model Y

Tesla Model Y
The Model Y is a mid-size crossover SUV offered in 5- and 7-passenger configurations with a single‐motor, rear-wheel drive or a dual-motor, all-wheel drive layout. The vehicle was designed to be more affordable than the luxury Model X SUV. A prototype Model Y was first shown in March 2019,[63] and deliveries started in March 2020.[65] The Model Y shared around 75 percent of its content with the Model 3.[64] In the first quarter of 2023, the Model Y outsold the Toyota Corolla to become the world's best-selling car, the first electric vehicle to claim the title.[107]

Tesla Semi
Main article: Tesla Semi

Tesla Semi prototype
The Tesla Semi is a Class 8 semi-truck by Tesla, Inc. with a tri-motor, rear-wheel drive layout. Tesla claims that the Semi has approximately three times the power of a typical diesel semi truck, and a range of 500 miles (800 km).[108] Two prototype trucks were first shown in November 2017 and initial deliveries were made to PepsiCo on December 1, 2022.[109] Tesla stated in April 2024 that it plans full production in late 2025.[110]
 '''

In [3]:
print(Text)

Model S
Main article: Tesla Model S

Tesla Model S
The Model S is a full-size car with a liftback body style and a dual motor, all-wheel drive layout. Development of the Model S began before 2007 and deliveries started in June 2012. The Model S has seen two major design refreshes, first in April 2016, which introduced a new front-end design and again in June 2021, which revised the interior. The Model S was the top-selling plug-in electric car worldwide in 2015 and 2016. More than 250,000 vehicles have been sold as of December 2018 (when Tesla merged production numbers for the Model S and Model X).

Model X
Main article: Tesla Model X

Tesla Model X
The Model X is a mid-size luxury crossover SUV offered in 5-, 6- and 7-passenger configurations with either a dual- or trimotor, all-wheel drive layout. The rear passenger doors open vertically with an articulating "falcon-wing" design. A prototype Model X was first shown in February 2012 and deliveries started in September 2015.[101] The M

In [22]:
Text

'Model S\nMain article: Tesla Model S\n\nTesla Model S\nThe Model S is a full-size car with a liftback body style and a dual motor, all-wheel drive layout. Development of the Model S began before 2007 and deliveries started in June 2012. The Model S has seen two major design refreshes, first in April 2016, which introduced a new front-end design and again in June 2021, which revised the interior. The Model S was the top-selling plug-in electric car worldwide in 2015 and 2016. More than 250,000 vehicles have been sold as of December 2018 (when Tesla merged production numbers for the Model S and Model X).\n\nModel X\nMain article: Tesla Model X\n\nTesla Model X\nThe Model X is a mid-size luxury crossover SUV offered in 5-, 6- and 7-passenger configurations with either a dual- or trimotor, all-wheel drive layout. The rear passenger doors open vertically with an articulating "falcon-wing" design. A prototype Model X was first shown in February 2012 and deliveries started in September 2015.

In [27]:
pattern='Main article: Tesla Model .|Main article: Tesla [^\n]*'
matches=re.findall(pattern,Text)
print(matches)

['Main article: Tesla Model S', 'Main article: Tesla Model X', 'Main article: Tesla Model 3', 'Main article: Tesla Model Y', 'Main article: Tesla Semi']


In [31]:
pattern='Main article: (Tesla Model .)|Main article: (Tesla [^\n]*)'
matches=re.findall(pattern,Text)
print(matches)

[('Tesla Model S', ''), ('Tesla Model X', ''), ('Tesla Model 3', ''), ('Tesla Model Y', ''), ('', 'Tesla Semi')]


In [33]:
pattern='Main article: (Tesla Model .)|Main article: (Tesla [^\n]*)'
matches=re.findall(pattern,Text)
print(matches)

[('Tesla Model S', ''), ('Tesla Model X', ''), ('Tesla Model 3', ''), ('Tesla Model Y', ''), ('', 'Tesla Semi')]


In [40]:

pattern='Main article: (Tesla Model .)| Main article: (Tesla [^\n]*)'
matches=re.findall(pattern,Text)
matches

[('Tesla Model S', ''),
 ('Tesla Model X', ''),
 ('Tesla Model 3', ''),
 ('Tesla Model Y', '')]

Extract financial periods from a company's finacnial reporting

In [57]:
text1='''
The Gross cost of operating lease vehilces in FY2021 Q1 was $4.85 bilion.
In previous quarter i.e FY2020 Q4 it was $3 billion.'''

In [58]:
pattern = 'FY\d{4} Q[1-4]'
matches = re.findall(pattern, text1)
matches

['FY2021 Q1', 'FY2020 Q4']

Case insensitive pattern match using flags

In [59]:
text1='''
The Gross cost of operating lease vehilces in FY2021 Q1 was $4.85 bilion.
In previous quarter i.e fy2020 Q4 it was $3 billion.'''

In [60]:
pattern = 'FY\d{4} Q[1-4]'
matches = re.findall(pattern, text1, flags=re.IGNORECASE)
matches

['FY2021 Q1', 'fy2020 Q4']

In [61]:
pattern = 'FY(\d{4} Q[1-4])'
matches = re.findall(pattern, text1, flags=re.IGNORECASE)
matches

['2021 Q1', '2020 Q4']

# Using Regex on Pandas Dataframe

In [62]:
import pandas as pd

In [74]:
data = pd.DataFrame({
    "Phone_text":[
        "Ajith number is 9999999999",
"Suriya number is 8956231245",
"Vijay number is 6532215487"]})

In [75]:
data

Unnamed: 0,Phone_text
0,Ajith number is 9999999999
1,Suriya number is 8956231245
2,Vijay number is 6532215487


Pandas Extract

In [76]:
data["Phone Number"]=data["Phone_text"].str.extract(r'(\d{10})')
data.head()

Unnamed: 0,Phone_text,Phone Number
0,Ajith number is 9999999999,9999999999
1,Suriya number is 8956231245,8956231245
2,Vijay number is 6532215487,6532215487


# Using apply function

In [77]:
def extract_phone_number(text):
    pattern = r'\d{10}'
    searfch_obj = re.search(pattern, text)
    number= searfch_obj.group()
    return number

In [78]:
data["Apply Phone Number"]=data["Phone_text"].apply(extract_phone_number)
data.head()

Unnamed: 0,Phone_text,Phone Number,Apply Phone Number
0,Ajith number is 9999999999,9999999999,9999999999
1,Suriya number is 8956231245,8956231245,8956231245
2,Vijay number is 6532215487,6532215487,6532215487
