# Regulas Expressions in Python

```
[] A set of characters "[a-m]"
\  Signals a special sequence "\d"
.  Any character "he..o"
^  Begins with "^hello"
$  Ends with "plane$"
*  zero or more occurences "he.*o"
+  One or more occurences "he.+o"
?  Zero or more occurences "he.?
{} Exactly the specified number of occurences "he.{2}o"
|  Either or
() Capture and group
```

In [1]:
import re
txt = "The rain in Spain"

x = re.search("^The.*Spain$", txt)
if x : print("Matched text: ", x.group())

Matched text:  The rain in Spain


In [2]:
import re
txt = "I gave you 50 dollars, 720 cents"
x = re.findall(r"\d\d", txt) # or "\d{2}"
print(x)

['50', '72']


In [3]:
import re
txt = "hello, I gave you 50 dollars, 72 cents"
x = re.findall(r"he..o", txt) # or "he.{2}"
print(x)

['hello']


In [4]:
import re
txt = "hello, hecko, 20, heggo, heo wazzaap"
x = re.findall(r"he.*o", txt)
y = re.findall(r"he.+o", txt)
print(x)
print(y)

['hello, hecko, 20, heggo, heo']
['hello, hecko, 20, heggo, heo']


In [5]:
import re
txt = "hello, hecko, heggo, heo"
x = re.findall(r"he.*?o", txt)
y = re.findall(r"he.+?o", txt)
print(x)
print(y)

['hello', 'hecko', 'heggo', 'heo']
['hello', 'hecko', 'heggo']


In [6]:
import re
text = "abbbbba"
print(re.findall(r"a.*b", text))  
print(re.findall(r"a.*?b", text)) 

['abbbbb']
['ab']


In [7]:
import re
text = "abbbbba"
print(re.findall(r"a.+b", text)) 
print(re.findall(r"a.+?b", text)) 

['abbbbb']
['abb']


In [8]:
import re
text = "aaabaaa"
print(re.findall(r"a{1,3}", text))  
print(re.findall(r"a{1,3}?", text))  

['aaa', 'aaa']
['a', 'a', 'a', 'a', 'a', 'a']


## Trying Assignment Exercises

### Match with any 3 digit number

In [9]:
import re
nums = "3 48 276 1057 8542 19 652 7428 583 91 7042 163 7 8245 382 8080"
numstwo = "MCA234 48 276 BOB1057 8542 19 652 XYZ7428 583 91 7042 163 7 8245 A382 8080"

x = re.findall(r"\d{3}", nums)
y = re.findall(r"\b\d{3}\b", nums)
z = re.findall(r"(?<!\d)\d{3}(?!\d)", numstwo)

print(x)
print(y)
print(z)

['276', '105', '854', '652', '742', '583', '704', '163', '824', '382', '808']
['276', '652', '583', '163', '382']
['234', '276', '652', '583', '163', '382']


The `?` in the patterns `(?<!\d)` and `(?!\d)` are part of the lookbehind and lookahead assertions, respectively. These assertions are used to specify conditions that must be true for the pattern to match, but the conditions themselves are not included in the match.

Here's a breakdown of each part:

1. **Negative Lookbehind `(?<!\d)`**:
   - `?<!` is the syntax for a negative lookbehind assertion.
   - `\d` inside the lookbehind asserts that there should not be a digit before the current position.
   - So `(?<!\d)` means "match the position only if it is not preceded by a digit."

2. **Three-Digit Number `\d{3}`**:
   - `\d` matches any digit (0-9).
   - `{3}` specifies that exactly three digits should be matched.

3. **Negative Lookahead `(?!\d)`**:
   - `?!` is the syntax for a negative lookahead assertion.
   - `\d` inside the lookahead asserts that there should not be a digit after the current position.
   - So `(?!\d)` means "match the position only if it is not followed by a digit."

When combined, `(?<!\d)\d{3}(?!\d)` matches exactly three-digit numbers that are not part of longer numbers. This ensures that:
- There is no digit immediately before the three-digit number (`(?<!\d)`).
- There is no digit immediately after the three-digit number (`(?!\d)`).

Here's the whole pattern:
- `(?<!\d)`: Negative lookbehind to ensure no digit before.
- `\d{3}`: Exactly three digits.
- `(?!\d)`: Negative lookahead to ensure no digit after.

I hope this helps clarify the role of the `?` in the lookbehind and lookahead assertions! Let me know if there's anything else you'd like to know.

### Match with this kind of number: 24BIT0090

In [10]:
import re
numthree = "24BIT0090"
numfour = "some text 24BIT0090 more letters"

y = re.findall(r"^\d{2}[A-Z]{3}\d{4}$", numthree)
z = re.findall(r"(?<!\d)\d{2}[A-Z]{3}\d{4}(?!\d)", numfour)

print(y)
print(z)

['24BIT0090']
['24BIT0090']


### Match with this kind of number: 123-5636-12-231

In [11]:
import re
numone = "123-5636-12-231"
numtwo = "some text 123-5636-12-231 more text"

w = re.findall(r"^\d{3}-\d{4}-\d{2}-\d{3}$", numone)
x = re.findall(r"(?<!\d)\d{3}-\d{4}-\d{2}-\d{3}(?!\d)", numtwo)

print(w)
print(x)

['123-5636-12-231']
['123-5636-12-231']


### Match with 10 digit mobile number with country code

```regex
^\+\d{1,4}\s?\d{10}$
```

Here's a breakdown of the pattern:
- `^` asserts the position at the start of the string.
- `\+` matches the literal "+" character (for the country code).
- `\d{1,4}` matches 1 to 4 digits (for the country code).
- `\s?` matches zero or one whitespace character (optional space between the country code and the phone number).
- `\d{10}` matches exactly 10 digits (for the phone number).
- `$` asserts the position at the end of the string.

This pattern will match strings like "+91 1234567890" or "+91234567890".

In [12]:
import re

numone = "+917904198245"
numtwo = "mobile: +917904198245"

x = re.findall(r"^\+\d{1,4}\s?\d{10}$", numone)
y = re.findall(r"(?<!\d)\+\d{1,4}\s?\d{10}(?!\d)", numtwo)

print(x)
print(y)

['+917904198245']
['+917904198245']


### Match with landline numbers: 0413-2345671

In [13]:
import re
num = "0413-2345671"
x = re.findall(r"^\d{4}\-\d{7}$",num)
print(x)

['0413-2345671']


### Match with email IDs

In [14]:
import re
num = "example@gmail.com"
text = "My email is test.email@example.com and another one is user123@domain.co.in"
x = re.findall(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$",num)
y = re.findall(r"\b([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b",text)
print(x)
print(y)

['example@gmail.com']
['test.email@example.com', 'user123@domain.co.in']


### Match with website address

In [15]:
import re
addr = "https://google.com and http://example.com or www.example.com"
x = re.findall(r"(?:https?://)?[a-zA-Z0-9._+-]+\.[a-zA-Z]{2,}",addr)
print(x)

['https://google.com', 'http://example.com', 'www.example.com']


### Match with American Express cards: starts with 34 or 37 and has 13 digits

In [3]:
import re
card = "3412345678901"
cards = ["3412345678901", "3712345678901", "3612345678901", "34123456789012"]
cardstr = "card numbers: 3412489239847 3729837398424 malformed card: 3448723"

w = re.findall(r"^3[47]\d{11}$", card)
x = [card for card in cards if re.findall(r"^3[47]\d{11}$",card)]
y = re.findall(r"(?<!\d)(?:34|37)\d{11}(?!\d)",cardstr)
z = re.findall(r"3[47]\d{11}", cardstr)

print(w)
print(x)
print(y)
print(z)

['3412345678901']
['3412345678901', '3712345678901']
['3412489239847', '3729837398424']
['3412489239847', '3729837398424']
