# Python RegEx

## What we will learn ?
    What are Regular Expressions ?
    Why we use Regular Expressions ?
    What are Metacharacters in RegEx ?
    Different types of Metacharacters in Python.
    Special Sequences In RegEx ?
    Basic Regular Expression Operations
    Phone Number Verification
    Email Verification
    Web scrapping using RegEx

### What are regular expressions ?
    RegEx is a special text string for describing a search pattern. We can use regex in python by importing the re module.
    
### Why we use regular expressions ?
- to extract the date & time from log file
- to verify emails format
- to verify a correct phone number
- to search for a specific pattern in database
    

In [1]:
# In order to use RegEx module, we have to import it:

import re

In [None]:
dir(re)

In [None]:
pattern = "^a...e$"
test_string = "apple"

result = re.match("^a...e$","apple")   #re.match() function to search pattern within the test_string
print(result)

if result:    # False
    print('Match found')
else:
    print('Match not found')

# Metacharacters:

To specify the pattern using regular expressions we use **metacharacters**.<br>
They are interpreted in a special way by a regex engine.<br>
**Metacharacters:** [ ] . ^ $ + ? { } ( ) \ |

# 1. Square Brackets[ ]
- specifies a set of characters you wish to match
- you can also specify the range

In [2]:
print(re.findall("[a]", "abcdefaga"))

['a', 'a', 'a']


In [3]:
print(re.findall("[b-e]", "abcdefg"))

['b', 'c', 'd', 'e']


In [4]:
print(re.findall("[1-4]", "abcdefg 124589"))

['1', '2', '4']


In [5]:
print(re.findall("[1-48]", "abcdefg 124589"))

['1', '2', '4', '8']


You can also complement the character set by using **caret^** symbol at the start of a square-bracket

In [6]:
print(re.findall("[^c-f1-4]", "abcdefg 124589"))

['a', 'b', 'g', ' ', '5', '8', '9']


# 2. Period(.)
- matches any single character

In [7]:
print(re.findall("..", "abcdefg"))

['ab', 'cd', 'ef']


In [9]:
print(re.findall("..", "a"))

[]


# 3. Caret ^
- is used to check if a string starts with a certain character

In [10]:
print(re.findall("^app", "apple"))

['app']


# 4. Dollar $
- is used to check if a string ends with a certain character

In [15]:
print(re.findall(r"....e$", "apple"))

['apple']


# 5. Star *
-  the * matches **zero or more occurrences** of the pattern left to it.

In [21]:
print(re.findall("ap*l", "apple"))

['appl']


In [23]:
print(re.match("ap*l", "aple"))

<re.Match object; span=(0, 3), match='apl'>


In [24]:
print(re.match("ap*l", "ale"))

<re.Match object; span=(0, 2), match='al'>


In [25]:
print(re.match("ap*l", "able"))

None


In [26]:
print(re.match("a*p*le", "le"))

<re.Match object; span=(0, 2), match='le'>


# 6. Plus +
- '+' matches **one or more ocurrences** of the pattern left to it.

In [27]:
print(re.match("ap+l", "apple"))

<re.Match object; span=(0, 4), match='appl'>


In [28]:
print(re.match("ap+l", "aple"))

<re.Match object; span=(0, 3), match='apl'>


In [29]:
print(re.findall("ap+l", "ale"))

[]


In [30]:
print(re.findall("ap+l", "apppe"))

[]


In [31]:
print(re.findall("a+p+l", "aaaaappple"))

['aaaaapppl']


# 7. Question Mark ?
- '?' matches **zero or one occurrence** of the pattern left to it.

In [32]:
print(re.findall("ap?l", "aple"))

['apl']


In [33]:
print(re.findall("ap?l", "apple"))

[]


In [34]:
print(re.findall("ap?l", "le"))

[]


In [35]:
print(re.findall("ap?l", "pamle"))

[]


# 8. Braces {n,m}
- at least **n repetitions** and **at most m repetitions**

In [36]:
print(re.findall("a{2,4}", "aapple is apple"))

['aa']


In [37]:
print(re.findall("ap{2,4}", "aapppppple is apple"))

['apppp', 'app']


In [38]:
print(re.findall("ap{2,4}", "aapppppple is bpple"))

['apppp']


In [39]:
print(re.findall("a{2,5}", "aaaaaaaapppppple is aaaapple"))

['aaaaa', 'aaa', 'aaaa']


In [40]:
print(re.findall("a{2,5}p{2,3}", "aaaaaaaapppppple is aaaapple"))

['aaaaappp', 'aaaapp']


In [41]:
print(re.findall("ba{2,5}p{2,3}", "aaaaaaaapppppple is aaaapple"))

[]


# 9. Alternation |
- vertical bar is used for alternation(or operator)

In [42]:
print(re.findall('a|b', 'abcdefba, ade, cde'))    # any string that contains either 'a' or 'b'

['a', 'b', 'b', 'a', 'a']


In [43]:
print(re.findall('X|Y|Za', 'Mr.X has come.'))    # any string that contains either 'X' or 'Y' or 'Za'

['X']


# 10. Group ( )
- parenthesis ( ) is used to group sub-patterns.
- for example - **(a|b|c)xz**


In [46]:
print(re.findall('(a|b|c)xz', 'abcz abxz'))    # any string that contains either 'a' or 'b' or 'c' followed by 'xz'

['b']


# 11. Backslash \
- is used to escape various characters including all metacharacters.

In [47]:
print(re.findall('\$a', 'abcd $a'))    # backslash will excape the meaning of '$' character and it will search '$a' in the search string

['$a']


## NOTE :
**If you are unsure if a character has a special meaning or not, you can put '\' in front of it. This makes sure that character is not treated in a special way.**

# Special Sequences :

# 1. \A
- Matches if the **specified characters are at the start of a string.**

In [48]:
print(re.match(r'\Athe', 'the sun is red the'))  # r means raw string  

<re.Match object; span=(0, 3), match='the'>


In [49]:
print(re.match(r'\AHello', 'Hi World'))    

None


# 2. \b
- Matches if the specified characters are at the **beginning or end** of a word.


In [50]:
print(re.search(r'\bfoo', 'football'))    # r specifies the raw string

<re.Match object; span=(0, 3), match='foo'>


In [51]:
print(re.search(r'\bfoo', 'a football'))    # r specifies the raw string

<re.Match object; span=(2, 5), match='foo'>


In [60]:
print(re.match(r'foo\b', 'ball is played by foo ballfoo'))    # foo is at the end of a word 'afoo' hence a match

None


In [64]:
print(re.findall(r'foo\b', 'foo ball is played by foo ball'))    # foo is at the end of a word 'afoo' hence a match

['foo', 'foo']


In [62]:
?re.search

In [65]:
print(re.search(r'foo\b', 'ball is afootball'))    # foo is not at the end of a word 'afootball' hence no match

None


In [67]:
print(re.search(r'\bfoo\b', 'ball is afoo'))      # creating a boundary '\b...\b', to match only for 'foo'

None


In [68]:
print(re.search(r'\bfoo\b', 'ball is foo'))      # creating a boundary '\b...\b', to match only for 'foo'

<re.Match object; span=(8, 11), match='foo'>


# 3. \B
- it is opposite of \b
- matches if the specified characters are **not at beginning or at the end** of a word.

In [69]:
print(re.search(r'\Bfoo', 'a football'))  # here 'foo' is at the beginning of a word 'football'; hence no match

None


In [70]:
print(re.search(r'foo\B', 'a football'))  # here 'foo' is not at the end, hence a match

<re.Match object; span=(2, 5), match='foo'>


In [71]:
print(re.search(r'\Bfoo\B', 'afootball'))   # 'foo' is not at the beginning nor at the end

<re.Match object; span=(1, 4), match='foo'>


# 4. \d
- Matches any decimal digit; equivalent to **[0-9]**

In [72]:
print(re.findall(r'\d', 'abcd234xyz78'))  

['2', '3', '4', '7', '8']


# 5. \D
- Matches any non decimal digit; equivalent to **[^0-9]**

In [73]:
print(re.findall(r'\D', 'abcd234xyz78'))  

['a', 'b', 'c', 'd', 'x', 'y', 'z']


In [74]:
print(re.findall(r'\D', '12345'))  

[]


# 6. \s
- Matches where a string contains any **whitespace character**

In [76]:
print(re.findall(r'\s', 'abcd 234 xyz 78'))  

[' ', ' ', ' ']


# 7. \S
- matches any **non-whitespace character**
- equivalent to **[^\t\n\r\f\v]**

In [75]:
print(re.findall(r'\S', 'xyz 45'))  

['x', 'y', 'z', '4', '5']


# 8. \w
- matches any **alphanumeric character**
- equivalent to **[a-zA-Z0-9_]**

In [77]:
print(re.findall(r'\w', 'xyz 123'))  

['x', 'y', 'z', '1', '2', '3']


In [78]:
print(re.findall(r'\w', '#$%@'))  

[]


# 9. \W
- matches any **non-alphanumeric character**
- equivalent to **[^a-zA-Z0-9_ ]**

In [79]:
print(re.findall(r'\W', '#$%@'))  

['#', '$', '%', '@']


In [80]:
print(re.findall(r'\W', 'xyz 123'))  

[' ']


# 10. \Z
- matches if the specified characters are at the **end of a string**.

In [81]:
print(re.findall(r'Python\Z', 'I love Python'))  

['Python']


In [82]:
print(re.findall(r'\ZPython', 'I love Python'))  

[]


In [83]:
print(re.findall(r'\ZPython', 'Python is my love'))  

[]


In [84]:
print(re.findall(r'Python\Z', 'Python is my lovePython'))  

['Python']


# Basic Regular Expression Operations ?

- re.findall()
- re.search()
- re.match()
- re.split()
- re.sub()
- re.subn()
- re.compile()
- match.group()

In [None]:
import re

# re.findall()
### Find out the name and age of a person from the sample string :

In [None]:
NameAge = '''hello Siddharth is 21 
and Shravan is 40
Mahesh is 25 and Ravi is 30 R
'''

name = re.findall('[A-Z][a-z]+', NameAge)
age = re.findall('\d{1,3}', NameAge)
print(name, age)
print()
print("The proper output is given below : ")
for i in zip(name, age):
    print(i)

### NOTE : 
#### If the pattern is not found **re.findall** returns an empty list.


# re.search()
### Search a pattern from the test string:

### Find out the username and host name from given gmail id :


### Grouping in RegEx

The group feature of regular expression allows you to pick up parts of the matching text. Parts of a regular expression pattern bounded by parenthesis () are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. 

In [None]:
statement = 'Please contact us at: askjojo@gmail.com'

match = re.findall(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)
# if statement:
#     print("Email address:", match.group('email'))
#     print("Username:", match.group('username'))
#     print("Host:", match.group('host'))
match

In [None]:
mystr = "rat, hat, mat, cat, chat"
myst = re.findall('[a-z]at', mystr)
print(myst)

In [None]:
mystr = "rat, hat, mat, cat, chat, dog, at, pack"
myst = re.findall('[a-z]*t', mystr)
print(myst)

In [None]:
mystr = "rat, hat, mat, cat, chat, dog, at, pack"
myst = re.findall('.*k$', mystr)
print(myst)

In [None]:
#replace a string
mystr = "there is a tower issue"

alstr = re.compile('tower')
myst = alstr.sub('network', mystr)
print(myst)


In [None]:
# how to deal with white spaces

randstr = '''
keep the Indian     
flag flying high
'''
print(randstr)

a = re.compile('\n')
b = a.sub("", randstr)
print(b)

In [None]:
teststring = "abc 12\d23 \n f456"
pattern = '\s+'
replace = ""
op = re.sub(pattern, replace, teststring)
print(op)

In [None]:
teststring = 'Shravan 40,Srinivas 60'
op = re.split(',', teststring)
print(op)

In [None]:
# match a single character

rand = '4512322222'
print('Matches:', len(re.findall('2', rand)))

In [None]:
rand2 = '123 1234 112345 123456 1234567 12345678'
print("MAtches:", len(re.findall('\d{5,8}', rand2)))

In [None]:
mobileNum = "123456 0189342908  0123345678910"
validNum = re.search('\d{10}', mobileNum)
print(validNum)

In [None]:
op = re.split('\s+', mobileNum)
for num in op:
    if len(num) == 10:
        print(f"valid number is : {num}")

In [None]:
a = 'a b c  d'
op1 = re.split(' ', a)
print(op1)

In [None]:
phone = input("Enter your phone number:")

if re.search("\d{3}-\d{3}-\d{4}", phone):
    print(phone, "is a valid phone number")
else:
    print(phone, "is not a valid phone number")

In [None]:
email = 'Bhupendra234@gmail.co'
if re.search('\w{2,20}@\w{3,15}.a-z{2,3}',email):
    print(email, "is a valid email id")
else:
    print(email, "is not a valid email id")

In [None]:
mails = 'pk@gmail.com md@aon.com raka@seo.com @mail.com dc@com'

print('EmailMatches:', len(re.findall('[\w]{2,10}@[\w]{1-5}.a-z{2,3}' , mails)))

## Read the phone number from the given url :

"https://www.summet.com/dmsi/html/codesamples/addresses.html"