MetaCharacters (Need to be escaped): <br>
. [ { ( ) \ ^ $ | ? * + 

In [None]:
.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary (. and whitespace is considered word boundary)
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More of the prefix
+       - 1 or More of the prefix
?       - 0 or One of the prefix
{3}     - Exact Number of the prefix
{3,4}   - Range of Numbers (Minimum, Maximum) of the prefix


#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

In [179]:
with open('simple.txt', 'r') as f :
    data = f.read()
    
print(data[])





<h3>Finding pattern</h3>

- re.findall(pattern , string, flag ) : Returns the list of all matches in the string as string. If it is matching groups it will only return groups
- re.finditer(pattern , string, flag) : Returns an iterator containing match objects 
- re.search(pattern string, flag) : Returns the first match as match object
- re.match(pattern, string, flag) : Applies the match pattern at start of string and returns a match object 
- re.split(pattern, string, maxsplit): Splits the string with the match
- re.sub(pattern , replace ,string, count) : Replaces matches count times with replace string

In [188]:
phNo = r'\d+[-.]\d+[-.]\d+'
string = "something 345.just 322.like 343.this 343. by coldplay 342. and chainsmookers"
print(re.findall(phNo, data))
print(re.search(phNo, data))
print(re.match(phNo, data))
print(re.match(r'some+', string))
print(re.split(r'\d+\.', string, maxsplit = 3))
print(re.sub(r'\d+\.', 'HI', string, count = 2))

['321-555-4321', '123.555.1234']
<re.Match object; span=(140, 152), match='321-555-4321'>
None
<re.Match object; span=(0, 4), match='some'>
['something ', 'just ', 'like ', 'this 343. by coldplay 342. and chainsmookers']
something HIjust HIlike 343.this 343. by coldplay 342. and chainsmookers


- If groups are there in the regular expression, then findall will only return the groups.


In [265]:
string = "something 345. just 322.like 343.this 343. by coldplay 342. just and chainsmookers"
reg = '(\d+\.) just'
print(list(re.finditer(reg,string)))
print(re.findall(reg, string)) # just returns the group

[<re.Match object; span=(10, 19), match='345. just'>, <re.Match object; span=(55, 64), match='342. just'>]
['345.', '342.']


In [197]:
match = re.finditer(r'(\d)+\.', string )
for i in match:
    print(i, i.start(), i.end(), i.span(), i.group())

<re.Match object; span=(10, 14), match='345.'> 10 14 (10, 14) 345.
<re.Match object; span=(19, 23), match='322.'> 19 23 (19, 23) 322.
<re.Match object; span=(28, 32), match='343.'> 28 32 (28, 32) 343.
<re.Match object; span=(37, 41), match='343.'> 37 41 (37, 41) 343.
<re.Match object; span=(54, 58), match='342.'> 54 58 (54, 58) 342.


<h3>Using groups</h3>

In [274]:
urls = """
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
"""
urlRegex = r'https?://(www\.)?(\w+)(\.\w+)'
print(re.findall(urlRegex, urls)) # Returns groups of the matching strings

[('www.', 'google', '.com'), ('', 'coreyms', '.com'), ('', 'youtube', '.com'), ('www.', 'nasa', '.gov')]


In [276]:
list(re.finditer(urlRegex, urls))

[<re.Match object; span=(1, 23), match='https://www.google.com'>,
 <re.Match object; span=(24, 42), match='http://coreyms.com'>,
 <re.Match object; span=(43, 62), match='https://youtube.com'>,
 <re.Match object; span=(63, 83), match='https://www.nasa.gov'>]

In [278]:
[''.join(i.group(2,3)) for i in re.finditer(urlRegex, urls)]

['google.com', 'coreyms.com', 'youtube.com', 'nasa.gov']

In [281]:
print(re.sub(urlRegex, r'\2\3', urls )) #Replace the pattern with elements in the group 2 and group 3


google.com
coreyms.com
youtube.com
nasa.gov



In [256]:
urls = """
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
"""
urlRegex = r'https?://(www\.)?(\w+)(\.\w+)'
pattern = re.compile(urlRegex)
match = pattern.sub(r'\2\3', urls)
print(match)


google.com
coreyms.com
youtube.com
nasa.gov



<h3>Using re.compile()</h3>

- re.compile(pattern, flag) returns a pattern or regular expression object
- The pattern object can be used for matching a string with a given pattern.
- Search, findall , match, etc can be used for matching which returns a match object which can be used for extracting groups

In [172]:
#Find all phone numbers from the given data
phNo = r'\d+[-.]\d+[-.]\d+'
pattern = re.compile(pattern = phNo) 
match = pattern.findall(data)
print(match)


['321-555-4321', '123.555.1234']


In [171]:
print('Pattern object methods')
print(dir(pattern))
print('Match object methods')
print(dir(match))

Pattern object methods
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'findall', 'finditer', 'flags', 'fullmatch', 'groupindex', 'groups', 'match', 'pattern', 'scanner', 'search', 'split', 'sub', 'subn']
Match object methods
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']


In [286]:
urls = """https://www.google.com
http://youtube.com
https://www.paytm.com
"""
reg = r'https?://(www.)?(\w+)(\.\w+)'
for i in re.finditer(reg, urls):
    print(i)
print(re.findall(reg, urls))

<re.Match object; span=(0, 22), match='https://www.google.com'>
<re.Match object; span=(23, 41), match='http://youtube.com'>
<re.Match object; span=(42, 63), match='https://www.paytm.com'>
[('www.', 'google', '.com'), ('', 'youtube', '.com'), ('www.', 'paytm', '.com')]


In [None]:
impo