# Python Regular Expressions
*- a sequence of characters used to check whether a pattern exists in a given text (string) or not.*<br>
The Python module **re** provides full support for Perl-like regular expressions in Python.

## Raw Strings
To avoid any confusion while dealing with regular expressions, we would use Raw Strings as **r'expression'.**
> Characters that cannot be represented in a normal string (tabs, line feeds) are described using an escape sequence with a backslash \ (\t, \n).<br>
Strings prefixed with **r** or **R**, such as **r'...'** and **r"..."**, are called raw strings and treat backslashes \ as literal characters. In raw strings, escape sequences are not treated specially.

In [14]:
line = "a\tb\nA\tB"
raw_line = r"a\tb\nA\tB"
print('line:\n', line, '\nraw_line:\n', raw_line)

line:
 a	b
A	B 
raw_line:
 a\tb\nA\tB


In [1]:
import re

In [3]:
someText = "111, 112, 113, 123, 120, 145, 158, 179, 111 \
           aaa, AAA, AAA BBB a123. \
           0_0, 0_^, ^_0, ^_^ \
           [] . a12 a1a c12 bAa b12"

## RegEx Functions
* .match() - find the pattern at the beginning of a string
* .search() - return the first match of a pattern in a string
* .findall()	- return all matches of a pattern in a string
* .finditer() - return all matches of a pattern as an iterator
* .split() - returns a list where the string has been split at each match
* .sub() - replaces one or many matches with a string


In [45]:
re.match(r'(111)', someText).groups()

('111',)

In [44]:
result = re.search(r'111', someText)
print('Matched string:',result.group())
print('Starting position:', result.start())
print('Ending position:',result.end())
print('Positions:',result.span())
result

Matched string: 111
Starting position: 0
Ending position: 3
Positions: (0, 3)


<re.Match object; span=(0, 3), match='111'>

In [43]:
re.findall(r'111', someText)

['111', '111']

In [4]:
itera = re.finditer(r'11', someText)
for i in itera:
    print(i)

<re.Match object; span=(0, 2), match='11'>
<re.Match object; span=(5, 7), match='11'>
<re.Match object; span=(10, 12), match='11'>
<re.Match object; span=(40, 42), match='11'>


In [5]:
re.fullmatch(r'111', someText)

### Metacharacters
* [] - a set of characters	**"[a-m]"**	
* {} - exactly the specified number of occurrences	**"he.{2}o"**
* |	- either or	**"falls|stays"**	
* () - 	capture and group

### Specifies a specific set of characters to match [ ]
**r'[0-9]'** == [0123456789] <br>
**r'[a-z]'** == [abcdefghijklmnopqrstuvwxyz] <br>
**r'[A-Z]'** == [ABCDEFGHIJKLMNOPQRSTUVWXYZ] <br>

In [42]:
re.findall(r'11[23]', someText)

['112', '113']

In [41]:
#returns a match for any character EXCEPT ^23
re.findall(r'11[^23]', someText)

['111', '111']

In [40]:
re.findall(r'[a-c]12', someText)

['a12', 'a12', 'c12', 'b12']

In [39]:
re.findall(r'\[\]', someText)

['[]']

### Special Sequences

**\A** - *the specified characters are at the beginning of the string* **"\AThe"** <br>	
**\Z** - *the specified characters are at the end of the string*	**"Spain\Z"** <br>
**\b** - *the beginning or at the end of a word* **r"\bain"  **r"ain\b"**	<br>	
**\B** - *the specified characters are present, but NOT at the beginning (or at the end) of a word* **r"\Bain" r"ain\B"**	<br>*
**\d** == [0-9]	<br>	
**\D** == [^0-9] <br>	
**\s** == [ ] *the string contains a white space character	(\t, \s)* <br>	
**\S** == [^ ] 	*the string DOES NOT contain a white space character*		<br>	
**\w** == [a-zA-Z0-9_] *any word characters, and the underscore _ character* <br>	
**\W** == [^a-zA-Z0-9_]<br>	
**.** - *any character* **"he..o"**	<br>	
**^** - *starts with* **"^hello"**	<br>	
**\$** - *ends with*	**"planet\\$"**	<br>	
\* - *zero or more occurrences*	**"he.\*o"**	<br>	
**\+** - *one or more occurrences*	**"he.+o"**	<br>	
**?** - *zero or one occurrences*	**"he.?o"**	<br>	

In [8]:
testText = "Beautiful 1is better than ugly.\n \
            Explicit is better than implicit.\n \
            Simple 23is better than complex.\n \
            Complex is better than complicated.\n \
            Flat is better than nested.\n \
            Sparse is better than dense.\n \
            \t Readability counts.\n \
            Special cases45 aren't special enough to break the rules.\n \
            Although practicality beats purity.\n \
            Errors should never pass silently.\n \
            Unless explicitly silenced.\n \
            In the face 758of amb34iguity, refuse the temptation to guess.\n \
            There should be one-- and preferably only one --obvious way to do it.\n \
            Although that way may not be obvious at fir4786st unless you're Dutch.\n \
            \t Now is better than never.\n \
            Although never is often better than *right* now.\n \
            If the implem7837entation is h783ard to explain, it's a bad idea.\n \
            If the implementation is easy to explain, it may be a good idea.\n \
            Namespaces are one honking great idea -- let's do more of those!"

In [60]:
re.findall(r'\AB', testText) #Beautiful

['B']

In [61]:
re.findall(r'\bc', testText) #complex, ...

['c', 'c', 'c', 'c']

In [63]:
re.findall(r'\d.', testText)

['1i', '23', '45', '75', '8o', '34', '47', '86', '78', '37', '78', '3a']

In [66]:
re.findall(r'\D\d', testText)

[' 1', ' 2', 's4', ' 7', 'b3', 'r4', 'm7', 'h7']

In [72]:
re.findall(r'\s\d', testText)

[' 1', ' 2', ' 7']

In [70]:
re.findall(r'\S\b', testText)

['45']

In [73]:
re.findall(r'\S$', testText)

['!']

In [83]:
re.findall(r'\d\w*', testText)

['1is', '23is', '45', '758of', '34iguity', '4786st', '7837entation', '783ard']

In [85]:
re.findall(r'.*?', testText)[0:10] #any

['', 'B', '', 'e', '', 'a', '', 'u', '', 't']

### Mastering Quantifiers
#### Greedy: As Many As Possible 

**{,n}** - less than or equal to n<br>
**{m,}** - greater than or equal to m<br>
**\d{3}** == \d\d\d <br>
**\d{0, }** == \d*<br>
**\d{1, }** == \d+<br>
**\d{0, 1}** == \d?<br>
**\d{2, 5}** == \d\d ... \d\d\d\d\d (5 or less)


In [92]:
re.findall(r'\d{3}', testText)

['758', '478', '783', '783']

In [110]:
re.findall(r'\w{9,12}', testText)

['Beautiful',
 'complicated',
 'Readability',
 'practicality',
 'explicitly',
 'amb34iguity',
 'temptation',
 'preferably',
 'fir4786st',
 'implem7837en',
 'implementati',
 'Namespaces']

In [111]:
re.findall(r'\d{2,4}', testText)

['23', '45', '758', '34', '4786', '7837', '783']

In [113]:
re.findall(r'\d\d?', testText) 

['1', '23', '45', '75', '8', '34', '47', '86', '78', '37', '78', '3']

In [115]:
re.findall(r'\w\d+', testText) 

['23', 's45', '758', 'b34', 'r4786', 'm7837', 'h783']

#### Lazy: As Few As Possible
**{m,n}?** - m to n <br>
**{,n}?** - to n<br>
**{m,}?** - m or more<br>
**\*?** - 0 or more<br>
**+?** - 1 or more<br>
**??** - 0 - 1<br>

In [118]:
re.findall(r'\w{9,12}?', testText)

['Beautiful',
 'complicat',
 'Readabili',
 'practical',
 'explicitl',
 'amb34igui',
 'temptatio',
 'preferabl',
 'fir4786st',
 'implem783',
 '7entation',
 'implement',
 'Namespace']

In [120]:
re.findall(r'\d\d*?', testText)[0:5]

['1', '2', '3', '4', '5']

### Capturing Groups
> A group is a part of a regex pattern enclosed in parentheses **()** metacharacter

**(regex)** <br>
**(?P\<name\>regex)** - сreates a named captured group<br>
**(?P\<name\>regex)(?P=name)** - matches the contents of a previously captured named group<br>
**(?:\<regex\>)** - creates a non-capturing group<br>

In [6]:
re.findall(r'(\d+)(\w{2})', testText)

[('1', 'is'),
 ('23', 'is'),
 ('758', 'of'),
 ('34', 'ig'),
 ('4786', 'st'),
 ('7837', 'en'),
 ('783', 'ar')]

In [50]:
result = re.search(r'(\d{2})(\d)', testText)
result.groups()

('75', '8')

In [5]:
result = re.findall(r'(?P<repeats>\w)(?P=repeats)', testText)
result #better Errors Unless

['t', 't', 't', 't', 't', 't', 'r', 's', 's', 's', 's', 't', 't', 'o']

In [14]:
result = re.search(r'(?P<word>[a-zA-Z]+)(?P<num_rep>\d+)', testText)
result.group(1)

'cases'

#### (?(n)yes|no)
if a: b<br>
else: c<br>

In [20]:
result = re.findall(r'(a)?(?(1)b|co)', testText)
result

['', '', 'a', '', 'a']

In [14]:
pattern = r"(\w+\s+)is(\s\w+)"
result = re.search(pattern, testText)
# for entire match
print(result.group())
# also print(match_object.group(0)) can be used
 
# for the first parenthesized subgroup
print(result.group(1))
 
# for the second parenthesized subgroup
print(result.group(2))


Explicit is better
Explicit 
 better


#### Comment group (?#)

In [6]:
re.search( r"(\w+\s+)(?#word+space)is(\s\w+)", testText).group()

'Explicit is better'

#### Non-capturing group (?:)

In [15]:
re.findall( r"(\w+\s+)(?:is|than)(\s\w+)", testText)

[('better ', ' ugly'),
 ('Explicit ', ' better'),
 ('better ', ' complex'),
 ('Complex ', ' better'),
 ('Flat ', ' better'),
 ('Sparse ', ' better'),
 ('Now ', ' better'),
 ('never ', ' often'),
 ('implem7837entation ', ' h783ard'),
 ('implementation ', ' easy')]

In [16]:
re.findall( r"(\w+\s+)(is|than)(\s\w+)", testText)

[('better ', 'than', ' ugly'),
 ('Explicit ', 'is', ' better'),
 ('better ', 'than', ' complex'),
 ('Complex ', 'is', ' better'),
 ('Flat ', 'is', ' better'),
 ('Sparse ', 'is', ' better'),
 ('Now ', 'is', ' better'),
 ('never ', 'is', ' often'),
 ('implem7837entation ', 'is', ' h783ard'),
 ('implementation ', 'is', ' easy')]

#### Positive Lookahead	 a(?=b) 
"a" - if it is followed by "b"

In [18]:
re.findall( r"(\w+\s)(?=than)", testText)

['better ',
 'better ',
 'better ',
 'better ',
 'better ',
 'better ',
 'better ',
 'better ']

#### Negative Lookahead	 a(?!b)

In [22]:
re.findall( r"(\w{9,15}\s)(?!than)", testText)

['Beautiful ',
 'Readability ',
 'practicality ',
 'explicitly ',
 'temptation ',
 'preferably ',
 'fir4786st ',
 'lem7837entation ',
 'implementation ',
 'Namespaces ']

#### Positive Lookbehind	(?<=b)a

In [35]:
re.findall( r"(?<=than\s)(\w+)", testText)

['ugly', 'implicit', 'complex', 'complicated', 'nested', 'dense', 'never']

#### Negative Lookbehind	 (?<!)

In [42]:
re.findall( r"(?<!\s)th\w+", testText)

['though', 'though', 'though']

**(?!pat1)(?=pat2)**	multiple assertions can be specified next to each other in any order
as they mark a matching location without consuming characters


**((?!pat).)\*** - negate a grouping, similar to negated character class

In [6]:
re.search( r"((?!is).)*than", testText) 

<re.Match object; span=(12, 25), match='s better than'>

### Or | Operator

In [9]:
re.findall( r"(compl)(i|e)", testText) 

[('compl', 'e'), ('compl', 'i')]

### Match Objects
* **Match.expand(template)**
* **Match.group([group1, ...])**
* **Match.\_\_getitem\_\_(g)**
* **Match.groups(default=None)**
* **Match.groupdict(default=None)**
* **Match.start([group])**, **.end([group])** - return the indices of the start and end of the substring matched by group
* **Match.span([group])**
* **Match.pos** - the index into the string at which the RE engine started looking for a match
* **Match.endpos** - the index into the string beyond which the RE engine will not go
* **Match.lastindex**
* **Match.re**
* **Match.string**

In [17]:
result = re.search(r"(\w+\s+)(is|than)(\s\w+)", testText)
result.expand(r'\1 - \3')

'better  -  ugly'

In [20]:
result.group(1, 2, 3)

('better ', 'than', ' ugly')

In [22]:
result = re.search(r'(?P<word>[a-zA-Z]+)(?P<num_rep>\d+)', testText)
result['word'], result['num_rep']

('cases', '45')

In [23]:
result.groups()

('cases', '45')

In [24]:
result.groupdict()

{'word': 'cases', 'num_rep': '45'}

In [62]:
result = re.search(r"\d+(?=is)", testText)
testText[:result.start()] + testText[result.end()]

'Beautiful i'

In [77]:
result = re.finditer(r'better\s\w+\s(\w+)', testText)
for match in result:
    index = match.span(1)
    print(testText[index[0]:index[1]])

ugly
implicit
complex
complicated
nested
dense
never


In [80]:
result = re.search(r"\d+(?=is)", testText)
result.pos, result.endpos

(0, 1081)

In [92]:
result = re.search(r'(better)(\s\w+\s)(\w+)', testText)
result.re

re.compile(r'(better)(\s\w+\s)(\w+)', re.UNICODE)

In [93]:
result.string

"Beautiful 1is better than ugly.\n             Explicit is better than implicit.\n             Simple 23is better than complex.\n             Complex is better than complicated.\n             Flat is better than nested.\n             Sparse is better than dense.\n             \t Readability counts.\n             Special cases45 aren't special enough to break the rules.\n             Although practicality beats purity.\n             Errors should never pass silently.\n             Unless explicitly silenced.\n             In the face 758of amb34iguity, refuse the temptation to guess.\n             There should be one-- and preferably only one --obvious way to do it.\n             Although that way may not be obvious at fir4786st unless you're Dutch.\n             \t Now is better than never.\n             Although never is often better than *right* now.\n             If the implem7837entation is h783ard to explain, it's a bad idea.\n             If the implementation is easy to explain,