<a href = 'https://docs.python.org/3/library/re.html#re-objects'>DOCUMENTATION</a>,
<a href = 'https://docs.python.org/3/howto/regex.html#regex-howto'>Tutorial</a>

In [1]:
import re

# Syntax


The special sequences consist of "\\" and a character from the following list:
    
<tr><td>\d</td><td>  Matches any decimal digit; equivalent to the set [0-9].</td></tr>
<tr><td>\D</td><td>  The complement of \d. It matches any non-digit character; equivalent to the set [^0-9].</td></tr>
<tr><td>\s</td><td>  Matches any whitespace character; equivalent to [ \t\n\r\f\v].</td></tr>
<tr><td>\S</td><td>  The complement of \s. It matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].</td></tr>
<tr><td>\w</td><td>  Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. With LOCALE, it will match the set [a-zA-Z0-9_] plus characters defined as letters for the current locale.</td></tr>
<tr><td>\W</td><td>  Matches the complement of \w.</td></tr>
<tr><td>\b</td><td>  Matches the empty string, but only at the start or end of a word.</td></tr>
<tr><td>\B</td><td>  Matches the empty string, but not at the start or end of a word.</td></tr>
<tr><td>\\  </td><td>Matches a literal backslash.</td></tr>


# Re.DOTALL

**`.`** matches everything but does not match new line \n. set **`flags = re.DOTALL`** to allow match newline

# []

every special characters link **`*`**, **`?`**, ... when we put it inside **`[]`**, it will be searched as what it means.

In [355]:
re.findall(r'[*?+()]', '+*^()')

['+', '*', '(', ')']

# Non greedy

**`*?`**, **`+?`**,**`??`**

In [3]:
html = '<a><b></b></a>'
html

'<a><b></b></a>'

In [5]:
#greedy: match as longest as possible
re.search(r'<.*>', html).group()

'<a><b></b></a>'

In [7]:
#nongreedy: match the shortest possible
re.search(r'<.*?>', html).group()

'<a>'

In [359]:
#greedy vs non greedy
re.search('ab+', 'abbbb').group(), re.search('ab+?', 'abbbb').group()

('abbbb', 'ab')

<hr>

**`{m,n}?`**

In [9]:
s = '123456abc'

In [12]:
#greedy
re.search(r'\d{2,5}', s).group()

'12345'

Non greedy: Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible.

In [13]:
#non greedy
re.search(r'\d{2,5}?', s).group()

'12'

# Lookahead, lookbehind assertions

**`(?=...)`**, **`(?!...`**, **`(?<=...)`**, **`(?<!...)`**

Match all file does not have bat extension: misc.bat, die.exe, king.png

In [None]:
all_file_pattern = r'.*\.[^.]*$' #match all fine: name.extension
not_bat_pattern =  r'.*\.(?!bat$)[^.]*$' #we want after `name.` does not have ending with bat

In [None]:
#mach all file does not have extension: bat, exe
pattern = r'.*\.(?!bat$|exe$)[^.]*$'

<hr>

Find strings that:
* Begins with at least 1 number
* Ends with at least 1 number
* between them in a string consists of a-z, A-Z, and starts with pikachu
--------------------
return the string between numbers
e.g: 31pikachuvn31 ---> pikachuvn
   

In [56]:
re.search(r'^\d+(?=pikachu)([a-z]*)\d+$', '31pikachuvnchampions53').group(1)

'pikachuvnchampions'

How this works?
* First RE matches for numbers at the beginnings
* Then it check whether succeeding these numbers is `pikachu`
* If this is false, then there is no match
* If this is true, then continue checking, starting from index succeeds the last number (in this case, starting matching from index 2)

# Method

In [2]:
dir(re)

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'Pattern',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_cache',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pickle',
 '_special_chars_map',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'template']

# Create a Regex Object

```python
re.compile(pattern, flags=0)
```

In [3]:
pattern = re.compile('(\d+) ([\w\s]+)')
txt = '31 VN Pikachu'
sobj = pattern.search(txt)
sobj

<re.Match object; span=(0, 13), match='31 VN Pikachu'>

In [4]:
sobj.group()

'31 VN Pikachu'

In [5]:
sobj.group(1)

'31'

In [6]:
sobj.span(1)

(0, 2)

In [7]:
sobj.group(2)

'VN Pikachu'

# Search

search for the occurance of the pattern anywhere in the string

In [8]:
txt = 'I am VN Pikachu, clan VN Champions, level 30, and I have 12 tanks'

In [9]:
re.search('\d+', txt)

<re.Match object; span=(42, 44), match='30'>

# Match

search for the occurent of the pattern at the beggining of the string

In [57]:
re.search('is', 'isane') #isane starts with is

<re.Match object; span=(0, 2), match='is'>

In [60]:
re.match('is', 'this') #this does not match, because 'this' does not start with is

# Searching begining, ending with Multiline

In [12]:
txt = 'VN Champions\n30.VN Pikachu\n34.Tank Cao'
print(txt)

VN Champions
30.VN Pikachu
34.Tank Cao


search for number at the begining of a line 

In [13]:
re.search('^\d+', txt) #^ match for the beginning of a string, MULTILINE

it does not return anything, despite the fact the we have a number at the beginning of the second line. This is because the <code>search</code> method only search for the first line

Solution, using flag: <code>re.MULTILINE</code>
    

In [14]:
re.search('^\d+', txt, flags = re.MULTILINE)

<re.Match object; span=(13, 15), match='30'>

Find all leading number in each line

In [15]:
re.findall('^\d+', txt, re.MULTILINE)

['30', '34']

<hr>

In [9]:
clan = '31 VN Pikachu\n34 Tank Cao'
re.findall(r'^\d+', clan, flags = re.MULTILINE)

['31', '34']

# Word boundaries

\b and \B

\b denotes the end or the beginning of a word

In [20]:
lyric = 'there is a girl'

let's check if there is a word 'the' in the lyric

In [21]:
#Bad approach
re.search('the', lyric)

<re.Match object; span=(0, 3), match='the'>

because 'the' is a substring of  'there', so search return the mach object

Find all words ends with es

In [23]:
#using word boundary
re.findall(r'\b[a-z]*es\b', 'there are boxes and 123es axes')

['boxes', 'axes']

In [24]:
#not using word boundary
re.findall(r'[a-z]*es', 'there are boxes and 123es axes')

['boxes', 'es', 'axes']

# Back reference

In [25]:
s1 = '31 31 Pikachu'
s2 = '31 32 Nanano'

check patterns in which the first and the second number is the same

In [33]:
re.search(r'(\d+) \1', s1)

<re.Match object; span=(0, 5), match='31 31'>

In [34]:
re.search(r'(\d+) \1', s2)

# Search Object

In [73]:
txt = 'VN Pikachu, Level 30, Clan VN Champions'
search_obj = re.search('\d+', txt) #this return re.Match object
search_obj 

<re.Match object; span=(18, 20), match='30'>

In [98]:
dir(search_obj)

['__class__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'end',
 'endpos',
 'expand',
 'group',
 'groupdict',
 'groups',
 'lastgroup',
 'lastindex',
 'pos',
 're',
 'regs',
 'span',
 'start',
 'string']

```python
search_obj.span(group=0, /)
```

In [17]:
#the starting and ending index of the matched pattern
search_obj.span()

(18, 20)

```python
search_obj.start(group=0, /)
```

In [19]:
#the starting index of the matched pattern
search_obj.start()

18

In [20]:
#the ending index
search_obj.end()

20

In [21]:
#the matched patterned found
search_obj.group()

'30'

<hr>

In [15]:
obj = re.search(r'(\d+) ([\w ]+)', '31 VN Pikchu')
obj

<re.Match object; span=(0, 12), match='31 VN Pikchu'>

In [16]:
#get the span of the first captured group
obj.span(1)

(0, 2)

In [17]:
#get the ending of the second captured group
obj.end(2)

12

In [99]:
obj.string

'31 VN Pikchu'

In [100]:
#the regular expression used
obj.re

re.compile(r'(\d+) ([\w ]+)', re.UNICODE)

In [103]:
#The value of pos which was passed to the search() or match() method of a regex object. 
#This is the index into the string at which the RE engine started looking for a match.
obj.pos

0

In [104]:
#The value of endpos which was passed to the search() or match() method of a regex object.
#This is the index into the string beyond which the RE engine will not go.
obj.endpos

12

**`Match.lastgroup`**  
The name of the last matched capturing group, or None if the group didn’t have a name, or if no group was matched at all.

In [143]:
#We do not name any group, so return none
re.search(r'(\d+) ([\w ]+)', '31 VN Pikachu').lastgroup

In [142]:
#There are 2 captured group matches, the first captured group called level, and the second one called named
re.search(r'(?P<level>\d+) (?P<name>[\w ]+)', '31 VN Pikachu').lastgroup

'name'

**`Match.lastindex`**

The integer index of the last matched capturing group, or None if no group was matched at all. For example, the expressions (a)b, ((a)(b)), and ((ab)) will have lastindex == 1 if applied to the string 'ab', while the expression (a)(b) will have lastindex == 2, if applied to the same string.

In [145]:
re.search(r'(\d+) ([\w ]+)', '31 VN Pikachu').lastindex

2

# Captured Group

let's find the account name, level, and clan name

In [83]:
search_object = re.search('^([\w\s]+), Level (\d+), Clan ([\w\s]+)$', txt)
search_object

<re.Match object; span=(0, 39), match='VN Pikachu, Level 30, Clan VN Champions'>

In [84]:
search_object.group()

'VN Pikachu, Level 30, Clan VN Champions'

In [85]:
search_object.group(0) #group(0) is the same as group() : return all captrued groups as a string

'VN Pikachu, Level 30, Clan VN Champions'

return the first captured group

In [86]:
search_object.group(1)

'VN Pikachu'

the second captrured group

In [87]:
search_object.group(2)

'30'

the third captured group

In [88]:
search_object.group(3)

'VN Champions'

<b>group(m, n)</b> is equivalent <b>(group(m), group(n))</b>

In [89]:
search_object.group(2,3)

('30', 'VN Champions')

<hr>

You can also access the value of captured groups by call: **`Match[index]`** instead of **`Match.group(index)`**

In [92]:
Match = re.search(r'(\d+) ([\w ]+)', '31 VN Pikachu')
Match

<re.Match object; span=(0, 13), match='31 VN Pikachu'>

In [93]:
Match[0]

'31 VN Pikachu'

In [94]:
Match[1]

'31'

In [95]:
Match[2]

'VN Pikachu'

<hr>

In [82]:
search_obj.groups()

()

### Match.groups()

In [29]:
#get all captured groups as a list
search_object.groups()

('VN Pikachu', '30', 'VN Champions')

## Exercise

A very intuitive example are XML or HTML tags. E.g. let's assume we have a file (called "tags.txt") with content like this:

<composer> Wolfgang Amadeus Mozart </composer>




We want to rewrite this text automatically to

composer: Wolfgang Amadeus Mozart


In [30]:
html_file = '''<composer> Wolfgang Amadeus Mozart </composer>'''

In [31]:
search_object = re.search(r'<(\w+)>(.*)</\1>', html_file)
search_object

<re.Match object; span=(0, 46), match='<composer> Wolfgang Amadeus Mozart </composer>'>

In [32]:
f'{search_object.group(1)}: {search_object.group(2)}'

'composer:  Wolfgang Amadeus Mozart '

## Excercise

In [33]:
target = ["555-8396 Neu, Allison", 
     "Burns, C. Montgomery", 
     "555-5299 Putz, Lionel",
     "555-7334 Simpson, Homer Jay"]
for info in target:
    obj = re.search(r'([\d-]*)\s?(\w+), (\w+)', info)
    print(obj.group(3), obj.group(2), obj.group(1))
  

Allison Neu 555-8396
C Burns 
Lionel Putz 555-5299
Homer Simpson 555-7334


# Exercise

 show the span of the second captured group, the ending index of the first captured group

In [34]:
obj = re.search(r'([\w\s]+), level (\d+)', 'VN Pikachu, level 30')
obj.group(1,2)

('VN Pikachu', '30')

In [35]:
obj.span(2)

(18, 20)

In [36]:
obj.end(1)

10

## Excercise: Phonebook

In [61]:
text = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""

In [62]:
lines = re.split(r'\n+', text)
lines

['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']

In [63]:
for line in lines:
    res = re.split(r':? ', line)
    print(res)

['Ross', 'McFluff', '834.345.1254', '155', 'Elm', 'Street']
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley', 'Avenue']
['Frank', 'Burger', '925.541.7625', '662', 'South', 'Dogwood', 'Way']
['Heather', 'Albrecht', '548.326.4584', '919', 'Park', 'Place']


We want 'Elm Street', not 'Elm', 'Street'.So let's limit the number of maximum split to 4

In [64]:
for line in lines:
    res = re.split(r':? ', line, 4)
    print(res)

['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street']
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue']
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way']
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']


# Named Backreferences

**`(?P<name>...)`**

In [18]:
data = 'VN Pikachu, level 30, clan VN Champions'
#Numbered Backreferences
obj1 = re.search(r'([\w\s]+), level (\d+), clan ([\w\s]+)', data)
obj1.group(1,2,3)

('VN Pikachu', '30', 'VN Champions')

In [19]:
#Named backreferences
obj2 = re.search(r'(?P<name>[\w\s]+), level (?P<level>\d+), clan (?P<clan>[\w\s]+)', data)
obj2.group()

'VN Pikachu, level 30, clan VN Champions'

In [20]:
obj2.group('name')

'VN Pikachu'

In [21]:
obj2.group('clan')

'VN Champions'

In [22]:
obj2.group('level')

'30'

In [24]:
obj2.span('name')

(0, 10)

In [25]:
obj2.start('clan'), obj2.end('clan')

(27, 39)

<hr>

**`groupdict`**

In [66]:
obj2.groupdict()

{'name': 'VN Pikachu', 'level': '30', 'clan': 'VN Champions'}

<hr>

back reference: **`<?P=name>`**

In [32]:
pattern = r'(?P<number>\d+) \w+ (?P=number)'

In [33]:
re.search(pattern, '31 Pikachu 31')

<re.Match object; span=(0, 13), match='31 Pikachu 31'>

In [34]:
re.search(pattern, '32 Pikachu 33')

# Case study: Parsing Phone Number

The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store:
* area code
* trunk
* number
* optionally an extension separately in the company’s database

Database
* 800-555-1212
* 800 555 1212
* 800.555.1212
* (800) 555-1212
* 1-800-555-1212
* 800-555-1212-1234
* 800-555-1212x1234
* 800-555-1212 ext. 1234
* work 1-(800) 555.1212 #1234

Quite a variety! In each of these cases, I need to know that:
* the area code was 800
* the trunk was 555
* the rest of the phone number was 1212
* For those with an extension, I need to know that the extension was 1234.

In [42]:
database = [
'800-555-1212',
'800 555 1212',
'800.555.1212',
'(800) 555-1212',
'1-800-555-1212',
'800-555-1212-1234',
'800-555-1212x1234',
'800-555-1212 ext. 1234', 
'work 1-(800) 555.1212 #1234' 
]

In [53]:
pattern = re.compile(r'^[a-z ]*(?:1-)?\(?(\d+)\)?[ -.](\d+)[ -.](\d+)[a-z#-. ]*(\d*)$', flags = re.I)

In [54]:
number = input('Enter phone number:')
sobj = pattern.search(number)
if sobj:
    print(sobj.groups())

Enter phone number:work 1-(800) 555.1212 #1234
('800', '555', '1212', '1234')


# Case study: Tokenizer

Input:


<pre>
statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''
</pre>

Output:

```python
Token(type='IF', value='IF', line=2, column=4)
Token(type='ID', value='quantity', line=2, column=7)
Token(type='THEN', value='THEN', line=2, column=16)
Token(type='ID', value='total', line=3, column=8)
Token(type='ASSIGN', value=':=', line=3, column=14)
Token(type='ID', value='total', line=3, column=17)
Token(type='OP', value='+', line=3, column=23)
Token(type='ID', value='price', line=3, column=25)
Token(type='OP', value='*', line=3, column=31)
Token(type='ID', value='quantity', line=3, column=33)
Token(type='END', value=';', line=3, column=41)
Token(type='ID', value='tax', line=4, column=8)
Token(type='ASSIGN', value=':=', line=4, column=12)
Token(type='ID', value='price', line=4, column=15)
Token(type='OP', value='*', line=4, column=21)
Token(type='NUMBER', value=0.05, line=4, column=23)
Token(type='END', value=';', line=4, column=27)
Token(type='ENDIF', value='ENDIF', line=5, column=4)
Token(type='END', value=';', line=5, column=9)
```

In [154]:
statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

In [155]:
class Token:
    def __init__(self, type, value, line, column):
        self.type = type
        self.value = value
        self.line = line
        self.column = column
    def __str__(self):
        return f'Token(type={self.type}, value={self.value}, line={self.line}, column={self.column})'

In [181]:
def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',   r':='),           # Assignment operator
        ('END',      r';'),            # Statement terminator
        ('ID',       r'[A-Za-z]+'),    # Identifiers
        ('OP',       r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE',  r'\n'),           # Line endings
        ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH', r'.')             # Any other character
    ]
    line_index = 1
    line_start = 0
    regex = '|'.join('(?P<{}>{})'.format(*spec) for spec in token_specification)
    for Match in re.finditer(regex, s):
        #there are many captured groups, but only 1 captured group matches
        #get the name of that captured group
        #print(Match.groupdict())
        kind = Match.lastgroup 
        res = Match.group()
        start, end = Match.span()
        
        if kind == 'NUMBER':
            value = float(res) if '.' in res else int(res)
        elif kind == 'ID' and res in keywords:
            kind = res
        elif kind == 'NEWLINE':
            line_index += 1
            line_start = end
            continue
        elif kind == 'SKIP':
            continue
        elif kind == 'MISMATCH':
            raise RuntimeError('Error')
        
        yield Token(kind, res, line_index, start - line_start)

In [182]:
for token in tokenize(statements):
    print(token)

Token(type=IF, value=IF, line=2, column=4)
Token(type=ID, value=quantity, line=2, column=7)
Token(type=THEN, value=THEN, line=2, column=16)
Token(type=ID, value=total, line=3, column=8)
Token(type=ASSIGN, value=:=, line=3, column=14)
Token(type=ID, value=total, line=3, column=17)
Token(type=OP, value=+, line=3, column=23)
Token(type=ID, value=price, line=3, column=25)
Token(type=OP, value=*, line=3, column=31)
Token(type=ID, value=quantity, line=3, column=33)
Token(type=END, value=;, line=3, column=41)
Token(type=ID, value=tax, line=4, column=8)
Token(type=ASSIGN, value=:=, line=4, column=12)
Token(type=ID, value=price, line=4, column=15)
Token(type=OP, value=*, line=4, column=21)
Token(type=NUMBER, value=0.05, line=4, column=23)
Token(type=END, value=;, line=4, column=27)
Token(type=ENDIF, value=ENDIF, line=5, column=4)
Token(type=END, value=;, line=5, column=9)


<hr>

<b style = 'color:red'>Solution:</b>

# VERBOSE Flag

A verbose regular expression is different from a compact regular expression in two ways:

* Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They’re not matched at all. (If you want to match a space in a verbose regular expression, you’ll need to escape it by putting a backslash in front of it.)


* Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a # character and goes until the end of the line. In this case it’s a comment within a multi-line string instead of within your source code, but it works the same way.

In [55]:
regex = r'''
^           #Beginning or the string
[a-z ]*     #Some characters and space at the begining
(?:1-)?     #check 1-, can have or do not have, do not capture this group
\(?(\d+)\)? #capture area code
[ -.]       #separator
(\d+)       #capture trunk
[ -.]       #separator
(\d+)       #capture the rest
[a-z#-. ]*  #separator
(\d*)$      #extension
'''

**`flags = re.VERBOSE`**

In [56]:
re.search(regex, 'work 1-(800) 555.1212 #1234', flags = re.VERBOSE).groups()

('800', '555', '1212', '1234')

# Replacement

**`re.sub`**

```python
re.sub(pattern, repl, string, count=0, flags=0)

----------
Docstring:
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the Match object and must return
a replacement string to be used.
```

In [68]:
#replace all matching with a string
re.sub('VNC', 'VN Champions', 'VNC Son La TF VNC VNC')

'VN Champions Son La TF VN Champions VN Champions'

In [70]:
#replace maximum 1 first matching with a string
re.sub('VNC', 'VN Champions', 'VNC Son La TF VNC VNC', count = 1)

'VN Champions Son La TF VNC VNC'

In [71]:
#replace maximum first 2 match with a string
re.sub('VNC', 'VN Champions', 'VNC Son La TF VNC VNC', count = 2)

'VN Champions Son La TF VN Champions VNC'

<hr>

replacement with back referenre

In [295]:
#swap position of level and player name
re.sub('(\d+)(\s+)([\w ]+)', r'\3\2\1', '31 VN Pikachu')

'VN Pikachu 31'

Write a Python program to convert a date of yyyy-mm-dd format to dd-mm-yyyy format

In [365]:
re.sub('(\d{4})-(\d{2})-(\d{2})', r'\3-\2-\1', '2001-10-06')

'06-10-2001'

<hr>

Using function replacement

In [64]:
#replace with a callback function
re.sub('VNC', lambda reMatch: f'xXx-{reMatch.group()}-xXx', 'VNC Son La TF VNC')

'xXx-VNC-xXx Son La TF xXx-VNC-xXx'

**`re.subn`**: like **`re.sub`** but return a tuple(new_string, number_of_substitution_was_made).

```python
re.subn(pattern, repl, string, count=0, flags=0)
```

In [74]:
re.subn('VNC', 'VN Champions', 'VNC Son La TF VNC VNC', count = 2)

('VN Champions Son La TF VN Champions VNC', 2)

In [75]:
re.subn('VNC', 'VN Champions', 'VNC Son La TF', count = 2)

('VN Champions Son La TF', 1)

In [2]:
import re
from urllib.request import urlopen

In [5]:
with urlopen('https://www.python-course.eu/simpsons_phone_book.txt') as file:
    for line in file:
        txt = line.decode('utf-8')

<hr>

### Exercise

relplace camelCase to snake_case

In [369]:
re.sub(r'([a-z])([A-Z])', lambda m: f'{m[1]}_{m[2].lower()}', 'camelCase anotherCamelCase')

'camel_case another_camel_case'

convert snake_case to camel_case

In [371]:
re.sub(r'(\w)_(\w)', lambda m: f'{m[1]}{m[2].upper()}', 'snake_case and_another_snake_case')

'snakeCase andAnotherSnakeCase'

# Findall

```python
re.findall(pattern, string, flags=0)
```

In [38]:
#Captured group
re.findall('(\d+) ([\w ]+)', '31 VN Pikachu, 34 Hadi, 30 Ben cut')

[('31', 'VN Pikachu'), ('34', 'Hadi'), ('30', 'Ben cut')]

<hr>

**`re.finditer`**: Find all substrings where the RE matches, and returns them as an iterator.  
Each element in the iterator is the Re.Match object of a captured group
<p style = 'color:red'>
    Using <b>re.finditer</b> when you are interested in the information of each captured group(e.g: span, start, end)
</p>

iterator = re.finditer(r'\d+', '12 afda 32 32')
iterator

In [42]:
type(iterator)

callable_iterator

In [43]:
next(iterator)

<re.Match object; span=(0, 2), match='12'>

In [44]:
next(iterator)

<re.Match object; span=(8, 10), match='32'>

In [45]:
next(iterator)

<re.Match object; span=(11, 13), match='32'>

Find all abverts and their position in the text

In [97]:
text = "He was carefully disguised but captured quickly by police."
for Match in re.finditer(r'(\w+)ly', text):
    print(Match.group(), Match.span())

carefully (7, 16)
quickly (40, 47)


# Alternations

In [39]:
pattern = re.compile(r'(VN Pikachu|Meomeo888) is the best')

In [40]:
pattern.search('VN Pikachu is the best')

<re.Match object; span=(0, 22), match='VN Pikachu is the best'>

In [41]:
pattern.search('Tank Cao is the best')

In [42]:
pattern.search('Meomeo888 is the best')

<re.Match object; span=(0, 21), match='Meomeo888 is the best'>

<hr>

In [47]:
txt = '''
From: VN pikachu\n
To: Tank Cao\n
Content: Congratulations
'''

In [51]:
#match all user names
re.findall('(?:^From:|^To:) ([\w ]+)', txt, flags = re.MULTILINE) #^ here marks the begining, not negate.

['VN pikachu', 'Tank Cao']

# Flags

In [53]:
#Ignore case matching
re.search('vn PikaChu', 'VN Pikachu is the best', flags = re.I)

<re.Match object; span=(0, 10), match='VN Pikachu'>

Use bitwise XOR to allow multiple flags:

In [36]:
s = re.compile(r'^pikachu', flags = re.I | re.MULTILINE) #ignorecase, multiline
s.findall('pikachu11\nPiKachu22')

['pikachu', 'PiKachu']

# Splitting

```python
re.split(pattern, string, maxsplit=0, flags=0)
```

In [46]:
s = '12aca13acd'

In [49]:
re.split(r'\d+', s)

['', 'aca', 'acd']

In [174]:
#to not remove delimiter, capture it
re.split(r'(\d+)', s)

['', '12', 'aca', '13', 'acd']