## Regrex for Python

From "Python Core Programming" - Ch1 (to be continued...)

In [1]:
import re

### 1.3.4 Use match()

Use `match()` to match strings; and use `.group()`, `.groups()` to return matched objects

In [2]:
m = re.match(pattern = 'foo', string = 'foo on the table')
m.group()

'foo'

### 1.3.5 Use search()

Use `search()` to find pattern

In [3]:
m = re.match(pattern = 'foo', string = 'seafood')
if m is None:
    print('m is NoneType')
else:
    m.group()

m is NoneType


In [4]:
m = re.search(pattern = 'foo', string = 'seafood')
m.group()

'foo'

### 1.3.5 Match multiple strings

In [5]:
bt = 'bat|bet|bit'
m = re.match(bt, 'bat')
m.group()

'bat'

In [6]:
m = re.match(bt, 'He bit me!') # cannot match with the string
m == None

True

In [7]:
m = re.search(bt, 'He bit me! bit')
m.group()

'bit'

### 1.3.7 Match any single character use `.` ( except `\n`)

In [8]:
anyend = '.end'
m = re.match(anyend, 'bend')
m.group()

'bend'

In [9]:
m = re.match(anyend, 'end') # cannot match empty
m == None

True

In [10]:
m = re.match(anyend, '\nend') # cannot match \n
m == None

True

### 1.3.8 Use `[ ]` to create set of string

In [11]:
m = re.match('[cr][23][dp][o2]', 'c3do')
m.group()

'c3do'

### 1.3.9 Duplicate and special characters and grouping

Example of email address

* `\w` can match any digits, letters; oppsite is `\W`
* `?` 0 or 1 time
* `*` 0 or more times
* `+` 1 or more times

In [12]:
# xxxx@xxxx.xxx.com
patt = '\w+@(\w+\.)?\w+\.(com|edu)' # ? could be *
print(re.match(patt, 'bs2996@columbia.edu').group())
print(re.match(patt, 'bangdasun94@gmail.com').group())

bs2996@columbia.edu
bangdasun94@gmail.com


### 1.3.10 Match the beginning and endding

* `^` for begin; (^[^]) for not begin
* `$` for end;
* `\b` for border

In [13]:
re.search('^The', 'The end.').group()

'The'

In [14]:
re.search('The$', 'end The').group()

'The'

In [15]:
re.search(r'\bthe', 'bite the dog').group() # at border

'the'

In [16]:
re.search(r'\bthe', 'bitethe dog') == None # have the border

True

### 1.3.11 Use `findall()` and `finditer()` to search location

* `finditer()` saves more memory, related to iterator

In [17]:
re.findall('car', 'carry the barcardi to the car')

['car', 'car', 'car']

### 1.3.12 Use `sub()` and `subn()` for seach and replace

* `subn()` can also return the number of replacement

In [18]:
uni = 'bs2996'
re.sub(pattern = '2996', repl = 'xxxx', string = uni)

'bsxxxx'

In [19]:
re.subn('2996', 'xxxx', 'bs2996')

('bsxxxx', 1)

### 1.3.13 Use `split()` for separation

In [20]:
re.split('@|\.', 'bs2996@columbia.edu')

['bs2996', 'columbia', 'edu']