# Regular Expression

## Benefits
- Very **FAST**

## Drawbacks
- Low **SCALABILITY**

## Symbol
- ```(?:) := The group is matched but cannot captured.```
  - So, the group cannot do back-reference

## Abbreviation
- ```\d := [0-9]; \D := ^\d;```
- ```\s := [ \t\n\r\n]; \S := ^\s```
- ```\w := [a-z]; \W := ^\s```
- ```\b := Word-boundary; \B := ^\b```
  - Word-boundary is a zero-width test between two characters
  - To pass the test, one side must be a word-character and another must be a non-word-character
  ```
  Hello World! (Text)
  
  (Test1) Hello W|orld! (testing W and o) = False
  (Test2) |Hello World! (testing the beg and H) = True
  ```

## Back-reference
- Whenever the group of pattern is matched, remains can refer it by ```\number```
- ```\0``` is the whole matched string
- ```\x``` is the matched ```(pattern)```

In [89]:
import re

url = "https://naver.naver.com"
pattern = re.compile(r'(?:\w+)://(\w+).\1.com')
print(pattern.search(url))
print(pattern.search(url).group(1))

<re.Match object; span=(0, 23), match='https://naver.naver.com'>
naver


## Python RE module
### Compile
- Pre-define patterns

### Group
- Index ```0``` matches the entire match
- After ```1```, matches the partial groups

### Search vs Match
- Search return sub-matched ```Match``` object or None
- Match return full-matched ```Match``` object or None

### find-
- Find **ALL** of matched sub-strings

### Split
- Split string by the occurrences of pattern

### Sub
- Substitude matched as a differnt string

In [112]:
import re

money = '$123'

pattern = re.compile(r'\d+')

print("search", pattern.search(money)) # Return 123
print("match", pattern.match(money)) # Return None

moneys = '$123 45 231 32323'
print("find", pattern.findall(moneys))

logicalExpression = 'a&a->b=>b'
print("sub", re.split(r'(?:\&|\||->|=>)', logicalExpression))

identityNumber = '200202-1234567'
replaced = r'\1-*******'
hiddenID = re.sub(r'(\d+)-\d+', replaced, identityNumber)
print(hiddenID)

search <re.Match object; span=(1, 4), match='123'>
match None
find ['123', '45', '231', '32323']
sub ['a', 'a', 'b', 'b']
200202-*******


## Greedy vs Lazy

### Greedy
- Default RE
- Search entire strings to find matches even if finding it

### Lazy
- Quantifiers ```+ * {} ?```
- Combine ```?``` with Quantifiers. So, ```+? *? {}? ??```
- Stop searching if finding first matched

In [58]:
import re

data = '0123456789'
greedy = re.compile(r'\d{1,10}') # Return 0123456789
lazy = re.compile(r'\d{1,10}?') # Return 0
print(greedy.search(data))
print(lazy.search(data))

greedy = re.compile(r'\d+') # Return 0123456789
lazy = re.compile(r'\d+?') # Return 0
print(greedy.search(data))
print(lazy.search(data))

greedy = re.compile(r'\d*') # Return 0123456789
lazy = re.compile(r'\d*?') # Return ''
print(greedy.search(data))
print(lazy.search(data))

greedy = re.compile(r'\d?') # Return 0
lazy = re.compile(r'\d??') # Return ''
print(greedy.search(data))
print(lazy.search(data))

<re.Match object; span=(0, 10), match='0123456789'>
<re.Match object; span=(0, 1), match='0'>
<re.Match object; span=(0, 10), match='0123456789'>
<re.Match object; span=(0, 1), match='0'>
<re.Match object; span=(0, 10), match='0123456789'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 1), match='0'>
<re.Match object; span=(0, 0), match=''>


## Lookaround

### Lookahead
- Exclude the pattern in ```(?=pattern)``` at returned **Match**

### Lookbehind
- Same as Lookahead except for searching direction reversly
- ```(?<=pattern)```

### Negative Lookaround
- Saerch not matched pattern
- Negative Lookahead := ```(?!pattern)```
- Negative Lookbehind := ```(?<!pattern)```

In [79]:
import re

# Lookahead
print("positive", re.search(r'\w+(?=₩)', '100₩'))
print("negative", re.search(r'\w+(?!₩)', '100'))

# Lookbehind
print("positive", re.search(r'\b(?<=\$)\d+\b', '$100'))
print("negative", re.search(r'\b(?<!\$)\d+\b', '100'))

positive <re.Match object; span=(0, 3), match='100'>
negative <re.Match object; span=(0, 3), match='100'>
positive <re.Match object; span=(1, 4), match='100'>
negative <re.Match object; span=(0, 3), match='100'>
