# Regular Expression
A regular expression (shortened as regex or regexp;also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

In [1]:
import re
re.__all__

['match',
 'fullmatch',
 'search',
 'sub',
 'subn',
 'split',
 'findall',
 'finditer',
 'compile',
 'purge',
 'template',
 'escape',
 'error',
 'Pattern',
 'Match',
 'A',
 'I',
 'L',
 'M',
 'S',
 'X',
 'U',
 'ASCII',
 'IGNORECASE',
 'LOCALE',
 'MULTILINE',
 'DOTALL',
 'VERBOSE',
 'UNICODE']

## 1.1 Regular Expression in Python
* `re.compile(pattern,flags = 0)`: compile a regular string to a `_sre.SRE_Pattern` object, which used in the computer memory
* `x.search(compiled_str)`: is same to `re.search(target_str, orign_str)`
* `re.match(pattern,string,flags =0)`
    -`.span` -`.group`
    
**`match` must start from the first character, `search` could start from the middle part**

In [2]:
p = re.compile('abc')
p.search('www.abc.com')

<re.Match object; span=(4, 7), match='abc'>

In [3]:
re.search('abc','www.abc.com')

<re.Match object; span=(4, 7), match='abc'>

In [4]:
m1 = re.match('www','www.fkit.org')
print(m1.span()) # return the location of matched item
print(m1.group()) # return what is matched

(0, 3)
www


In [5]:
print(re.match('fkit','www.fkit.com')) # search from the start position
print(re.search('fkit','www.fkit.com')) # search from the middle part
print(re.search('www','www.fkit.com'))

None
<re.Match object; span=(4, 8), match='fkit'>
<re.Match object; span=(0, 3), match='www'>


**Other commonly used functions**
* `re.findall(pattern,string,flags = 0)` : return a list contains all string matched with the pattern.
* `re.finditer(pattern,string,flags = 0)`: same to the pervious fuction but return a iterator

In [6]:
print(re.findall('fkit','FKIT is very good, Fkit.org is my favorite'))
# Ignore the upper-case and lower-case
print(re.findall('fkit','FKIT is very good, Fkit.org is my favorite',re.I))

[]
['FKIT', 'Fkit']


In [7]:
it = re.finditer('fkit','FKIT is very good, Fkit.org is my favorite',re.I)
for e in it:
    print(str(e.span()) + '-->' + e.group())

(0, 4)-->FKIT
(19, 23)-->Fkit


**Other commonly used functions**
* `re.fullmatch(pattern,string,flags = 0)` : require a full match string, if not return None
* `re.sub(pattern,repl,string,count = 0, flags = 0)`: substitute pattern within a string using repl, allow `count` times replace
* `re.split(pattern,string,maxsplit = 0, flags = 0)`: use pattren to split string

In [8]:
import re
my_date = '2008-08-18'
print(re.sub(r'-','/',count =1,string = my_date))

2008/08-18


In [9]:
print(re.split(',','fkit,fkjava,crazyit',1))
print(re.split(',','fkit,fkjava,crazyit'))
print(re.split('a','fkit,fkjava,crazyit',))
print(re.split('x','fkit,fkjava,crazyit',1)) # No x character, so would return the origin list

['fkit', 'fkjava,crazyit']
['fkit', 'fkjava', 'crazyit']
['fkit,fkj', 'v', ',cr', 'zyit']
['fkit,fkjava,crazyit']


### There are two major classes in `re` module, one is the `complie` function returned  `_sre.SRE_Pattern` object, the other one is `match` returned match object
`_sre.SRE_Pattern` object has the following properties
* `match.group` : get the matched string
* `match.groups(default = None)` : return a tuple contains the matched string
* `match.groupdict(default = None)`: return a dict contains all the matched string
* `match.start()`
* `match.end()`
* `match.span()`
* `match.re` return the using patterns
* `match.string` return the using strings(origin)

In [10]:
pa = re.compile('fkit')
print(pa.match('www.fkit.org',4).span()) # start search from 4
print(pa.match('www.fkit.org'),4,6) # start from 4 and end at 6
print(pa.fullmatch('www.fkit.org',4,8).span())

(4, 8)
None 4 6
(4, 8)


In [11]:
m = re.search(r'(fkit).(org)',r'www.fkit.org is a good domain')
print(m.group(0))
print(m[0])
print(m[1])
print(m.span(0))
print(m.span(1))
print(m.group(1))
print(m.group(2))
print(m.groups())

fkit.org
fkit.org
fkit
(4, 12)
(4, 8)
fkit
org
('fkit', 'org')


**In the regular expression, could use `?P<name>` to set a name for each group**

In [12]:
m2 = re.search(r'(?P<prefix>fkit).(?P<suffix>org)',\
              r"www.fkit.org os a good domain")
print(m2.groupdict())
print(m2.string)
print(m2.re)

{'prefix': 'fkit', 'suffix': 'org'}
www.fkit.org os a good domain
re.compile('(?P<prefix>fkit).(?P<suffix>org)')


## 1.2 `flag` in regular expression
* `re.A` or `re.ASCII`: only search the ASCII characters, (`?a`)
* `re.DEBUG`: shows the debug info in the regular expression
* `re.I` or `re.IGNORECASE`: not case sensitive (`?i`)

In [13]:
re.findall(r'fkit','FKIT is a good domain, FKIT is good')
re.findall(r'fkit','FKIT is a good domain, FKIT is good',re.I)

['FKIT', 'FKIT']

## 1.3 Create regular expression
* `x`: stands for any character
* `\r`: stands for enter (similar to line break)
* `\f` : stands for page break
* `\n`: stands for line break
* `$`: in the end 
* `^`: in the begining
* `()`: set a group and their start and end position
* `[]`: similar to `()`
* `{}`: set the times for the front regular expression
* `*`: shows 0~n times
* `+`: shows 1~n times
* `?`: shows 0~1 times
* `.` stands for all characters instead of `\n`, unless you use `re.S` or `s.DOTALL` flag
* `\d` : number 0~9
* `\D` : non-numeric
* `\s` : all blank, including space, line break, table symbol, page break, etc.
* `\S` : any non-blank symbol
* `\w` : all character, including number 0-9, 26 english characters, and `_`
* `\W` : non-character

In [14]:
# find a phone number
re.fullmatch(r'\d\d\d-\d\d\d-\d\d\d\d','789-123-8888')

<re.Match object; span=(0, 12), match='789-123-8888'>

#### Enumerate method
* `[abc]` : means could be any one of {a,b,c}
* `[a-f]` : means could be any one of {a,b,c,d,e,f}
* `[^abc]`: means could be anyone **not from {a,b,c}**

## 1.4 Sub-expression
Use `()` to represent a sub-expression, and use `\ + number` to mark each sub-expression.   
For example, in the following test, `\1` represent the first sub-expression, if it matched with `98`, then the following position must be matched with `98`

In [15]:
re.search(r'Windows (95|98|NT|2000)[\w ]+\1', 'Windows 98 published in 98')

<re.Match object; span=(0, 26), match='Windows 98 published in 98'>

In [16]:
re.search(r'Windows (95|98|NT|2000)[\w ]+\1', 'Windows 98 published in 95')

#### In addition, we could use `(?p<name> exp)`  to create a sub-expression
* Use `<?P=name>` to call the sub-expression (must be same)

In [17]:
re.search(r'<(?P<tag>\w+)>\w+</(?P=tag)>', '<h3>xx</h3>')

<re.Match object; span=(0, 11), match='<h3>xx</h3>'>

In [18]:
re.search(r'<(?P<tag>\w+)>\w+</(?P=tag)>', '<h3>xx</h2>')

#### other important expression type
* `(?<=exp)`: the exp must be shown in the left side of the matched content
* `(?=exp)`: the exp must be shown in the right side of the matched content
* `(?<!exp)`: the exp must not be shown in the left side of the matched content
* `(?!exp)`: the exp must not be shown in the right side of the matched content
* `(?#comment)`: `?#` means it is a comment

In [19]:
re.search(r'(?<=<h1>).+?(?=</h1>)','help! <h1><div>fkit</div></h1>! technology')

<re.Match object; span=(10, 25), match='<div>fkit</div>'>

## 1.5 Greedy model and reluctant model
* `*`: 0~n times
* `+`: 1~n times
* `?`: 0~1 times
* `{n,m}`: n~m times
* `{n,}`: at least n times
* `{,m}`: not more than m times
* `{n}`: must shown in n times

In [20]:
re.search(r'\d{3}-\d{3}-\d{4}','789-654-1234')

<re.Match object; span=(0, 12), match='789-654-1234'>

**The default search model is greedy model**

In [21]:
re.search(r'@.+\.','sun@fkit.com.cn')

<re.Match object; span=(3, 13), match='@fkit.com.'>

**use `?` to change it to reluctant model**

In [22]:
re.search(r'@.+?\.','sun@fkit.com.cn')

<re.Match object; span=(3, 9), match='@fkit.'>