# python re模块攻略（正则表达式模块）


正则表达式 Regular Expression (RegEx)是定义搜索特征的字符串序列。

举例如下：

```bash
^a...s$
```
上面的代码定义了RegEx的特征，任意5个字符串，开始为`a`结束为`s`。

      
Expression:`^a...s$`

| String      | Matched? |
| :---------- | :------- |
| `abs`       | No match |
| `alias`     | Match    |
| `abyss`     | Match    |
| `Alias`     | No match |
| `An abacus` | No match |

下面，我们使用re模块来进行正则匹配

In [1]:
import re
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")	

Search successful.


# 使用RegEx指定特征

通过使用元字符串来指定正则表达式的特征，如前文所提到的`^`和`$`就是两个元字符。

## 何谓元字符
元字符可以被RegEx引擎解析为一个特殊的操作方式。

以下是元字符列表

[] . ^ $ * + ? {} () \ |


[] - 方括号

方括号指定你想匹配的一组数据

Expression：[abc]

| String    | Matched? |
| :---------- | :-------- |
| `a`       | 1 match  |
| `ac`        | 2 matches |          
| `Hey Jude`  | No match  |          
| `abc de ca` | 5 matches |

[abc]表示匹配任何包含在`[]`内的元素，如a,b,c

你也可以通过`-`指定字符串范围 .

- `[a-e]`等同于`[abcde]`.
- `[1-4]` 等同于 `[1234]`.
- `[0-39]` 等同于`[01239]`.

也可以通过`^` 标志在方括号内表示**补集**匹配。

- `[^abc]` 表示匹配除了a，b，c的任意字符。
- `[^0-9]` 表示匹配除了数字以外的字符。



## re.findall()

`findall(pattern, string, flags=0)`

返回一串包含所有匹配项的列表，如果没找到，则返回空列表

In [None]:
# Program to extract numbers from a string
import re
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'
result = re.findall(pattern, string) 
print(result)
# Output: ['12', '89', '34']

## re.finditer()

返回一个迭代(iterator)形式的match对象

In [23]:
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

07-16: carefully
40-47: quickly


## re.split()

`split(pattern, string, maxsplit=0, flags=0)`

字符串str.split()也可以做到简单分割。

但将一串文本按多个分隔符分割可以借助re模块的split(),如果没找到，则返回空列表

In [2]:
import re
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
result = re.split(pattern, string) 
print(result)
# Output: ['Twelve:', ' Eighty nine:', '.']

['Twelve:', ' Eighty nine:', '.']


指定了maxsplit，则限制了最大切分次数

In [6]:
import re
string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'
# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)
# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

['Twelve:', ' Eighty nine:89 Nine:9.']


## re.sub()

sub(pattern, repl, string, count=0, flags=0)

在指定string中，匹配对应的pattern，将匹配项替换为repl项

In [8]:
# Program to remove all whitespaces
import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
# empty string
replace = ''
new_string = re.sub(pattern, replace, string) 
print(new_string)
# Output: abc12de23f456

abc12de23f456


传入count参数，控制替换次数

In [9]:
import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
replace = ''
new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)
# Output:
# abc12de 23
# f45 6

abc12de 23 
 f45 6


## re.subn()

`subn(pattern, repl, string, count=0, flags=0)`

与re.sub()有些相似，但是，它返回的是包含两个元素的元组，替换后的新字符串和替换的数量

In [None]:
# Program to remove all whitespaces
import re
# multiline string
string = 'abc 12\
de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
# empty string
replace = ''
new_string = re.subn(pattern, replace, string) 
print(new_string)
# Output: ('abc12de23f456', 4)

## re.search()

`search(pattern, string, flags=0)`

和match()方法有些相似，然而match只对开始的部分进行匹配，search返回的是第一个找到的匹配项的定位，也是一个`match`对象。如果未找到，返回`None`


In [None]:
import re
string = "Python is fun"
# check if 'Python' is at the beginning
match = re.search('\APython', string)
if match:
    print("pattern found inside the string")
else:
    print("pattern not found")  
# Output: pattern found inside the string

## re.compile
`compile(pattern, flags=0)`

compile 用于将特征编译为re对象，用来在入参为pattern的方法中使用。

```python
prog = re.compile(pattern)
result = prog.match(string)
```
如上，与
```python
result = re.match(pattern, string)
```
等价

但是使用compile有个好处在于，在一个程序中，可以节省对re的pattern对象重复利用的时间

In [22]:
help(re.compile)

Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a Pattern object.



# Match object

Match对象的几种属性

## match.group()

通过group，返回字符串的匹配部分。

In [13]:
import re
string = '39801 356, 2102 1111'
# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'
# match variable contains a Match object.
match = re.search(pattern, string) 
if match:
    print(match.group())
else:
    print("pattern not found")
# Output: 801 35

801 35


特征`(\d{3}) (\d{2})`，有两个由`()`划分的分组

In [14]:
match.group(1)

'801'

In [15]:
match.group(1, 2)

('801', '35')

In [16]:
match.groups()

('801', '35')

## match.start(), match.end() and match.span()

start返回匹配项的开始索引，end则返回匹配项的尾部索引

In [17]:
match.start()

2

In [18]:
match.end()

8

span返回一个元组，包含开始和结束

In [19]:
match.span()

(2, 8)

## match.re and match.string

match的re属性返回一个正则表达式对象。

string属性返回所传递的字符串

In [20]:
match.re

re.compile(r'(\d{3}) (\d{2})', re.UNICODE)

In [21]:
match.string

'39801 356, 2102 1111'

# 在RegEx前使用r前缀

如果在定义一个正则表达式之前，使用一个`r`或者`R`作为前缀，那就表示原始字符串。

如`'\n'`表示一个新行，而`r'\n'`则表示两个原始字符：即1个反斜杠`\`和一个`n`

`\`本来是用作将特别的变量解释为元字符，然而使用`r`前缀，将整个字符串仅作为普通字符串表示了。

In [None]:
import re
string = '\n and \r are escape sequences.'
result = re.findall(r'[\n\r]', string) 
print(result)
# Output: ['\n', '\r']

## 写一个分词器


In [24]:
import collections
import re

Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])

def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',   r':='),           # Assignment operator
        ('END',      r';'),            # Statement terminator
        ('ID',       r'[A-Za-z]+'),    # Identifiers
        ('OP',       r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE',  r'\n'),           # Line endings
        ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH', r'.'),            # Any other character
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group()
        column = mo.start() - line_start
        if kind == 'NUMBER':
            value = float(value) if '.' in value else int(value)
        elif kind == 'ID' and value in keywords:
            kind = value
        elif kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
            continue
        elif kind == 'SKIP':
            continue
        elif kind == 'MISMATCH':
            raise RuntimeError(f'{value!r} unexpected on line {line_num}')
        yield Token(kind, value, line_num, column)

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

Token(type='IF', value='IF', line=2, column=4)
Token(type='ID', value='quantity', line=2, column=7)
Token(type='THEN', value='THEN', line=2, column=16)
Token(type='ID', value='total', line=3, column=8)
Token(type='ASSIGN', value=':=', line=3, column=14)
Token(type='ID', value='total', line=3, column=17)
Token(type='OP', value='+', line=3, column=23)
Token(type='ID', value='price', line=3, column=25)
Token(type='OP', value='*', line=3, column=31)
Token(type='ID', value='quantity', line=3, column=33)
Token(type='END', value=';', line=3, column=41)
Token(type='ID', value='tax', line=4, column=8)
Token(type='ASSIGN', value=':=', line=4, column=12)
Token(type='ID', value='price', line=4, column=15)
Token(type='OP', value='*', line=4, column=21)
Token(type='NUMBER', value=0.05, line=4, column=23)
Token(type='END', value=';', line=4, column=27)
Token(type='ENDIF', value='ENDIF', line=5, column=4)
Token(type='END', value=';', line=5, column=9)
