# re模块方法与正则表达式

方法 | 功能说明
:-: | :-:  
compile(pattern\[,flags\]) | 创建模式对象 |
search(pattern,string\[,flag\]) | 在整个字符串中寻找模式，返回match对象或None |
match(pattern,string\[,flag\]) | 从字符串的开始处匹配模式，返回match或者None |
findall(pattern,string\[,flag\]) | 列出字符串模式中的左右匹配项 | 
split(pattern,string\[,maxsplit=0\]) | 根据模式匹配项分割字符串 |
sub(pat,repl,string\[,count=0\]) | 将字符串中所有pat的匹配项用repl替换 |
escape(string) | 将字符串中所有特殊正则表达式字符转义 |

### 1.直接使用re模块中的方法

In [1]:
import re

In [8]:
text = 'a b.. . c d'

In [9]:
text.split()

['a', 'b..', '.', 'c', 'd']

In [10]:
re.split('[\. ]+',text) #对指定的字符串进行分割

['a', 'b', 'c', 'd']

In [11]:
re.split('[\. ]',text)

['a', 'b', '', '', '', '', 'c', 'd']

In [12]:
pat = '[a-zA-Z]+'

In [13]:
re.findall(pat,text) #查找所有单词

['a', 'b', 'c', 'd']

In [14]:
pat = '{name}'

In [15]:
text  = 'Dear {name} ...'

In [16]:
re.sub(pat,'Wang Yuze',text)

'Dear Wang Yuze ...'

In [17]:
s = 'a s d'
re.sub('a|s|d','good',s)

'good good good'

In [18]:
re.sub('a',lambda x:x.group(0).upper(),'aaa abc abdc')

'AAA Abc Abdc'

##### ?

In [21]:
s = "It\'s a very good good idea"
re.sub(r'(\b\w+)',r'\1',s)

"It's a very good good idea"

In [23]:
re.sub('[a-z]',lambda x:x.group(0).upper(),'aaa bbb ccsds')

'AAA BBB CCSDS'

In [24]:
re.subn('a','dfg','aaa abc abcd') #返回新的字符串和替换次数

('dfgdfgdfg dfgbc dfgbcd', 5)

In [26]:
example = 'Beautiful is better than guly'
re.findall('\\bb.+?\\b',example) #以b字母开头的完整单词，？表示非贪心模式

['better']

In [29]:
re.findall('\\bb.+\\b',example) #贪心模式的匹配结果

['better than guly']

In [36]:
re.findall('\d+\.\d+\.\d+','Python 2.7.13')  
# \d 是匹配数字，加号表示至少一个数字

['2.7.13']

In [37]:
re.findall('\d+\.\d+\.\d+','Python 2.7.13  Python 3.6.0')

['2.7.13', '3.6.0']

In [44]:
s = '<html><head>This is head.</head><body>This is body.</body></html>'
pattern = r'<html><head>(.+)</head><body>(.+)</body></html>'
result = re.search(pattern,s)

In [45]:
result.group(1) #第一个子模式

'This is head.'

In [46]:
result.group(2) #第二个子模式

'This is body.'

In [47]:
result.group(0)

'<html><head>This is head.</head><body>This is body.</body></html>'

### 2.使用正则表达式对象

使用re模块中的compile()方法将正则表达式编译生成正则表达式对象，然后再用正则表达式对象提供的方法进行字符串处理。
- 正则表达式对象的match(string\[,pos\[,endpos\]\])方法用于在字符串开始或指定位置进行搜索，模式必须出现在字符串开头或指定位置。

- 正则表达式对象的search(string\[,pos\[,endpos\]\])方法用于整个字符串中进行搜索

- 正则表达式对象的findall(string\[,pos\[,endpos\]\])方法用于在字符串中查找所有符合正则表达式的字符串列表

In [48]:
import re

In [49]:
example = 'ShanDong Institude of Bussiness and Technology'

In [50]:
pattern = re.compile(r'\bB\w+\b') #\w匹配任何字母数字以及下划线

In [51]:
pattern.findall(example)

['Bussiness']

！ 注意区分以下两种

In [56]:
pattern = re.compile(r'\w+g') #查找以字母g结尾的单词

In [57]:
pattern.findall(example)

['ShanDong', 'Technolog']

In [60]:
pattern = re.compile(r'\w+g\b') #查找以字母g结尾的单词

In [61]:
pattern.findall(example)

['ShanDong']

In [62]:
pattern = re.compile(r'\b[a-zA-Z]{3}\b') #查找三个字母长度的单词

In [63]:
pattern.findall(example)

['and']

In [65]:
pattern.search(example)

<_sre.SRE_Match object; span=(32, 35), match='and'>

In [68]:
pattern.match(example) #从字符串开头开始匹配，失败返回空值

In [70]:
pattern = re.compile(r'\b\w*a\w*\b') #查找所有含有字母a的单词

In [71]:
pattern.findall(example)

['ShanDong', 'and']

In [72]:
text = "He was carefully disguised but captured quickly by police."

In [74]:
re.findall(r'\w+ly',text) #查找所有以字母组合ly结尾的单词

['carefully', 'quickly']

**sub()、subn()**<br>正则表达式对象的sub(repl,string\[,count=0\])和subn(repl,string\[,count=0\])方法用来实现字符串的替换功能，其中参数repl可以为字符串或返回字符串的可调用对象

In [127]:
example  = '''
B b Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.'''

In [128]:
#re.I 忽略大小写
# * 匹配位于*之前的字符或子模式的0次或多次出现，第一行的三个例子足以说明问题
pattern = re.compile(r'\bb\w*\b',re.I)

sub() 的第一个参数是replace，第二个参数是文本

In [129]:
print(pattern.sub("*",example))


* * * is * than ugly.
Explicit is * than implicit.
Simple is * than complex.
Complex is * than complicated.
Flat is * than nested.
Sparse is * than dense.
Readability counts.


In [133]:
#把所有的匹配项都改成大写
print(pattern.sub(lambda x:x.group(0).upper(),example))


B B BEAUTIFUL is BETTER than ugly.
Explicit is BETTER than implicit.
Simple is BETTER than complex.
Complex is BETTER than complicated.
Flat is BETTER than nested.
Sparse is BETTER than dense.
Readability counts.


In [135]:
# 只替换一次
print(pattern.sub("*",example,1))


* b Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.


##### ？？？

In [123]:
#re.M 是将所有行的行尾字母输出
re.findall(r'\w+',example,re.M)

['B',
 'b',
 'Beautiful',
 'is',
 'better',
 'than',
 'ugly',
 'Explicit',
 'is',
 'better',
 'than',
 'implicit',
 'Simple',
 'is',
 'better',
 'than',
 'complex',
 'Complex',
 'is',
 'better',
 'than',
 'complicated',
 'Flat',
 'is',
 'better',
 'than',
 'nested',
 'Sparse',
 'is',
 'better',
 'than',
 'dense',
 'Readability',
 'counts']

正则表达式对象的** split(string\[,maxsplit = 0\]) **方法用来实现字符串分割

In [136]:
example = r'one,two,three.four/five\six?seven[eight]nine|ten'

In [139]:
pattern = re.compile(r'[,.\\/?\[\]|]')

In [140]:
pattern.split(example)

['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']

In [156]:
example = r'one2two3three4four5five8888'

In [157]:
pattern = re.compile(r'\d+')

In [158]:
pattern.split(example)

['one', 'two', 'three', 'four', 'five', '']

In [161]:
example = 'one two    three four,five .six88seven'

In [162]:
pattern = re.compile(r'[\s,.\d]+') #允许分隔符重复

In [163]:
pattern.split(example)

['one', 'two', 'three', 'four', 'five', 'six', 'seven']

### 3.子模式与match对象

使用 ** () **表示一个子模式，即** () **的内容作为一个整体出现例如，'(red)+'可以匹配'redred'、'redredred'等多个重复'red'的情况

In [164]:
telNumber = '''
Suppose my Phone No. is 0535-1234567, 
yours is 010-12345678 ,his is 025-87654321.
'''

In [165]:
print(telNumber)


Suppose my Phone No. is 0535-1234567, 
yours is 010-12345678 ,his is 025-87654321.



In [166]:
pattern = re.compile(r'(\d{3,4})-(\d{7,8})')

In [168]:
pattern.findall(telNumber)

[('0535', '1234567'), ('010', '12345678'), ('025', '87654321')]

正则表达式对象的match方法和search方法匹配成功后返回match对象。

match对象的主要方法有：
- group():返回匹配的一个或多个子模式的问题
- groups():返回一个包含匹配的所有子模式内容的元组
- groupdict():返回包含匹配的所有命名子模式内容的字典
- start() : 返回指定子模式内容的起始位置
- end() : 返回指定子模式内容的结束位置的前一个位置
- span() : 返回一个包含指定子模式内容起始位置和结束位置前一个位置元组